Job Openings Lead Site Reliability Engineer (Remote)

About the job Lead Site Reliability Engineer (Remote)

Lead Site Reliability Engineer (GCP - Google Cloud)

Are you a seasoned SRE with a proven track record of driving reliability and scalability in production environments? 

Our client is seeking a highly skilled and motivated Lead Site Reliability Engineer specialising in Google Cloud Platform to join their team. 

This is a permanent, full-time, remote position, and will allow you to lean into your GCP skills while you grow as a leader in the support of a fast-paced UK based client and look after a distributed team of SREs.

You'll be able to hone your DevOps skills by working on challenging projects, lead a diverse team of multi-skilled people, gain international exposure, and you'll get to work with one of the leaders in the Cloud tech and SaaS space in Africa. 

About the Role:

As the Lead SRE (GCP - Google Cloud), you will be instrumental in leading a high-performing SRE team, implementing robust monitoring, automation, and DevOps practices on Google Cloud Platform. Your expertise will ensure system uptime, efficiency, and performance, while also mentoring others and fostering a culture of engineering excellence.

Key Responsibilities & Outputs:

  • Lead and mentor a team of SRE engineers, promoting knowledge sharing and growth.
  • Act as the technical authority on SRE practices for GCP, ensuring system reliability and uptime across environments.
  • Oversee team workload distribution and manage stakeholder expectations.
  • Champion and implement DevOps and SRE best practices with emphasis on automation and scalability.
  • Drive monitoring and observability initiatives, leveraging tools like Grafana, Prometheus, and Stackdriver.
  • Design, maintain, and optimise CI/CD pipelines using GCP-native tools and industry standards.
  • Troubleshoot complex production incidents, ensuring root cause analysis and long-term fixes.
  • Collaborate with cross-functional teams to ensure consistent platform performance.
  • Apply Infrastructure as Code (IaC) principles using tools such as Terraform or Deployment Manager.
  • Stay abreast of emerging technologies to continually evolve our tooling and architecture.
  • Foster a proactive and blameless incident management culture.

Education & Experience:

  • Degree or Diploma in Information Technology, Computer Science, or equivalent experience.
  • Google Cloud certifications (e.g., Professional Cloud DevOps Engineer, Professional Cloud Architect) are highly advantageous.
  • Minimum of 3 years in a management/leadership capacity within SRE/DevOps teams.
  • Strong experience working on GCP infrastructure and services.
  • Experience with Kubernetes, Docker, and container orchestration at scale.
  • Familiarity with incident management, post-mortem processes, and production monitoring tools.
  • Hands-on experience with IaC tools such as Terraform, Ansible, or Deployment Manager.
  • Experience working with CI/CD pipelines and automation tools.
  • UNIX/Linux administration expertise.
  • Familiarity with security, compliance, and cost optimization on GCP.

Technical Skills/Knowledge:

  • Strong scripting and automation skills (Python, Bash, Shell).
  • Familiarity with configuration management (Chef, Puppet, or Ansible).
  • Use of observability platforms (Grafana, Prometheus, Stackdriver, etc.).
  • Deep understanding of system performance and reliability engineering.

Abilities / Behaviours:

  • Strong leadership and team development capabilities.
  • High level of professionalism and ownership.
  • Excellent communication and stakeholder management skills.
  • Ability to manage priorities in high-pressure environments.
  • Passion for continuous improvement and driving engineering excellence.
  • Innovative thinker willing to challenge the status quo.

Apply now to set up some time to speak about how this role could take your career to the next level!