Lead Site Reliability Engineer (Remote)

R 50,000.00 - 85,000.00 (South African Rand)

Job Openings Lead Site Reliability Engineer (Remote)

About the job Lead Site Reliability Engineer (Remote)

Lead Site Reliability Engineer (GCP - Google Cloud)

Are you a seasoned SRE with a proven track record of driving reliability and scalability in production environments?

Our client is seeking a highly skilled and motivated Lead Site Reliability Engineer specialising in Google Cloud Platform to join their team.

This is a permanent, full-time, remote position, and will allow you to lean into your GCP skills while you grow as a leader in the support of a fast-paced UK based client and look after a distributed team of SREs.

You'll be able to hone your DevOps skills by working on challenging projects, lead a diverse team of multi-skilled people, gain international exposure, and you'll get to work with one of the leaders in the Cloud tech and SaaS space in Africa.

About the Role:

As the Lead SRE (GCP - Google Cloud), you will be instrumental in leading a high-performing SRE team, implementing robust monitoring, automation, and DevOps practices on Google Cloud Platform. Your expertise will ensure system uptime, efficiency, and performance, while also mentoring others and fostering a culture of engineering excellence.

Key Responsibilities & Outputs:

Lead and mentor a team of SRE engineers, promoting knowledge sharing and growth.
Act as the technical authority on SRE practices for GCP, ensuring system reliability and uptime across environments.
Oversee team workload distribution and manage stakeholder expectations.
Champion and implement DevOps and SRE best practices with emphasis on automation and scalability.
Drive monitoring and observability initiatives, leveraging tools like Grafana, Prometheus, and Stackdriver.
Design, maintain, and optimise CI/CD pipelines using GCP-native tools and industry standards.
Troubleshoot complex production incidents, ensuring root cause analysis and long-term fixes.
Collaborate with cross-functional teams to ensure consistent platform performance.
Apply Infrastructure as Code (IaC) principles using tools such as Terraform or Deployment Manager.
Stay abreast of emerging technologies to continually evolve our tooling and architecture.
Foster a proactive and blameless incident management culture.

Education & Experience:

Degree or Diploma in Information Technology, Computer Science, or equivalent experience.
Google Cloud certifications (e.g., Professional Cloud DevOps Engineer, Professional Cloud Architect) are highly advantageous.
Minimum of 3 years in a management/leadership capacity within SRE/DevOps teams.
Strong experience working on GCP infrastructure and services.
Experience with Kubernetes, Docker, and container orchestration at scale.
Familiarity with incident management, post-mortem processes, and production monitoring tools.
Hands-on experience with IaC tools such as Terraform, Ansible, or Deployment Manager.
Experience working with CI/CD pipelines and automation tools.
UNIX/Linux administration expertise.
Familiarity with security, compliance, and cost optimization on GCP.

Technical Skills/Knowledge:

Strong scripting and automation skills (Python, Bash, Shell).
Familiarity with configuration management (Chef, Puppet, or Ansible).
Use of observability platforms (Grafana, Prometheus, Stackdriver, etc.).
Deep understanding of system performance and reliability engineering.

Abilities / Behaviours:

Strong leadership and team development capabilities.
High level of professionalism and ownership.
Excellent communication and stakeholder management skills.
Ability to manage priorities in high-pressure environments.
Passion for continuous improvement and driving engineering excellence.
Innovative thinker willing to challenge the status quo.

Apply now to set up some time to speak about how this role could take your career to the next level!