Job Openings Site Reliability Lead

About the job Site Reliability Lead

Job Description

  • Lead, mentor, and manage a team of Site Reliability Engineers, ensuring coverage across shifts and on-call rotations.
  • Define team goals, KPIs, and performance metrics aligned with service reliability and business continuity.
  • Conduct regular coaching, performance reviews, and skills development planning.
  • Oversee workload distribution, escalation protocols, and incident ownership across the team.
  • Champion a culture of documentation, knowledge sharing, and operational discipline.

Key Responsibilities

  • Own the architecture and lifecycle of monitoring, alerting, and logging systems.
  • Ensure early detection, triage, and escalation of service degradation based on SLAs.
  • Lead major incident response, root cause analysis (RCA), and postmortem documentation.
  • Review and approve SOPs, runbooks, and playbooks created by the team.
  • Analyze incident trends and drive systemic fixes to reduce recurrence and improve MTTR.
  • Work closely with DevOps, Infrastructure, QA, and Development teams to improve deployment readiness and system resilience.
  • Represent the SRE function in planning meetings, audits, and compliance reviews.
  • Collaborate with ITSM teams to align incident, problem, and change management processes.

Skills and Competencies

  • Proven leadership experience in managing technical operations or SRE teams.
  • Strong command of ITSM platforms (e.g., ServiceNow, Jira Service Management).
  • Deep understanding of monitoring tools (e.g., Prometheus, Grafana, ELK, Datadog).
  • Familiarity with ITIL principles and regulatory frameworks (e.g., BSP, PDIC, ISO27001).
  • Expertise in incident response, escalation protocols, and RCA methodologies.
  • Excellent communication and stakeholder management skills.
  • Ability to synthesize operational data into actionable insights and team strategies.

Qualifications and Experience

  • Bachelors degree in Computer Science, Information Technology, Electronics
  • Engineering, or equivalent.
  • 5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles.
  • 2+ years in a leadership or team management capacity.
  • Hands-on experience with cloud platforms (AWS, GCP, Azure).
  • Knowledgeable in scripting (Python, Bash) and Linux systems.
  • Experience in fintech, banking, or SaaS environments with high availability SLAs.