About the job Site Reliability Lead
Job Description
Lead, mentor, and manage a team of Site Reliability Engineers, ensuring coverage across shifts and on-call rotations.
Define team goals, KPIs, and performance metrics aligned with service reliability and business continuity.
Conduct regular coaching, performance reviews, and skills development planning.
Oversee workload distribution, escalation protocols, and incident ownership across the team.
Champion a culture of documentation, knowledge sharing, and operational discipline.
Key Responsibilities
Own the architecture and lifecycle of monitoring, alerting, and logging systems.
Ensure early detection, triage, and escalation of service degradation based on SLAs.
Lead major incident response, root cause analysis (RCA), and postmortem documentation.
Review and approve SOPs, runbooks, and playbooks created by the team.
Analyze incident trends and drive systemic fixes to reduce recurrence and improve MTTR.
Work closely with DevOps, Infrastructure, QA, and Development teams to improve deployment readiness and system resilience.
Represent the SRE function in planning meetings, audits, and compliance reviews.
Collaborate with ITSM teams to align incident, problem, and change management processes.
Skills and Competencies
Proven leadership experience in managing technical operations or SRE teams.
Strong command of ITSM platforms (e.g., ServiceNow, Jira Service Management).
Deep understanding of monitoring tools (e.g., Prometheus, Grafana, ELK, Datadog).
Familiarity with ITIL principles and regulatory frameworks (e.g., BSP, PDIC, ISO 27001).
Expertise in incident response, escalation protocols, and RCA methodologies.
Excellent communication and stakeholder management skills.
Ability to synthesize operational data into actionable insights and team strategies.
Qualifications and Experience
Bachelors degree in Computer Science, Information Technology, Electronics Engineering, or equivalent.
5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles.
2+ years in a leadership or team management capacity.
Hands-on experience with cloud platforms (AWS, GCP, Azure).
Knowledgeable in scripting (Python, Bash) and Linux systems.
Experience in fintech, banking, or SaaS environments with high availability SLAs.