Job Openings
Site Reliability Lead
About the job Site Reliability Lead
Job Description
- Lead, mentor, and manage a team of Site Reliability Engineers, ensuring coverage across shifts and on-call rotations.
- Define team goals, KPIs, and performance metrics aligned with service reliability and business continuity.
- Conduct regular coaching, performance reviews, and skills development planning.
- Oversee workload distribution, escalation protocols, and incident ownership across the team.
- Champion a culture of documentation, knowledge sharing, and operational discipline.
Key Responsibilities
- Own the architecture and lifecycle of monitoring, alerting, and logging systems.
- Ensure early detection, triage, and escalation of service degradation based on SLAs.
- Lead major incident response, root cause analysis (RCA), and postmortem documentation.
- Review and approve SOPs, runbooks, and playbooks created by the team.
- Analyze incident trends and drive systemic fixes to reduce recurrence and improve MTTR.
- Work closely with DevOps, Infrastructure, QA, and Development teams to improve deployment readiness and system resilience.
- Represent the SRE function in planning meetings, audits, and compliance reviews.
- Collaborate with ITSM teams to align incident, problem, and change management processes.
Skills and Competencies
- Proven leadership experience in managing technical operations or SRE teams.
- Strong command of ITSM platforms (e.g., ServiceNow, Jira Service Management).
- Deep understanding of monitoring tools (e.g., Prometheus, Grafana, ELK, Datadog).
- Familiarity with ITIL principles and regulatory frameworks (e.g., BSP, PDIC, ISO27001).
- Expertise in incident response, escalation protocols, and RCA methodologies.
- Excellent communication and stakeholder management skills.
- Ability to synthesize operational data into actionable insights and team strategies.
Qualifications and Experience
- Bachelors degree in Computer Science, Information Technology, Electronics
- Engineering, or equivalent.
- 5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles.
- 2+ years in a leadership or team management capacity.
- Hands-on experience with cloud platforms (AWS, GCP, Azure).
- Knowledgeable in scripting (Python, Bash) and Linux systems.
- Experience in fintech, banking, or SaaS environments with high availability SLAs.