Site Reliability Engineer - Cloud & Platform Resilience

Pasig, Metro Manila, Philippines

Job Openings Site Reliability Engineer - Cloud & Platform Resilience

About the job Site Reliability Engineer - Cloud & Platform Resilience

Job Summary:

We are seeking an experienced and proactive Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of critical digital platforms in a high-availability environment. This role combines software engineering and systems administration principles to automate operations, optimize monitoring, and manage incident response within cloud-native banking and fintech ecosystems.

The ideal candidate has a deep understanding of distributed systems, automation frameworks, and modern observability stacks, with experience operating production-grade environments in regulated industries.

Key Responsibilities:

Maintain and improve platform reliability through automation, monitoring, and performance tuning.
Build tools and services that reduce manual operations and improve developer productivity.
Define and implement Service Level Indicators (SLIs), Objectives (SLOs), and Agreements (SLAs) for key services.
Design and maintain CI/CD pipelines, infrastructure as code (IaC), and container orchestration (e.g., Kubernetes).
Monitor application health, conduct incident response and postmortems, and drive root cause analysis.
Collaborate with software engineers, infrastructure teams, and cybersecurity to implement secure, resilient system designs.
Optimize usage of cloud resources (e.g., AWS, Azure, GCP) and enhance cost efficiency and fault tolerance.
Develop and maintain runbooks, operational playbooks, and documentation for ongoing support.

Qualifications:

Bachelors degree in Computer Science, Information Systems, or related field.
3-5 years of experience in SRE, DevOps, or platform engineering roles in enterprise or cloud-based environments.
Proficiency in infrastructure automation tools (Terraform, Ansible, Helm) and scripting languages (Python, Bash, Go).
Hands-on experience with Kubernetes, Docker, and container orchestration platforms.
Strong familiarity with observability stacks (Prometheus, Grafana, ELK, Datadog, etc.).
Experience with incident management, monitoring strategies, and system performance tuning.
Understanding of CI/CD pipelines (e.g., Jenkins, GitLab CI, ArgoCD) and GitOps principles.
Knowledge of system security, compliance, and availability in regulated industries is a plus.

Or refer someone