Job Openings
Site Reliability Engineer - Cloud & Platform Resilience
About the job Site Reliability Engineer - Cloud & Platform Resilience
Job Summary:
We are seeking an experienced and proactive Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of critical digital platforms in a high-availability environment. This role combines software engineering and systems administration principles to automate operations, optimize monitoring, and manage incident response within cloud-native banking and fintech ecosystems.
The ideal candidate has a deep understanding of distributed systems, automation frameworks, and modern observability stacks, with experience operating production-grade environments in regulated industries.
Key Responsibilities:
- Maintain and improve platform reliability through automation, monitoring, and performance tuning.
- Build tools and services that reduce manual operations and improve developer productivity.
- Define and implement Service Level Indicators (SLIs), Objectives (SLOs), and Agreements (SLAs) for key services.
- Design and maintain CI/CD pipelines, infrastructure as code (IaC), and container orchestration (e.g., Kubernetes).
- Monitor application health, conduct incident response and postmortems, and drive root cause analysis.
- Collaborate with software engineers, infrastructure teams, and cybersecurity to implement secure, resilient system designs.
- Optimize usage of cloud resources (e.g., AWS, Azure, GCP) and enhance cost efficiency and fault tolerance.
- Develop and maintain runbooks, operational playbooks, and documentation for ongoing support.
Qualifications:
- Bachelors degree in Computer Science, Information Systems, or related field.
- 3-5 years of experience in SRE, DevOps, or platform engineering roles in enterprise or cloud-based environments.
- Proficiency in infrastructure automation tools (Terraform, Ansible, Helm) and scripting languages (Python, Bash, Go).
- Hands-on experience with Kubernetes, Docker, and container orchestration platforms.
- Strong familiarity with observability stacks (Prometheus, Grafana, ELK, Datadog, etc.).
- Experience with incident management, monitoring strategies, and system performance tuning.
- Understanding of CI/CD pipelines (e.g., Jenkins, GitLab CI, ArgoCD) and GitOps principles.
- Knowledge of system security, compliance, and availability in regulated industries is a plus.