Job Openings Site Reliability Engineer - Cloud & Platform Resilience

About the job Site Reliability Engineer - Cloud & Platform Resilience

Job Summary:

We are seeking an experienced and proactive Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of critical digital platforms in a high-availability environment. This role combines software engineering and systems administration principles to automate operations, optimize monitoring, and manage incident response within cloud-native banking and fintech ecosystems.

The ideal candidate has a deep understanding of distributed systems, automation frameworks, and modern observability stacks, with experience operating production-grade environments in regulated industries.

Key Responsibilities:

  • Maintain and improve platform reliability through automation, monitoring, and performance tuning.
  • Build tools and services that reduce manual operations and improve developer productivity.
  • Define and implement Service Level Indicators (SLIs), Objectives (SLOs), and Agreements (SLAs) for key services.
  • Design and maintain CI/CD pipelines, infrastructure as code (IaC), and container orchestration (e.g., Kubernetes).
  • Monitor application health, conduct incident response and postmortems, and drive root cause analysis.
  • Collaborate with software engineers, infrastructure teams, and cybersecurity to implement secure, resilient system designs.
  • Optimize usage of cloud resources (e.g., AWS, Azure, GCP) and enhance cost efficiency and fault tolerance.
  • Develop and maintain runbooks, operational playbooks, and documentation for ongoing support.

Qualifications:

  • Bachelors degree in Computer Science, Information Systems, or related field.
  • 3-5 years of experience in SRE, DevOps, or platform engineering roles in enterprise or cloud-based environments.
  • Proficiency in infrastructure automation tools (Terraform, Ansible, Helm) and scripting languages (Python, Bash, Go).
  • Hands-on experience with Kubernetes, Docker, and container orchestration platforms.
  • Strong familiarity with observability stacks (Prometheus, Grafana, ELK, Datadog, etc.).
  • Experience with incident management, monitoring strategies, and system performance tuning.
  • Understanding of CI/CD pipelines (e.g., Jenkins, GitLab CI, ArgoCD) and GitOps principles.
  • Knowledge of system security, compliance, and availability in regulated industries is a plus.