Senior Site Reliability Engineer (Cloud Native & Observability)

Putrajaya, Malaysia

Job Openings Senior Site Reliability Engineer (Cloud Native & Observability)

About the job Senior Site Reliability Engineer (Cloud Native & Observability)

Role Overview:

Responsible for highly resilient, scalable, and cost-optimized systems on multi-cloud environments. Focus on infrastructure as code, observability, chaos engineering, and Kubernetes ecosystem stability across distributed systems used by millions of users.

Design and implement scalable and reliable systems.
Monitor system health using tools like Prometheus, Grafana, or Datadog.
Manage CI/CD pipelines and infrastructure automation (e.g., Jenkins, GitHub Actions).
Troubleshoot incidents and ensure root cause analysis is completed.
Work with DevOps and development teams to improve system performance.
Build tools to automate operations and reduce manual intervention (IaC).

Key Responsibilities:

Architect and maintain multi-region Kubernetes clusters (AKS/EKS/GKE) with Istio/Linkerd service mesh.
Implement full-stack observability using OpenTelemetry, Grafana Loki, and Jaeger.
Build self-healing infrastructure with tools like KEDA, Argo CD, Crossplane.
Design and manage CI/CD pipelines with GitOps approach (FluxCD/Argo CD).
Conduct chaos testing using Gremlin or LitmusChaos to validate system resilience.
Work with finance and ops teams on FinOps strategies for optimizing cloud usage (Spot instances, autoscaling policies).
Implement policy-as-code for security compliance via OPA/Gatekeeper.

Technology Stack:

Languages: Go, Python, Bash
Cloud: AWS, Azure, GCP
IaC Tools: Terraform, Helm, Pulumi
Observability: Prometheus, Grafana, ELK, New Relic
Certifications Preferred: CKA, CKAD, Terraform Associate, Google SRE, FinOps Certified Practitioner

Requirements:

Minimum 8 years of experience in DevOps/SRE, with at least 5 years specifically in Site Reliability Engineering roles
Bachelor's in Computer Science, Engineering, or related.
Familiarity with Terraform, Ansible, or other automation tools.
Experience with public cloud (AWS, Azure, or GCP).
Strong scripting skills (Python, Bash, or Go).

Or refer someone