Job Openings
Senior Site Reliability Engineer (Cloud Native & Observability)
About the job Senior Site Reliability Engineer (Cloud Native & Observability)
Role Overview:
Responsible for highly resilient, scalable, and cost-optimized systems on multi-cloud environments. Focus on infrastructure as code, observability, chaos engineering, and Kubernetes ecosystem stability across distributed systems used by millions of users.
- Design and implement scalable and reliable systems.
- Monitor system health using tools like Prometheus, Grafana, or Datadog.
- Manage CI/CD pipelines and infrastructure automation (e.g., Jenkins, GitHub Actions).
- Troubleshoot incidents and ensure root cause analysis is completed.
- Work with DevOps and development teams to improve system performance.
- Build tools to automate operations and reduce manual intervention (IaC).
Key Responsibilities:
- Architect and maintain multi-region Kubernetes clusters (AKS/EKS/GKE) with Istio/Linkerd service mesh.
- Implement full-stack observability using OpenTelemetry, Grafana Loki, and Jaeger.
- Build self-healing infrastructure with tools like KEDA, Argo CD, Crossplane.
- Design and manage CI/CD pipelines with GitOps approach (FluxCD/Argo CD).
- Conduct chaos testing using Gremlin or LitmusChaos to validate system resilience.
- Work with finance and ops teams on FinOps strategies for optimizing cloud usage (Spot instances, autoscaling policies).
- Implement policy-as-code for security compliance via OPA/Gatekeeper.
Technology Stack:
- Languages: Go, Python, Bash
- Cloud: AWS, Azure, GCP
- IaC Tools: Terraform, Helm, Pulumi
- Observability: Prometheus, Grafana, ELK, New Relic
- Certifications Preferred: CKA, CKAD, Terraform Associate, Google SRE, FinOps Certified Practitioner
Requirements:
- Bachelor's in Computer Science, Engineering, or related.
- Familiarity with Terraform, Ansible, or other automation tools.
- Experience with public cloud (AWS, Azure, or GCP).
- Strong scripting skills (Python, Bash, or Go).