SRE Engineer

Job Openings SRE Engineer

About the job SRE Engineer

We are looking for an experienced SRE to ensure the reliability, scalability, and performance of our AWS-based infrastructure.

You will focus on building resilient systems, driving observability, and reducing operational toil through automation — while collaborating closely with development teams to maintain high service levels.

This role balances infrastructure engineering with reliability practices. While you will support internal teams with deployment and platform needs, the primary focus is on system reliability, automation, and infrastructure excellence rather than day-to-day operational firefighting.

Key Responsibilities
-Reliability & Observability
-Define and maintain SLOs, SLIs, and error budgets across services and environments
-Implement comprehensive monitoring, alerting, and observability (metrics, logs, traces)
-Drive incident response processes, post-mortems, and reliability improvements
-Identify and reduce operational toil through automation and self-service tooling

Infrastructure & Networking
-Design, deploy, and maintain AWS infrastructure across production, staging, and sandbox environments
-Manage Kubernetes clusters (EKS) — scaling, upgrades, troubleshooting, and reliability
-Maintain networking components: VPCs, VPN connections, site-to-site connectivity, load balancing
-Implement and maintain Infrastructure as Code using Terraform
-Design high-availability, fault-tolerant, and auto-scaling architectures

Automation & Developer Experience
-Build and maintain CI/CD pipelines using Jenkins
-Create self-service tooling and workflows to improve developer productivity
-Support test environment provisioning and deployment automation
-Collaborate with development and product teams to improve deployment reliability

Platform Support
-Support and scale Node.js and PHP-based ecosystems
-Work with MongoDB and DynamoDB — performance tuning, backup strategies, and reliability
-Maintain and optimize serverless solutions (AWS Lambda, event-driven architectures)
-Ensure security best practices, cost optimization, and operational excellence

Required Skills & Experience
-Strong hands-on experience in an AWS-centric environment
-Proven experience with Kubernetes (EKS) — cluster operations, troubleshooting, scaling
-Solid knowledge of Terraform for infrastructure provisioning
-Experience with SRE practices: SLOs, incident management, post-mortems, toil reduction
-Hands-on experience with networking: VPCs, VPNs, site-to-site connections, DNS, load balancing
-Experience with monitoring and observability tooling
-Proven experience with CI/CD automation, specifically Jenkins
-Experience with high-availability and fault-tolerant architectures
-Strong understanding of security, networking, and cloud best practices
-Experience operating production systems with high traffic and transaction volumes

Nice to Have
-Experience with Datadog / Prometheus / Grafana stack
-Experience supporting Node.js and PHP application ecosystems
-Experience with MongoDB and/or DynamoDB
-Experience with serverless architectures (AWS Lambda)
-Experience in high-volume transaction systems (payments, fintech, gaming)
-Familiarity with PCI, security, and compliance-driven environments
-Scripting experience (Bash, Python)

Or refer someone