Job Openings Staff Devops Engineer

About the job Staff Devops Engineer

This is a hands-on Staff DevOps Engineer role responsible for designing, operating, and evolving a highly available, multi-tenant platform on AWS. You will work closely with software engineering to deploy, operate, and scale production systems while driving improvements in reliability, automation, and performance.

This role requires strong ownership of infrastructure and production systems. You will also provide technical leadership and mentorship to other DevOps engineers.

You will also help introduce and operationalize AI/LLM capabilities within the platform.

AI / LLM Systems (Emerging Area)

  • Experience operating or integrating LLM/AI services in production environments, including tracing and evaluation
  • (OpenTelemetry, LangSmith, LangFuse or equivalent)
  • Experience managing performance, cost, and reliability of LLM workloads (latency, token usage, rate limiting, fallbacks)
  • Experience using AI/agentic developer tools (e.g., Claude Code, Cursor or similar) to accelerate DevOps, workflows and improve engineering efficiency

What You'll Be Responsible For

  • Design, build, and operate scalable, highly available infrastructure in AWS
  • Own and evolve infrastructure as code (Terraform) across all environments
  • Operate and optimize Aurora PostgreSQL (replication, failover, performance tuning)
  • Operate ECS (Fargate), ECR, and containerized services
  • Operate Kafka-based event streaming systems
  • Manage Auto Scaling Groups and EC2-based workloads
  • Design and maintain CI/CD pipelines (Buildkite)
  • Build automation to eliminate manual operational work
  • Manage and secure secrets and access (Vault, AWS Secrets Manager, IAM)
  • Partner with engineering teams to improve system reliability and performance
  • Provide technical leadership and mentorship to DevOps engineers
  • Drive cost optimization across AWS infrastructure
  • Operate systems behind Cloudflare (WAF, CDN, traffic management)

Production Reliability & Incident Ownership

  • Own production incident response end-to-end (triage, mitigation, coordination)
  • Lead high-severity outage response under pressure
  • Drive root cause analysis (RCA) and enforce follow-ups
  • Continuously improve system resilience and recovery mechanisms

Observability & System Insight

  • Design and operate end-to-end observability (metrics, logs, tracing)
  • Build high-signal monitoring, alerting, and dashboards
  • Define and enforce SLIs/SLOs and alerting standards
  • Reduce alert fatigue and improve signal-to-noise ratio


What You Bring

  • Deep experience operating production systems on AWS (ECS/Fargate, EC2, networking, IAM)
  • Expert-level Terraform experience managing infrastructure at scale
  • Strong experience with containerized applications and distributed systems (e.g., Kafka)
  • Experience operating multi-tenant, highly available systems
  • Proven ownership of production on-call and resolving critical incidents
  • Strong systems fundamentals (Linux, networking, debugging)
  • Strong scripting ability (Bash, Python or equivalent)
  • Experience designing and operating CI/CD systems
  • Strong understanding of security best practices (IAM, secrets management)

Nice to Have

  • Experience operating multi-region or globally distributed systems
  • Experience working with Cloudflare at scale
  • Experience optimizing high-throughput or event-driven systems