Staff Devops Engineer

San Francisco, California, United States

Job Openings Staff Devops Engineer

About the job Staff Devops Engineer

This is a hands-on Staff DevOps Engineer role responsible for designing, operating, and evolving a highly available, multi-tenant platform on AWS. You will work closely with software engineering to deploy, operate, and scale production systems while driving improvements in reliability, automation, and performance.

This role requires strong ownership of infrastructure and production systems. You will also provide technical leadership and mentorship to other DevOps engineers.

You will also help introduce and operationalize AI/LLM capabilities within the platform.

AI / LLM Systems (Emerging Area)

Experience operating or integrating LLM/AI services in production environments, including tracing and evaluation
(OpenTelemetry, LangSmith, LangFuse or equivalent)
Experience managing performance, cost, and reliability of LLM workloads (latency, token usage, rate limiting, fallbacks)
Experience using AI/agentic developer tools (e.g., Claude Code, Cursor or similar) to accelerate DevOps, workflows and improve engineering efficiency

What You'll Be Responsible For

Design, build, and operate scalable, highly available infrastructure in AWS
Own and evolve infrastructure as code (Terraform) across all environments
Operate and optimize Aurora PostgreSQL (replication, failover, performance tuning)
Operate ECS (Fargate), ECR, and containerized services
Operate Kafka-based event streaming systems
Manage Auto Scaling Groups and EC2-based workloads
Design and maintain CI/CD pipelines (Buildkite)
Build automation to eliminate manual operational work
Manage and secure secrets and access (Vault, AWS Secrets Manager, IAM)
Partner with engineering teams to improve system reliability and performance
Provide technical leadership and mentorship to DevOps engineers
Drive cost optimization across AWS infrastructure
Operate systems behind Cloudflare (WAF, CDN, traffic management)

Production Reliability & Incident Ownership

Own production incident response end-to-end (triage, mitigation, coordination)
Lead high-severity outage response under pressure
Drive root cause analysis (RCA) and enforce follow-ups
Continuously improve system resilience and recovery mechanisms

Observability & System Insight

Design and operate end-to-end observability (metrics, logs, tracing)
Build high-signal monitoring, alerting, and dashboards
Define and enforce SLIs/SLOs and alerting standards
Reduce alert fatigue and improve signal-to-noise ratio

What You Bring

Deep experience operating production systems on AWS (ECS/Fargate, EC2, networking, IAM)
Expert-level Terraform experience managing infrastructure at scale
Strong experience with containerized applications and distributed systems (e.g., Kafka)
Experience operating multi-tenant, highly available systems
Proven ownership of production on-call and resolving critical incidents
Strong systems fundamentals (Linux, networking, debugging)
Strong scripting ability (Bash, Python or equivalent)
Experience designing and operating CI/CD systems
Strong understanding of security best practices (IAM, secrets management)

Nice to Have

Experience operating multi-region or globally distributed systems
Experience working with Cloudflare at scale
Experience optimizing high-throughput or event-driven systems

Or refer someone