Job Openings Analytics Observability Engineer

About the job Analytics Observability Engineer

As a Remote Analytics Observability Engineer, you will be responsible for designing, implementing, and maintaining end-to-end observability solutions that ensure visibility into analytics systems, data pipelines, and application performance. You will work cross-functionally with engineering, data science, DevOps, and SRE teams to enable proactive monitoring, alerting, logging, and tracing across data infrastructure.

This role plays a critical part in ensuring data reliability, system uptime, and actionable insights by building tools and dashboards that improve system transparency and performance.

Key Responsibilities:

Architect and implement observability solutions for analytics platforms and data pipelines (ETL/ELT, streaming, batch)

Integrate monitoring tools (e.g., Prometheus, Grafana, Datadog, New Relic) into analytics environments (Spark, Airflow, dbt, etc.)

Design real-time dashboards and alerts that capture system health, job failures, data anomalies, and latency issues

Analyze telemetry data to identify performance degradation, failures, or capacity bottlenecks

Enable distributed tracing for data flows using OpenTelemetry, Jaeger, or similar technologies

Collaborate with Data Engineering and Site Reliability Engineering (SRE) teams to build scalable and fault-tolerant observability stacks

Drive observability best practices and help teams adopt instrumentation standards

Write infrastructure-as-code to deploy monitoring systems (Terraform, Helm, Kubernetes, etc.)

Required Qualifications:

Bachelors degree in Computer Science, Data Engineering, or a related field

2+ years of experience in observability, SRE, DevOps, or DataOps

Deep knowledge of observability tools such as Grafana, Prometheus, Datadog, Splunk, New Relic, or Honeycomb

Familiarity with monitoring cloud-based data systems (AWS/GCP/Azure), and platforms like Snowflake, BigQuery, Redshift, or Databricks

Proficiency in scripting and automation (e.g., Python, Bash)

Experience with infrastructure management and orchestration tools (Kubernetes, Terraform, Helm)

Strong analytical and debugging skills using telemetry data.