Backend Engineer & Senior Data Pipeline

Job Openings Backend Engineer & Senior Data Pipeline

About the job Backend Engineer & Senior Data Pipeline

Were looking for a Senior Data Pipeline Engineer who is equally strong in Airflow DAGs, modern data ingestion tools (Airbyte, Fivetran, custom connectors), ETL/ELT pipelines, and backend engineering. This person will own and scale the full data ingestion augmentation AI processing storage pipeline that powers Autoplays session analysis system.

Youll work across ingestion, transformation, augmentation, and backend systems reliability to ensure our pipeline is robust, scalable, observable, and cost-efficient.

What Youll Own

1. Data Pipeline Architecture & Management

Design, orchestrate, and maintain our multi-source data pipeline (RRWeb events, analytics events, video frames, metadata).
Manage and optimize Airflow DAGs (scheduling, retries, dependency management, error handling, backfilling).
Integrate and scale Airbyte connectors to pull data from tools like PostHog, Mixpanel, Pendo, and custom APIs.
Build high-reliability pipelines that can handle large, bursty session replay data.

2. Pipeline Reliability & Observability

Implement end-to-end monitoring: logs, metrics, alerts, data quality checks, schema validations.
Reduce pipeline failures and rate-limit issues (e.g., PostHog ingestion constraints).
Introduce automatic retries, dead-letter queues, and backpressure strategies.

3. Backend Engineering

Build and optimize backend services (Python/FastAPI, Node, etc.) that consume and expose pipeline outputs.
Improve the performance of data storage (Postgres/Neon, vector DBs, GCS).
Implement caching layers for metadata, summaries, and user-level insights.

4. Scalability & Performance

Architect systems that can scale across:
- High-volume session replays
- Large embeddings
- JSON augmentation workloads
- Batch and real-time computation
Identify bottlenecks and implement optimizations across the pipeline (I/O, compute, caching, parallelization).

5. Ownership of the Full Augmentation Flow

Directly manage all backend systems that produce:

Augmented interactions
Markdown summaries
Session highlights
User intent & frictions
Session tags
One-liner summaries
Product sections
User flow
GCS output storage

Youll own this pipeline from ingestion augmentation storage.

Ideal Profile

Required Experience

5+ years in data engineering or backend engineering
Deep experience with Apache Airflow, DAG design, and orchestration at scale
Strong familiarity with Airbyte, ETL/ELT patterns, and connector configuration
Strong Python engineering background (FastAPI / Django / async patterns)
Experience processing large JSON datasets or high-volume event streams
Proven track record of building scalable, cost-efficient, well-monitored data systems
Familiarity with GCP (GCS, Cloud Run, Pub/Sub), AWS, or similar cloud environments

Nice to Have

Experience with RRWeb or session replay data
Background in AI/ML data pipelines
Experience with vector databases, embeddings, or semantic search
Understanding of clickstream analytics
DevOps exposure (Docker, Terraform, CI/CD)

What Success Looks Like (First 90 Days)

Pipeline reliability increases to 99%+ success rate
Airflow DAGs are fully structured, well-documented, modular, and observable
Rate limit issues with PostHog and others are solved via batching + queuing
Airbyte pipelines are stable, monitored, and error-recoverable
Backend bottlenecks (payload size, memory usage, API latency) are reduced
Augmentation pipeline outputs are consistent, validated, and cached intelligently
Youve shipped several major improvements to the data throughput and cost efficiency