About the job Founding Engineer – Full Stack ML DevTools & Systems
Founding Engineer – Full Stack ML DevTools & Systems
Location: San Francisco, CA
Type: Full-Time
Base Compensation: $150,000 – $250,000
Equity: Competitive Series A Equity Package
Overview
This is a founding-level engineering role within a Series A AI infrastructure company building core developer tools and platform primitives for post-training, evaluation, and reinforcement learning workflows.
The platform enables ML engineers and researchers to:
-
Create structured training data
-
Run reinforcement fine-tuning workflows
-
Evaluate model performance reliably and reproducibly at scale
This is a high-ownership role at the center of the product. You will operate across the Python SDK, backend systems, infrastructure, and developer experience—partnering directly with frontier labs, enterprise AI teams, and AI-native startups.
This is not a narrow feature role. You will shape foundational platform architecture and developer workflows that power advanced model training systems.
Core Responsibilities
Platform & Backend Systems
-
Design and implement backend systems supporting post-training workflows, dataset primitives, run tracking, and artifact management
-
Build reliable execution and orchestration systems with strong isolation and reproducibility
-
Improve observability, debugging capabilities, and performance across job execution and distributed data pipelines
-
Contribute to containerized infrastructure and Kubernetes-based deployment patterns
Python SDK & Developer Experience
-
Own and evolve the Python SDK with clean APIs, strong documentation, intuitive defaults, and extensibility
-
Design developer-friendly abstractions for reinforcement learning, evaluation loops, and training workflows
-
Develop evaluation-native workflows connecting capability measurement, data creation, training, and re-evaluation loops
-
Improve CLI tools, developer interfaces, and local-to-cloud workflows
Infrastructure & Cloud Systems
-
Work across compute, networking, storage, and IAM configurations
-
Design systems that are scalable, reproducible, and secure
-
Collaborate on distributed systems design and execution infrastructure
Customer & Research Collaboration
-
Partner directly with ML engineers and researchers to translate real-world workflows into platform improvements
-
Incorporate structured customer feedback into roadmap decisions
-
Operate at the intersection of research needs and production reliability
Requirements
-
Strong production experience in Python
-
Comfort operating across the stack, including APIs, backend systems, data systems, and frontend integration
-
Deep understanding of Docker and Linux environments
-
Cloud fundamentals: compute, networking, storage, IAM
-
Strong product instincts with a bias toward shipping
-
Demonstrated end-to-end ownership of production systems
Required Candidate Q&A
-
LinkedIn Profile
-
GitHub URL
-
Publications URL (Google Scholar or similar, if applicable)
Interview Process
-
Initial Screen
-
Technical Evaluation
-
Work Trial
-
Final Discussion
-
Offer Decision