Beijing, China

Senior AI Infrastructure Engineer

 Job Description:

Responsibilities

1. Full-Stack AI Infrastructure Architecture & Development:

  • Build a full-stack AI infrastructure system for quantitative scenarios based on Kubernetes, unifying the management of heterogeneous computing resources (e.g., GPU pooling).
  • Integrate high-performance communication layers (e.g., RDMA) and drive the unified development of AI training/inference platforms and GPU operation/maintenance platforms.
  • Streamline the end-to-end workflow from resource scheduling to model deployment, enhancing system efficiency and stability.

2. Intelligent Computing Power Scheduling System Design:

  • Design a global scheduling mechanism supporting multi-task types and priority strategies, leveraging Volcano scheduler capabilities.
  • Lead the customization and maintenance of Volcano and core Operators, optimizing elastic scaling and resource utilization based on dynamic demands of quantitative tasks.

3. Hardware-Software Co-Optimization & System Reliability:

  • Develop an intermediate layer bridging underlying hardware (GPU/networking/storage) and AI frameworks (PyTorch/TensorFlow).
  • Build GPU elastic resource pools, fault self-healing mechanisms, and unified observability platforms (e.g., monitoring dashboards).
  • Ensure high-efficiency iteration and high availability of large-scale model training through performance tuning and automated operations.

4. Technical Foresight & Architecture Evolution:

  • Drive long-term AI Infra roadmap planning, anticipating quantitative business needs in computing scale, training efficiency, and cost control.
  • Explore and validate cutting-edge architectures (e.g., heterogeneous computing fusion, compute-storage separation, Serverless AI) to enhance infrastructure capabilities and technical barriers.

Qualifications

1. Bachelors/Masters in Computer Science or related fields, 5-10 years of experience, with strong self-motivation and execution ability to identify and resolve technical bottlenecks.

2. Deep expertise in AI infrastructure: Kubernetes, GPU resource management, RDMA/high-performance networking, and large-scale distributed AI system design/deployment.

3. Proficient in *Golang/Python* with solid system programming and automation skills. Priority given to candidates with experience in *Volcano/Kueue schedulers, K8s Operator development, or open-source contributions*.

4. Familiar with core resource scheduling principles, GPU lifecycle management (allocation, isolation, elasticity, fault tolerance), and designing high-availability, low-latency strategies for quantitative tasks.

5. Knowledge of mainstream AI frameworks (PyTorch/TensorFlow), with experience in training/inference performance optimization and cross-team collaboration for framework-infra co-optimization.

6. Preferred: Experience in **FinTech/quantitative AI infrastructure*, understanding of business-critical computing demands, and ability to drive cross-team collaboration and value delivery.

  Required Skills:

Infrastructure