Beijing, China

Senior AI Infrastructure Engineer

 Job Description:

Responsibilities

1. Full-Stack AI Infrastructure Architecture & Development:

- Build a full-stack AI infrastructure system for quantitative scenarios based on Kubernetes, unifying the management of heterogeneous computing resources (e.g., GPU pooling).

- Integrate high-performance communication layers (e.g., RDMA) and drive the unified development of AI training/inference platforms and GPU operation/maintenance platforms.

- Streamline the end-to-end workflow from resource scheduling to model deployment, enhancing system efficiency and stability.

2. Intelligent Computing Power Scheduling System Design:

- Design a global scheduling mechanism supporting multi-task types and priority strategies, leveraging Volcano scheduler capabilities.

- Lead the customization and maintenance of Volcano and core Operators, optimizing elastic scaling and resource utilization based on dynamic demands of quantitative tasks.

3. Hardware-Software Co-Optimization & System Reliability:

- Develop an intermediate layer bridging underlying hardware (GPU/networking/storage) and AI frameworks (PyTorch/TensorFlow).

- Build GPU elastic resource pools, fault self-healing mechanisms, and unified observability platforms (e.g., monitoring dashboards).

- Ensure high-efficiency iteration and high availability of large-scale model training through performance tuning and automated operations.

4. Technical Foresight & Architecture Evolution:

- Drive long-term AI Infra roadmap planning, anticipating quantitative business needs in computing scale, training efficiency, and cost control.

- Explore and validate cutting-edge architectures (e.g., heterogeneous computing fusion, compute-storage separation, Serverless AI) to enhance infrastructure capabilities and technical barriers.

*Qualifications*

1. Bachelors/Masters in Computer Science or related fields, 510 years of experience, with strong self-motivation and execution ability to identify and resolve technical bottlenecks.

2. Deep expertise in AI infrastructure: Kubernetes, GPU resource management, RDMA/high-performance networking, and large-scale distributed AI system design/deployment.

3. Proficient in *Golang/Python* with solid system programming and automation skills. Priority given to candidates with experience in *Volcano/Kueue schedulers, K8s Operator development, or open-source contributions*.

4. Familiar with core resource scheduling principles, GPU lifecycle management (allocation, isolation, elasticity, fault tolerance), and designing high-availability, low-latency strategies for quantitative tasks.

5. Knowledge of mainstream AI frameworks (PyTorch/TensorFlow), with experience in training/inference performance optimization and cross-team collaboration for framework-infra co-optimization.

6. *Preferred: Experience in **FinTech/quantitative AI infrastructure*, understanding of business-critical computing demands, and ability to drive cross-team collaboration and value delivery.

  Required Skills:

TensorFlow High Availability Operations Operators Collaboration Resource Management Cost Control Reliability Storage Architecture Optimization Kubernetes Infrastructure Availability Automation Networking Programming Computer Science Scheduling Planning Maintenance Design Business Science Training Communication Management