Senior AI Infrastructure Engineer
Job Description:
Responsibilities
1. Full-Stack AI Infrastructure Architecture & Development:
- Build a full-stack AI infrastructure system for quantitative scenarios based on Kubernetes, unifying the management of heterogeneous computing resources (e.g., GPU pooling).
- Integrate high-performance communication layers (e.g., RDMA) and drive the unified development of AI training/inference platforms and GPU operation/maintenance platforms.
- Streamline the end-to-end workflow from resource scheduling to model deployment, enhancing system efficiency and stability.
2. Intelligent Computing Power Scheduling System Design:
- Design a global scheduling mechanism supporting multi-task types and priority strategies, leveraging Volcano scheduler capabilities.
- Lead the customization and maintenance of Volcano and core Operators, optimizing elastic scaling and resource utilization based on dynamic demands of quantitative tasks.
3. Hardware-Software Co-Optimization & System Reliability:
- Develop an intermediate layer bridging underlying hardware (GPU/networking/storage) and AI frameworks (PyTorch/TensorFlow).
- Build GPU elastic resource pools, fault self-healing mechanisms, and unified observability platforms (e.g., monitoring dashboards).
- Ensure high-efficiency iteration and high availability of large-scale model training through performance tuning and automated operations.
4. Technical Foresight & Architecture Evolution:
- Drive long-term AI Infra roadmap planning, anticipating quantitative business needs in computing scale, training efficiency, and cost control.
- Explore and validate cutting-edge architectures (e.g., heterogeneous computing fusion, compute-storage separation, Serverless AI) to enhance infrastructure capabilities and technical barriers.
Qualifications
1. Bachelors/Masters in Computer Science or related fields, 5-10 years of experience, with strong self-motivation and execution ability to identify and resolve technical bottlenecks.
2. Deep expertise in AI infrastructure: Kubernetes, GPU resource management, RDMA/high-performance networking, and large-scale distributed AI system design/deployment.
3. Proficient in *Golang/Python* with solid system programming and automation skills. Priority given to candidates with experience in *Volcano/Kueue schedulers, K8s Operator development, or open-source contributions*.
4. Familiar with core resource scheduling principles, GPU lifecycle management (allocation, isolation, elasticity, fault tolerance), and designing high-availability, low-latency strategies for quantitative tasks.
5. Knowledge of mainstream AI frameworks (PyTorch/TensorFlow), with experience in training/inference performance optimization and cross-team collaboration for framework-infra co-optimization.
6. Preferred: Experience in **FinTech/quantitative AI infrastructure*, understanding of business-critical computing demands, and ability to drive cross-team collaboration and value delivery.
Required Skills:
Infrastructure