Job Openings
Machine Learning Operations Engineer
About the job Machine Learning Operations Engineer
Gentis Solutions is seeking a Machine Learning Operations Engineer to join our team. This contract-to-hire position is with one of our Fortune 50 clients interested in full-time flex/remote consultants. The ideal candidates will have the required skills listed below and will be eligible and open to being hired by our client at the end of the project's duration. This position works alongside an existing team and leverages enterprise-level technologies and processes. If you would like to work at a company that has been recognized for its diversity and inclusion, its work to drive positive social change, and as an environmental leader, make sure you apply below.
Requirements
- Proficiency in Linux administration, with a strong preference for candidates with deep expertise in Linux environments
- Windows experience is acceptable, but a solid grasp of Linux is essential
- Demonstrated ability to install, modify, and provide support for Linux applications
- Familiarity with cluster management, particularly in negotiating resources across multiple computers simultaneously
- Proficiency in Slurm for job scheduling, with any prior experience being an advantage
- Competence in container management, including expertise with Docker for containerization, pushing, and pulling containers
- Knowledge of maintaining High-Performance Computing (HPC) systems, encompassing various components that make up this sophisticated infrastructure
Desirable Skills
- Experience with JupyterHub is a plus
- Knowledge of the Bright software is highly desirable
Typical Duties
- Collaborate with the AI team to customize the environment, ensuring it is optimized for AI development
- Work closely with the infrastructure team to configure and manage physical hardware and the underlying operating system
- Implement and manage partitioning on the supernode, allocating resources for different environments (Jupyter, Slurm, Linux shell, Docker containers, etc.)
- Provide support and administration for Kubernetes, aiding in the integration of various providers
- Continuously evolve processes and ways of working to maximize the platform's efficiency, ultimately reducing the need for external support