Job Openings Machine Learning Operations Engineer

About the job Machine Learning Operations Engineer

Gentis Solutions is seeking a Machine Learning Operations Engineer to join our team. This contract-to-hire position is with one of our Fortune 50 clients interested in full-time flex/remote consultants. The ideal candidates will have the required skills listed below and will be eligible and open to being hired by our client at the end of the project's duration. This position works alongside an existing team and leverages enterprise-level technologies and processes. If you would like to work at a company that has been recognized for its diversity and inclusion, its work to drive positive social change, and as an environmental leader, make sure you apply below.

Requirements

  • Proficiency in Linux administration, with a strong preference for candidates with deep expertise in Linux environments
    • Windows experience is acceptable, but a solid grasp of Linux is essential
  • Demonstrated ability to install, modify, and provide support for Linux applications
  • Familiarity with cluster management, particularly in negotiating resources across multiple computers simultaneously
  • Proficiency in Slurm for job scheduling, with any prior experience being an advantage
  • Competence in container management, including expertise with Docker for containerization, pushing, and pulling containers
  • Knowledge of maintaining High-Performance Computing (HPC) systems, encompassing various components that make up this sophisticated infrastructure

Desirable Skills

  • Experience with JupyterHub is a plus
  • Knowledge of the Bright software is highly desirable

Typical Duties

  • Collaborate with the AI team to customize the environment, ensuring it is optimized for AI development
  • Work closely with the infrastructure team to configure and manage physical hardware and the underlying operating system
  • Implement and manage partitioning on the supernode, allocating resources for different environments (Jupyter, Slurm, Linux shell, Docker containers, etc.)
  • Provide support and administration for Kubernetes, aiding in the integration of various providers
  • Continuously evolve processes and ways of working to maximize the platform's efficiency, ultimately reducing the need for external support