Job Openings AI Operations Engineer

About the job AI Operations Engineer

Job Responsibilities

Years of Experience: 3-5 Yrs

Responsibilities:

AI Model Deployment & Integration:

  • Deploy and manage AI/ML models, including traditional machine learning and GenAI solutions (e.g., LLMs, RAG systems).
  • Implement automated CI/CD pipelines for seamless deployment and scaling of AI models.
  • Ensure efficient model integration into existing enterprise applications and workflows in collaboration with AI Engineers.
  • Optimize AI infrastructure for performance and cost efficiency in cloud environments (AWS, Azure, GCP).

Monitoring & Performance Management:

  • Develop and implement monitoring solutions to track model performance, latency, drift, and cost metrics.
  • Set up alerts and automated workflows to manage performance degradation and retraining triggers.
  • Ensure responsible AI by monitoring for issues such as bias, hallucinations, and security vulnerabilities in GenAI outputs.
  • Collaborate with Data Scientists to establish feedback loops for continuous model improvement.

Automation & MLOps Best Practices:

  • Establish scalable MLOps practices to support the continuous deployment and maintenance of AI models.
  • Automate model retraining, versioning, and rollback strategies to ensure reliability and compliance.
  • Utilize infrastructure-as-code (Terraform, CloudFormation) to manage AI pipelines.

Security & Compliance:

  • Implement security measures to prevent prompt injections, data leakage, and unauthorized model access.
  • Work closely with compliance teams to ensure AI solutions adhere to privacy and regulatory standards (HIPAA, GDPR).
  • Regularly audit AI pipelines for ethical AI practices and data governance.

Collaboration & Process Improvement:

  • Work closely with AI Engineers, Product Managers, and IT teams to align AI operational processes with business needs.
  • Contribute to the development of AI Ops documentation, playbooks, and best practices.
  • Continuously evaluate emerging GenAI operational tools and processes to drive innovation.

Skills/Qualifications:

Education:

  • Bachelors or Masters degree in Computer Science, Data Engineering, AI, or a related field.
  • Relevant certifications in cloud platforms (AWS, Azure, GCP) or MLOps frameworks are a plus.

Experience:

  • 3+ years of experience in AI/ML operations, MLOps, or DevOps for AI-driven solutions.
  • Hands-on experience deploying and managing AI models, including LLMs and GenAI solutions, in production environments.
  • Experience working with cloud AI platforms such as Azure AI, AWS SageMaker, or Google Vertex AI.

Technical Skills:

  • Proficiency in MLOps tools and frameworks such as MLflow, Kubeflow, or Airflow.
  • Hands-on experience with monitoring tools (Prometheus, Grafana, ELK Stack) for AI performance tracking.
  • Experience with containerization and orchestration tools (Docker, Kubernetes) to support AI workloads.
  • Familiarity with automation scripting using Python, Bash, or PowerShell.
  • Understanding of GenAI-specific operational challenges such as response monitoring, token management, and prompt optimization.
  • Knowledge of CI/CD pipelines (Jenkins, GitHub Actions) for AI model deployment.
  • Strong understanding of AI security principles, including data privacy and governance considerations.

Soft Skills:

  • Strong problem-solving skills with the ability to troubleshoot complex AI operational issues.
  • Excellent communication skills to effectively collaborate with cross-functional stakeholders.
  • Proactive and results-driven mindset with a focus on operational efficiency and scalability.
  • Ability to work effectively in a fast-paced, dynamic environment.