AI Operations Engineer

Pune, Maharashtra, India

Or refer someone

Job Openings AI Operations Engineer

About the job AI Operations Engineer

Job Responsibilities

Years of Experience: 3-5 Yrs

Responsibilities:

AI Model Deployment & Integration:

Deploy and manage AI/ML models, including traditional machine learning and GenAI solutions (e.g., LLMs, RAG systems).

Implement automated CI/CD pipelines for seamless deployment and scaling of AI models.

Ensure efficient model integration into existing enterprise applications and workflows in collaboration with AI Engineers.

Optimize AI infrastructure for performance and cost efficiency in cloud environments (AWS, Azure, GCP).

Monitoring & Performance Management:

Develop and implement monitoring solutions to track model performance, latency, drift, and cost metrics.

Set up alerts and automated workflows to manage performance degradation and retraining triggers.

Ensure responsible AI by monitoring for issues such as bias, hallucinations, and security vulnerabilities in GenAI outputs.

Collaborate with Data Scientists to establish feedback loops for continuous model improvement.

Automation & MLOps Best Practices:

Establish scalable MLOps practices to support the continuous deployment and maintenance of AI models.

Automate model retraining, versioning, and rollback strategies to ensure reliability and compliance.

Utilize infrastructure-as-code (Terraform, CloudFormation) to manage AI pipelines.

Security & Compliance:

Implement security measures to prevent prompt injections, data leakage, and unauthorized model access.

Work closely with compliance teams to ensure AI solutions adhere to privacy and regulatory standards (HIPAA, GDPR).

Regularly audit AI pipelines for ethical AI practices and data governance.

Collaboration & Process Improvement:

Work closely with AI Engineers, Product Managers, and IT teams to align AI operational processes with business needs.

Contribute to the development of AI Ops documentation, playbooks, and best practices.

Continuously evaluate emerging GenAI operational tools and processes to drive innovation.

Skills/Qualifications:

Education:

Bachelors or Masters degree in Computer Science, Data Engineering, AI, or a related field.

Relevant certifications in cloud platforms (AWS, Azure, GCP) or MLOps frameworks are a plus.

Experience:

3+ years of experience in AI/ML operations, MLOps, or DevOps for AI-driven solutions.

Hands-on experience deploying and managing AI models, including LLMs and GenAI solutions, in production environments.

Experience working with cloud AI platforms such as Azure AI, AWS SageMaker, or Google Vertex AI.

Technical Skills:

Proficiency in MLOps tools and frameworks such as MLflow, Kubeflow, or Airflow.

Hands-on experience with monitoring tools (Prometheus, Grafana, ELK Stack) for AI performance tracking.

Experience with containerization and orchestration tools (Docker, Kubernetes) to support AI workloads.

Familiarity with automation scripting using Python, Bash, or PowerShell.

Understanding of GenAI-specific operational challenges such as response monitoring, token management, and prompt optimization.

Knowledge of CI/CD pipelines (Jenkins, GitHub Actions) for AI model deployment.

Strong understanding of AI security principles, including data privacy and governance considerations.

Soft Skills:

Strong problem-solving skills with the ability to troubleshoot complex AI operational issues.

Excellent communication skills to effectively collaborate with cross-functional stakeholders.

Proactive and results-driven mindset with a focus on operational efficiency and scalability.

Ability to work effectively in a fast-paced, dynamic environment.

Or refer someone