About the job Senior / Lead Machine Learning Engineer (AI & LLM Systems) - Hybrid (2 days office)
ABOUT THE OPPORTUNITY
Join a leading global technology platform as a Senior or Lead Machine Learning Engineer and drive the design, development, and production deployment of advanced AI and Large Language Model (LLM) systems that power intelligent product features and workflows at scale.
Reporting to the Head of AI or VP Engineering, you'll shape the architecture of scalable machine learning solutions, guide technical direction across AI initiatives, and collaborate with world-class engineering, product, and research teams to deliver high-impact AI products. This role offers the perfect blend of technical leadership and hands-on engineering, where you'll architect and implement robust LLM-based systems, lead training and fine-tuning of large-scale models, and develop sophisticated workflows including retrieval-augmented generation (RAG), vector databases, and agentic AI components.
You'll have the opportunity to translate cutting-edge research into production-ready systems while mentoring engineers and driving best practices across the organization. Working at the intersection of research and engineering, you'll transform advanced AI concepts into scalable, reliable production systems that deliver measurable business value to millions of users worldwide.
Critical Requirements: This is a senior to lead-level position requiring 5+ years of experience in machine learning and AI with proven expertise in Python, ML frameworks (PyTorch or TensorFlow), hands-on production experience training and deploying LLMs and generative models, deep knowledge of modern ML tooling including RAG pipelines and vector search, and comprehensive understanding of ML system design, optimization, and evaluation. Advanced English (C1 level) is essential for cross-functional collaboration.
PROJECT & CONTEXT
You'll architect and build production-grade machine learning and LLM systems that power intelligent features, automation, and AI-driven workflows for a global technology platform serving millions of users. Your core responsibilities center on designing robust, scalable ML infrastructure, making critical architectural decisions about system design, model selection, deployment strategies, and technical approaches that balance innovation with production reliability, scalability, and performance requirements.
A significant portion of your work involves leading training, fine-tuning, and optimization of large-scale language models for real production use cases. This includes selecting appropriate base models, implementing fine-tuning strategies using domain-specific data, optimizing model performance for inference efficiency, and ensuring models meet quality, safety, and compliance requirements for production deployment. You'll develop and maintain sophisticated AI workflows including RAG systems that ground LLM outputs in accurate information, information retrieval systems, vector databases for semantic search, and agentic AI components that can reason, plan, and execute multi-step tasks autonomously.
Experimentation and validation are critical - you'll drive comprehensive experimentation frameworks, design and execute benchmarks to evaluate model performance, and implement rigorous evaluation methodologies measuring both technical metrics (accuracy, latency, throughput) and business outcomes (user satisfaction, task completion). Cross-functional collaboration defines your daily work as you partner with engineering teams to integrate ML systems, work with product managers to translate business requirements into technical solutions, and collaborate with research teams to apply cutting-edge approaches to production problems.
Technical leadership and mentorship are key expectations at the Senior/Lead level. You'll mentor engineers through code reviews, architecture reviews, and knowledge sharing, establish best practices for ML engineering including testing strategies and deployment patterns, and guide the team in model development, system design, and production ML practices. You'll translate research insights into production systems by evaluating emerging research, prototyping promising approaches, and implementing production-ready versions with clear success metrics.
The role demands expertise in modern ML tooling including vector databases (Pinecone, Weaviate, Chroma), embedding models, semantic search, model serving infrastructure, and ML observability tools. You'll make critical architectural decisions about fine-tuning versus few-shot prompting, efficient RAG retrieval, multi-step agentic workflow orchestration, and ensuring systems scale efficiently. Working in a hybrid environment with 2 days per week in the Lisbon office enables both focused individual work and collaborative team activities.
Core Tech Stack: Python (primary), PyTorch (preferred) or TensorFlow, LLM frameworks (LangChain, LlamaIndex), vector databases, embedding models
ML Infrastructure: Model training frameworks, experiment tracking, model serving platforms, feature stores, ML monitoring tools
AI Focus: Large language models (LLMs), generative AI, RAG, vector search, agentic AI, multi-agent systems
Engineering Practices: Production ML system design, model evaluation, A/B testing, monitoring and observability, scalable deployment
Scale: Global platform serving millions of users with high-volume, low-latency AI inference requirements
WHAT WE'RE LOOKING FOR (Required)
Machine Learning Experience: Minimum 5+ years of hands-on experience in machine learning, AI, or closely related fields with proven track record delivering production ML systems - this is the core requirement
Python Proficiency: Strong proficiency in Python for machine learning engineering with deep understanding of Python best practices, libraries, and frameworks relevant to ML development
ML Frameworks Expertise: Production experience with machine learning frameworks such as PyTorch (strongly preferred) or TensorFlow, including model development, training, fine-tuning, and deployment
LLM Production Experience: Hands-on experience training and deploying Large Language Models (LLMs) and generative models in production environments, understanding of LLM architectures, training techniques, inference optimization, and production deployment challenges
Modern ML Tooling: Strong knowledge of modern ML tooling including RAG (Retrieval-Augmented Generation) pipelines, vector search and embeddings, vector databases, model serving infrastructure, and related technologies for building production LLM applications
ML System Design: Deep understanding of ML system design principles including data processing pipelines, feature engineering, model architecture selection, evaluation metrics design, optimization strategies, and production deployment patterns
Data Processing: Experience with large-scale data processing for ML training and inference, understanding data quality requirements, and implementing efficient data pipelines
Model Evaluation: Expertise in designing and implementing comprehensive model evaluation frameworks, selecting appropriate metrics, conducting benchmarks, and validating production readiness
Optimization Skills: Strong skills in model optimization including hyperparameter tuning, training efficiency, inference performance, cost optimization, and resource utilization
Production ML Deployment: Demonstrated experience deploying ML models to production including model serving, versioning, monitoring, A/B testing, and continuous improvement cycles
Technical Leadership: Ability to guide technical direction, make architectural decisions, drive technical strategy, and influence engineering practices across teams
Collaboration Skills: Excellent collaboration abilities working effectively with cross-functional teams including engineering, product, research, and business stakeholders
Communication Excellence: Outstanding communication skills capable of articulating complex technical concepts to both technical and non-technical audiences, writing clear technical documentation, and presenting to stakeholders
Mentorship Capability: Experience and willingness to mentor other engineers through code reviews, knowledge sharing, best practices establishment, and technical guidance
Problem-Solving: Strong analytical and problem-solving skills for debugging complex ML systems, identifying performance bottlenecks, and resolving production issues
English Proficiency: C1 level (Advanced) or higher in English for technical communication, documentation, collaboration with international teams, and stakeholder engagement - this is mandatory
Work Authorization: Eligibility to work in Portugal with availability for hybrid work model (2 days per week in Lisbon office)
NICE TO HAVE (Preferred)
Agentic AI Experience: Hands-on experience with agentic AI systems, multi-agent workflows, autonomous agents, planning and reasoning systems, and tool-using LLMs
Developer Tooling: Experience building developer-facing tools, APIs, SDKs, or platforms that enable other engineers to leverage ML capabilities
Applied Research Background: Background in applied research with publications in relevant areas including NLP, LLMs, generative AI, information retrieval, or machine learning conferences/journals
Research to Production: Track record of successfully translating academic research into production systems with clear business impact
Cloud Platform Expertise: Hands-on experience with cloud platforms including AWS (SageMaker, Bedrock, EC2, S3), GCP (Vertex AI, Cloud ML), or Azure (Azure ML, OpenAI Service)
Scalable Deployment Frameworks: Experience with scalable ML deployment frameworks including Kubernetes for ML workloads, model serving platforms (TensorFlow Serving, TorchServe, Triton), and container orchestration
Vector Database Deep Expertise: Advanced knowledge of vector databases like Pinecone, Weaviate, Chroma, Milvus, FAISS including optimization strategies and production deployment
Additional ML Frameworks: Experience with complementary ML frameworks like JAX, scikit-learn, Hugging Face Transformers, or specialized libraries for NLP and generative AI
LLM Fine-Tuning Advanced: Deep expertise in advanced fine-tuning techniques including LoRA, QLoRA, PEFT methods, instruction tuning, and RLHF (Reinforcement Learning from Human Feedback)
Prompt Engineering: Advanced skills in prompt engineering, few-shot learning, chain-of-thought prompting, and prompt optimization techniques
Embeddings Expertise: Deep understanding of embedding models, similarity search, semantic search optimization, and embedding space analysis
Multi-Modal AI: Experience with multi-modal models handling text, images, audio, or other modalities
MLOps Practices: Strong MLOps experience including ML pipeline automation, CI/CD for ML, model monitoring and observability, feature store implementation, and experiment tracking
Distributed Training: Experience with distributed training of large models across multiple GPUs or machines using frameworks like DeepSpeed, Megatron, or FSDP
Model Compression: Knowledge of model compression techniques including quantization, pruning, distillation, and knowledge transfer
Information Retrieval: Deep expertise in information retrieval systems, search ranking, relevance scoring, and retrieval optimization
Natural Language Processing: Strong NLP background including tokenization, text preprocessing, named entity recognition, semantic similarity, and language understanding
A/B Testing & Experimentation: Experience designing and analyzing ML experiments, A/B tests for model performance, and statistical analysis of results
Cost Optimization: Skills in optimizing ML inference costs through efficient model architecture, caching strategies, batching, and resource management
Monitoring & Observability: Experience implementing comprehensive monitoring for ML systems including model performance metrics, drift detection, and alerting
Data Science Skills: Strong data analysis and visualization skills for understanding model behavior, analyzing errors, and communicating insights
API Design: Experience designing clean, scalable APIs for ML services that are intuitive for other engineers to use
Database Technologies: Familiarity with both SQL and NoSQL databases relevant to ML applications including PostgreSQL, MongoDB, Redis
Real-Time Systems: Experience building real-time ML inference systems with low-latency requirements
Security & Privacy: Understanding of ML security including model security, data privacy, differential privacy, and secure model deployment
Open Source Contributions: Active contributions to open source ML projects or maintainer experience with ML libraries
Technical Writing: Strong technical writing skills including documentation, blog posts, or technical publications
Startup Experience: Previous experience in fast-paced startup or scale-up environments with ability to operate with ambiguity
Domain Knowledge: Background in marketplace platforms, workforce solutions, freelancing ecosystems, or related domains
Location: Lisbon, Portugal (Hybrid - 2 days per week in office)