Senior / Lead Machine Learning Engineer (AI & LLM Systems) - Hybrid (2 days office)

Lisbon, Portugal

Job Openings Senior / Lead Machine Learning Engineer (AI & LLM Systems) - Hybrid (2 days office)

About the job Senior / Lead Machine Learning Engineer (AI & LLM Systems) - Hybrid (2 days office)

ABOUT THE OPPORTUNITY

Join a leading global technology platform as a Senior or Lead Machine Learning Engineer and drive the design, development, and production deployment of advanced AI and Large Language Model (LLM) systems that power intelligent product features and workflows at scale.

Reporting to the Head of AI or VP Engineering, you'll shape the architecture of scalable machine learning solutions, guide technical direction across AI initiatives, and collaborate with world-class engineering, product, and research teams to deliver high-impact AI products. This role offers the perfect blend of technical leadership and hands-on engineering, where you'll architect and implement robust LLM-based systems, lead training and fine-tuning of large-scale models, and develop sophisticated workflows including retrieval-augmented generation (RAG), vector databases, and agentic AI components.

You'll have the opportunity to translate cutting-edge research into production-ready systems while mentoring engineers and driving best practices across the organization. Working at the intersection of research and engineering, you'll transform advanced AI concepts into scalable, reliable production systems that deliver measurable business value to millions of users worldwide.

Critical Requirements: This is a senior to lead-level position requiring 5+ years of experience in machine learning and AI with proven expertise in Python, ML frameworks (PyTorch or TensorFlow), hands-on production experience training and deploying LLMs and generative models, deep knowledge of modern ML tooling including RAG pipelines and vector search, and comprehensive understanding of ML system design, optimization, and evaluation. Advanced English (C1 level) is essential for cross-functional collaboration.

PROJECT & CONTEXT

You'll architect and build production-grade machine learning and LLM systems that power intelligent features, automation, and AI-driven workflows for a global technology platform serving millions of users. Your core responsibilities center on designing robust, scalable ML infrastructure, making critical architectural decisions about system design, model selection, deployment strategies, and technical approaches that balance innovation with production reliability, scalability, and performance requirements.

A significant portion of your work involves leading training, fine-tuning, and optimization of large-scale language models for real production use cases. This includes selecting appropriate base models, implementing fine-tuning strategies using domain-specific data, optimizing model performance for inference efficiency, and ensuring models meet quality, safety, and compliance requirements for production deployment. You'll develop and maintain sophisticated AI workflows including RAG systems that ground LLM outputs in accurate information, information retrieval systems, vector databases for semantic search, and agentic AI components that can reason, plan, and execute multi-step tasks autonomously.

Experimentation and validation are critical - you'll drive comprehensive experimentation frameworks, design and execute benchmarks to evaluate model performance, and implement rigorous evaluation methodologies measuring both technical metrics (accuracy, latency, throughput) and business outcomes (user satisfaction, task completion). Cross-functional collaboration defines your daily work as you partner with engineering teams to integrate ML systems, work with product managers to translate business requirements into technical solutions, and collaborate with research teams to apply cutting-edge approaches to production problems.

Technical leadership and mentorship are key expectations at the Senior/Lead level. You'll mentor engineers through code reviews, architecture reviews, and knowledge sharing, establish best practices for ML engineering including testing strategies and deployment patterns, and guide the team in model development, system design, and production ML practices. You'll translate research insights into production systems by evaluating emerging research, prototyping promising approaches, and implementing production-ready versions with clear success metrics.

The role demands expertise in modern ML tooling including vector databases (Pinecone, Weaviate, Chroma), embedding models, semantic search, model serving infrastructure, and ML observability tools. You'll make critical architectural decisions about fine-tuning versus few-shot prompting, efficient RAG retrieval, multi-step agentic workflow orchestration, and ensuring systems scale efficiently. Working in a hybrid environment with 2 days per week in the Lisbon office enables both focused individual work and collaborative team activities.

Core Tech Stack: Python (primary), PyTorch (preferred) or TensorFlow, LLM frameworks (LangChain, LlamaIndex), vector databases, embedding models

ML Infrastructure: Model training frameworks, experiment tracking, model serving platforms, feature stores, ML monitoring tools

AI Focus: Large language models (LLMs), generative AI, RAG, vector search, agentic AI, multi-agent systems

Engineering Practices: Production ML system design, model evaluation, A/B testing, monitoring and observability, scalable deployment

Scale: Global platform serving millions of users with high-volume, low-latency AI inference requirements

WHAT WE'RE LOOKING FOR (Required)

Machine Learning Experience: Minimum 5+ years of hands-on experience in machine learning, AI, or closely related fields with proven track record delivering production ML systems - this is the core requirement

Python Proficiency: Strong proficiency in Python for machine learning engineering with deep understanding of Python best practices, libraries, and frameworks relevant to ML development

ML Frameworks Expertise: Production experience with machine learning frameworks such as PyTorch (strongly preferred) or TensorFlow, including model development, training, fine-tuning, and deployment

LLM Production Experience: Hands-on experience training and deploying Large Language Models (LLMs) and generative models in production environments, understanding of LLM architectures, training techniques, inference optimization, and production deployment challenges

Modern ML Tooling: Strong knowledge of modern ML tooling including RAG (Retrieval-Augmented Generation) pipelines, vector search and embeddings, vector databases, model serving infrastructure, and related technologies for building production LLM applications

ML System Design: Deep understanding of ML system design principles including data processing pipelines, feature engineering, model architecture selection, evaluation metrics design, optimization strategies, and production deployment patterns

Data Processing: Experience with large-scale data processing for ML training and inference, understanding data quality requirements, and implementing efficient data pipelines

Model Evaluation: Expertise in designing and implementing comprehensive model evaluation frameworks, selecting appropriate metrics, conducting benchmarks, and validating production readiness

Optimization Skills: Strong skills in model optimization including hyperparameter tuning, training efficiency, inference performance, cost optimization, and resource utilization

Production ML Deployment: Demonstrated experience deploying ML models to production including model serving, versioning, monitoring, A/B testing, and continuous improvement cycles

Technical Leadership: Ability to guide technical direction, make architectural decisions, drive technical strategy, and influence engineering practices across teams

Collaboration Skills: Excellent collaboration abilities working effectively with cross-functional teams including engineering, product, research, and business stakeholders

Communication Excellence: Outstanding communication skills capable of articulating complex technical concepts to both technical and non-technical audiences, writing clear technical documentation, and presenting to stakeholders

Mentorship Capability: Experience and willingness to mentor other engineers through code reviews, knowledge sharing, best practices establishment, and technical guidance

Problem-Solving: Strong analytical and problem-solving skills for debugging complex ML systems, identifying performance bottlenecks, and resolving production issues

English Proficiency: C1 level (Advanced) or higher in English for technical communication, documentation, collaboration with international teams, and stakeholder engagement - this is mandatory

Work Authorization: Eligibility to work in Portugal with availability for hybrid work model (2 days per week in Lisbon office)

NICE TO HAVE (Preferred)

Agentic AI Experience: Hands-on experience with agentic AI systems, multi-agent workflows, autonomous agents, planning and reasoning systems, and tool-using LLMs

Developer Tooling: Experience building developer-facing tools, APIs, SDKs, or platforms that enable other engineers to leverage ML capabilities

Applied Research Background: Background in applied research with publications in relevant areas including NLP, LLMs, generative AI, information retrieval, or machine learning conferences/journals

Research to Production: Track record of successfully translating academic research into production systems with clear business impact

Cloud Platform Expertise: Hands-on experience with cloud platforms including AWS (SageMaker, Bedrock, EC2, S3), GCP (Vertex AI, Cloud ML), or Azure (Azure ML, OpenAI Service)

Scalable Deployment Frameworks: Experience with scalable ML deployment frameworks including Kubernetes for ML workloads, model serving platforms (TensorFlow Serving, TorchServe, Triton), and container orchestration

Vector Database Deep Expertise: Advanced knowledge of vector databases like Pinecone, Weaviate, Chroma, Milvus, FAISS including optimization strategies and production deployment

Additional ML Frameworks: Experience with complementary ML frameworks like JAX, scikit-learn, Hugging Face Transformers, or specialized libraries for NLP and generative AI

LLM Fine-Tuning Advanced: Deep expertise in advanced fine-tuning techniques including LoRA, QLoRA, PEFT methods, instruction tuning, and RLHF (Reinforcement Learning from Human Feedback)

Prompt Engineering: Advanced skills in prompt engineering, few-shot learning, chain-of-thought prompting, and prompt optimization techniques

Embeddings Expertise: Deep understanding of embedding models, similarity search, semantic search optimization, and embedding space analysis

Multi-Modal AI: Experience with multi-modal models handling text, images, audio, or other modalities

MLOps Practices: Strong MLOps experience including ML pipeline automation, CI/CD for ML, model monitoring and observability, feature store implementation, and experiment tracking

Distributed Training: Experience with distributed training of large models across multiple GPUs or machines using frameworks like DeepSpeed, Megatron, or FSDP

Model Compression: Knowledge of model compression techniques including quantization, pruning, distillation, and knowledge transfer

Information Retrieval: Deep expertise in information retrieval systems, search ranking, relevance scoring, and retrieval optimization

Natural Language Processing: Strong NLP background including tokenization, text preprocessing, named entity recognition, semantic similarity, and language understanding

A/B Testing & Experimentation: Experience designing and analyzing ML experiments, A/B tests for model performance, and statistical analysis of results

Cost Optimization: Skills in optimizing ML inference costs through efficient model architecture, caching strategies, batching, and resource management

Monitoring & Observability: Experience implementing comprehensive monitoring for ML systems including model performance metrics, drift detection, and alerting

Data Science Skills: Strong data analysis and visualization skills for understanding model behavior, analyzing errors, and communicating insights

API Design: Experience designing clean, scalable APIs for ML services that are intuitive for other engineers to use

Database Technologies: Familiarity with both SQL and NoSQL databases relevant to ML applications including PostgreSQL, MongoDB, Redis

Real-Time Systems: Experience building real-time ML inference systems with low-latency requirements

Security & Privacy: Understanding of ML security including model security, data privacy, differential privacy, and secure model deployment

Open Source Contributions: Active contributions to open source ML projects or maintainer experience with ML libraries

Technical Writing: Strong technical writing skills including documentation, blog posts, or technical publications

Startup Experience: Previous experience in fast-paced startup or scale-up environments with ability to operate with ambiguity

Domain Knowledge: Background in marketplace platforms, workforce solutions, freelancing ecosystems, or related domains

Location: Lisbon, Portugal (Hybrid - 2 days per week in office)

Or refer someone