About the job Data Engineer (Databricks) - Porto (2 days/month on-site)
ABOUT THE OPPORTUNITY
Join a global custom software solutions company with 40 years of experience delivering innovative data engineering solutions for clients worldwide. Position available for talented Data Engineers to work on modern data architectures supporting both traditional analytics and emerging AI/ML workloads. Operating in a collaborative, flat management structure where you're valued as an integral team member, you'll design and build scalable data pipelines using cutting-edge tools across AWS, Azure, and Databricks platforms. With exceptional flexibility requiring only 2 days per month in the Porto office, you'll enjoy outstanding work-life balance while accessing knowledge sharing, social events, catered lunches, and a culture that emphasizes continuous learning and recognition through awards and performance bonuses across 7 international hubs. This role offers exposure to modern data lakehouse architectures, real-time streaming, and MLOps practices while working alongside passionate professionals on challenging projects that transform raw data into actionable business insights.
PROJECT & CONTEXT
You'll play a pivotal role enabling data-driven decision-making by designing, building, and maintaining efficient ETL/ELT pipelines using Python, SQL, and Apache Spark. Your work will focus on implementing modern data architectures including Data Lakehouse and Medallion Architecture (Bronze/Silver/Gold layers), supporting both business reporting and advanced analytics use cases. Managing and optimizing cloud-based infrastructure on AWS and Azure, you'll ensure cost-effectiveness, performance, and scalability while processing massive data volumes. Responsibilities include implementing data governance and quality standards using frameworks like Great Expectations and Unity Catalog, ensuring data integrity and compliance across the data lifecycle. You'll collaborate closely with Data Scientists, AI Engineers, and Business Analysts to understand requirements and deliver high-quality datasets, while supporting MLOps practices for model deployment and monitoring. The role involves orchestrating complex workflows using Apache Airflow, handling real-time data streams with Kafka, managing Delta Lake features, and driving automation through CI/CD practices to continuously improve pipeline performance and reliability.
WHAT WE'RE LOOKING FOR (Required)
- Data Engineering Experience: Minimum 4-5 years proven experience as Data Engineer building production data pipelines
- Python Proficiency: Strong programming skills in Python for data manipulation, scripting, and pipeline development
- Apache Spark: Extensive hands-on experience with Apache Spark (PySpark) for batch and streaming data processing
- Workflow Orchestration: Proficiency with Apache Airflow for scheduling and managing complex data workflows
- SQL Expertise: Expert-level SQL skills for data analysis, transformation, and optimization
- Big Data Formats: Deep experience with Parquet, Avro, and Delta Lake file formats and their optimization
- Data Lake Design: Proven experience designing Data Lakes and implementing Medallion Architecture patterns
- Real-Time Streaming: Hands-on experience with Apache Kafka or similar platforms for real-time data processing
- Version Control: Proficiency with Git for collaborative development and code management
- Data Quality: Experience implementing data quality frameworks like Great Expectations or Soda
- AWS Storage: Deep knowledge of Amazon S3 for data lake storage, lifecycle policies, and security configurations
- Cloud Platforms: Practical experience with AWS and/or Azure cloud-based data infrastructure
- Data Modeling: Strong understanding of data modeling principles and dimensional design
- DevOps Practices: Familiarity with CI/CD pipelines for data infrastructure deployment
- Language: B2 English (Upper Intermediate) minimum - entire interview process conducted in English with solid proficiency required
- Location: Based in Porto/Northern Portugal region with availability for 2 on-site days per month
NICE TO HAVE (Preferred)
- AWS Services: AWS Glue (Crawlers, Jobs, Data Catalog), Lake Formation, Kinesis (Data Streams, Firehose), Lambda, IAM, CloudWatch
- Azure Services: Azure Data Lake Storage Gen2, Azure Data Factory, Azure Synapse Analytics, Microsoft Purview, Event Hubs, Azure Stream Analytics
- Databricks Platform: Workspace management, Unity Catalog, Databricks Jobs, Delta Live Tables (DLT), cluster optimization
- Delta Lake Advanced: Time travel, schema enforcement, optimization techniques, ACID transactions
- MLOps Tools: MLflow for experiment tracking and model registry, supporting ML model deployment
- AI/ML Context: Exposure to Generative AI concepts (LLMs, RAG, Vector Search) and data requirements for AI workloads
- Mosaic AI: Experience with Model Serving, Vector Search, AI Gateway for LLM workloads
- Alternative Languages: Scala or Java programming experience
- Alternative Orchestration: Prefect, Dagster, or Azure Data Factory experience
- BI Integration: Experience serving data to Power BI, Tableau, or Looker
- Data Governance: Understanding of data lineage, security principles, and compliance requirements
- Streaming Analytics: Azure Stream Analytics or advanced Kafka Streams processing
Certifications (Advantageous):
- Databricks Certified Data Engineer Professional or Associate
- AWS Certified Data Engineer Associate (DEA-C01) or Solutions Architect Associate
- Microsoft Certified: Azure Data Engineer Associate (DP-203) or Azure Solutions Architect Expert
Location: Porto, Portugal (2 days/month on-site)