About the job XTN-EF5F239 | SENIOR DEVOPS ENGINEER -CLOUD & HPC INFRASTRUCTURE
Position Overview
We are seeking a highly experienced Senior DevOps Engineer to lead the design, deployment, automation, and operational excellence of our AWS-based cloud infrastructure and high-performance computing (HPC) environments. This role requires deep expertise in AWS architecture, Linux systems administration, server deployment, containerization, virtualization, license server management, and cloud networking.
The ideal candidate is hands-on, automation-driven, security-focused, and comfortable operating in complex hybrid environments supporting research, engineering, and compute-intensive workloads.
- Health Insurance/HMO
- Enjoy unlimited MadMax Coffee
- Diverse learning & growth opportunities
- Accessible Cloud HR platform (Sprout)
- Above standard leaves
Key Responsibilities
- Cloud Infrastructure & AWS Architecture
- Design, deploy, and manage scalable, secure AWS infrastructure.
- Architect and maintain VPCs, subnets, route tables, NAT gateways, transit gateways, and peering.
- Manage AWS networking components including Route53, Load Balancers (ALB/NLB), CloudFront, and PrivateLink.
- Implement infrastructure-as-code (IaC) using Terraform, CloudFormation, or similar.
- Optimize cloud cost, performance, and resource utilization.
- Implement AWS best practices for security, resilience, and high availability.
Server Deployment & Systems Engineering
- Architect and automate server provisioning across cloud and hybrid environments.
- Deploy and manage EC2, Auto Scaling Groups, Launch Templates, and AMIs.
- Build hardened Linux server images (CIS benchmarks preferred).
- Implement configuration management using tools such as Ansible, Puppet, or Chef.
- Manage patching, lifecycle management, and OS hardening strategies.
Expert Linux Administration
- Advanced administration of RHEL, Rocky, Ubuntu, or similar distributions.
- Kernel tuning and performance optimization for compute-intensive workloads.
- Troubleshooting system-level performance (CPU, memory, I/O, networking).
- Manage system services, storage, RAID, LVM, NFS, and distributed filesystems.
- Shell scripting and automation (Bash, Python).
Containerization & Virtualization
- Design and manage containerized workloads using Docker.
- Deploy and maintain Kubernetes (EKS preferred).
- Implement CI/CD pipelines for container-based applications.
- Manage virtualization platforms (VMware, KVM, or similar).
- Optimize container orchestration for HPC and compute workloads.
HPC Infrastructure Management
- Deploy and maintain High Performance Computing clusters.
- Manage job schedulers (Slurm, PBS, or similar).
- Optimize cluster performance, storage throughput, and node scaling.
- Integrate HPC workloads with AWS services (e.g., ParallelCluster).
- Manage high-speed networking (InfiniBand or equivalent if applicable).
- Support GPU-based workloads where applicable.
License Server Administration
- Deploy and manage FlexLM or similar license servers.
- Ensure high availability and redundancy for engineering license services.
- Monitor license usage and optimize allocation.
- Troubleshoot license connectivity and performance issues.
Cloud Networking & Security
- Deep understanding of TCP/IP, DNS, routing protocols, and firewall design.
- Implement secure connectivity (VPN, Direct Connect, site-to-site).
- Manage security groups, NACLs, IAM roles, and zero-trust principles.
- Implement logging, monitoring, and alerting (CloudWatch, Prometheus, Grafana).
- Support compliance frameworks and infrastructure security controls.
Automation & CI/CD
- Build and maintain CI/CD pipelines (GitHub Actions, GitLab, Jenkins, etc.).
- Automate infrastructure deployments and configuration management.
- Implement DevSecOps best practices.
- Develop reusable infrastructure modules and standards.
Monitoring & Observability
- Implement centralized logging solutions.
- Configure performance monitoring and alerting systems.
- Perform root cause analysis and incident response.
- Develop dashboards and operational metrics.
Required Qualifications
- 7+ years of experience in DevOps, Infrastructure Engineering, or Systems Engineering.
- 5+ years of hands-on AWS architecture experience.
- Deep expertise in Linux systems administration.
- Strong experience with containerization and Kubernetes.
- Proven experience managing HPC environments.
- Experience managing enterprise license servers.
- Strong scripting skills (Bash, Python).
- Experience with Infrastructure as Code (Terraform preferred).
- Strong understanding of networking fundamentals and cloud networking.
Preferred Qualifications
- AWS Solutions Architect Professional or DevOps Professional certification.
- Experience with AWS ParallelCluster.
- Experience with GPU workloads and AI/ML infrastructure.
- Experience with enterprise storage solutions (NetApp, Isilon, etc.).
- Experience supporting research or engineering compute environments.
- Soft Skills
- Strong troubleshooting and analytical skills.
- Ability to work independently in high-complexity environments.
- Clear documentation and communication skills.
- Experience collaborating across engineering, security, and research teams.
- Strategic mindset with hands-on execution capability.
What Success Looks Like
- Highly available, secure, and automated AWS & HPC infrastructure.
- Optimized cloud costs and compute performance.
- Reliable license server infrastructure with minimal downtime.
- Fully automated server deployments.
- Secure, scalable cloud networking architecture.
- Improved deployment velocity through CI/CD automation.