Job Openings XTN-EF5F239 | SENIOR DEVOPS ENGINEER -CLOUD & HPC INFRASTRUCTURE

About the job XTN-EF5F239 | SENIOR DEVOPS ENGINEER -CLOUD & HPC INFRASTRUCTURE

Position Overview

We are seeking a highly experienced Senior DevOps Engineer to lead the design, deployment, automation, and operational excellence of our AWS-based cloud infrastructure and high-performance computing (HPC) environments. This role requires deep expertise in AWS architecture, Linux systems administration, server deployment, containerization, virtualization, license server management, and cloud networking.

The ideal candidate is hands-on, automation-driven, security-focused, and comfortable operating in complex hybrid environments supporting research, engineering, and compute-intensive workloads.

  • Health Insurance/HMO 
  • Enjoy unlimited MadMax Coffee
  • Diverse learning & growth opportunities
  • Accessible Cloud HR platform (Sprout)
  • Above standard leaves

Key Responsibilities

  • Cloud Infrastructure & AWS Architecture
  • Design, deploy, and manage scalable, secure AWS infrastructure.
  • Architect and maintain VPCs, subnets, route tables, NAT gateways, transit gateways, and peering.
  • Manage AWS networking components including Route53, Load Balancers (ALB/NLB), CloudFront, and PrivateLink.
  • Implement infrastructure-as-code (IaC) using Terraform, CloudFormation, or similar.
  • Optimize cloud cost, performance, and resource utilization.
  • Implement AWS best practices for security, resilience, and high availability.

Server Deployment & Systems Engineering

  • Architect and automate server provisioning across cloud and hybrid environments.
  • Deploy and manage EC2, Auto Scaling Groups, Launch Templates, and AMIs.
  • Build hardened Linux server images (CIS benchmarks preferred).
  • Implement configuration management using tools such as Ansible, Puppet, or Chef.
  • Manage patching, lifecycle management, and OS hardening strategies.

Expert Linux Administration

  • Advanced administration of RHEL, Rocky, Ubuntu, or similar distributions.
  • Kernel tuning and performance optimization for compute-intensive workloads.
  • Troubleshooting system-level performance (CPU, memory, I/O, networking).
  • Manage system services, storage, RAID, LVM, NFS, and distributed filesystems.
  • Shell scripting and automation (Bash, Python).

Containerization & Virtualization

  • Design and manage containerized workloads using Docker.
  • Deploy and maintain Kubernetes (EKS preferred).
  • Implement CI/CD pipelines for container-based applications.
  • Manage virtualization platforms (VMware, KVM, or similar).
  • Optimize container orchestration for HPC and compute workloads.

HPC Infrastructure Management

  • Deploy and maintain High Performance Computing clusters.
  • Manage job schedulers (Slurm, PBS, or similar).
  • Optimize cluster performance, storage throughput, and node scaling.
  • Integrate HPC workloads with AWS services (e.g., ParallelCluster).
  • Manage high-speed networking (InfiniBand or equivalent if applicable).
  • Support GPU-based workloads where applicable.

License Server Administration

  • Deploy and manage FlexLM or similar license servers.
  • Ensure high availability and redundancy for engineering license services.
  • Monitor license usage and optimize allocation.
  • Troubleshoot license connectivity and performance issues.

Cloud Networking & Security

  • Deep understanding of TCP/IP, DNS, routing protocols, and firewall design.
  • Implement secure connectivity (VPN, Direct Connect, site-to-site).
  • Manage security groups, NACLs, IAM roles, and zero-trust principles.
  • Implement logging, monitoring, and alerting (CloudWatch, Prometheus, Grafana).
  • Support compliance frameworks and infrastructure security controls.

Automation & CI/CD

  • Build and maintain CI/CD pipelines (GitHub Actions, GitLab, Jenkins, etc.).
  • Automate infrastructure deployments and configuration management.
  • Implement DevSecOps best practices.
  • Develop reusable infrastructure modules and standards.

Monitoring & Observability

  • Implement centralized logging solutions.
  • Configure performance monitoring and alerting systems.
  • Perform root cause analysis and incident response.
  • Develop dashboards and operational metrics.

Required Qualifications

  • 7+ years of experience in DevOps, Infrastructure Engineering, or Systems Engineering.
  • 5+ years of hands-on AWS architecture experience.
  • Deep expertise in Linux systems administration.
  • Strong experience with containerization and Kubernetes.
  • Proven experience managing HPC environments.
  • Experience managing enterprise license servers.
  • Strong scripting skills (Bash, Python).
  • Experience with Infrastructure as Code (Terraform preferred).
  • Strong understanding of networking fundamentals and cloud networking.

Preferred Qualifications

  • AWS Solutions Architect Professional or DevOps Professional certification.
  • Experience with AWS ParallelCluster.
  • Experience with GPU workloads and AI/ML infrastructure.
  • Experience with enterprise storage solutions (NetApp, Isilon, etc.).
  • Experience supporting research or engineering compute environments.
  • Soft Skills
  • Strong troubleshooting and analytical skills.
  • Ability to work independently in high-complexity environments.
  • Clear documentation and communication skills.
  • Experience collaborating across engineering, security, and research teams.
  • Strategic mindset with hands-on execution capability.

What Success Looks Like

  • Highly available, secure, and automated AWS & HPC infrastructure.
  • Optimized cloud costs and compute performance.
  • Reliable license server infrastructure with minimal downtime.
  • Fully automated server deployments.
  • Secure, scalable cloud networking architecture.
  • Improved deployment velocity through CI/CD automation.