Job Description:
Purpose of the Role
Design, implement, and maintain secure, scalable, and cost-effective cloud infrastructure. This role ensures long-term cloud sustainability through FinOps, cost optimization, automation, and resilient architectures that support business growth, reliability, and operational efficiency.
Key Responsibilities
- Design and implement scalable, secure, cost-efficient cloud infrastructure.
- Lead cloud cost-optimization using FinOps principles and long-term commitments.
- Architect cloud solutions for sustainability and economies of scale.
- Configure and manage compute, networking, storage, and monitoring tools.
- Automate provisioning, deployment, and maintenance using IaC.
- Work closely with DevOps and Engineering to ensure performance and high availability.
- Monitor infrastructure health, optimize resource usage, and resolve performance issues.
- Implement strong cloud security, encryption, and compliance standards.
- Evaluate and recommend new cloud services and technologies.
Minimum Requirements
- Bachelor's degree in Computer Science / Information Technology / or related field.
- 7+ years in infrastructure engineering or similar roles.
- 3–5+ years hands‑on experience designing and managing secure, scalable cloud environments (AWS, Azure, or GCP).
- Strong understanding of cloud architecture, networking, security, and FinOps.
- Experience with Infrastructure as Code (e.g., Terraform, CloudFormation, ARM/Bicep).
- Relevant certifications beneficial (AWS Solutions Architect, Azure Architect, FinOps Certified Practitioner).
- Strong analytical, problem‑solving, communication, and collaboration skills.
Key Performance Measures
- Cloud Cost Efficiency: Savings via right‑sizing, Reserved Instances/Savings Plans, and FinOps reporting.
- Scalability & Elasticity: Ability to scale environments with minimal manual intervention.
- Security & Compliance: Effective implementation of security controls and audit readiness.
- System Uptime: Meeting or exceeding cloud uptime SLAs.
- Incident Response (MTTR): Speed and effectiveness in detecting and resolving incidents.
- Automation: Level of automation improving deployment velocity and reducing manual tasks.
- Resource Utilization: Efficient CPU, memory, storage, and network usage.
- Disaster Recovery Readiness: Achieving target RPO/RTO and successful DR test results.
- 360° Internal Feedback: Collaboration and stakeholder satisfaction across teams.