Job Openings Site Reliability Engineer (SRE)

About the job Site Reliability Engineer (SRE)

Overview

We are seeking an experienced Site Reliability Engineer (SRE) to join a dynamic technology team supporting large-scale infrastructure and AML systems. This role combines software engineering, systems engineering, automation, and operational excellence to ensure high availability, scalability, and reliability across critical platforms.

The ideal candidate is passionate about infrastructure automation, system performance, cloud-native technologies, and operational reliability in fast-paced environments.

Key Responsibilities

  • Design, build, and maintain highly available, scalable, and fault-tolerant systems
  • Collaborate closely with software engineering teams to improve system reliability and performance
  • Develop and maintain automation tools and operational procedures to improve efficiency and reduce manual intervention
  • Monitor infrastructure and application performance to proactively identify and resolve issues
  • Implement and maintain monitoring, alerting, and observability solutions including SLIs, SLOs, and SLAs
  • Participate in 24/7 on-call rotations, incident management, root-cause analysis, and blameless post-mortems
  • Ensure infrastructure security, compliance, and operational best practices
  • Support large-scale web traffic and machine learning data processing environments

Requirements

Technical Skills

  • Proficiency in at least one programming language such as Python, Go, Java, or C++
  • Strong scripting and automation skills
  • Good understanding of Linux operating systems and network architecture
  • Experience with Docker and Kubernetes
  • Hands-on experience with monitoring tools such as Prometheus and Grafana
  • Knowledge of relational databases and database modeling

Preferred Skills

  • Exposure to machine learning frameworks such as TensorFlow, PyTorch, MXNet, or PaddlePaddle
  • Strong analytical and problem-solving abilities
  • Excellent communication and collaboration skills
  • Ability to work effectively in a fast-paced and cross-functional environment

Qualifications

  • Bachelor's or Master's Degree in Computer Science, Information Technology, Computer Engineering, or related field
  • Minimum 3 years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering

Why Join Us

  • Opportunity to work on large-scale distributed systems and modern infrastructure technologies
  • Exposure to cloud-native environments and advanced automation practices
  • Collaborative and technology-driven working environment
  • Career growth and continuous learning opportunities
  • Competitive salary and benefits package