Job Openings Site Reliability Engineer (Hybrid, Lahore, Remittance Salary)

About the job Site Reliability Engineer (Hybrid, Lahore, Remittance Salary)

Requirements:

  • 5+ years of experience in an SRE, DevOps, or infrastructure engineering role.
  • Strong experience with AWS or GCP, including services like EC2,Lambda, S3, RDS, and GKE (for GCP).
  • Experience with automation tools like Terraform.
  • Proficient in at least one scripting language (Python, Bash, Go, etc.).
  • Solid understanding of Linux systems, networking, and cloud-based architectures.
  • Experience working with container orchestration platforms like Kubernetes.
  • Proficient with CI/CD pipelines, preferably with cloud-native tools (e.g.,GitHub).
  • Ability to troubleshoot complex, distributed systems and provide solutions in high-pressure environments.
  • Ability to communicate effectively with both technical and non-technical stakeholders.

Nice to have:

  • Exposure to Execution Management Systems (EMS) / Portfolio Management Systems (PMS).
  • Experience with client-impact triage, working cross-functionally with account managers or product teams.
  • Proficiency with Datadog or similar observability platforms.
  • Knowledge of serverless architectures (e.g., AWS Lambda, GCP Cloud Functions).
  • Familiarity with RDBMS and NoSQL databases, such as RDS, CloudSQL, and DynamoDB.
  • Prior experience in fintech, trading platforms, or 24/7 financial infrastructure.
  • Strong understanding of API integrations and how infrastructure issues might manifest in client environments.
  • Excellent problem-solving and communication skills, with the ability to translate technical incidents into clear client updates.
  • Experience working with client-facing teams.

Responsibilities:

  • Ensure the reliability, availability, and performance of production systems, particularly during weekends.
  • Take ownership of monitoring, troubleshooting, and incident response during weekends and off-hours.
  • Troubleshoot and resolve critical issues in a fast-paced, high-availability environment.
  • Automate manual processes and workflows, reducing operational overhead.
  • Work closely with engineering teams to design and deploy scalable, fault-tolerant infrastructure solutions on AWS or GCP.
  • Improve observability by utilizing monitoring, logging, and alerting systems (e.g., CloudWatch, Datadog).
  • Lead post-incident reviews, contribute to the continuous improvement of system reliability, and follow up on strategic fixes.
  • Develop and update runbooks, incident response playbooks, and documentation.
  • Work closely with Engineering, Product, and Client teams to proactively identify infrastructure pain points that could affect the user experience.
  • Monitor alert channels, logs, and infrastructure load for the entire stack.
  • Set up automation for alerting.