Job Openings
Site Reliability Engineer (Hybrid, Lahore, Remittance Salary)
About the job Site Reliability Engineer (Hybrid, Lahore, Remittance Salary)
Requirements:
- 5+ years of experience in an SRE, DevOps, or infrastructure engineering role.
- Strong experience with AWS or GCP, including services like EC2,Lambda, S3, RDS, and GKE (for GCP).
- Experience with automation tools like Terraform.
- Proficient in at least one scripting language (Python, Bash, Go, etc.).
- Solid understanding of Linux systems, networking, and cloud-based architectures.
- Experience working with container orchestration platforms like Kubernetes.
- Proficient with CI/CD pipelines, preferably with cloud-native tools (e.g.,GitHub).
- Ability to troubleshoot complex, distributed systems and provide solutions in high-pressure environments.
- Ability to communicate effectively with both technical and non-technical stakeholders.
Nice to have:
- Exposure to Execution Management Systems (EMS) / Portfolio Management Systems (PMS).
- Experience with client-impact triage, working cross-functionally with account managers or product teams.
- Proficiency with Datadog or similar observability platforms.
- Knowledge of serverless architectures (e.g., AWS Lambda, GCP Cloud Functions).
- Familiarity with RDBMS and NoSQL databases, such as RDS, CloudSQL, and DynamoDB.
- Prior experience in fintech, trading platforms, or 24/7 financial infrastructure.
- Strong understanding of API integrations and how infrastructure issues might manifest in client environments.
- Excellent problem-solving and communication skills, with the ability to translate technical incidents into clear client updates.
- Experience working with client-facing teams.
Responsibilities:
- Ensure the reliability, availability, and performance of production systems, particularly during weekends.
- Take ownership of monitoring, troubleshooting, and incident response during weekends and off-hours.
- Troubleshoot and resolve critical issues in a fast-paced, high-availability environment.
- Automate manual processes and workflows, reducing operational overhead.
- Work closely with engineering teams to design and deploy scalable, fault-tolerant infrastructure solutions on AWS or GCP.
- Improve observability by utilizing monitoring, logging, and alerting systems (e.g., CloudWatch, Datadog).
- Lead post-incident reviews, contribute to the continuous improvement of system reliability, and follow up on strategic fixes.
- Develop and update runbooks, incident response playbooks, and documentation.
- Work closely with Engineering, Product, and Client teams to proactively identify infrastructure pain points that could affect the user experience.
- Monitor alert channels, logs, and infrastructure load for the entire stack.
- Set up automation for alerting.