Job Openings Site Reliability Engineer

About the job Site Reliability Engineer

Overview
An organization is seeking a Site Reliability Engineering professional to strengthen the reliability, scalability, monitoring, and performance of its on-premises services within an engineering environment.

The role focuses on building and maintaining a robust observability and monitoring ecosystem while collaborating closely with development and infrastructure teams to ensure high system availability and operational efficiency.

Responsibilities

  • Design and maintain monitoring and observability infrastructure

  • Develop custom dashboards, alerting mechanisms, and visualization tools

  • Implement distributed tracing and centralized log aggregation solutions

  • Define and maintain monitoring standards, including SLI/SLO frameworks

  • Ensure security compliance of monitoring tools in an on-premises environment

  • Automate infrastructure deployment and configuration management

  • Work with development teams to implement application instrumentation

  • Participate in on-call rotations for incident response

Requirements

Core Technologies

  • Grafana (advanced usage)

  • Prometheus (including PromQL)

  • OpenTelemetry

  • Elasticsearch

Infrastructure

  • Linux system administration

  • Networking fundamentals

  • Security practices for on-premises environments

Programming / Automation

  • Python

  • Bash or Go

Experience

  • Minimum 3 years of experience in monitoring and observability

  • At least 2 years of hands-on experience with Grafana and Prometheus in production environments

  • Strong experience managing Linux-based systems

  • Demonstrated experience working with on-premises infrastructure solutions

Security

  • Familiarity with enterprise security practices

  • Understanding of compliance requirements

Additional Skills

  • Ability to balance technical trade-offs with business priorities

  • Strong prioritization and problem-solving capabilities

  • Willingness to participate in 24/7 on-call rotations for incident support