About the job Site Reliability Engineer
Overview
An organization is seeking a Site Reliability Engineering professional to strengthen the reliability, scalability, monitoring, and performance of its on-premises services within an engineering environment.
The role focuses on building and maintaining a robust observability and monitoring ecosystem while collaborating closely with development and infrastructure teams to ensure high system availability and operational efficiency.
Responsibilities
-
Design and maintain monitoring and observability infrastructure
-
Develop custom dashboards, alerting mechanisms, and visualization tools
-
Implement distributed tracing and centralized log aggregation solutions
-
Define and maintain monitoring standards, including SLI/SLO frameworks
-
Ensure security compliance of monitoring tools in an on-premises environment
-
Automate infrastructure deployment and configuration management
-
Work with development teams to implement application instrumentation
-
Participate in on-call rotations for incident response
Requirements
Core Technologies
-
Grafana (advanced usage)
-
Prometheus (including PromQL)
-
OpenTelemetry
-
Elasticsearch
Infrastructure
-
Linux system administration
-
Networking fundamentals
-
Security practices for on-premises environments
Programming / Automation
-
Python
-
Bash or Go
Experience
-
Minimum 3 years of experience in monitoring and observability
-
At least 2 years of hands-on experience with Grafana and Prometheus in production environments
-
Strong experience managing Linux-based systems
-
Demonstrated experience working with on-premises infrastructure solutions
Security
-
Familiarity with enterprise security practices
-
Understanding of compliance requirements
Additional Skills
-
Ability to balance technical trade-offs with business priorities
-
Strong prioritization and problem-solving capabilities
-
Willingness to participate in 24/7 on-call rotations for incident support