About the job Senior SRE
Job Description: Senior Site Reliability Engineer (SRE)
Position Title: Senior Site Reliability Engineer (SRE)
Job Type: Full-Time
Candidate location: Mexico or Columbia ONLY
Communication: Excellent communication in English, will be client facing at times
Role Overview: We are seeking a Senior Site Reliability Engineer (SRE) with extensive experience in Logstash, Kibana, and the ELK stack. The ideal candidate will be proficient in creating and managing rules and watchers and have strong knowledge of Elasticsearch. Familiarity with Ansible, Terraform, Python, is also required. This role includes participation in an on-call rotation to ensure the reliability and availability of our systems. The candidate will also be responsible for providing thought leadership, maintaining documentation, and promoting best practices.
Key Responsibilities:
- Log Management and Monitoring:
- Design, implement, and manage Logstash pipelines for data ingestion and transformation for new and existing observability clients.
- Develop and maintain Kibana dashboards for data visualization and monitoring.
- Create and manage Kibana rules and watchers to detect and alert on anomalies and critical events.
- Infrastructure as Code:
- Utilize Ansible for configuration management and automation.
- Implement and maintain infrastructure as code (IaC) using Terraform.
- Strong understanding of Jenkins
- System Reliability and Performance:
- Monitor and ensure the performance, availability, and security of Elasticsearch clusters and the ELK stack.
- Troubleshoot and resolve issues with Logstash pipelines, Kibana dashboards, and Elasticsearch.
- Implement best practices for logging, monitoring, and alerting.
- Implement and optimize beats configurations across VMs and containers
- Implement and optimize APM configurations for Python, Java, and .Net
- Cloud Expertise:
- Extensive experience with Google Cloud Platform (GCP) and Amazon Web Services (AWS), including management and deployment of cloud infrastructure.
- Linux admin experience
- Windows admin Experience
- Kubernetes experience across K8s, GKE, EKS
- Thought Leadership:
- Provide thought leadership and contribute to the strategic direction of our SRE practices.
- Advocate for and implement new technologies and methodologies to enhance system reliability.
- Documentation and Best Practices:
- Develop and maintain comprehensive documentation for systems, processes, and procedures.
- Establish and promote best practices for system reliability, monitoring, and incident response.
- Collaboration and Mentorship:
- Work closely with development and operations teams to ensure system reliability.
- Mentor junior team members and provide technical guidance.
- On-Call Rotation:
- Participate in an on-call rotation to provide 24/7 support for critical systems.
- Respond to incidents, perform root cause analysis, and implement solutions to prevent recurrence.
- Continuous Improvement:
- Identify areas for improvement and drive initiatives to enhance system reliability, scalability, and performance.
- Stay up-to-date with the latest industry trends and technologies.
Qualifications:
- ***Technical Skills:
- Proven experience with Logstash, Kibana, and the ELK stack.
- Strong understanding of Elasticsearch, including cluster management and query optimization.
- Proficiency with Ansible, Terraform, and Jenkins.
- Strong programming skills in Python
- Experience with Linux/Unix systems.
- Problem-Solving:
- Excellent troubleshooting and analytical skills.
- Ability to diagnose and resolve complex system issues.
- ***Communication:
- Strong written and verbal communication skills.
- Ability to work collaboratively in a team environment.
- Experience:
- 5+ years of experience in a Site Reliability Engineer (SRE) or similar role.
- Experience working in a high-availability, 24/7 production environment.
- Certifications (Preferred):
- Relevant certifications in Elasticsearch, Terraform, or related technologies are a plus.
o GCP, AWS, or Azure cloud certifications