About the job Site Reliability Engineer
Team Segment : Solutions Business
KKCompany Technologies, Asias leading AI multimedia technology group is dedicated to creating values for customers with core businesses of multimedia technologies, digital cloud, and AI applications.
At KKCompany, we believe in Innovation Made Simple, and technology is the answer to the struggles faced by every industry. Since its establishment two decades ago, KKCompany has expanded its portfolio, including KKBOX, BlendVision and Going Cloud. KKBOX is the worlds first platform bringing legal music streaming service to the public. It utilizes state-of-the-art streaming technology to enable excellent user experience. Our flagship brands and a base of international clients enable us to accumulate extensive data and advance analytical capabilities. The strengths along with our abundant experience in brand management help businesses achieve digital transformation successfully. We serve over tens of millions of consumers and enterprise clients in Asia cross a broad spectrum of industries such as telecommunication, multimedia, online education, fitness, smart retail and more.
KKCompany now has nearly 500 employees across offices in Tokyo, Singapore, Taipei, Kaohsiung, and Hong Kong.
*Job Overview:
We are seeking a Site Reliability Engineer (SRE) to join our team supporting services with millions of active users. This role ensures service availability, performance, and scalability through automation, monitoring, incident response, and collaboration with DevOps and application development teams.
As an SRE, you will be embedded in the lifecycle of our systems from architecture design, deployment pipelines, and observability frameworks to incident resolution. This is a highly impactful position that requires both technical depth and operational ownership.
Responsibilities:
- Monitoring & Incident Management
- Participate in on-call rotations to respond to critical incidents and ensure high service availability.
- Build and maintain monitoring and alerting tools using AWS CloudWatch or third-party platforms.
- Set up effective alerting rules, triage anomalies, and lead service recovery efforts during incidents.
- Architecture Understanding & Collaboration
- Work with Web, Backend and DevOps teams to gain deep understanding of the service architecture.
- Support integration of operational and reliability best practices into the software development process.
- Deployment & Release Validation
- Monitor new deployments and evaluate their impact on service SLAs.
- Make quick rollback decisions when deployments threaten reliability or availability.
- Infrastructure & Automation
- Automate infrastructure provisioning using Infrastructure-as-Code tools such as Terraform, AWS CDK, or CloudFormation for the core encoding service.
- Ensure highly available and scalable system design using AWS and Kubernetes.
- Toil Reduction & Operational Efficiency
- Identify repetitive manual tasks (toil) in operations and incident management.
- Design and implement automation or process improvements to reduce manual effort and increase engineering velocity.
- Documentation
- Create and maintain detailed documentation including architecture diagrams, runbooks, and postmortems.
Requirements:
- Bachelor's degree in Computer Science or a related technical field involving software or systems engineering, or equivalent practical experience.
- Willingness to take part in on-call rotations and respond quickly to incidents.
- Strong collaboration and communication skills across cross-functional teams.
- Ability to write scripting languages such as Python or Shell.
- Familiarity with AWS high availability architecture and services.
- Experience with Git and CI/CD pipelines, preferably using GitLab CI/CD.
- Experience with operating and debugging Kubernetes in production.
Nice to Have:
- Experience in optimizing service performance and reliability in cloud-native environments.
- Experience managing observability tools such as CloudWatch
- Familiarity with managing large-scale systems supporting millions of active users.
- Knowledge of auditing and compliance processes related to ISO27001.