X9546VV3 |【中文岗】Senior Operations Engineer (SRE/AI Platform) 高级运维工程师（SRE/人工智能平台）

Kuala Lumpur, Malaysia

Or refer someone

Job Openings X9546VV3 |【中文岗】Senior Operations Engineer (SRE/AI Platform) 高级运维工程师（SRE/人工智能平台）

About the job X9546VV3 |【中文岗】Senior Operations Engineer (SRE/AI Platform) 高级运维工程师（SRE/人工智能平台）

工作地点：吉隆坡 KL
薪资范围：RM14,700 - RM17,700
工作签证：不提供

职位亮点

加入全球领先的AI基础设施服务提供商的国际团队，参与构建和运维尖端AI平台。
独立负责全球用户的生产环境，直接影响核心服务的可靠性与性能。
深度接触多云架构、GPU计算和自动化运维，积累高价值技术经验。
跨文化协作环境，与中美技术团队紧密合作，提升中英文双语技术沟通能力。

核心职责

端到端运维 ownership：全面负责AI基础设施产品（Model-API、Serverless、GPU实例）的可用性、延迟、性能与效率。
故障响应与管理：作为生产事件第一响应人，深入排查根因（RCA），实施预防措施，并参与轮值待命。
自动化与工具开发：设计和维护自动化脚本与工具，实现运维任务、部署和故障恢复的流程化。
监控与告警体系：构建并优化监控告警系统（如Prometheus/Grafana），实现问题主动发现。
基础设施即代码（IaC）：使用Terraform/Ansible等工具管理云基础设施，保障环境一致性与可重复性。
性能与成本优化：持续分析系统性能与资源使用，识别瓶颈并优化云平台（AWS/GCP/Azure）成本。
跨职能协作：与中方工程团队密切合作，理解新功能、提供运维反馈，并确保新服务达到生产就绪状态。

硬性要求

5年以上DevOps/SRE/云运维经验，有科技或云服务公司背景优先。
精通至少一家主流云平台（AWS/GCP/Azure）；具备容器化与编排技术实战经验（必须掌握Docker/Kubernetes）。
熟练使用至少一种脚本语言（如Python/Go/Shell）；掌握Terraform/Ansible等IaC工具。
具备监控与可观测性工具（如Prometheus/Grafana/ELK）的实战经验。
系统化的问题排查能力，能在压力下冷静处理复杂分布式系统问题。
中英文双语流利（书面和口语），能胜任跨团队技术沟通。
具备高度责任心和自驱力，适应远程/分布式团队独立工作模式。
加分项：有GPU加速计算环境经验；熟悉MLOps工具（如Kubeflow/MLflow）；了解Serverless技术及CI/CD流水线。

如何申请？

点击'Apply'申请或发送简历至[apply@ttukoffer.co.uk]，邮件标题注明[申请 WBX9546VV3]。推荐奖金：成功推荐人选可获得推荐奖励。详情：https://ttukoffer.co.uk/refer-a-friend-bonus/

[Mandarin-speaking Role] Senior Operations Engineer (SRE/AI Platform)

Location: Kuala Lumpur
Compensation: RM10,000 - RM15,000
Visa Sponsorship: Not Available

Job Highlights

Join the international team of a leading global AI infrastructure service provider to build and operate cutting-edge AI platforms.
Take end-to-end ownership of production environments for global users, directly impacting core service reliability and performance.
Gain deep exposure to multi-cloud architecture, GPU computing, and automated operations in a high-impact role.
Collaborate in a multicultural environment with engineering teams across China and North America, enhancing bilingual technical communication skills.

Key Responsibilities

End-to-End Service Ownership: Assume primary responsibility for the availability, latency, performance, and efficiency of AI infrastructure products (Model-API, Serverless, GPU Instances).
Incident Management & Response: Act as the first responder for production incidents, perform root cause analysis (RCA), and implement preventive measures. Participate in an on-call rotation.
Automation & Tooling: Design, build, and maintain automation scripts and tools to streamline operational tasks, deployments, and failure recovery.
Monitoring & Alerting: Develop and refine monitoring and alerting systems (e.g., Prometheus/Grafana) to enable proactive issue detection.
Infrastructure as Code (IaC): Manage and provision cloud infrastructure using IaC tools (e.g., Terraform, Ansible) to ensure consistency and repeatability.
Performance & Cost Optimization: Continuously analyze system performance and resource utilization to identify bottlenecks and optimize cloud platform (AWS/GCP/Azure) costs.
Cross-Functional Collaboration: Work closely with engineering teams in China to understand new features, provide operational feedback, and ensure production readiness of new services.

Must-Have Requirements

5+ years of hands-on experience in DevOps, SRE, or cloud operations, preferably in a tech or cloud service company.
Expertise in at least one major cloud provider (AWS/GCP/Azure); practical experience with containerization and orchestration technologies (Docker/Kubernetes required).
Proficiency in at least one scripting language (e.g., Python, Go, Shell); solid understanding of IaC tools like Terraform/Ansible.
Hands-on experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack).
Systematic problem-solving skills with the ability to troubleshoot complex distributed systems under pressure.
Professional fluency in both English and Mandarin (written and spoken) for effective cross-regional collaboration.
Strong sense of ownership and self-drive, with the ability to work independently in a remote/distributed team setting.
Nice to Have: Experience with GPU-accelerated computing; knowledge of MLOps tools (e.g., Kubeflow, MLflow); familiarity with serverless technologies and CI/CD pipelines.

How to Apply?

Click 'Apply' or send your resume to [apply@ttukoffer.co.uk] with the subject line [Apply to WBX9546VV3]. Refer a friend for this role and earn referral bonuses! See details: https://ttukoffer.co.uk/refer-a-friend-bonus/

By applying, you acknowledge that TT UKoffer Ltd may process your personal data for recruitment purposes under the lawful basis of legitimate interest. This includes sharing your CV with potential employers. We comply with UK GDPR regulations, and you may request data removal at any time by contacting apply@ttukoffer.co.uk.