Job Openings Supervisor- Server repair engineering

About the job Supervisor- Server repair engineering

Description

This is a foundational role responsible for architecting, defining, and continuously improving the entire technical framework for diagnosing and repairing our complex, high-value AI server infrastructure.

More than a traditional supervisor, you are the lead repair engineer and process owner.

You will leverage your deep hardware expertise to develop systematic, data-driven, and scalable repair processes from the ground up.

You will not only lead a team of technicians and junior engineers but also act as their primary technical mentor and the engineering liaison to our core Product Design and Quality teams.

Your mission is to transform our repair facility into a center of excellence by embedding engineering discipline into every aspect of our service operations.

Key Responsibilities

1. Process Architecture & Definition (Primary Focus):

* Architect and Author: Design, document, and deploy the end-to-end technical workflow for AI server repair. This includes creating detailed Standard Operating Procedures (SOPs), diagnostic flowcharts, decision trees, and work instructions.

* Test Plan Development: Define and validate comprehensive test plans and validation criteria for all repaired components and full systems, ensuring they meet strict performance and reliability standards before being returned to service.

* Tooling & Automation: Identify, develop, and implement diagnostic scripts, software tools, and physical fixtures to improve the accuracy, consistency, and efficiency of the troubleshooting and repair process.

* Process Control: Establish critical control points within the repair process to ensure quality and gather vital failure data.

2. Advanced Engineering Support & Failure Analysis (Primary Focus):

* Technical Authority: Serve as the ultimate escalation point for the most complex hardware failures that elude standard diagnostic procedures.

* Root Cause Analysis (RCA): Lead systematic deep dives into new and recurring failure modes. Perform board-level analysis, interpret schematics, and collaborate with the team to isolate the root cause.

* Engineering Feedback Loop: Act as the primary technical interface between the repair center and core Hardware Engineering/R&D. Consolidate, analyze, and present failure data and RCA findings to influence future product design for improved serviceability and reliability (Design for Serviceability).

3. Operational Leadership & Team Enablement:

* Technical Mentorship: Lead and develop the technical capabilities of the repair team. Provide hands-on training on new products, advanced diagnostic techniques, and established repair processes.

* Enablement, Not Just Delegation: Empower the team by ensuring they have the processes, tools, and knowledge required to succeed. Focus on removing technical roadblocks and fostering an environment of structured problem-solving.

* Performance Management: Set clear technical objectives, manage workflow priorities based on engineering needs, and guide the professional growth of team members.

4. Data-Driven Continuous Improvement:

* Analyze Repair Data: Systematically collect and analyze repair data (failure modes, component usage, test yields) to identify trends and opportunities for process optimization.

* Drive Improvements: Initiate and lead engineering change requests (ECRs) and process improvement projects based on data analysis to enhance repair quality, reduce turn-around time, and lower costs.

Qualifications

Qualifications & Skills

Required Qualifications (Must-Haves):

* Education: Bachelors degree in Electrical Engineering, Computer Engineering, Manufacturing Engineering, or a closely related field.

* Experience: * 4+ years in a technical engineering role such as Test Engineering, Manufacturing Engineering, Hardware Sustaining, or high-level Repair Engineering.

* Proven track record of developing and documenting technical processes (SOPs, test plans, work instructions) from scratch in a manufacturing or repair environment.

* 3+ years in a technical leadership role, mentoring junior engineers or technicians.

* Technical Expertise:

* Expert-level ability to read and interpret electronic schematics, board layout files, and product specifications.

* Strong, hands-on experience with systematic hardware troubleshooting methodologies for complex systems (e.g., servers, networking equipment).

* Demonstrated proficiency in scripting (Python, Bash, or similar) to automate diagnostic tests and parse data logs.

* Deep knowledge of server components and architecture, including GPUs, high-speed interconnects (InfiniBand/Ethernet), CPUs, and power systems.

Working Conditions

Must be able to tolerate moderate to high noise levels in production and testing rooms. Office and outside environmental conditions found in the warehouse, hot in the summer, cold in the winter. Individuals may need to walk for an extensive period of time while working and walking the facilities; to reach over shoulder heights; bend or stoop below the waist; repetitive wrist, hand, or finger movement; and occasional lifting up to 50 pounds.

To apply send your resume to ana@employeemagnets.com