Job Openings
Senior SQA Engineer (LLM) Remote Pakistan
About the job Senior SQA Engineer (LLM) Remote Pakistan
- Design and own the end-to-end QA strategy for the Conversational Banking Platform, covering functional, regression, performance, security, and AI-specific evaluation.
- Build and maintain golden datasets, eval suites, and LLM-as-judge frameworks to validate conversational quality across intents, languages, and tenants.
- Define the tenant onboarding QA gate, the certification checklist every new business unit must pass before going live.
- Establish regression strategies for prompt changes, model upgrades, retrieval index updates, and guardrail policy changes.
- Use Langfuse traces to drive evaluation: mine production failures, convert them into test cases, and close the loop with engineering.
- Test NeMo Guardrails configurations against jailbreaks, prompt injection, off-topic drift, and false-positive over-blocking.
- Validate governance and compliance behaviors: data residency, PII handling, regulated-product disclosures, and off-limits topics.
- Build automated test harnesses for Spring AI services, including tool-calling validation, RAG groundedness, and integration with Cosmos DB and MongoDB data layers.
- Partner with the Platform team on quality metrics, SLOs, and the platform eval scorecard.
- Coach feature engineers and tenant teams on writing their own evals, making platform-grade quality self-service over time.
Tech Stack To Work
- AI/Application: Spring AI, Java/Spring Boot
- Data: Cosmos DB (vector and operational), MongoDB
- Observability and evaluation: Langfuse
- Governance and safety: NVIDIA NeMo Guardrails
- CI/CD: standard enterprise pipelines with automated quality gates
Required Experience and Skills
- 6+ years in software QA, with at least 1–2 years testing LLM-based, RAG, or conversational AI systems in production.
- Hands-on experience with LLM observability and evaluation tools such as Langfuse, LangSmith, Arize, or Phoenix.
- Working knowledge of eval frameworks such as Ragas, DeepEval, Promptfoo, or TruLens — including metrics like faithfulness, groundedness, answer relevance, and context precision.
- Practical understanding of how to test non-deterministic systems: golden datasets, semantic similarity, LLM-as-judge, and statistical regression detection.
- Experience testing guardrail or policy frameworks (NeMo Guardrails, Guardrails AI, or similar).
- Solid foundation in API testing, automation frameworks (e.g., pytest, JUnit, Karate, RestAssured), and CI/CD integration.
- Familiarity with Spring and Spring Boot applications and JVM-based services.
- Comfortable writing queries against NoSQL stores (MongoDB, Cosmos DB) for test data setup and trace inspection.
- Strong written communication : able to produce clear test plans, defect reports, and tenant readiness assessments.
Good to Have
- Experience in banking, financial services, or another regulated industry.
- Exposure to multi-tenant platforms: understanding how shared infrastructure changes the testing problem.
- Familiarity with red-teaming, adversarial prompt testing, and prompt injection defense.
- Working knowledge of vector databases, embedding models, and retrieval evaluation.
- Experience with multi-language conversational systems.
- Performance and load testing experience for AI workloads (token throughput, latency percentiles, cost per conversation).
- Contributions to open-source eval or AI testing tooling.
- Experience working with compliance, risk, or audit teams on AI assurance.