Job Openings Senior SQA Engineer (LLM) Remote Pakistan

About the job Senior SQA Engineer (LLM) Remote Pakistan

  • Design and own the end-to-end QA strategy for the Conversational Banking Platform, covering functional, regression, performance, security, and AI-specific evaluation.
  • Build and maintain golden datasets, eval suites, and LLM-as-judge frameworks to validate conversational quality across intents, languages, and tenants.
  • Define the tenant onboarding QA gate, the certification checklist every new business unit must pass before going live.
  • Establish regression strategies for prompt changes, model upgrades, retrieval index updates, and guardrail policy changes.
  • Use Langfuse traces to drive evaluation: mine production failures, convert them into test cases, and close the loop with engineering.
  • Test NeMo Guardrails configurations against jailbreaks, prompt injection, off-topic drift, and false-positive over-blocking.
  • Validate governance and compliance behaviors: data residency, PII handling, regulated-product disclosures, and off-limits topics.
  • Build automated test harnesses for Spring AI services, including tool-calling validation, RAG groundedness, and integration with Cosmos DB and MongoDB data layers.
  • Partner with the Platform team on quality metrics, SLOs, and the platform eval scorecard.
  • Coach feature engineers and tenant teams on writing their own evals, making platform-grade quality self-service over time.

Tech Stack To Work

  • AI/Application: Spring AI, Java/Spring Boot
  • Data: Cosmos DB (vector and operational), MongoDB
  • Observability and evaluation: Langfuse
  • Governance and safety: NVIDIA NeMo Guardrails
  • CI/CD: standard enterprise pipelines with automated quality gates

Required Experience and Skills

  • 6+ years in software QA, with at least 1–2 years testing LLM-based, RAG, or conversational AI systems in production.
  • Hands-on experience with LLM observability and evaluation tools such as Langfuse, LangSmith, Arize, or Phoenix.
  • Working knowledge of eval frameworks such as Ragas, DeepEval, Promptfoo, or TruLens — including metrics like faithfulness, groundedness, answer relevance, and context precision.
  • Practical understanding of how to test non-deterministic systems: golden datasets, semantic similarity, LLM-as-judge, and statistical regression detection.
  • Experience testing guardrail or policy frameworks (NeMo Guardrails, Guardrails AI, or similar).
  • Solid foundation in API testing, automation frameworks (e.g., pytest, JUnit, Karate, RestAssured), and CI/CD integration.
  • Familiarity with Spring and Spring Boot applications and JVM-based services.
  • Comfortable writing queries against NoSQL stores (MongoDB, Cosmos DB) for test data setup and trace inspection.
  • Strong written communication : able to produce clear test plans, defect reports, and tenant readiness assessments.

Good to Have

  • Experience in banking, financial services, or another regulated industry.
  • Exposure to multi-tenant platforms: understanding how shared infrastructure changes the testing problem.
  • Familiarity with red-teaming, adversarial prompt testing, and prompt injection defense.
  • Working knowledge of vector databases, embedding models, and retrieval evaluation.
  • Experience with multi-language conversational systems.
  • Performance and load testing experience for AI workloads (token throughput, latency percentiles, cost per conversation).
  • Contributions to open-source eval or AI testing tooling.
  • Experience working with compliance, risk, or audit teams on AI assurance.