Job Openings AI Engineer: Text‑to‑Speech (TTS)

About the job AI Engineer: Text‑to‑Speech (TTS)

AI Engineer: Text‑to‑Speech (TTS)

Create high‑quality, natural‑sounding synthetic speech for Awarri’s platforms, supporting multiple Nigerian languages and voice styles. Design neural TTS architectures, manage voice data pipelines, and deliver scalable TTS services for internal and client‑facing products.

Key Responsibilities

  • Design, train, and fine tune TTS models such as Tacotron 2, FastSpeech 2, VITS, or diffusion based architectures.
  • Build data preprocessing pipelines for text normalization, phonemization, silence trimming, and alignment.
  • Implement MFA based alignment or alternative alignment strategies for TTS pipelines.
  • Create multi speaker and multi language training frameworks with speaker embeddings and conditioning.
  • Optimize models for naturalness, prosody, accent accuracy, and low latency inference.
  • Build evaluation suites for MOS, prosody accuracy, pronunciation correctness, stability, and noise artifacts.
  • Diagnose failure modes such as misalignment, character mapping issues, noisy datasets, and instability in inference.
  • Implement deployment ready inference pipelines with vocoders like HiFiGAN or WaveGlow.
  • Work on optimization, quantization, and streaming inference to reduce GPU cost.
  • Collaborate with data engineering to strategize on building large scale multilingual TTS datasets.
  • Document experiments, dataset versions, hyperparameters, and reproducibility pipelines.
  • Develop phoneme inventories, pronunciation lexicons, and custom G2P models for underrepresented languages.
  • Handle tonal, orthographic, and dialectal complexity in languages like Yoruba, Igbo, Hausa, etc.
  • Use data augmentation, noise reduction, and denoising pipelines to improve training quality.
  • Build tools for voice cloning, speaker adaptation, and style or emotion transfer.

Person Profile

  • Strong background in speech synthesis, signal processing, and deep learning.
  • Hands on experience training Tacotron, FastSpeech, VITS, or diffusion based TTS models.
  • Skilled with PyTorch, GPU training, and implementing custom TTS modules.
  • Familiar with MFA, text normalization, phonemization, and G2P systems.

  • Experience with vocoders like HiFiGAN, WaveGlow, or Parallel WaveGAN.

  • Strong debugging skills for alignment errors, noisy training data

    and inference instability.

  • Experience building or cleaning multilingual datasets, especially low resource languages.
  • Ability to tune prosody, tone, rhythm, and style using conditioning or embeddings.
  • Writes clean, production ready code for TTS inference services.
  • Communicates clearly, flags blockers early, and owns tasks end to end.
  • Bonus: experience with emotional TTS, diffusion TTS, speaker embeddings, and cross linguistic modeling.