Skip to main content
RingDispatch

Glossary

TTS (Text-to-Speech)

Also known as: text-to-speech, AI voice synthesis, TTS engine

Definition

Text-to-speech (TTS) is the technology that converts the AI receptionist's text response into spoken audio that plays through the call. Modern TTS engines (ElevenLabs Flash, OpenAI TTS, Google Cloud Text-to-Speech) produce voices indistinguishable from human recordings in most accent and emotion ranges.

Why it matters

The quality of TTS is what callers notice first. A robotic, latency-heavy TTS makes callers hang up; a natural, fast TTS keeps them engaged. The 2020-era 'press 1 for sales' robotic voice is a different category from a 2026 AI receptionist that callers often don't realize is AI until told. TTS also enables voice cloning — synthesizing a custom voice from a short recorded sample of the business owner.

How it works

When the LLM returns a text response, the TTS engine converts it to audio in real time, streamed back to the caller through Twilio. ElevenLabs Flash (RingDispatch's choice) generates speech in 32 languages with native accents, latency under 400ms for the first audio chunk, and a wide emotion range. Voice cloning uses a ~30-second to 1-minute recorded sample (ElevenLabs Instant Voice Cloning) to train a custom voice that's used on every subsequent call.

Examples

  • A solo locksmith's cloned voice answers lockout calls at 2am — callers hear him personally even though he's home asleep.
  • A bilingual salon greets in English then switches to Spanish mid-call — the same TTS voice handles both languages with native accents.
  • A dental practice's friendly female voice books cleaning appointments — patients comment that the AI sounds more pleasant than previous human receptionists.

Related