Glossary
STT (Speech-to-Text)
Also known as: speech-to-text, ASR (automatic speech recognition), STT engine
Definition
Speech-to-text (STT), also called Automatic Speech Recognition (ASR), is the technology that converts the caller's spoken audio into text transcripts that the LLM can read. Deepgram, OpenAI Whisper, and Google Cloud Speech-to-Text are the dominant engines for AI receptionist applications in 2026.
Why it matters
STT determines whether the AI understands what the caller said. A weak STT misses words, mishears proper names, drops the end of sentences when audio quality is poor, and fails on accents. The result is the AI asking the caller to repeat themselves — a frustrating, conversion-killing experience. Strong STT (Deepgram Nova, the 2026 baseline) handles accents, background noise, mumbling, and partial words. Streaming STT lets the AI start formulating its response while the caller is still talking — critical for the sub-second response latency callers expect.
How it works
During a call, the caller's audio is streamed in real time from Twilio to the STT engine. The engine returns transcript chunks every 100-300ms, which feed into the LLM's running context. When the caller pauses, the LLM generates a response; the response is streamed to TTS and back to the caller. RingDispatch uses Deepgram Nova for English + most Latin-script languages, with language-specific models for non-Latin scripts (Mandarin, Korean, Arabic).
Examples
- Caller with a thick regional accent (Boston, New Orleans, deep South) gets transcribed correctly — modern STT models handle US regional accents well.
- Caller in a noisy environment (job site with power tools running) — Deepgram's noise-robust models pick out the speech.
- Caller mumbling their phone number — the AI asks for confirmation and re-transcribes the corrected version, much like a human would.