Glossary

STT (Speech-to-Text)

Also known as: speech-to-text, ASR (automatic speech recognition), STT engine

Definition

Speech-to-text (STT), also called Automatic Speech Recognition (ASR), is the technology that converts the caller's spoken audio into text transcripts that the LLM can read. Deepgram, OpenAI Whisper, and Google Cloud Speech-to-Text are the dominant engines for AI receptionist applications in 2026.

Why it matters

STT determines whether the AI understands what the caller said. A weak STT misses words, mishears proper names, drops the end of sentences when audio quality is poor, and fails on accents. The result is the AI asking the caller to repeat themselves — a frustrating, conversion-killing experience. Strong STT (Deepgram Nova, the 2026 baseline) handles accents, background noise, mumbling, and partial words. Streaming STT lets the AI start formulating its response while the caller is still talking — critical for the sub-second response latency callers expect.

How it works

During a call, the caller's audio is streamed in real time from Twilio to the STT engine. The engine returns transcript chunks every 100-300ms, which feed into the LLM's running context. When the caller pauses, the LLM generates a response; the response is streamed to TTS and back to the caller. RingDispatch uses Deepgram Nova for English + most Latin-script languages, with language-specific models for non-Latin scripts (Mandarin, Korean, Arabic).

Examples

Caller with a thick regional accent (Boston, New Orleans, deep South) gets transcribed correctly — modern STT models handle US regional accents well.
Caller in a noisy environment (job site with power tools running) — Deepgram's noise-robust models pick out the speech.
Caller mumbling their phone number — the AI asks for confirmation and re-transcribes the corrected version, much like a human would.

Why it matters

How it works

Examples

Related