Express course · No. 35
Voice changes everything about an AI product. Instead of reading and writing, it has to listen, understand, think, and speak — fast enough to feel like a conversation, not a walkie-talkie. There's a pipeline of three pieces, plus the genuinely hard part: doing it all in real time. Learn the stages, why latency is the central challenge, and what makes a voice agent feel natural instead of robotic.
Essence only · One picture per idea · Latency is everything
A voice AI isn't one thing — it's a chain of three jobs working together. Seeing the pipeline first makes everything else in this course fall into place.
Ear, brain, mouth
A person in a conversation: they hear what you say, think about it, and speak a reply — three distinct acts that feel like one seamless thing.
A voice AI does three things in sequence: it listens (turns your speech into text), it thinks (a model reasons over that text and produces a reply), and it speaks (turns the reply back into audio). Ear, brain, mouth. The middle step is the language model you already know; the new parts are the ear and the mouth on either side. Understanding voice AI starts with seeing it as this three-stage chain, not a single magic box.
Speech-to-text, model, text-to-speech
An assembly line with three stations: one transcribes, one decides, one voices — the work flows from each to the next.
The three stages have names. Speech-to-text (STT) transcribes the user's spoken words into text. The model takes that text and generates a response, exactly the text-in-text-out you've learned throughout. Text-to-speech (TTS) converts the model's text reply into spoken audio. Voice in, transcribe, reason, synthesise, voice out. Most voice AI is this classic pipeline, and naming the three stages is most of understanding how it works.
The model is the same; the edges are new
You already know how to think and reply in words — learning to converse by voice just adds listening and speaking around that same core.
The reasoning core is the same language model from every other course — context, prompting, tools, structured output, all of it still applies to the text in the middle. What's new is the audio on both ends: turning sound into text, and text into sound, reliably and fast. So you don't relearn the brain; you add the ear and mouth, and learn the one thing that makes voice genuinely hard — doing all three in the tight timing a conversation demands.
A voice AI is a pipeline: speech-to-text (the ear), the model (the brain), and text-to-speech (the mouth). The reasoning core is the model you know; the new parts are the audio edges, and the timing between them.
The first stage turns sound into words. It's remarkably good now, but its imperfections shape everything downstream, so it's worth understanding what it can and can't do.
Transcription: sound becomes text
A court stenographer typing every word as it's spoken — turning the flow of speech into a written record the rest of the system can read.
Speech-to-text (also called transcription or automatic speech recognition) converts spoken audio into written text. It's the ear of the system: the user talks, and STT produces the words the model will reason over. Modern STT is impressively accurate and handles natural, messy speech far better than the clunky voice recognition of years past. This is the step that lets a person simply talk instead of type — the entry point to the whole voice experience.
Accents, noise, and errors flow downstream
If the stenographer mishears one word, every reader after them inherits the mistake — the error is baked in at the start.
STT isn't perfect: accents, background noise, crosstalk, technical terms, and names can all cause mis-transcriptions. And because it's the first stage, its errors propagate — the model reasons over whatever text it was given, so a misheard word can quietly derail the whole response. A voice system is only as good as its transcription. So you design for imperfect input: the model should handle slightly garbled text gracefully, and you watch for the conditions (noise, accents) where the ear struggles most.
Streaming transcription as the user speaks
A stenographer who types continuously as you talk, not one who waits for you to finish the whole speech before starting — the text appears in step with the words.
For a responsive feel, STT can be streaming — transcribing continuously as the user speaks, rather than waiting for them to finish and then processing the whole clip. This matters for speed: streaming lets the system start working sooner and react the moment the user stops, instead of adding a long pause. The choice between transcribe-at-the-end and transcribe-as-you-go is one of the first latency decisions in a voice system, and streaming is usually what makes it feel live.
Speech-to-text is the ear: it transcribes speech into text for the model. Its errors flow downstream, so design for imperfect input — and stream the transcription so the system reacts as the user speaks.
The last stage turns the model's text reply back into a voice. It's the part the user actually hears, so its quality shapes how the whole product feels.
Synthesis: text becomes a voice
A narrator reading a script aloud — turning written words into natural, expressive speech the listener experiences as a voice, not a recital.
Text-to-speech (TTS) converts the model's text response into spoken audio — the mouth of the system. Modern TTS produces remarkably natural voices, with realistic intonation and rhythm, far beyond the flat robotic speech of old. This is what the user hears, so it carries a lot of the product's personality and feel. A great pipeline with a robotic voice still feels cheap; a natural voice makes the whole interaction feel human, which is much of why voice products have become viable.
The voice carries the experience
The same words said warmly or coldly land completely differently — the delivery, not just the content, is what the listener responds to.
In a voice product, how something is said matters as much as what's said. The chosen voice, its tone, its pacing — these are product decisions that shape trust and comfort, the way visual design does in a screen product. A voice that's too fast, too flat, or oddly inflected undermines an otherwise good answer. So TTS isn't an afterthought bolted on at the end; the voice is a core part of the experience, and worth choosing and tuning as deliberately as any other interface element.
Stream the speech as it's generated
Someone who starts speaking their answer as they think of it, rather than silently composing the whole thing first and then delivering it in one block.
Just as STT can stream in, TTS can stream out — beginning to speak the start of the reply while the rest is still being generated, instead of waiting for the full text and full audio before making a sound. This is crucial for responsiveness: streaming the voice means the user hears a reply beginning almost immediately, rather than sitting in silence. Combined with the model streaming its text, this is how a voice agent answers without an awkward gap — the speech starts as soon as there's something to say.
Text-to-speech is the mouth: it voices the model's reply, and modern voices sound natural enough to carry the whole experience. Stream the speech as it's generated so the reply begins instantly, not after a silence.
Here is what makes voice genuinely hard. A conversation has a rhythm, and the whole three-stage pipeline has to complete within it — or the magic breaks.
A conversation has a tight timing budget
In real talk, a reply that comes a beat late feels awkward; one that comes after several seconds feels broken. The window for "natural" is small.
People expect a conversational reply to come within a fraction of a second. A delay that would be invisible in a chat app is glaring in speech — a couple of seconds of silence after you finish talking feels like the system froze. This tight latency budget is the defining constraint of voice AI: the entire round trip — hear, transcribe, think, synthesise, speak — has to fit inside the small window a human conversation allows. Miss it, and even a perfect answer feels broken.
The delays of all three stages add up
A relay race where the total time is every runner's leg combined — a slow handoff anywhere makes the whole team late.
The cruel part is that latency accumulates across the pipeline: the time to transcribe, plus the time for the model to think, plus the time to synthesise speech, all stack into the total delay the user feels. Each stage being "fast enough" alone isn't sufficient if their sum blows the budget. This is why voice is harder than text: a text chat only waits on the model, but voice waits on three stages in series, and the conversational timing window is far tighter.
Streaming everything is how you hit the budget
An interpreter who translates as you speak and starts voicing the reply mid-thought — overlapping the stages instead of doing them strictly one after another.
The key technique is to stream and overlap the stages rather than run them strictly in sequence: transcribe as the user speaks, let the model start generating before the question is fully processed, begin speaking the reply before it's fully written. Overlapping the three stages, instead of waiting for each to fully finish, is how a voice system collapses the total delay into something conversational. The whole art of low-latency voice is keeping the pipeline flowing continuously rather than stopping and starting at each stage.
A conversation has a tight latency budget, and the three stages' delays add up. Stream and overlap them — transcribe, think, and speak in parallel rather than in sequence — to fit the response inside the conversational window.
Beyond speed, natural conversation has a choreography of who speaks when. Getting that flow right is what separates a real-feeling voice agent from a clumsy one.
Knowing when the user has finished
A good listener can tell the difference between a pause for breath and the end of a sentence — they don't jump in every time you take a moment.
A voice system has to detect when the user has actually finished speaking versus just paused mid-thought — the problem of knowing when it's the system's turn. Jump in too early and you cut the user off; wait too long and the conversation drags. This turn-taking judgement is surprisingly hard and central to feeling natural: humans do it effortlessly with subtle cues, and a voice agent has to approximate it well enough that the back-and-forth flows instead of stumbling.
Handling interruptions: barge-in
When you start talking over someone, they stop and listen — they don't keep plowing through their sentence as if you said nothing.
Real conversation includes interruption. If the user starts speaking while the AI is talking — barge-in — a natural system stops, listens, and responds to the new input, rather than finishing its scripted reply over the top of them. Supporting interruption is a hallmark of a good voice agent and noticeably absent in a poor one, which talks over you obliviously. Handling barge-in means the system is always listening even while it speaks, ready to yield the floor the instant the user takes it.
The flow is part of the product
A good conversation partner and an awkward one say similar words — the difference is timing, listening, and giving way. The flow is the experience.
Much of whether a voice agent feels natural or robotic comes down to this choreography — sensing turns, handling interruptions, not leaving dead air or talking over the user. These flow details are as important as the accuracy of the answer, because a correct response delivered with clumsy timing still feels broken. So designing a voice product means designing the conversation, not just the answers: the rhythm of who speaks when is a first-class part of the experience, not a polish step.
Natural voice needs turn-taking — knowing when the user finished — and barge-in — stopping when interrupted. The flow of who speaks when is as much the product as the answers themselves.
The classic three-stage pipeline is being joined by a newer approach: models that handle voice directly, end to end, collapsing the stages to cut latency. It's worth knowing where the field is heading.
Models that take and produce voice directly
Instead of a translator writing down your words, passing the note to a thinker, and handing their answer to a speaker — one person who hears, thinks, and replies all at once.
A newer class of realtime (or speech-to-speech) models handles audio directly: they take voice in and produce voice out, without the separate transcribe-then-reason-then-synthesise steps. Instead of three models in a chain, one model does the whole job. By collapsing the pipeline, this approach can dramatically cut the latency that stacking three stages creates, and can preserve tone and emotion that get lost when speech is flattened into plain text and back.
Fewer stages, less latency
A direct flight instead of two connections — fewer handoffs, far less total time, and nothing lost in the transfers.
The big win of end-to-end realtime models is latency: removing the boundaries between stages removes the delays of passing data between them, getting much closer to the instant back-and-forth of human conversation. It also avoids losing information at each conversion — the model can hear how something was said, not just the words, and reply with matching expression. For the most demanding, natural-feeling voice interactions, this collapsed pipeline is increasingly the approach.
Pipeline or end-to-end: a real choice
You can travel by a series of trains you control each leg of, or one direct service that's faster but gives you less say over the route — different trade-offs, both valid.
The classic pipeline and the end-to-end model are both real options with trade-offs. The pipeline gives you control and flexibility at each stage — swap the STT, inspect the transcript, use any model — at the cost of latency and complexity. The end-to-end model gives speed and naturalness but less visibility and control over the middle. Knowing both exist lets you choose deliberately: the transparent, flexible pipeline, or the fast, fluid end-to-end approach, depending on what your product needs most.
Realtime end-to-end models take voice in and out directly, collapsing the pipeline to slash latency and preserve tone. Pipeline versus end-to-end is a real trade-off: control and flexibility, or speed and naturalness.
Using voice well means choosing it for the right situations and respecting that latency, more than anything, decides whether it feels magical or broken.
Reach for voice when it genuinely fits
You'd rather speak directions while driving and type a long form at your desk — the right input depends entirely on the situation.
Voice is powerful where it fits — hands-free use, accessibility, natural conversation, situations where talking is faster or safer than typing — and a poor fit where text is better, like anything needing precise input, privacy in public, or careful review. The skill is matching the modality to the moment, not adding voice because it's impressive. A voice interface used where text would serve better is a downgrade; used where speaking is the natural act, it's transformative. Choose it for the situation, not the novelty.
Budget for latency from the start
A chef who plans the whole meal around the dish that takes longest — designing around the binding constraint, not discovering it at the end.
Because latency is the make-or-break constraint, design for it from the beginning: choose streaming at every stage, consider an end-to-end model where speed is critical, and have a plan for when a stage is slow. Voice features that ignore latency until late tend to feel sluggish and broken no matter how good the answers are. Treat the response-time budget as the central design constraint of a voice product — the one number that, missed, ruins everything else — and build everything around hitting it.
- Does voice genuinely fit the situation — or would text serve the user better? - Is each stage streaming — transcribe, think, and speak overlapping, not in strict sequence? - Does the total latency feel conversational, or is there an awkward gap? - Does it handle turn-taking — knowing when the user finished — and barge-in? - Is the voice natural enough to carry the experience? - Pipeline or end-to-end — have I chosen the architecture for my latency and control needs?
- speech-to-text (STT) / transcription — the ear: turning speech into text. - text-to-speech (TTS) / synthesis — the mouth: turning the reply into a voice. - the voice pipeline — STT, model, TTS in sequence. - streaming — processing each stage continuously rather than waiting for it to finish. - latency budget — the tight response-time window a conversation allows. - turn-taking / barge-in — knowing when the user finished, and stopping when interrupted. - realtime / end-to-end (speech-to-speech) model — handling voice directly, collapsing the pipeline.
- You choose voice for situations where speaking genuinely fits, not for novelty. - Every stage streams and overlaps, so the reply starts almost instantly. - The total latency feels conversational, with no awkward silence. - The agent handles turn-taking and barge-in, so the flow feels natural. - You chose pipeline or end-to-end deliberately, designing around the latency budget.
Voice means listen, think, and speak fast enough to feel like a conversation. Stream and overlap the pipeline, handle turn-taking and interruptions, choose pipeline or end-to-end deliberately — and treat latency as the constraint everything else bends to.