Building a Voice AI Agent: LiveKit, Deepgram, and the Latency Problem Nobody Talks About

Voice AI sounds straightforward until you're staring at 800ms of lag between a user's question and the agent's first word. Here's how I actually got it under 400ms end-to-end.

Afzal ZubairJanuary 12, 20265 min read
Building a Voice AI Agent: LiveKit, Deepgram, and the Latency Problem Nobody Talks About

Last year I built a voice AI survey agent at Fortell AI. The premise was simple: users answer survey questions by speaking, the agent listens, understands, asks intelligent follow-ups, and synthesises everything into structured analytics. Simple in concept. Brutal in execution.

The hardest part wasn't the LLM. It wasn't even the speech recognition. It was latency — that painfully human thing where a conversation feels wrong if there's more than about 500ms between your last word and the other person's first.

Here's what I actually learned.

The Pipeline That Seems Obvious (But Isn't)

The naive approach is to treat voice AI as a linear pipeline:

Audio in → STT → LLM → TTS → Audio out

And it works. Until you measure it. On a good day, a basic implementation looks like:

  • Deepgram STT: ~200–400ms (including end-of-utterance detection)
  • GPT-4o API call (first token): ~300–600ms
  • ElevenLabs TTS (generate + stream start): ~400–800ms

That's 900ms to 1.8 seconds before the user hears a single syllable. That's not a conversation — that's a phone tree.

The Architecture That Actually Works

The insight that changed everything: don't wait for steps to finish before starting the next one.

# Bad: sequential
transcript = await stt.transcribe(audio)
response = await llm.complete(transcript)
audio = await tts.synthesize(response)
 
# Better: stream at every stage
async def handle_utterance(audio_stream):
    # Start transcribing as audio comes in
    async for transcript in stt.stream(audio_stream):
        if transcript.is_final:
            # Start LLM as soon as we have the transcript
            async for token in llm.stream(transcript.text):
                # Start TTS on the first sentence boundary, not the full response
                if is_sentence_end(token):
                    await tts.stream_sentence(sentence_so_far)

The key is sentence-level TTS dispatch. Don't wait for the LLM to finish generating the full response. As soon as you have a complete sentence ("That's a great point.", "Let me ask you something."), pipe it to TTS immediately. The user starts hearing audio while the LLM is still generating the rest.

With this approach my pipeline went from ~1.4s to ~380ms. Most of that is irreducible network time.

LiveKit Makes the Real-Time Layer Manageable

Before LiveKit I was doing raw WebSocket audio streaming. It works but you end up reimplementing a lot: track management, reconnection logic, echo cancellation configuration, TURN server setup.

LiveKit abstracts all of that. Here's the minimal setup for a voice agent:

from livekit import agents
from livekit.agents import AutoSubscribe, JobContext, WorkerOptions, cli
from livekit.agents.voice_assistant import VoiceAssistant
from livekit.plugins import deepgram, openai, cartesia
 
async def entrypoint(ctx: JobContext):
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
 
    assistant = VoiceAssistant(
        vad=silero.VAD.load(),               # Voice activity detection
        stt=deepgram.STT(model="nova-2"),
        llm=openai.LLM(model="gpt-4o-mini"),
        tts=cartesia.TTS(voice="sonic-english"),
        chat_ctx=initial_context,
    )
 
    assistant.start(ctx.room)
    await asyncio.sleep(1)
    await assistant.say("Hey! Ready when you are.", allow_interruptions=True)
 
cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

What I like about this: allow_interruptions=True. The agent will stop speaking mid-sentence if the user starts talking. That's the difference between a voice assistant and a voice chatbot.

End-of-Utterance Detection is Harder Than It Sounds

Deepgram gives you is_final on transcripts, but "final" means the model is confident in the words — not that the person has finished speaking. A user saying "I think the product is... good" will trigger is_final on "I think the product is" before they finish the sentence.

You need VAD (Voice Activity Detection) on top. Silero VAD is lightweight and accurate enough for most use cases. The heuristic that worked for me:

SILENCE_THRESHOLD_MS = 700  # Wait 700ms of silence before treating as utterance end
 
class UtteranceDetector:
    def __init__(self):
        self.last_speech_at = None
        self.buffer = []
 
    def on_vad_event(self, event: VADEvent):
        if event.type == "start":
            self.last_speech_at = time.monotonic()
        elif event.type == "end":
            silence_duration = (time.monotonic() - self.last_speech_at) * 1000
            if silence_duration >= SILENCE_THRESHOLD_MS:
                self.flush_utterance()

700ms is the sweet spot I landed on after a lot of user testing. Less than that and you interrupt people who pause to think. More than that and the conversation feels sluggish.

The Model That Surprised Me

I assumed GPT-4o was the right model for this — smart enough to handle nuanced follow-up questions. It is smart enough. But it's also slow enough to hurt latency noticeably.

After testing, GPT-4o-mini with a well-crafted system prompt performed comparably for the survey domain and shaved ~150ms off the TTFT (time to first token). For most conversational use cases where the LLM is essentially doing comprehension + response generation (not complex reasoning), the smaller model is fine.

Groq with Llama 3.1 70B is worth a look if latency is critical — inference is fast enough to feel meaningfully different.

What I'd Do Differently

  • Start with Cartesia TTS, not ElevenLabs. ElevenLabs is higher quality but latency is noticeably worse for streaming. Cartesia is built for real-time and it shows.
  • Log every stage's latency from day one. You can't optimise what you can't measure. I added per-stage timing to every request before anything else.
  • Build the interruption handling early. It's much harder to retrofit. Interruption support changes how you buffer audio, how you track state, and how you handle partial TTS — all of it.

Voice AI is genuinely one of the harder problems in software right now. The technology is capable. The hard part is making it feel natural, and that's almost entirely a latency and UX problem, not a model problem.

Related Posts