Why Your LLM App Feels Slow (And It's Not the Model)
An LLM generating 50 tokens/second isn't slow — but if your UI makes the user stare at a spinner for the first 2 seconds, it feels slow. Most LLM latency is a UX problem, not an infrastructure problem.
There's a category of feedback I've gotten multiple times on AI features I've shipped: "it feels slow." And when I dig into it, the actual latency numbers are fine — p50 under 600ms, streaming working correctly, model inference not the bottleneck.
The problem is always the same: perceived performance. The time between a user doing something and them getting feedback that something is happening.
I've fixed this enough times that I now have a mental checklist. Here it is.
Start Streaming Immediately, Skip the Thinking Spinner
The classic mistake: show a loading spinner while you wait for the first token, then swap to streaming text.
// What most people build
function ChatMessage() {
const { isLoading, message } = useChat()
if (isLoading) return <Spinner /> // 1–2 seconds of nothing
return <StreamingText text={message} /> // Then text appears
}The fix: stream from the very first token. There's no reason to show a spinner once you're streaming — the appearing text is the loading indicator.
// What actually feels fast
function ChatMessage() {
const { isLoading, streamingText, isComplete } = useStreamingChat()
if (!streamingText && isLoading) {
// Only show this for the brief moment before first token
return <TypingIndicator />
}
return (
<div>
<StreamingText text={streamingText} />
{isComplete && <MessageActions />}
</div>
)
}The typing indicator (three bouncing dots, or similar) is fine for the first few hundred milliseconds. After the first token arrives, it disappears and text starts rendering. This alone makes most AI chat UIs feel dramatically faster.
Optimistic UI for User Actions
When a user submits a message, add it to the conversation immediately — before the API call even starts. Don't wait for confirmation.
function ChatInput({ onSend }: { onSend: (msg: string) => void }) {
const { messages, addOptimisticMessage, submitMessage } = useChat()
const handleSubmit = async (text: string) => {
// Add user message instantly — don't wait for the API
addOptimisticMessage({ role: 'user', content: text })
// Then start the actual request
await submitMessage(text)
}
return <form onSubmit={handleSubmit}>...</form>
}With the Vercel AI SDK's useChat, this is built in — user messages appear instantly and the assistant response streams in. If you're building custom, you need to wire this yourself.
Pre-warm the Context for Common Flows
If you know a user is likely to trigger an LLM call next, you can start the request early.
Example: a user opens a document editor. There's a ~70% chance they'll click "Generate Summary" in the next 30 seconds. You can speculatively start generating it on document open, cancel if they don't click, use it instantly if they do.
class SpeculativeLoader {
private pending: AbortController | null = null
private cached: Promise<string> | null = null
preload(documentId: string): void {
// Start the request speculatively
this.pending = new AbortController()
this.cached = generateSummary(documentId, {
signal: this.pending.signal
})
}
async get(documentId: string): Promise<string> {
if (this.cached) {
// Already started — just await it
return this.cached
}
// User clicked before we could preload
return generateSummary(documentId)
}
cancel(): void {
this.pending?.abort()
this.cached = null
this.pending = null
}
}This is aggressive and you need to be selective about where you use it — speculative LLM calls cost real money. But for high-confidence "next action" predictions, it's a significant UX win.
Cache Deterministic Responses
Not all LLM responses are conversational. Some are effectively deterministic: given the same input document, the summary should be the same. Cache those.
import { unstable_cache } from 'next/cache'
const getCachedSummary = unstable_cache(
async (documentHash: string, content: string) => {
return generateSummary(content)
},
['document-summary'],
{
revalidate: 86400, // 24 hours
tags: ['summaries'],
}
)
// Usage
async function DocumentSummary({ doc }: { doc: Document }) {
// Same document content = cached response, instant load
const summary = await getCachedSummary(doc.contentHash, doc.content)
return <p>{summary}</p>
}For server-side caching in Next.js, unstable_cache is the right tool. For cross-service caching (useful when you have multiple instances), Redis with a content-addressed key:
import hashlib
async def get_or_generate_summary(content: str) -> str:
# Deterministic cache key based on content
cache_key = f"summary:{hashlib.sha256(content.encode()).hexdigest()}"
cached = await redis.get(cache_key)
if cached:
return cached
summary = await llm.complete(SUMMARY_PROMPT.format(content=content))
await redis.setex(cache_key, 86400, summary)
return summaryRender Markdown Incrementally
A streaming LLM response often contains Markdown — headers, bold text, code blocks. If you wait until streaming is complete to render Markdown, users see raw **bold** text turning into formatted text at the end, which is jarring.
Render incrementally. The trick is that most Markdown renderers are designed for complete documents. For streaming, you want a renderer that handles partial content gracefully.
import ReactMarkdown from 'react-markdown'
import { useStreamingText } from '@/hooks/useStreamingText'
function StreamingMarkdown({ source }: { source: string }) {
return (
<ReactMarkdown
// react-markdown handles partial markdown reasonably well
// Code blocks that haven't closed yet render as inline code
components={{
code: ({ node, inline, children, ...props }) => {
if (inline) return <code {...props}>{children}</code>
return (
<pre>
<code {...props}>{children}</code>
</pre>
)
},
}}
>
{source}
</ReactMarkdown>
)
}For code blocks specifically: a half-rendered code block (the LLM is mid-way through generating it) should show what's there so far, not wait for the closing fence. Most Markdown libraries handle this acceptably out of the box.
The One Metric That Matters for Perceived Speed
Time to First Token (TTFT) — the gap between the user's action and the first piece of content appearing — is the number to optimise. Everything else is secondary.
Users tolerate slow overall generation if they get feedback quickly. They don't tolerate staring at a spinner. A response that starts in 200ms and takes 8 seconds to complete feels faster than one that starts in 2 seconds and takes 4 seconds, even though the latter finishes sooner.
Measure TTFT in production. Put it on a dashboard. It's the most honest signal of how your AI feature actually feels to use.
const start = performance.now()
let firstTokenAt: number | null = null
for await (const chunk of stream) {
if (!firstTokenAt) {
firstTokenAt = performance.now()
metrics.record('llm.ttft_ms', firstTokenAt - start)
}
yield chunk
}LLM performance work is mostly UX work. The models are fast enough. What needs engineering is the gap between "user acts" and "user sees something."