Why Your LLM App Feels Slow (And It's Not the Model)

There's a category of feedback I've gotten multiple times on AI features I've shipped: "it feels slow." And when I dig into it, the actual latency numbers are fine — p50 under 600ms, streaming working correctly, model inference not the bottleneck.

The problem is always the same: perceived performance. The time between a user doing something and them getting feedback that something is happening.

I've fixed this enough times that I now have a mental checklist. Here it is.

Start Streaming Immediately, Skip the Thinking Spinner

The classic mistake: show a loading spinner while you wait for the first token, then swap to streaming text.

// What most people build
function ChatMessage() {
  const { isLoading, message } = useChat()
 
  if (isLoading) return <Spinner />           // 1–2 seconds of nothing
  return <StreamingText text={message} />     // Then text appears
}

The fix: stream from the very first token. There's no reason to show a spinner once you're streaming — the appearing text is the loading indicator.

// What actually feels fast
function ChatMessage() {
  const { isLoading, streamingText, isComplete } = useStreamingChat()
 
  if (!streamingText && isLoading) {
    // Only show this for the brief moment before first token
    return <TypingIndicator />
  }
 
  return (
    <div>
      <StreamingText text={streamingText} />
      {isComplete && <MessageActions />}
    </div>
  )
}

The typing indicator (three bouncing dots, or similar) is fine for the first few hundred milliseconds. After the first token arrives, it disappears and text starts rendering. This alone makes most AI chat UIs feel dramatically faster.

Optimistic UI for User Actions

When a user submits a message, add it to the conversation immediately — before the API call even starts. Don't wait for confirmation.

function ChatInput({ onSend }: { onSend: (msg: string) => void }) {
  const { messages, addOptimisticMessage, submitMessage } = useChat()
 
  const handleSubmit = async (text: string) => {
    // Add user message instantly — don't wait for the API
    addOptimisticMessage({ role: 'user', content: text })
 
    // Then start the actual request
    await submitMessage(text)
  }
 
  return <form onSubmit={handleSubmit}>...</form>
}

With the Vercel AI SDK's useChat, this is built in — user messages appear instantly and the assistant response streams in. If you're building custom, you need to wire this yourself.

Pre-warm the Context for Common Flows

If you know a user is likely to trigger an LLM call next, you can start the request early.

Example: a user opens a document editor. There's a ~70% chance they'll click "Generate Summary" in the next 30 seconds. You can speculatively start generating it on document open, cancel if they don't click, use it instantly if they do.

class SpeculativeLoader {
  private pending: AbortController | null = null
  private cached: Promise<string> | null = null
 
  preload(documentId: string): void {
    // Start the request speculatively
    this.pending = new AbortController()
    this.cached = generateSummary(documentId, {
      signal: this.pending.signal
    })
  }
 
  async get(documentId: string): Promise<string> {
    if (this.cached) {
      // Already started — just await it
      return this.cached
    }
    // User clicked before we could preload
    return generateSummary(documentId)
  }
 
  cancel(): void {
    this.pending?.abort()
    this.cached = null
    this.pending = null
  }
}

This is aggressive and you need to be selective about where you use it — speculative LLM calls cost real money. But for high-confidence "next action" predictions, it's a significant UX win.

Cache Deterministic Responses

Not all LLM responses are conversational. Some are effectively deterministic: given the same input document, the summary should be the same. Cache those.

import { unstable_cache } from 'next/cache'
 
const getCachedSummary = unstable_cache(
  async (documentHash: string, content: string) => {
    return generateSummary(content)
  },
  ['document-summary'],
  {
    revalidate: 86400, // 24 hours
    tags: ['summaries'],
  }
)
 
// Usage
async function DocumentSummary({ doc }: { doc: Document }) {
  // Same document content = cached response, instant load
  const summary = await getCachedSummary(doc.contentHash, doc.content)
  return <p>{summary}</p>
}

For server-side caching in Next.js, unstable_cache is the right tool. For cross-service caching (useful when you have multiple instances), Redis with a content-addressed key:

import hashlib
 
async def get_or_generate_summary(content: str) -> str:
    # Deterministic cache key based on content
    cache_key = f"summary:{hashlib.sha256(content.encode()).hexdigest()}"
 
    cached = await redis.get(cache_key)
    if cached:
        return cached
 
    summary = await llm.complete(SUMMARY_PROMPT.format(content=content))
    await redis.setex(cache_key, 86400, summary)
    return summary

Render Markdown Incrementally

A streaming LLM response often contains Markdown — headers, bold text, code blocks. If you wait until streaming is complete to render Markdown, users see raw **bold** text turning into formatted text at the end, which is jarring.

Render incrementally. The trick is that most Markdown renderers are designed for complete documents. For streaming, you want a renderer that handles partial content gracefully.

import ReactMarkdown from 'react-markdown'
import { useStreamingText } from '@/hooks/useStreamingText'
 
function StreamingMarkdown({ source }: { source: string }) {
  return (
    <ReactMarkdown
      // react-markdown handles partial markdown reasonably well
      // Code blocks that haven't closed yet render as inline code
      components={{
        code: ({ node, inline, children, ...props }) => {
          if (inline) return <code {...props}>{children}</code>
          return (
            <pre>
              <code {...props}>{children}</code>
            </pre>
          )
        },
      }}
    >
      {source}
    </ReactMarkdown>
  )
}

For code blocks specifically: a half-rendered code block (the LLM is mid-way through generating it) should show what's there so far, not wait for the closing fence. Most Markdown libraries handle this acceptably out of the box.

The One Metric That Matters for Perceived Speed

Time to First Token (TTFT) — the gap between the user's action and the first piece of content appearing — is the number to optimise. Everything else is secondary.

Users tolerate slow overall generation if they get feedback quickly. They don't tolerate staring at a spinner. A response that starts in 200ms and takes 8 seconds to complete feels faster than one that starts in 2 seconds and takes 4 seconds, even though the latter finishes sooner.

Measure TTFT in production. Put it on a dashboard. It's the most honest signal of how your AI feature actually feels to use.

const start = performance.now()
let firstTokenAt: number | null = null
 
for await (const chunk of stream) {
  if (!firstTokenAt) {
    firstTokenAt = performance.now()
    metrics.record('llm.ttft_ms', firstTokenAt - start)
  }
  yield chunk
}

LLM performance work is mostly UX work. The models are fast enough. What needs engineering is the gap between "user acts" and "user sees something."

Why Your LLM App Feels Slow (And It's Not the Model)

Start Streaming Immediately, Skip the Thinking Spinner

Optimistic UI for User Actions

Pre-warm the Context for Common Flows

Cache Deterministic Responses

Render Markdown Incrementally

The One Metric That Matters for Perceived Speed

Related Posts

Building HIPAA-Compliant AI Features: What the Tutorials Skip

The Hidden Cost of Context Windows: Managing Tokens in Production

TypeScript Patterns I Actually Use in Production AI Apps