The Hidden Cost of Context Windows: Managing Tokens in Production

When you're building an LLM feature for the first time, the context window feels enormous. GPT-4o gives you 128k tokens. Claude gives you 200k. You stuff the conversation history in and move on.

Then you get to production and your p99 costs are six times what you budgeted. A user who's been chatting for 45 minutes hits a wall. Your API latency is creeping up as conversations get longer. Context management goes from "thing I'll deal with later" to "thing that's actively on fire."

Here's how I think about it now.

Count Tokens Before You Send, Not After You Pay

The first thing to instrument: token counts on every request, before the call goes out.

import anthropic
 
client = anthropic.Anthropic()
 
def count_tokens(messages: list[dict], system: str = "") -> int:
    response = client.beta.messages.count_tokens(
        model="claude-3-5-sonnet-20241022",
        system=system,
        messages=messages,
    )
    return response.input_tokens
 
# In your request handler
def build_request(conversation: Conversation) -> dict:
    messages = conversation.to_messages()
    token_count = count_tokens(messages, system=SYSTEM_PROMPT)
 
    # Log for cost tracking
    logger.info("llm_request", tokens=token_count, conversation_id=conversation.id)
 
    if token_count > TOKEN_BUDGET:
        messages = compress_conversation(messages, target_tokens=TOKEN_BUDGET)
 
    return {"messages": messages, "model": MODEL}

count_tokens is a lightweight API call — no generation, costs almost nothing. Build this in from day one so you have the data to make good decisions later.

The Sliding Window: Simple but Effective

The easiest strategy: keep only the last N messages. Works well for chatbots where recent context matters more than full history.

def sliding_window(
    messages: list[dict],
    max_tokens: int,
    system_prompt: str,
    min_messages: int = 4,  # Always keep at least the last few turns
) -> list[dict]:
    if count_tokens(messages, system_prompt) <= max_tokens:
        return messages
 
    # Drop oldest messages (but preserve at least min_messages)
    while (
        len(messages) > min_messages
        and count_tokens(messages, system_prompt) > max_tokens
    ):
        # Remove oldest user+assistant pair
        messages = messages[2:]
 
    return messages

The gotcha: if the first message established critical context ("I'm a doctor, these are my patients"), dropping it silently changes the model's behaviour. Keep an eye on what you're trimming.

Summarisation: Compress Rather Than Drop

Better than the sliding window for most cases: summarise old parts of the conversation rather than dropping them. You preserve the semantic content without paying for the full token count.

SUMMARISE_PROMPT = """
The following is the beginning of a longer conversation. 
Summarise it in 2–3 sentences, capturing any important context, 
decisions made, or facts established. Be concise.
 
Conversation:
{conversation}
 
Summary:"""
 
async def compress_conversation(
    messages: list[dict],
    target_tokens: int,
    system_prompt: str,
) -> list[dict]:
    # Split: old messages to summarise + recent messages to keep verbatim
    midpoint = len(messages) // 2
    to_summarise = messages[:midpoint]
    to_keep = messages[midpoint:]
 
    # Generate a summary of the old messages
    summary_text = await llm.complete(
        SUMMARISE_PROMPT.format(
            conversation=format_messages(to_summarise)
        )
    )
 
    # Replace old messages with a single summary message
    summary_message = {
        "role": "user",
        "content": f"[Earlier conversation summary: {summary_text}]",
    }
 
    compressed = [summary_message] + to_keep
 
    # Recurse if still over budget
    if count_tokens(compressed, system_prompt) > target_tokens:
        return await compress_conversation(compressed, target_tokens, system_prompt)
 
    return compressed

This costs one extra LLM call but users don't hit hard cutoffs mid-conversation. At Allia Health, where therapy sessions could run 90+ minutes with a lot of back-and-forth, this was the only approach that held up.

Cache Aggressively for Repeated Context

If part of your prompt is the same across many requests — a long system prompt, a static knowledge base, reference documents — you're paying to re-process those tokens on every call.

Both Anthropic and OpenAI support prompt caching. With Anthropic, you mark cacheable blocks with a cache_control header:

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LARGE_STATIC_SYSTEM_PROMPT,  # 10k tokens of medical guidelines
            "cache_control": {"type": "ephemeral"},  # Cache this block
        }
    ],
    messages=conversation_messages,
)
 
# Check cache performance
cache_hits = response.usage.cache_read_input_tokens
cache_misses = response.usage.cache_creation_input_tokens

Cached tokens cost ~10% of normal input token prices. For a system prompt that's 8,000 tokens and gets used 1,000 times a day, that's significant. The cache lasts 5 minutes with ephemeral — enough to cover most conversational use cases.

Set a Token Budget Per Feature, Not Per Call

The mistake I made early on: thinking about tokens per API call. The right mental model is tokens per user session or per feature invocation.

from dataclasses import dataclass
from enum import Enum
 
class Feature(Enum):
    CHAT = "chat"
    CLINICAL_NOTE = "clinical_note"
    DOCUMENT_ANALYSIS = "document_analysis"
 
@dataclass
class TokenBudget:
    feature: Feature
    input_limit: int
    output_limit: int
    daily_limit_per_user: int
 
BUDGETS = {
    Feature.CHAT: TokenBudget(
        feature=Feature.CHAT,
        input_limit=16_000,
        output_limit=2_000,
        daily_limit_per_user=100_000,
    ),
    Feature.CLINICAL_NOTE: TokenBudget(
        feature=Feature.CLINICAL_NOTE,
        input_limit=8_000,
        output_limit=1_500,
        daily_limit_per_user=50_000,
    ),
}
 
async def check_budget(user_id: str, feature: Feature, estimated_tokens: int) -> bool:
    budget = BUDGETS[feature]
    used_today = await redis.get(f"tokens:{user_id}:{feature.value}:{today()}")
    return (used_today or 0) + estimated_tokens <= budget.daily_limit_per_user

This gives you per-feature cost visibility and lets you rate-limit at the token level instead of the request level — which is what actually maps to cost.

The Number That Should Be On Your Dashboard

If I had to pick one metric to watch: cost per active user per day. Not total token spend, not average tokens per request. Cost per active user per day.

It tells you whether your context management is working (should be stable or declining as you optimise), whether certain user segments are dramatically more expensive (power users vs. casual users), and gives you something concrete to set a budget against.

Context management isn't glamorous. But getting it wrong is one of the fastest ways to make an AI product economically unviable. Get the instrumentation in early and you'll always know where you stand.

The Hidden Cost of Context Windows: Managing Tokens in Production

Count Tokens Before You Send, Not After You Pay

The Sliding Window: Simple but Effective

Summarisation: Compress Rather Than Drop

Cache Aggressively for Repeated Context

Set a Token Budget Per Feature, Not Per Call

The Number That Should Be On Your Dashboard

Related Posts

Why Your LLM App Feels Slow (And It's Not the Model)

Building HIPAA-Compliant AI Features: What the Tutorials Skip

TypeScript Patterns I Actually Use in Production AI Apps