The Hidden Cost of Context Windows: Managing Tokens in Production
128k tokens sounds like infinite space until you're paying $0.40 per conversation and users are hitting limits mid-session. Here's how I actually manage context in long-running AI applications.
When you're building an LLM feature for the first time, the context window feels enormous. GPT-4o gives you 128k tokens. Claude gives you 200k. You stuff the conversation history in and move on.
Then you get to production and your p99 costs are six times what you budgeted. A user who's been chatting for 45 minutes hits a wall. Your API latency is creeping up as conversations get longer. Context management goes from "thing I'll deal with later" to "thing that's actively on fire."
Here's how I think about it now.
Count Tokens Before You Send, Not After You Pay
The first thing to instrument: token counts on every request, before the call goes out.
import anthropic
client = anthropic.Anthropic()
def count_tokens(messages: list[dict], system: str = "") -> int:
response = client.beta.messages.count_tokens(
model="claude-3-5-sonnet-20241022",
system=system,
messages=messages,
)
return response.input_tokens
# In your request handler
def build_request(conversation: Conversation) -> dict:
messages = conversation.to_messages()
token_count = count_tokens(messages, system=SYSTEM_PROMPT)
# Log for cost tracking
logger.info("llm_request", tokens=token_count, conversation_id=conversation.id)
if token_count > TOKEN_BUDGET:
messages = compress_conversation(messages, target_tokens=TOKEN_BUDGET)
return {"messages": messages, "model": MODEL}count_tokens is a lightweight API call — no generation, costs almost nothing. Build this in from day one so you have the data to make good decisions later.
The Sliding Window: Simple but Effective
The easiest strategy: keep only the last N messages. Works well for chatbots where recent context matters more than full history.
def sliding_window(
messages: list[dict],
max_tokens: int,
system_prompt: str,
min_messages: int = 4, # Always keep at least the last few turns
) -> list[dict]:
if count_tokens(messages, system_prompt) <= max_tokens:
return messages
# Drop oldest messages (but preserve at least min_messages)
while (
len(messages) > min_messages
and count_tokens(messages, system_prompt) > max_tokens
):
# Remove oldest user+assistant pair
messages = messages[2:]
return messagesThe gotcha: if the first message established critical context ("I'm a doctor, these are my patients"), dropping it silently changes the model's behaviour. Keep an eye on what you're trimming.
Summarisation: Compress Rather Than Drop
Better than the sliding window for most cases: summarise old parts of the conversation rather than dropping them. You preserve the semantic content without paying for the full token count.
SUMMARISE_PROMPT = """
The following is the beginning of a longer conversation.
Summarise it in 2–3 sentences, capturing any important context,
decisions made, or facts established. Be concise.
Conversation:
{conversation}
Summary:"""
async def compress_conversation(
messages: list[dict],
target_tokens: int,
system_prompt: str,
) -> list[dict]:
# Split: old messages to summarise + recent messages to keep verbatim
midpoint = len(messages) // 2
to_summarise = messages[:midpoint]
to_keep = messages[midpoint:]
# Generate a summary of the old messages
summary_text = await llm.complete(
SUMMARISE_PROMPT.format(
conversation=format_messages(to_summarise)
)
)
# Replace old messages with a single summary message
summary_message = {
"role": "user",
"content": f"[Earlier conversation summary: {summary_text}]",
}
compressed = [summary_message] + to_keep
# Recurse if still over budget
if count_tokens(compressed, system_prompt) > target_tokens:
return await compress_conversation(compressed, target_tokens, system_prompt)
return compressedThis costs one extra LLM call but users don't hit hard cutoffs mid-conversation. At Allia Health, where therapy sessions could run 90+ minutes with a lot of back-and-forth, this was the only approach that held up.
Cache Aggressively for Repeated Context
If part of your prompt is the same across many requests — a long system prompt, a static knowledge base, reference documents — you're paying to re-process those tokens on every call.
Both Anthropic and OpenAI support prompt caching. With Anthropic, you mark cacheable blocks with a cache_control header:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": LARGE_STATIC_SYSTEM_PROMPT, # 10k tokens of medical guidelines
"cache_control": {"type": "ephemeral"}, # Cache this block
}
],
messages=conversation_messages,
)
# Check cache performance
cache_hits = response.usage.cache_read_input_tokens
cache_misses = response.usage.cache_creation_input_tokensCached tokens cost ~10% of normal input token prices. For a system prompt that's 8,000 tokens and gets used 1,000 times a day, that's significant. The cache lasts 5 minutes with ephemeral — enough to cover most conversational use cases.
Set a Token Budget Per Feature, Not Per Call
The mistake I made early on: thinking about tokens per API call. The right mental model is tokens per user session or per feature invocation.
from dataclasses import dataclass
from enum import Enum
class Feature(Enum):
CHAT = "chat"
CLINICAL_NOTE = "clinical_note"
DOCUMENT_ANALYSIS = "document_analysis"
@dataclass
class TokenBudget:
feature: Feature
input_limit: int
output_limit: int
daily_limit_per_user: int
BUDGETS = {
Feature.CHAT: TokenBudget(
feature=Feature.CHAT,
input_limit=16_000,
output_limit=2_000,
daily_limit_per_user=100_000,
),
Feature.CLINICAL_NOTE: TokenBudget(
feature=Feature.CLINICAL_NOTE,
input_limit=8_000,
output_limit=1_500,
daily_limit_per_user=50_000,
),
}
async def check_budget(user_id: str, feature: Feature, estimated_tokens: int) -> bool:
budget = BUDGETS[feature]
used_today = await redis.get(f"tokens:{user_id}:{feature.value}:{today()}")
return (used_today or 0) + estimated_tokens <= budget.daily_limit_per_userThis gives you per-feature cost visibility and lets you rate-limit at the token level instead of the request level — which is what actually maps to cost.
The Number That Should Be On Your Dashboard
If I had to pick one metric to watch: cost per active user per day. Not total token spend, not average tokens per request. Cost per active user per day.
It tells you whether your context management is working (should be stable or declining as you optimise), whether certain user segments are dramatically more expensive (power users vs. casual users), and gives you something concrete to set a budget against.
Context management isn't glamorous. But getting it wrong is one of the fastest ways to make an AI product economically unviable. Get the instrumentation in early and you'll always know where you stand.