Token costs are the dominant expense in AI agent workflows. LLM API calls account for 70–85% of total operating costs, and most teams default to the same frontier model for every task — which overpays by 40–85% on routine work.
Why do AI agent token costs get out of control?
Agentic systems multiply token costs in ways that simple chat interfaces do not. A single user request can trigger retrieval across indexes, multi-step planning loops, tool calls and retries, model-to-model communication, and repeated injection of the same context window across dozens of turns.
A session that starts at 5,000 tokens per call can reach 200,000 tokens per call by turn 50 — and you pay for the accumulated context every time. This is sometimes called context bloat, and it is the primary reason agentic costs surprise teams who estimated from single-call benchmarks.
How do I reduce token costs for Claude Code and other agents?
Enable prompt caching. When the same system prompt, code context, or retrieved documents appear across multiple turns, caching stores that content and serves subsequent hits at roughly 10% of the standard input rate. For Claude Code sessions with large codebases loaded into context, this is the single most impactful change available. Published case studies show 45–80% cost reductions from caching alone.
Route tasks to cheaper models. The price gap between frontier and mid-tier models is large. Routing 70% of requests from the most capable model to a mid-tier alternative reduces LLM costs by roughly 60% in typical agentic workflows. The key is routing by task complexity, not by convenience — simple completions, code formatting, and short Q&A rarely need the most powerful model available.
Reduce context bloat. Audit what your agent actually needs in its context window at each step. Common sources of unnecessary tokens: full conversation history when a summary would suffice, complete file contents when only relevant sections are needed, verbose tool results that can be trimmed, and system prompts that have grown over time without pruning. Cutting irrelevant context reduces cost without changing outputs.
Use retrieval instead of full context injection. Rather than injecting all memory or all retrieved documents into each call, use a retrieval step that pulls only semantically relevant entries. Both approaches produce equivalent answers, but retrieval sends hundreds fewer tokens per call. This is especially impactful for agents with persistent memory or large knowledge bases.
Set per-request and per-session token budgets. Hard limits on max_tokens per request prevent runaway generations. Session-level budgets with alerts at 50% and 80% thresholds let you adjust before costs become a problem. Without these guardrails, a single misconfigured request or retry loop can generate unexpected charges.
What are common mistakes to avoid?
- Using Opus or GPT-4-class models for every task regardless of complexity
- Not enabling prompt caching in sessions where context repeats across turns
- No
max_tokenslimit on individual requests - Letting conversation history grow indefinitely without compression or summarization
- Running Agent mode in Cursor or similar tools for tasks that Chat handles adequately