Token costs are the dominant expense in AI agent workflows. LLM API calls account for 70–85% of total operating costs, and most teams default to the same frontier model for every task — which overpays by 40–85% on routine work.
Why do AI agent token costs get out of control?
Agentic systems multiply token costs in ways that simple chat interfaces do not. A single user request can trigger retrieval across indexes, multi-step planning loops, tool calls and retries, model-to-model communication, and repeated injection of the same context window across dozens of turns.
A session that starts at 5,000 tokens per call can reach 200,000 tokens per call by turn 50 — and you pay for the accumulated context every time. This is sometimes called context bloat, and it is the primary reason agentic costs surprise teams who estimated from single-call benchmarks.
How do I reduce token costs for Claude Code and other agents?
Verify prompt caching is working. When the same system prompt, code context, or retrieved documents appear across multiple turns, caching serves subsequent hits at roughly 10% of the standard input rate. For subscription tools like Claude Code CLI and Cursor, caching is handled automatically by the tool layer — no configuration needed. For direct API deployments, behavior depends on the provider:
- OpenAI (GPT-4o and newer): Fully automatic. No code changes required. The API caches prompt prefixes of 1,024 tokens or more and applies up to 90% discounts on cached tokens. Place stable content (system prompt, instructions) before variable content (user input) to maximize hits.
- Anthropic Claude API: Requires explicit
cache_controlmarkers. Addcache_control={"type": "ephemeral"}to content blocks you want cached. Cache reads cost 10% of the standard input rate; cache writes cost 1.25x. Important 2026 change: the default TTL dropped from 60 minutes to 5 minutes — if your use case has longer gaps between calls, use the extended 1-hour TTL option at 2x write cost to avoid repeated cache misses.
Published case studies show 45–80% cost reductions from caching alone.
Route tasks to cheaper models. The price gap between frontier and mid-tier models is large. Routing 70% of requests from the most capable model to a mid-tier alternative reduces LLM costs by roughly 60% in typical agentic workflows. The key is routing by task complexity, not by convenience — simple completions, code formatting, and short Q&A rarely need the most powerful model available.
Reduce context bloat. Audit what your agent actually needs in its context window at each step. Common sources of unnecessary tokens: full conversation history when a summary would suffice, complete file contents when only relevant sections are needed, verbose tool results that can be trimmed, and system prompts that have grown over time without pruning. Cutting irrelevant context reduces cost without changing outputs.
Use retrieval instead of full context injection. Rather than injecting all memory or all retrieved documents into each call, use a retrieval step that pulls only semantically relevant entries. Both approaches produce equivalent answers, but retrieval sends hundreds fewer tokens per call. This is especially impactful for agents with persistent memory or large knowledge bases.
Set per-request and per-session token budgets. Hard limits on max_tokens per request prevent runaway generations. Session-level budgets with alerts at 50% and 80% thresholds let you adjust before costs become a problem. Without these guardrails, a single misconfigured request or retry loop can generate unexpected charges.
What are common mistakes to avoid?
- Using Opus or GPT-4-class models for every task regardless of complexity
- Not verifying cache hit rates in API deployments — subscription tools handle this automatically, but direct API integrations require correct prompt structure or explicit cache_control headers
- No
max_tokenslimit on individual requests - Letting conversation history grow indefinitely without compression or summarization
- Running Agent mode in Cursor or similar tools for tasks that Chat handles adequately
