Token waste is the gap between the tokens an agent actually needed and the tokens it consumed. In agentic workflows, that gap can be large. A session that could complete a task in 20,000 tokens sometimes uses 200,000, not because the task required more work but because of how context accumulates, how retries compound, and which models were chosen for which steps.
What is AI agent token waste?
Token waste is any token consumption that does not contribute to task completion. It takes several forms:
Context bloat occurs when an agent carries more information in its context window than the current step requires. Every tool result, every prior turn, every retrieved document that gets injected wholesale adds to the input cost of every subsequent call — even when most of it is no longer relevant.
Retry waste occurs when a tool fails or returns an ambiguous result and the agent re-invokes the same prompt without changing the approach. Each retry pays the full input cost of the accumulated context, plus additional output tokens for a response that still does not resolve the problem.
Model over-provisioning occurs when a high-cost frontier model handles tasks that a cheaper model would complete at the same quality. The price difference between the most and least expensive available models is commonly 50–100x.
Unconstrained output occurs when no max_tokens limit is set and the model generates more than the task requires. Extended thinking, verbose tool results, and open-ended system prompts all contribute to output token bloat.
How do I find token waste in my agents?
Log token counts at the span level. Each individual LLM call should record input tokens, output tokens, and which model was used. Without this, you can see that a session was expensive but not which steps drove the cost.
Plot input token counts across turns in a session. A rising count across turns without a corresponding increase in task complexity indicates context bloat. The context window is filling with accumulated history that is not being pruned.
Look for repeated tool calls with identical arguments. The same tool appearing multiple times in a session with the same inputs is a retry loop signature. Each iteration consumes the full context cost again.
Check the output-to-input token ratio. For most coding and reasoning tasks, output tokens should be a fraction of input tokens. A ratio above 0.5 or rising over a session often signals unconstrained generation, verbose response formatting, or extended thinking that was not required by the task.
Compare per-session costs to expected task complexity. A task that takes ten turns to complete should not cost ten times more than one that takes two turns, unless the later turns require significantly more context. Tracking cost per completed task (not cost per session) reveals efficiency problems that per-session averages obscure.
What are common mistakes to avoid?
- Using aggregate monthly billing as the only cost signal
- Not logging which model handled each span within a session
- Allowing conversation history to grow without periodic summarization or pruning
- No
max_tokensset on individual requests in agentic workflows - Treating high session costs as normal without investigating whether the work required it