We truncated tool output to fit the context window. It silently corrupted the agent's answers.
Every tool-calling agent eventually hits the same wall. A tool returns more data than you want to put back into the model's context: an alarm query comes back with a few hundred rows, a list endpoint returns every device on a site, and a single tool result is suddenly larger than the budget you can afford to spend on it.
The obvious fix is to truncate. It is also the one that quietly breaks the agent.
Why truncation corrupts answers
We started by capping each tool result at a fixed character count. The output stayed under budget and the agent kept responding, so it looked like it worked. It did not. Truncating a JSON payload mid-structure leaves the model holding a fragment that no longer parses as the data it represents. The model does not raise an error. It improvises, answers confidently from the half of the records it can see, and presents that as the whole picture. For a monitoring agent reporting how many sites had alarms, "47 of the rows I received" silently becomes "47 sites," and nobody downstream can tell the count is wrong.
A byte limit is the wrong thing to optimize for. What has to survive is the data's meaning, and character truncation is indifferent to it.
Compress fields, not characters
The fix is to compress by field. For each tool we wrote a compactResult step that knows the shape of that tool's output and keeps only the business-relevant fields, dropping internal IDs, repeated enum labels, and metadata the model never reasons over. It preserves the full record count. A hundred alarm records that would be tens of kilobytes of raw JSON compress to roughly 15 KB of the fields that carry meaning, and the model handles that comfortably without losing rows.
This enforces a useful discipline. You have to know, per tool, which fields actually inform the answer, and that knowledge has to live somewhere explicit. A field registry per tool is a small price for output you can trust.
A budget with four levels, not one switch
Field compression handles most cases, but a complex turn can still run over budget. Rather than fall back to truncation, we made the budget a pipeline that degrades in stages and applies the least-destructive option that fits:
- Field compression (
compactResult): the default, applied to every tool result. - Retrieval fallback: when prefetched data is too large to inline, store it and let the model pull back only what a step needs by semantic search.
- Hint trimming: drop low-priority guidance blocks from the prompt before touching any real data.
- Hard truncation: the last resort, applied only to the prompt tail, never to a tool result mid-structure.
The ordering is the point. Each level is more lossy than the one above it, so you apply them in sequence and stop at the first that brings you under budget. We target roughly 30,000 tokens of total context (an early 15,000 cap cost us too much grounding) and estimate tokens at about 3.5 characters each, which is conservative for mixed Chinese and English text where a single character can cost more than one token.
What this buys you
The agent's answers stay attributable to real data, because no stage silently drops records. When the system does have to shed information, it sheds the parts that were never load-bearing first, and the one irreversible operation is fenced off to a place where it cannot corrupt a tool result. That is the difference between an agent you can put in front of an operations team and one you quietly stop trusting after it miscounts something that mattered.
Where AgentKick fits
We build production AI agent systems where this kind of plumbing (context budgets, grounding that survives compression, answers that trace back to real rows) is the starting scaffolding rather than an afterthought. If you are taking a tool-calling agent from a convincing demo to something a team relies on, that is the work we do, usually as a fixed-scope AI Agent Production-Readiness Review into a phased build.