A
AgentKick
Back to blog
Production AI agents·

Self-healing agent output: deterministic checks first, a judge model second

An agent that produces structured output (a report, a chart spec, a table) will sometimes produce it wrong. A column the schema requires goes missing, an enum value falls outside the allowed set, a chart ships with empty series that render as nothing. You can ask the model to re-check its own work, but doing that on every response is slow and the model is an unreliable judge of itself. Here is the two-pass repair design we use instead, and the rule that keeps it from backfiring.

Pass one uses no model at all

The first pass is deterministic. It is a set of hand-written checkers that each know one rule the output has to satisfy, paired with fixers that repair the specific failure they detect. A checker confirms an alarm table carries its category column; its fixer injects that column from the source data when it is absent. A checker confirms a chart's series are non-empty; its fixer drops or flags the empty ones. These run on every response because they are cheap, deterministic, and add no latency worth measuring. Most real defects are mechanical, so this pass clears most of them without ever calling a model.

Pass two spends a judge only where it earns its cost

Some defects are semantic and no checker can see them: an answer that is structurally valid but misreads the question, or a summary that drops the one number that mattered. For those we run a second pass with a judge model, and we gate it hard. Pass two runs only on high-priority output paths. Before it touches anything, it judges the current reply against the acceptance criteria, and if that judgement already passes, it skips the fixers and ships what exists. The judge is expensive, so we spend it only when there is a real chance of improving the answer.

The revert rule

A fixer can make output worse, and the guard against that is to judge again after patching. The after-judge runs the same criteria on the patched reply, and a patch is kept only if it improved the flagged issues without introducing new ones. If the patched version trips a hard-failure criterion the original did not, the whole patch set is reverted and the pre-patch reply ships. A repair step that cannot demonstrate it helped does not get to land.

One repair the system is never allowed to make

If the data needed to answer is missing, the fixers must not invent a plausible value to fill the gap. The required output is an explicit statement that the data was unavailable, with the reason. A fabricated number that happens to look right is the most expensive kind of wrong, because nobody downstream knows to distrust it. Suppressing the gap is a silent failure; declaring it is an honest one, and only the second is acceptable in a system people make decisions from.

Why split it this way

The split exists because the two passes have different economics. Deterministic checks are free and catch the common, mechanical failures, so they run always. The judge is costly and catches the rare, semantic failures, so it runs selectively and proves its work before and after. Treating every defect as a job for the model is how you get an agent that is both slow and untrustworthy.

Where AgentKick fits

We build production AI agent systems where output validation, repair, and honest failure are part of the architecture rather than a layer bolted on after the demo. If you are hardening an agent whose answers people actually act on, that is the work we do, usually as a fixed-scope AI Agent Production-Readiness Review into a phased build.

aillmagentsevals

Working on something like this?

Tell us the system, the timeline, and a budget range. You will get a feasibility note and rough sizing within one business day, or an honest no.

Tell us about your project