"Give the agent memory" is underspecified. An agent has at least four kinds of memory, each with a different lifetime and a different store, and conflating them is how you end up with a system that either forgets what it just did or drags an entire conversation history into every model call. Here is the layering we use and the bounds on each.

Within a run: the working set

During a single turn the agent accumulates the messages and tool results it needs to reason, and that working set lives only as long as the turn. It has to fit the token budget, which is its own engineering problem (covered in a separate writeup on context compression). The rule here is that nothing in the working set is durable by default. It exists to answer the current question and is rebuilt each turn from the layers below.

Across turns: bounded conversation history

For follow-up questions the agent needs recent history, and the temptation is to keep all of it. We keep a bounded window instead: the last couple dozen turns, each capped in size, with the whole conversation expiring after a couple of hours and the oldest conversations evicted once we hit a ceiling on how many we hold at once. When a single turn is too long to keep whole, we keep its head and tail and mark the middle as truncated, which preserves the opening context and the most recent content while dropping the part least likely to matter. History is a cache, and a cache needs an eviction policy.

Across conversations: recall, not auto-injection

Some data is worth keeping past a single conversation, but injecting it into every prompt is wasteful and pollutes context. We store the results of past tool calls as snapshots with an embedding, and we expose recall as a tool the model calls when it decides it needs something it fetched before. The model asks "did I already pull this" the same way it asks for new data, and the snapshot comes back by semantic similarity. The vectors here index the agent's own past work, not a document corpus. (The distinction from document RAG is its own topic.)

Across the system: the learning signal

The fourth layer is not in-band memory at all. Every answer is logged with a quality assessment, and that ledger is what you mine between releases to improve prompts and catch regressions. It does not feed the live conversation; it feeds the next version of the agent. Keeping it separate from conversation memory matters, because the data you want for live context and the data you want for offline analysis are not the same.

The design rule

For each piece of information, decide how long it should live and who reads it: this turn, this conversation, any conversation, or the engineering team. Most "memory" bugs are a piece of state stored at the wrong layer, kept too long or not long enough. Name the lifetime first, and the storage follows from it.

Where AgentKick fits

We design the memory and context architecture for production agents: working set, bounded history, cross-conversation recall, and the offline quality ledger, each with an explicit lifetime. If your agent forgets what it just did, or carries too much into every call, that is the work we do, usually as a fixed-scope AI Agent Production-Readiness Review into a phased build.

What an AI agent should remember, and for how long