Memory Architecture for AI Agents. How to Give Claude, GPT and Gemini Durable Memory
A production-tested three-layer memory system that turns stateless models into agents that actually remember what you told them last week.
AI agents need three things to remember across sessions: a fast index for relevance, a structured store for typed facts, and a semantic search layer for fuzzy recall. We use Obsidian plus INDEX hub-files plus Chroma for our agent Linky. Here is how it works in production.
Why Agents Forget
Every language model is stateless. Claude, GPT and Gemini start each new session with a context window that is completely empty except for the system prompt you give them. The conversation you had yesterday is gone. The correction the user made last Tuesday is gone. The architectural decision your team agreed on three weeks ago is gone unless the model is told about it again.
This is not a bug. The model itself has no internal disk. The weights are frozen, and the working memory is the context window, which lives only for the duration of one inference call. The moment a session ends, that working memory is discarded.
The naive fix is to dump everything into the prompt. Paste the last 200 conversations, the entire wiki, every Slack thread, and ask the model to figure out what is relevant. This fails for three concrete reasons. Context windows have token limits, and even a 1M-token window costs real money at inference time. Relevance degrades when you stuff a prompt with noise, because the model's attention is finite and gets distributed across irrelevant material. Latency explodes, because every additional token of input adds to the time-to-first-token of every single response.
What you actually need is a memory architecture that lives outside the model and is loaded selectively. The agent decides what to read into context each turn. The rest stays on disk, indexed and searchable, ready when needed.
The Three-Layer Architecture
After two years of building production agents, we converged on a memory system with three distinct layers, each solving a different recall problem.
Layer 1 is the INDEX hub-file. A single markdown file that lists every fact the agent knows about, one fact per line, with a wikilink to the file that contains the detail. The agent reads this at session start. It is a fast-relevance scan that fits inside a few thousand tokens and tells the agent what exists.
Layer 2 is a set of topic-specific markdown files. Each file is a typed entry: feedback, project, reference, user note, or skill. The agent loads only the files that match the current task, pulled by wikilink from the index. This is the structured store. The file system is the database.
Layer 3 is a Chroma vector database. Every markdown file is embedded and stored at .chroma in the memory folder. When a user asks "what did we decide about X" and the answer is not findable by keyword, the agent runs a semantic search and retrieves the top-k most similar entries. This is the fuzzy recall layer for when the agent does not know what to look for.
The three layers complement each other. Index handles known-knowns. Markdown handles structured retrieval. Chroma handles unknown-unknowns. Skip any one of the three and the agent either forgets, hallucinates, or burns tokens reading the wrong files.
INDEX Hub-Files Explained
The index is the part most teams get wrong. A flat list of "everything I remember" does not scale past about 100 entries. After that the model starts skimming and missing things.
The fix is typed indexes. We maintain five separate index hub-files, one per memory type. Currently our agent has 212 feedback entries, 65 project entries, 21 references, 8 user notes and 6 skills. Each is a one-line entry, kept under 200 characters, ending with a wikilink to the detail file.
A feedback entry looks like this:
- Klant-files altijd direct in Telegram sturen, geen Drive-link [[feedback_klant_files_direct_telegram]]
A project entry looks like this:
- November CV Generator: €2/CV via Moneybird, only report unique counts [[project_november_cv]]
The agent loads the index for the relevant type at session start. If the user asks something about a client project, the agent reads the project index. If the user gives correction-style feedback, the agent reads the feedback index. The full detail file is only loaded when the wikilink is followed.
This pattern is roughly equivalent to a B-tree index in a database. The index is small and stays hot. The data lives in pages that are paged in on demand.
Markdown File Conventions
The structured store is just markdown files in a folder. We use Obsidian as the editor because it renders wikilinks natively, but the format is plain text and any editor works. Obsidian is optional; the file structure is what matters.
Each entry file has frontmatter at the top with four fields. The name is the human-readable title. The description is a one-line summary, used for embedding and for the index entry. The type is one of feedback, project, reference, user or skill. The originSessionId records the session in which the entry was created, so you can trace memory back to its source conversation.
The body has two required sections. Why explains the reason the entry exists, the context that produced it, and what problem it solves. How to apply is the operational instruction the agent should follow when this memory becomes relevant.
Daily logs are different. They live in a dated folder and are append-only. The agent never edits a past day's log; it only writes new entries to the current day. This makes the log a reliable audit trail and prevents the agent from rewriting its own history under prompt pressure.
If you want to see what a writable agent file system looks like in practice, our walkthrough of building an Etsy shop with AI agents in 48 hours shows the same pattern applied to a project, not a personal memory store.
Semantic Search Layer
The third layer is the one that catches the things the index does not know to surface. We run a local Chroma instance, stored at ~/memory/.chroma/, with one collection per memory type plus a global collection for cross-type search.
Every time the agent writes or updates a markdown file, a hook embeds the new content via OpenAI's text-embedding-3-small model and upserts it into Chroma. The embedding cost is negligible: about $0.02 per million tokens, and most updates are well under 1,000 tokens.
At query time, the agent constructs a search call when the user's question is open-ended. Examples: "what did Mike say about pricing", "have we talked about Bali villas before", "is there anything in memory about Chroma performance". The agent embeds the query, retrieves the top five hits with their cosine similarity scores, and then reads the matching markdown files in full.
The pattern is RAG, but the retrieval target is the agent's own memory rather than a customer's knowledge base. The embedding model is identical. The retrieval logic is identical. The difference is that the corpus grows organically as the agent works.
Why Not Just Use AGENTS.md or CLAUDE.md
Both AGENTS.md and CLAUDE.md are useful, but they are read-only inputs to the agent, not a memory system. They tell the agent how to behave in general; they cannot record what happened yesterday.
The official Claude Code memory documentation describes how the CLI reads a CLAUDE.md file at the project root and the user level. That is excellent for instructions: tone, conventions, file paths, hard rules. It is not designed to grow as a function of usage. If you put 500 corrections inside CLAUDE.md it becomes unreadable for both the human and the model.
A real memory system must be writable by the agent itself, typed so different memory categories can be queried separately, and indexed so the agent can find a relevant entry in O(log n) rather than O(n) tokens. AGENTS.md and CLAUDE.md are the static configuration. The three-layer system on top of them is the durable memory.
Once both exist together, the agent behaves as if it actually remembers you. Not because the model has changed, but because the architecture around the model carries the state the model itself cannot hold.
FAQ
How do AI agents store memory across sessions?
AI agents store memory across sessions using an external persistence layer outside the model itself. The model's context window resets every session, so durable memory lives in a file system, a vector database, or both. In production we use a three-layer system: a fast INDEX hub-file that lists every fact in one line, a structured markdown store with typed entries, and a Chroma vector database for semantic recall. The agent reads the index first, pulls the specific markdown files it needs, and runs vector search when the request is fuzzy.
What is the difference between context window and persistent memory?
The context window is the temporary scratch space the model uses inside a single session. It vanishes the moment the session ends. Persistent memory is an external store, typically files or a vector database, that the agent reads from at the start of every session and writes to during the session. Context window is volatile and bounded by token limits. Persistent memory is durable and effectively unbounded, but the agent has to choose what to load into the context window each turn.
Does Claude have built-in long-term memory?
Claude does not have built-in long-term memory in the model itself. Claude Code, Anthropic's CLI, reads a CLAUDE.md file at session start and supports auto-memory files in a designated memory folder. That gives the agent a starting context, but it is a read pattern, not an active memory system. To get real persistent memory, you build a writable architecture around Claude where the agent updates structured files and indexes after each meaningful interaction.
How big should an AI agent memory system be?
There is no fixed size, but the structure matters more than the volume. Our agent's index hub-file currently lists 212 feedback entries, 65 project entries, 21 references, 8 user notes and 6 skills, each as a one-line wikilink under 200 characters. The detail files behind those entries can be as long as needed. The constraint is that the index plus the daily log must fit comfortably into the model's context window at session start, typically under 25,000 tokens combined.
Build Your Own, or Have Us Build It
The architecture above runs in production every day for our own agent and for client agents we have deployed. The full system is replicable in any stack that supports a file system and an embedding API. If you want to see what this looks like for your team or product, take a look at what we ship and how we approach memory in client engagements.
If you would rather skip the build phase, email michaelmartina@linkaiagency.com and we will scope a deployment for your agent stack.
Want an agent that actually remembers?
We deploy production memory architectures for Claude, GPT and Gemini agents. Free scoping call, no pitch deck.
Book a Free Call →