Every coding agent I know does some version of this: load context, estimate what the model will need, and inject it into the system prompt. Rules, memories, tool schemas, project conventions. All front-loaded into every request.
The problem is that “estimate what the model will need” is guessing. And guessing wrong has two costs. The visible one is tokens: a simple follow-up question carries the same overhead as a complex refactoring task, and that compounds across every turn. The invisible one is attention: when you inject hundreds of lines of project rules, the model processes all of them to find the two or three that matter. Most turns need zero.
This is the same mistake as context compaction, just in reverse. Compaction tries to fit everything by making it smaller. Injection tries to fit everything by putting it there upfront. Both assume the system knows what the model needs. Both are wrong.
Search, do not inject
The fix was obvious once I saw it: stop guessing. Give the model a search tool and let it find its own context.
Acolyte now has a memory toolkit with three operations: search, add, and remove. The model searches for relevant memories when it needs context. It adds memories when it learns something worth keeping. It removes memories when they become stale. This only works because of semantic recall. Every memory is embedded at write time, so the model can search by meaning, not keywords.
There is no fallback injection. If the model does not search, it does not get the context. The system prompt shrank to the bare minimum: a short soul prompt that defines personality, core instructions for the signal contract, and tool definitions. Everything else is searchable.
This changes the failure mode. With injection, the model sometimes ignores context that is present. With search, it can miss context entirely if it does not look for it. In practice, missing context is easier to detect and fix than context being silently ignored.
Unified storage
The prerequisite was unifying the storage. Acolyte previously had two separate memory systems: markdown files for explicit stored memories and SQLite for distill records (observations and reflections). Two storage backends meant two query paths and no way to search across both.
I merged them into a single memories table in SQLite. Stored memories, observations, and reflections all live in one place with a kind column to distinguish them. Embeddings are shared. One cosine similarity search surfaces results across all memory types.
When the model searches for “how do tools work in this project,” it gets relevant stored facts, project observations, and user preferences all ranked together. No manual source selection, no configuration.
How search works
When you save a memory, an embedding model turns the text into a list of numbers: coordinates in a high-dimensional space. Sentences with similar meaning end up near each other. “Tool execution uses runTool” and “how does tool execution work” map to nearby points even though they share few words.
At search time, the query gets embedded once. Then the system compares it against the pre-computed embeddings for every stored memory using cosine similarity, which measures how close two points are. The closest matches are the most relevant results.
The key word is pre-computed. Embeddings are calculated once at write time and stored in SQLite. Search only needs one embedding call for the query, then it is pure math. No re-embedding, no round trips per memory.
This scales well for hundreds of memories on a single machine. SQLite handles the storage, TypeScript handles the similarity math. For thousands of memories across a team, you would move to something like Postgres with pgvector. The MemoryStore interface already abstracts the backend, so the swap is clean when the time comes.
The minimal prompt
The system prompt now contains exactly three things:
The soul prompt is a few lines that define Acolyte’s personality and core behavior. It includes one line about searching memory: “If a question can be answered by reading the code or searching my memory, I do that before guessing.”
Core instructions are the operational rules the model must follow on every turn. The signal contract, tool usage patterns, workspace constraints. These are small and universal.
Tool definitions describe the available tools. This is the largest remaining cost, and it is next on the list: deferred tool loading so only the tools the model actually uses are sent.
Everything else is gone from the prompt. Project rules, stored memories, distill context, continuation state. All searchable, none injected.
Write path simplified
The distill pipeline still runs after every task, but it got simpler. The observer extracts facts from the conversation. Each fact is tagged with a scope (@observe project, @observe user, or @observe session) and stored as its own record with its own embedding.
The reflector is gone. It used to consolidate multiple observations into a single compressed summary and delete the originals. That made sense for injection budgets where you needed to fit everything into a fixed token window. It makes no sense for search. Individual observations have precise embeddings that match specific queries. A consolidated blob has a fuzzy embedding that matches everything poorly.
One model call per request instead of two. Every fact stays searchable in its original form.
Cold start
With injection gone, a new project starts with zero memory. To solve this, I wrote a seed script that loads curated facts into project memory from a version-controlled JSON file. Idempotent, cheap to re-run. For Acolyte’s own development, that is 32 facts covering architecture, lifecycle, tools, and conventions.
This is a better model than AGENTS.md. The facts are searchable by semantic relevance instead of dumped wholesale into every prompt. The model gets the 2–3 facts it needs for the current task, not all 32 on every turn.
The next step is auto-syncing AGENTS.md itself into memory. The user writes project rules as they normally would. Acolyte detects changes and seeds the rules into project memory automatically. Same interface for the user, but the rules become searchable instead of injected. No manual seed script needed.
The difference
The memory injection alone was ~400 tokens per turn, gone. When the AGENTS.md rules move to searchable memory too, that is another ~700 tokens saved. The system prompt will shrink from ~2,100 tokens to under 1,000. Every turn gets cheaper, every rate limit window gets roomier.
For providers with per-token rate limits, this directly increases throughput. For users paying per token, this is immediate cost savings. For the model, this means less noise and better focus on the actual task.
The model’s behavior improved too, though that is harder to measure. With fewer irrelevant rules in the prompt, the model spends more attention on the user’s actual request. Early testing suggests it follows relevant conventions more consistently when it actively searches for them than when they are passively present in a wall of injected text.
Built for tomorrow
I built the injection system because I assumed the model needed the context upfront. It did not.
The model knows what it needs better than the system can predict.
That gap will only widen. Models are getting better at tool use, self-directed search, and knowing when they need more context. A system that gets out of the way benefits from every one of those improvements. A system that compensates for the model has to undo that work when the model catches up.
Most systems are designed around what today’s models cannot do.
Acolyte is designed for what the next ones can do.