Follow the Thread · Blog · Christoffer Niska

The verify cycle was silently broken for weeks. When Acolyte ran a task through the headless CLI, the model would edit files, signal completion, and exit without running lint or tests. From the outside, everything looked correct. The output was clean. The edits were right.

The trace showed the truth. One line: verify-cycle action=done, the evaluator returned immediately instead of triggering verification. No scan, no test run, no mode switch. That led straight to the root cause: the CLI was passing verifyScope: "none" to the lifecycle, disabling the evaluator entirely.

Without the trace, this bug would have been invisible. With it, the fix took minutes.

When Acolyte runs a task, it emits structured events for every lifecycle phase: tool calls, guard decisions, evaluator actions, mode transitions, generation cycles. That is not logging for the sake of logging. It is the system explaining what it did and why.

The interesting part is how the trace tool came to exist and what that says about building observable AI systems.

The architecture that makes it possible

Observability has been a first-class principle in Acolyte from the start. Two architectural decisions make it practical.

The first is the daemon. Acolyte runs as a headless server. Every task, whether it comes from the CLI, an editor plugin, or a future API client, flows through the daemon rather than an ephemeral process. That means lifecycle events are correlated by a stable task_id and land in a persistent log stream instead of scattering across short-lived shells.

The second is the lifecycle itself. Every request passes through five phases: resolve, prepare, generate, evaluate, finalize. Each phase emits events as it runs. Tool calls, guard decisions, evaluator actions, mode transitions, generation results, all of them are structured events with typed fields. The lifecycle does not log because someone asked for observability. It logs because each phase boundary is an explicit point in the control flow where the host makes a decision worth recording.

The server dual-writes all of this: once to a logfmt file for raw debugging, and once to a SQLite database (trace.db) for indexed queries. The acolyte trace command queries SQLite directly, filtered by task, ordered by timestamp, with no full-file scanning. The raw log is still there for tail -f. The trace just makes the structured data easy to read.

Together, these two choices mean that for any task, you can reconstruct the full sequence of what happened: what the model did, what the host decided, and why.

It started as a script for AI

The first version of the trace tool was a standalone script. Not a CLI command. Not a feature. A debugging aid that I wrote so the AI could read its own runtime behavior.

When something went wrong during a task (a guard blocked a tool call, an evaluator triggered an unexpected regeneration, a mode switch happened at the wrong time) I needed the model to understand what the host had done. The server log had the information, but it was noisy. Hundreds of lines of logfmt with fields the model did not need.

So I wrote a script that parsed the log, filtered by task ID, and compressed each line into a compact summary. Instead of a raw log line with twenty fields, the model would see:

lifecycle.tool.call tool=edit-file path=src/foo.ts
lifecycle.guard guard=file-churn tool=read-file action=blocked
lifecycle.eval.decision evaluator=verify action=regenerate

That was enough for the model to understand the sequence. It could see that a guard blocked a read after an edit, that the evaluator triggered verification, and that the lifecycle was working as designed. Or it could see where something went wrong.

From script to command

The script worked. I used it constantly, not just for AI debugging, but for my own understanding of what happened during a run. That was the signal that it belonged in the product.

The migration was straightforward: move the parsing into a proper module, register it as a CLI command with subcommands for filtering by task, and delete the script. Later, the storage moved from logfmt parsing to SQLite. The server writes trace events into an indexed database, and the CLI queries it directly instead of scanning the full log file.

But the refactoring exposed things worth fixing. The original script matched log lines by checking the msg field, free-text strings like "task state updated" and "rpc task accepted". That coupling was fragile. Change a log message and the trace silently breaks.

The fix was to add structured event fields to every server-side log call and match on those instead. Now the trace formatter uses an exhaustive switch over a typed event union. Adding a new event without a formatter is a compile error.

What the trace shows

Running acolyte trace with no arguments lists recent tasks:

Task         Model       Status  Time
task_abc123  gpt-5-mini  ok      2m ago
task_def456  gpt-5-mini  error   15m ago

That is the starting point. Pick a task and drill in with acolyte trace task (the latest task is picked automatically):

timestamp=... task_id=task_abc123 event=lifecycle.start mode=work model=gpt-5-mini
timestamp=... task_id=task_abc123 event=lifecycle.tool.call tool=read-file path=src/cli.ts
timestamp=... task_id=task_abc123 event=lifecycle.tool.result tool=read-file duration_ms=12 is_error=false
timestamp=... task_id=task_abc123 event=lifecycle.tool.call tool=edit-file path=src/cli.ts
timestamp=... task_id=task_abc123 event=lifecycle.tool.result tool=edit-file duration_ms=45 is_error=false
timestamp=... task_id=task_abc123 event=lifecycle.mode.changed from=work to=verify trigger=evaluator
timestamp=... task_id=task_abc123 event=lifecycle.generate.start model=gpt-5-mini mode=verify
timestamp=... task_id=task_abc123 event=lifecycle.signal.accepted signal=done mode=verify
timestamp=... task_id=task_abc123 event=lifecycle.summary model_calls=2 total_tool_calls=4

That is a complete picture of one task: what the model did, what the host decided, and how the lifecycle progressed. No guessing, no inference from output. Just events.

The same data is available as structured JSON with --json, so developers can build their own tooling on top: pipe traces into jq, feed them into a dashboard, or write a script that flags specific patterns across runs.

The refactoring

Promoting the script to a command was the obvious move. But using it immediately showed what was still wrong.

The compact formatter was a 35-case switch statement that built strings directly. Each event type had its own branch with hardcoded field names and string templates. Adding a new event meant touching the switch, the type union, and a runtime set, three places that had to stay in sync by hand.

I replaced the switch with a data-driven field map. Each event declares which fields to display and what to call them. The formatter just iterates the map. No cases, no string building, no manual sync. The type system enforces exhaustiveness through a Zod schema that derives both the TypeScript union and the runtime set from a single source.

The original script also supported filtering by request_id, a legacy from before the task system existed, when each request was a standalone RPC call with its own ID. Once tasks became the primary unit of work, the request ID became an internal correlation detail. The command dropped request-based filtering entirely. Tasks are the only entry point now, and the trace reflects that.

Then the output itself. The formatter was producing strings and printing them immediately. That meant compact text was the only format, so adding JSON meant duplicating every code path. I borrowed a pattern from a Rust project I built earlier: buffer structured data through typed methods, then render at the end. The command calls addRow, addTable, addHeader, and render without knowing the output format. Text mode joins sections with newlines and aligns table columns. JSON mode serializes each entry.

That abstraction is not trace-specific. Any CLI command can use it to support text and JSON output from the same data. When we add more output formats later, each one is just another implementation of the same interface.

Observability builds trust

Most AI coding tools give you a chat transcript. Some give you a log file. Very few give you a structured trace of every decision the runtime made on your behalf.

I think that matters more than most people realize. When behavior is opaque, trust erodes. When something goes wrong and you cannot see why, you lose confidence in the system. When you can read the trace and understand the sequence, the tool earns trust even when it makes mistakes. More bugs have been caught by reading traces than by writing tests.

The trace also changes how you debug. Instead of re-running a task with more logging or adding print statements, you look at what already happened. The events are always there. The question is just whether you have a good way to read them.

The pattern

The trace tool followed a pattern I keep seeing in this project:

Build something small to solve an immediate problem.
Notice that it is useful beyond the original context.
Promote it to a proper interface.

The lifecycle signal started as a debugging experiment and became a core protocol. The behavior harness started as a test helper and became a development tool. The trace script started as an AI debugging aid and became a CLI command.

The useful tools are rarely planned. They emerge from paying attention to what you actually reach for when things break. That has been one of the more enjoyable parts of this project.