I ran an autonomous AI agent workspace called cliref for twenty sprints with no human approval gates. A manager agent dispatched worker agents per sprint, the workers built pages, the manager verified output and queued the next sprint. My job was to not touch it.
When it was done there were 200 pages of CLI documentation: one page per macOS/Unix command, synopsis/flags/examples verified against man, organized into 18 categories. Plus a complete navigation redesign. There was also a list of four things that had broken.
The four things that broke were the more valuable output.
What "autonomous" actually meant
cliref ran inside my omni agent-OS with authority: autonomous. That tag is not cosmetic. It means the manager loop self-drives without human approval gates: no "should I proceed?" checks, no per-sprint confirmation, no human in the task-selection path. The manager found the next todo item, templated a worker prompt, dispatched, polled for completion, verified output, and queued the next sprint. Twenty times over, running detached in tmux while I worked on other things.
Most "autonomous agent" setups I've seen gate each step on human review. What I was testing was whether a fully unsupervised loop could maintain quality and make independent forward progress on a real deliverable; not a toy, not a pipeline-processing task, but a site built from scratch over an extended run.
The answer was yes, with a footnote. The footnote was the point.
What it shipped
Sprints 01 through 20: complete. Eighteen command categories, ~200 pages.
Each page covers one command. Synopsis from the man page. Flags table with short descriptions. Two or three examples that actually run. No hallucinated flags (the sprint instructions required cross-checking against the local man page; the preflight check caught divergences). The navigation started as a flat list and was redesigned in sprint 09 to a collapsible docs-style sidebar with client-side search. The manager recognized the flat list was breaking at scale and redesigned without being asked.
That last detail is the one I'd quote if someone asked "but did it really work autonomously?" The manager noticed the usability problem, proposed a solution, and implemented it mid-run. No one told it to. It's in the commit log.
The hook outage that proved the design
Midway through the run, a delivery bug knocked out every tool invocation fleet-wide: a missing space in the omni hook invoke call broke all hooks for every workspace simultaneously. The cliref manager and workers were among the processes affected.
Zero work was lost.
The reason: work state lives in the filesystem. The project-management/ directory has todo/, doing/, and done/ folders. A sprint is a file. When a worker picks it up, the file moves to doing/. When the worker finishes, it moves to done/. A hook outage doesn't corrupt that state. When the bug was patched and the manager resumed, it looked at the filesystem, found the in-flight sprint still in doing/, and continued from there.
I had designed this intentionally. What I hadn't predicted was how cleanly it would handle a fleet-wide outage I didn't engineer. The failure mode arrived unexpectedly, mid-run, and the design held. That's the kind of resilience you can't fully test until something actually breaks.
Three things that broke
The hook outage was a test the substrate passed. These are three tests it failed.
Workers arriving without their task. The omni dispatch worker --file <prompt.md> flow had a race: the pane would spawn before the prompt file was reliably attached. Workers would boot up with an empty prompt, decide they had nothing to do, and exit cleanly. The manager, polling for task completion, would see the exit and wait for a signal that was never coming. The worker finished, but it didn't finish the sprint. I call this "dead on arrival." The manager and the worker both behaved correctly given their inputs; the failure was in the handoff between them. Fix: boot-verify the pane and relay the mission explicitly before starting the completion poll.
A dispatch gate that silenced the heartbeat. The pressure-gated-dispatch Stop hook fires when a worker exits and auto-dispatches the next todo sprint. On cliref, it competed with the manager's own dispatch loop; and it dispatched using a generic cargo-project template instead of cliref's vanilla-HTML template. The worker would launch with the wrong template, produce the wrong output or nothing, and the manager would have no way to detect the conflict. The autonomous heartbeat wasn't dead; it was misdirected. Fix for this one: disable pressure-gated-dispatch for authority: autonomous workspaces where the manager owns dispatch, or make the template workspace-aware before firing.
The fragments path trap. cliref uses pagekit plus a fragments binary for shared blocks. pagekit reads _fragments/ (underscore prefix). The fragments.toml config, by default, references fragments/ (no underscore). Sprints 02 through 04 silently worked in the wrong directory. Workers made edits that looked correct in isolation, the preflight check passed (valid HTML, file count met), but the shared blocks weren't being synced. The preflight caught page-count and validity; it didn't catch "you edited the wrong directory for three sprints." Fix: preflight should verify that at least one fragment marker in the target HTML matches the expected source.
What I learned that I couldn't have learned any other way
You cannot find these failure modes by reading the code. You cannot find them by reviewing the spec. You find them by running the system autonomously, under real conditions, long enough that the edge cases arrive.
The "dead on arrival" race surfaces maybe one time in ten dispatches. At that rate, a supervised workflow masks it. Someone notices the task didn't progress and manually retries. Unsupervised, the manager hangs, you come back the next morning, and you have a clear reproduction to fix.
The same logic applies to the dispatch gate and the fragments path. Both produced "valid but wrong" output that a human reviewer in the loop would have caught and corrected without recording. Running without that reviewer forced the failure modes to accumulate until they were visible as a pattern.
That's the actual thesis: autonomous agent workspaces are a reliability test for your substrate, not for your agents. The agents on cliref were the same agents I use everywhere. What I was testing (without knowing I was testing it) was whether my dispatch system, my state management, my fragment tooling, and my preflight checks were robust enough to run without a human catching the gaps.
They mostly were. The places they weren't are now fixed.
cliref is archived. Its 200 pages work fine; the experiment concluded. The four failure modes it surfaced became prioritized substrate fixes and a dedicated test surface. That's a better outcome than if it had run perfectly. A workspace that runs perfectly and never breaks teaches you nothing about the limits of your substrate. A workspace that runs mostly well and breaks in specific, reproducible ways hands you a roadmap.
The bugs were the point. I'd run it again exactly the same way.