619 commits, two months. That's how much of a personal agent operating system I've built: a layer that sits between my repositories, my tools, and a fleet of Claude Code agents, and runs them as long-lived workers against a backlog. I use it every day. This is the honest retrospective: which design choices held under real load, and which ones broke.
The useful version of this post is not a feature list. It's the line between the decisions that survived contact with an unsupervised fleet and the ones that didn't.
What holds
State lives in the filesystem. Work isn't tracked in a database or in an agent's memory. A unit of work is a file that moves between todo/, doing/, and done/ directories. This sounds primitive until the day a fleet-wide bug takes out every running process at once. That happened: one malformed call broke tool invocation across the whole system mid-run. Zero work was lost. When the fault was patched and the orchestrator restarted, it read the filesystem, found the in-flight work still sitting in doing/, and continued. A crash can't corrupt state that was never in memory to begin with. This is the choice I'd defend hardest.
The folder tree is the config. There's no central configuration file describing the fleet. The physical layout of the directory (which workspace owns which project, where an agent's instructions live) is the configuration. An agent's identity is its folder. New workspaces are created by making directories, not by editing a registry, and nothing drifts out of sync with a file that has to be maintained by hand.
Memory is a service, not a habit. Context lives once, in a vault the agents query, instead of being copy-pasted into markdown files per project. A decision made in one workspace is visible to an agent in another. The instruction files became thin boot loaders that declare what to load, not where the knowledge is stored.
Sequential beats parallel. I learned this the expensive way. Running eight concurrent agents thrashes the machine and produces merge conflicts that cost more time than the parallelism saves. The default is now one thing done well; parallel is reserved for work proven independent. Coordination overhead is real and it is rarely worth it for a solo operator.
What breaks
Idle efficiency. The first unsupervised overnight run spent 94.5% of its cycles polling an empty queue, because the loop checked for work on a fixed interval with no backoff. The fleet produced real merged pull requests, but it wasted most of its runtime doing nothing. Naive polling is the first thing that breaks the moment a system runs long enough for the backlog to drain.
Handoffs race. The failure that taught me the most: a worker would spawn before its task was reliably attached, boot up with nothing to do, and exit cleanly. The manager, watching for completion, saw the exit and waited forever for a signal that would never come. Both processes behaved correctly in isolation; the bug lived in the gap between them. Distributed handoffs need an explicit confirm step, not an assumption that a message arrived before the process started.
Two things competing to dispatch. An automatic dispatch hook and a manager's own dispatch loop, both correct on their own, fought over the same queue on one workspace and launched work with the wrong template. The heartbeat wasn't dead; it was misdirected. Anywhere two mechanisms can both take an action, exactly one has to own it.
"Valid but wrong" output. The hardest failures aren't crashes. For three runs, agents edited the wrong directory because of a one-character path mismatch; the output validated, the file counts matched, and nothing was actually being synced. A check that only confirms "is this valid HTML?" misses "did you edit the thing you were supposed to edit?" Preflight checks have to verify intent, not just well-formedness.
The throughline
The pattern across the breakages is that none of them are about the agents writing bad code. They're about the substrate around the agents: dispatch, handoff, state, verification. An autonomous fleet is a reliability test for that substrate, and the only way to find these failure modes is to run the system unsupervised, long enough that the edge cases actually arrive. A human in the loop silently corrects them and you never learn they exist.
So the system logs its own friction as it works, and that log becomes the next backlog. The mechanism that finds the problems is the same one that fixes them. 619 commits in, that loop is the part I'm most confident about: not because the system runs perfectly, but because it gets less wrong every week, on its own.
I build agent infrastructure — a Rust multi-agent OS I operate daily, and context-vault, a published MCP memory layer that works across Claude Code, Cursor, and Windsurf.
Available Q3 2026. If you're building agent systems and want someone who has shipped this kind of infrastructure, let's talk.