← Writing · No 12 · Operations

An overnight agent fleet merged 11 PRs while I slept

By Felix Hellström · Stockholm · 800 words

On the night of 1 March 2026, I let my agent orchestrator run unsupervised for the first time. I handed it a backlog, started the loop, and went to sleep. When I woke up, eleven pull requests had been merged into my repositories. Each one was implemented, tested, and merged by a fleet of agents I wasn't awake to watch.

This is the story of that run: the numbers, the architecture that produced them, and the failure modes that mattered more than the successes.

The numbers

Over 7.5 hours the system ran 686 orchestration cycles. It pulled 54 unique issues from a backlog and dispatched 103 tasks against them. Eleven pull requests were merged automatically. Nine were flagged for manual review. Eight agents stalled mid-task and were detected, killed, and re-dispatched without me touching the keyboard.

And 94.5% of cycles were idle.

Both numbers matter. The eleven merged PRs weren't cosmetic; they included dependency upgrades, bug fixes, feature work, and new tests, each passing the suite before merge. The 94.5% idle rate is the reason this was a proof of concept and not a production system. The fleet did real work in 5.5% of its runtime and wasted the rest polling an empty queue.

The architecture

Three layers produced that result.

A scheduler reads a backlog of issues, prioritizes by label, and assigns each one to an open agent slot.

The agents are Claude Code instances, each running in its own tmux pane with a single issue, a fresh checkout, and a set of instructions. An agent implements the change, runs the tests, and opens a pull request.

A supervisor watches agent health. When a pane stops producing output for long enough, the supervisor treats the agent as stalled, kills it, and hands the issue back to the scheduler.

The weakest link was that last layer. Detecting "stuck" by watching terminal output is fragile; it pattern-matches on a side effect instead of reading a real status signal. It worked eight times that night, but it is the part I trusted least, and the part I rebuilt first.

What broke

The interesting output of an unsupervised run isn't the merged PRs. It's the catalogue of things that failed where no human was watching to quietly paper over them.

No adaptive polling. When the backlog drained, the loop kept checking an empty queue on a fixed interval. Hence 94.5% idle. The fix is backoff when the queue is empty and ramp-up when work arrives.

No wall-clock timeout. One agent got stuck and ran for more than six hours before the night ended on its own. A single task burned most of the run. Any agent past 30 minutes on one issue should be killed and re-dispatched.

Duplicate dispatches. 103 dispatches for 54 issues means some issues were picked up twice. The deduplication had a race: two cycles that started inside the same polling window could both grab the same issue. Issues need to be locked before dispatch, not after.

No quality gates. PRs auto-merged on a green test suite: no diff-size ceiling, no post-merge CI, no human gate above a complexity threshold. For a first run I got lucky. Luck is not a guardrail.

Destructive restarts. When the orchestrator reconciled state on startup, it killed every running agent. Any in-flight work was lost. Reconciliation should preserve healthy agents, not nuke them.

Why this is the work I care about

None of these failures are about the agents being bad at writing code. The agents were fine. Every failure lived in the infrastructure around them: the scheduler, the dispatch lock, the health check, the reconciliation logic. That layer is the one I build.

An overnight fleet that produces eleven real PRs is a leverage machine: team-scale output at close to zero marginal cost, running while I sleep. The scarce skill isn't prompting a model. It's building the orchestration, state management, and recovery loops that let a fleet run unattended without corrupting itself. And being honest about exactly where those loops still break.

The morning after this run, the system's own failure log became the next backlog. It had, in effect, written its own to-do list. That is the loop I was trying to build, and it works.

I build agent infrastructure — a Rust multi-agent OS I operate daily, and context-vault, a published MCP memory layer that works across Claude Code, Cursor, and Windsurf.

Available Q3 2026. If you're building agent systems and want someone who has shipped this kind of infrastructure, let's talk.

Next up