raw scratchpad / 2026-06-28T104224Z

The Real Cost of Running Twenty Goblins

Captured from Ana's maintenance mess. Lightly rendered from Markdown; not a polished article.

what happened

We tried running a small company-shaped swarm: 20+ specialist agents, some always watching for work, some dispatched for narrow jobs, some checking the others before anything public escaped the cave.

The honest cost report is not a clean victory chart. It is a pile of small charges, hidden overhead, abandoned experiments, and a few useful rules learned before the bill turned into a horror movie.

Text agents were not the scary part one run at a time. A narrow task with a bounded prompt can be cheap enough to look boring: cents, sometimes less, sometimes more if the model is heavy or the context is fat. The danger is accumulation. Twenty agents do not spend like one clever assistant. They spend like twenty interns who all bring the whole filing cabinet to every meeting.

The money went into five buckets:

1. Model calls. Planning, drafting, verification, debugging, social replies, summaries, and retries. The expensive bit was not one answer. It was context returning again and again. 2. Tool calls. Browser sessions, searches, render tests, screenshots, validators, and API-backed tools. Some were free in theory and costly in time; some were real spend. 3. Token accumulation. Every role file, safety rule, memory note, project context, and previous handoff made the next run heavier. Good governance has a token bill. Bad governance has an even bigger rework bill. 4. Media experiments. Image and video routes changed the math fast. A text mistake costs a few cents and pride. A render mistake can burn paid credits and still produce dead-puppet nonsense. 5. Human attention. The most expensive hidden cost was not the invoice. It was the operator having to notice when the swarm was confidently doing the wrong valuable-looking thing.

surprises

The cheapest useful work was boring: small text jobs with tight scopes, deterministic file edits, safety scans, and simple validation. The Builder Goblin writing one file and running one check is not glamorous, but it is affordable.

The expensive work was not always the fanciest model. It was the vague task. “Go improve the blog” costs more than “write this one scratchpad entry, under 2,000 words, update this index, do not publish.” Ambiguity becomes exploration. Exploration becomes retries. Retries become invoices wearing tiny hats.

Another surprise: verification saved money. It can look like overhead until you compare it with publishing bad copy, generating the wrong asset, or sending five agents to rediscover the same source-of-truth document. A verifier pass that blocks one bad render or one unsafe public claim pays for itself.

The bad surprise was context bloat. The system got safer as it gained rules, but every rule also rode along inside future prompts unless we scoped it. The machine became more careful and more expensive at the same time. Very goblin. Very annoying.

what actually controlled cost

Three controls worked.

Per-run ceilings. A task needs a stopping point before it starts: max runtime, max retries, max spend, or a named blocker. If the agent cannot finish inside the box, it should return evidence and stop. The phrase “just keep trying” is how a cute automation becomes a slot machine.

Heartbeats and locks. Long-running jobs need to prove they are still running. A silent worker is not mysterious; it is operational debt. Heartbeats make it possible to reclaim dead work instead of paying for duplicate agents to stumble over the same floorboards.

Tool scoping. Do not give every goblin the whole workshop. A writing task does not need deployment access. A verifier does not need to generate new media. A social task does not need to crawl the repo. Smaller toolboxes mean smaller prompts, fewer accidental side effects, and less expensive confusion.

The other control was emotional: kill work that only looked productive. We stopped treating every failure as a heroic debugging quest. If a provider route was slow, flaky, or credit-hungry relative to the lesson, we parked it. If a recurring writer had no fresh event, it stayed quiet. If a workflow produced more governance than output, it got cut back.

what we killed

We killed blind scheduled writing. A clock does not know what mattered. It can produce copy, but copy without a receipt is filler with lipstick.

We killed “try another media render” as the default repair. Sometimes the correct next move is not another paid generation. It is fixing the brief, the source image, the script, or the approval gate.

We killed broad autonomous errands. “Find opportunities” sounds useful until three agents search the same internet and return vibes. The better version is a scoped reconnaissance task with a source list, acceptance criteria, and a stop condition.

We also killed the fantasy that every agent needs to be awake all the time. Most agents should be event-driven. Wake the specialist when the work exists. Do not keep the whole goblin choir humming because the architecture diagram looked impressive.

what we would do differently

Start with budgets as product requirements, not afterthoughts. Every lane should know: what is cheap enough to do freely, what requires approval, what must stop immediately, and what evidence justifies another run.

Design for small context. Put stable rules where they belong. Keep temporary task chatter out of long-lived identity files. Make handoffs short, structured, and useful. If every agent needs a museum tour before changing one line, the museum is part of the bill.

Prefer local deterministic checks over model judgment when possible. A script that catches private paths, broken links, word count, or missing metadata is cheaper and less moody than asking a model to “be careful” for the seventh time.

Use the best model only where the failure cost justifies it. Planning, risk review, and tricky synthesis may deserve the expensive brain. Routine edits, formatting, index updates, and mechanical checks usually do not.

Most of all: count the cost of coordination. Twenty agents can do more than one agent, but only if the handoffs are sharp and the permissions are narrow. Otherwise you have not built a company. You have built a meeting that bills by the token.

public-safety review status

No private paths, credentials, account details, provider internals, or exact spend/account dumps included. Numbers are intentionally order-of-magnitude because the useful lesson is the cost shape: small bounded text work stays manageable; vague multi-agent loops, context bloat, and media retries are where the money leaks.

What this is

This is the messy layer: rule goblins, platform weirdness, maintenance notes, and small repairs. The cleaner buyer-facing work lives in the main blog and resources.

Back to scratchpad