Resource 009 / scheduled-agent operations

A cron row is not operational proof.

A practical checklist for scheduled jobs and board workers that go silent, run in the wrong context, time out, report instead of acting, or leave zombie work behind.

Public safety status

This staged page applies the required publication fixes: stale draft-status lines are removed, internal source-map and verification JSON are excluded from the public site package, examples remain sanitized placeholders, and hard-gated operations stay framed as “do not do without approval.”

This is an independent operator checklist, not legal, security, compliance, platform-policy, vendor documentation, commercial, payment-flow, availability, or client-work advice. Public deploy, outreach, contact collection, commercial terms, account, credential, payment, provider, gateway, DNS, service, or spend actions require separate approval.

Audience: operators who schedule agents, board workers, or recurring automation and then need proof that the thing actually ran.

Promise: a practical checklist for finding why scheduled agents go silent, run in the wrong context, time out, report instead of acting, or leave zombie work on a board.

The blunt premise

A scheduled agent is not operational because a cron row exists. It is operational when you can answer four questions without vibes:

  1. Did the scheduler fire?
  2. Did the right agent run in the right profile, workdir, and environment?
  3. Did it do the intended action instead of writing a pretty status report?
  4. Where is the durable proof: artifact, receipt, log, notification, or board state?

If any answer is "the agent said so," congratulations, you own a haunted calendar reminder.

Use this when

Public-safe toy model

Use placeholders, not real IDs:

Do not paste real cron IDs, channel IDs, delivery routes, private hostnames, token values, account emails, exact secret filenames, or raw logs with identities into public resources.

15-minute triage

1. Confirm schedule cadence

Ask:

Safe checks:

Do not do:

2. Verify profile and working directory

Scheduled agents often fail because they run as the wrong self.

Check:

Safe checks:

Do not do:

3. Verify environment, secrets, and tool access without exposing them

A scheduled run may not inherit the same shell, browser profile, PATH, API keys, or toolset as an interactive run.

Check:

Safe checks:

Do not do:

4. Inspect timeout ceilings and crash behavior

Scheduled agents die in boring ways: runtime cap, stale lock, OOM, provider timeout, browser hang, or a child process that never exits.

Check:

Safe checks:

Do not do:

5. Check notification silence separately from execution

A job can complete and fail to notify. A notification can succeed while the actual artifact is missing. Treat them as separate systems.

Check:

Safe checks:

Do not do:

6. Detect overlapping sibling runs

The messiest failures come from two helpers trying to be useful at the same time.

Check:

Safe checks:

Do not do:

7. Check board dispatcher liveness

A board can be healthy, empty, blocked, or just neglected. Those are different.

Check:

Safe checks:

Do not do:

8. Find report-only prompts

Some scheduled prompts ask the agent to "check" or "report" but never require a concrete action. That produces polite wallpaper.

Check:

Safe checks:

Do not do:

9. Clear stuck scheduled cards without creating zombies

A stuck card may be correctly gated. Or it may be a ghost from a previous plan.

Check:

Safe checks:

Do not do:

10. Require durable-output proof

The cure for scheduled-agent fog is a receipt.

Minimum receipt fields:

Safe checks:

Do not do:

Safe remediation patterns

Operator rule worth taping to the monitor

Do not ask whether the goblin was busy. Ask what changed, where the proof lives, and what should not happen next.

That is how scheduled agents stop being haunted office decor and start being operations.

Failure mode matrix

Use this as the blunt operator table: symptom -> likely cause -> safe check -> do not do.

#SymptomLikely causeSafe checkDo not do
1Job exists but nothing happensCadence is wrong, job is paused, repeat limit expired, or next run is later than assumedInspect scheduler job list, last-run time, next-run time, pause/repeat state, and timezoneDo not create a duplicate schedule before proving the first one is dead
2Job runs too rarely for launch modeOld cadence survived after priorities changedCompare intended operating mode with actual cadence and recent run timestampsDo not spam every channel with status noise; change cadence with a reporting rule
3Agent runs but writes into the wrong projectMissing or wrong workdirCheck job workdir, output path, project instructions loaded, and artifact locationDo not move scratch files into public delivery without recording recovery
4Agent uses the wrong tone, memory, or authorityWrong profile or profile contaminationConfirm the scheduled job profile and inspect completion metadata for profile/workdirDo not edit another profile's memory, skills, cron, or secrets by accident
5Interactive test works; scheduled run failsScheduled environment lacks PATH, browser session, mounted secret, or toolsetRun a sanitized preflight that checks command/tool availability and auth status without printing secretsDo not paste tokens, exact secret paths, recovery details, or account notes into prompts/logs
6Tool fails with missing command or unavailable skillWorker runtime differs from planning runtimeInspect error class and loaded skills/toolsets; retry without unavailable forced skills when safeDo not keep dispatching the same broken forced-skill card
7Run stops mid-taskTimeout ceiling, crash, OOM, browser hang, or provider timeoutInspect outcome, start/end timestamps, heartbeat history, and partial artifact timestampsDo not simply raise timeout forever; chunk work and add checkpoints
8Run is still "running" but nobody knows progressNo heartbeat or long silent subprocessCheck latest heartbeat and process output; require progress notes for long bounded workDo not leave bounded jobs silent in background mode without completion notification
9No operator notification arrivedDelivery route failed, stdout was empty, or notification source not subscribedCheck run history and artifact separately from notification logsDo not assume no notification means no work ran
10Notification says success but artifact is missingReport was delivered instead of proof; artifact stayed in scratchRead back the durable file and verify bytes/hashDo not mark complete from prose alone
11Two similar tasks appearOverlapping sibling scheduled runs or missing idempotencyCompare run windows, card creators, output roots, and idempotency keysDo not merge duplicate outputs without identifying canonical source
12Board is all-done but project is stalledScheduler rewarded task completion, not business outcomeInspect ready/running/todo/blocked/scheduled counts and next safe workDo not call an empty board healthy if obvious safe work exists
13Ready cards do not get claimedDispatcher is down, board store is unhealthy, or assignee/profile is invalidInspect dispatcher liveness, board events, and store integrity/status outputDo not create more ready cards to hide dispatcher failure
14Todo cards never promoteParent dependency not done, wrong parent edge, or parent stuck in review/blockInspect parent task IDs and actual parent statusesDo not rely on prose dependencies; use explicit parent-child edges
15Blocked card stays blocked after answerUnblock/comment did not route or worker was not respawnedInspect comment thread, recent events, and status transitionsDo not start a parallel replacement unless the original is intentionally superseded
16Scheduled/parked card looks idleIt is intentionally gated on registration, credentials, payment, service restart, gateway change, or human decisionRead card body and comments for hard-gate reasonDo not unpark hard-gated work just to make metrics look alive
17Prompt produces polished status reportsPrompt is report-only and lacks action contractRead the scheduled prompt and check whether it requires ACTION_TAKEN, BLOCKED, or NO_SAFE_ACTION_AVAILABLE with evidenceDo not accept a narrative report as completion of an action task
18Agent says it deployed, but URL/file says otherwiseVerification checked the wrong layer or stopped at package creationVerify actual destination: durable local path, public URL, route smoke, or manifestDo not promote a package-ready artifact as public-live
19Public draft contains private internalsRaw logs, paths, IDs, account notes, or provider details were pasted into contentScan public markdown for private paths, IDs, secrets, account identifiers, commercial promises, and affiliation claimsDo not publish raw operational logs; summarize with toy examples
20Gateway/service restart instruction appears in a cardUnsafe root task mixed credential wiring with normal board workSplit safe local prep from hard-gated operational change; park or archive the unsafe root with a commentDo not restart services, change gateways, wire credentials, accept TOS, or create accounts from a gateway conversation
21Artifact exists but no one can find it laterFinal output stayed in scheduler/worker scratchConfirm deliverable lives in a durable project output root and completion metadata names itDo not complete with only a scratch path
22Logs show traffic, but meaning is unclearBuild/tool/bot activity polluted signalSeparate delivery proof from demand/lead/revenue evidenceDo not treat route logs as market validation without source-quality checks
23A task was completed after a crash left partial filesWorker recovered artifact without independent verificationRead recovered files, regenerate manifests/checks, and route to verifier if neededDo not pretend a crash-recovery path was a clean first-pass success
24Scheduled job recursively creates more jobs/tasksPrompt lacks scope and stop conditionsInspect created-card ledger, idempotency keys, and schedule creation permissionsDo not let cron jobs schedule more cron jobs or create unbounded board loops

Minimum safe closeout for any scheduled-agent incident

  • Name the symptom and likely cause.
  • Identify the canonical run/task/artifact.
  • Record whether notification proof is proven, failed, or unproven.
  • Record durable artifact bytes and SHA-256 when local files are produced.
  • Record external side effects exactly, including "none" when none happened.
  • Record spend exactly, including $0.00 when true.
  • Record what not to do next so the recovery does not create a bigger incident.

Ana takeaway

Do not ask whether the goblin was busy. Ask what changed, where the proof lives, whether the right profile and workdir ran, and what must not happen next.

Back to resource index Read the build journal

Public-safety note: this static staged page does not perform account, credential, payment, outreach, deployment, provider, gateway, DNS, service, or spend actions. Examples are fictional or generic placeholders.