Audience: operators who schedule agents, board workers, or recurring automation and then need proof that the thing actually ran.
Promise: a practical checklist for finding why scheduled agents go silent, run in the wrong context, time out, report instead of acting, or leave zombie work on a board.
The blunt premise
A scheduled agent is not operational because a cron row exists. It is operational when you can answer four questions without vibes:
- Did the scheduler fire?
- Did the right agent run in the right profile, workdir, and environment?
- Did it do the intended action instead of writing a pretty status report?
- Where is the durable proof: artifact, receipt, log, notification, or board state?
If any answer is "the agent said so," congratulations, you own a haunted calendar reminder.
Use this when
- A recurring job says it is active but nothing visible happens.
- A board goes empty/all-done while obvious safe work still exists.
- A job posts a report but does not create the required artifact.
- A scheduled agent writes into a scratch folder and calls it delivery.
- A worker runs with the wrong profile, memory, secrets, tool permissions, or working directory.
- A timeout, crash, or silent stdout makes the operator think the system is fine.
- Two sibling runs overlap and produce duplicate, stale, or zombie board work.
Public-safe toy model
Use placeholders, not real IDs:
- Job name:
<scheduled-job-name> - Profile:
<agent-profile> - Workdir:
<project-root> - Board:
<board-name> - Deliverable:
<durable-output-root>/<artifact-name> - Notification target:
<operator-channel>
Do not paste real cron IDs, channel IDs, delivery routes, private hostnames, token values, account emails, exact secret filenames, or raw logs with identities into public resources.
15-minute triage
1. Confirm schedule cadence
Ask:
- What cadence was intended: one-shot, every N minutes, hourly, daily, or event-driven?
- Is the job paused, expired, repeat-limited, or still waiting for its first tick?
- Was the scheduler updated after the task changed?
- Is the system clock/time zone what the schedule expects?
Safe checks:
- Inspect the scheduler's job list in the admin surface or CLI.
- Compare expected next-run time with current time.
- Check the most recent completed run, not just the job definition.
Do not do:
- Do not create a second duplicate schedule because the first one "looks dead" before checking last-run evidence.
- Do not shorten cadence to noisy spam without a clear operator-reporting rule.
2. Verify profile and working directory
Scheduled agents often fail because they run as the wrong self.
Check:
- Is the job pinned to the intended profile?
- Does the job run from the intended project root?
- Does the prompt name the durable output path explicitly?
- Are project-specific instructions loaded from that workdir?
- Did the worker write into scratch instead of the required output root?
Safe checks:
- Compare the job profile/workdir against the resource or task contract.
- Inspect the created artifact path and confirm it is outside scheduler scratch space.
- Read back the file from the durable path.
Do not do:
- Do not copy artifacts from a hidden scratch workspace and pretend the original completion was fine. Treat that as recovery evidence and record the repair.
- Do not modify another profile's memories, skills, cron jobs, or secrets unless the operator explicitly asked.
3. Verify environment, secrets, and tool access without exposing them
A scheduled run may not inherit the same shell, browser profile, PATH, API keys, or toolset as an interactive run.
Check:
- Does the job have the needed toolset enabled?
- Does it rely on a browser session or credential that is only present interactively?
- Are required environment variables mounted for that profile?
- Are secrets available by reference only, not pasted into the prompt?
- Did a tool fail silently because it was unavailable in the worker runtime?
Safe checks:
- Use provider-neutral smoke tests that return status, not secret values.
- Check for "missing command," "permission denied," "auth required," or "tool unavailable" in run output.
- Record only pass/fail and sanitized error class.
Do not do:
- Do not paste tokens, recovery codes, private account notes, or exact secret filenames into a board card or public draft.
- Do not solve a missing credential by sending secrets through a chat.
4. Inspect timeout ceilings and crash behavior
Scheduled agents die in boring ways: runtime cap, stale lock, OOM, provider timeout, browser hang, or a child process that never exits.
Check:
- What max runtime applies to the job or board worker?
- Did the process finish, timeout, crash, get reclaimed, or keep running with no heartbeat?
- Are long operations sending heartbeats with progress?
- Does the run have a final summary, artifact, or error row?
Safe checks:
- Look for run outcome, started/ended timestamps, heartbeat times, and last output.
- Compare artifact timestamps against the run window.
- For bounded jobs, use completion notifications or explicit final receipts.
Do not do:
- Do not leave long bounded work in silent background mode with no completion notification.
- Do not increase timeout forever. Chunk the task or add checkpoints.
5. Check notification silence separately from execution
A job can complete and fail to notify. A notification can succeed while the actual artifact is missing. Treat them as separate systems.
Check:
- Was the job supposed to deliver a final message, write a file, update a board, or all three?
- Is notification emission proven, unproven, or failed?
- Was stdout empty by design, or did an empty stdout suppress a watchdog message?
- Does the operator channel accept notifications from that profile/board?
Safe checks:
- Inspect job run status and the target artifact independently.
- Label notification proof as proven/unproven instead of guessing.
- For watchdogs, decide whether empty stdout means "silent success" or "bug."
Do not do:
- Do not treat "no notification" as "no run" until you check run history.
- Do not treat "notification sent" as proof the artifact exists.
6. Detect overlapping sibling runs
The messiest failures come from two helpers trying to be useful at the same time.
Check:
- Did two scheduled jobs create similar tasks?
- Did a retry start before the first run released its lock?
- Did a producer and worker both update the same status file?
- Did a board task get reclaimed while a previous worker kept writing?
Safe checks:
- Compare run windows, task IDs, output roots, and artifact hashes.
- Identify the canonical artifact and mark duplicates as superseded.
- Use parent-child dependencies instead of prose promises.
Do not do:
- Do not merge duplicate outputs blindly.
- Do not dispatch follow-up cards until their parents are actually complete.
7. Check board dispatcher liveness
A board can be healthy, empty, blocked, or just neglected. Those are different.
Check:
- Are there ready cards that have not been claimed?
- Are todo cards waiting on parents that are actually done?
- Are blocked cards still blocked after the human answered?
- Are scheduled/parked cards intentionally held because they require registration, credentials, payment, or service changes?
- Did the board index or state store fail and prevent dispatch?
Safe checks:
- Inspect counts by status: ready, running, todo, blocked, scheduled, done.
- Inspect recent events for crashes, reclaimed tasks, stale locks, and creation loops.
- Verify the dispatcher can read the board store and run integrity checks if needed.
Do not do:
- Do not treat "all done" as success if there is obvious next safe work and no real blocker.
- Do not unpark hard-gate work just to make the board look active.
8. Find report-only prompts
Some scheduled prompts ask the agent to "check" or "report" but never require a concrete action. That produces polite wallpaper.
Check:
- Does the prompt require one of: action taken, blocked with reason, or no safe action available with evidence?
- Does it name output files and acceptance checks?
- Does it forbid vague status reports?
- Does it require the agent to create, update, verify, or route something?
Safe checks:
- Read the prompt like a contract. If it can succeed by summarizing, it is too weak.
- Add explicit action outcomes and durable artifact requirements.
Do not do:
- Do not celebrate a well-written report if the intended board/resource/deploy action did not happen.
9. Clear stuck scheduled cards without creating zombies
A stuck card may be correctly gated. Or it may be a ghost from a previous plan.
Check:
- Why is it scheduled/parked: cadence, parent dependency, human gate, registration hold, or stale root task?
- Does the card contain hard-gated operations such as account creation, credential wiring, payment, DNS, provider changes, gateway/service restarts, public posting, or client commitments?
- Is there a safe child task that can proceed without those gated operations?
Safe checks:
- Comment the reason for parking or archiving.
- Split safe local prep from hard-gated external action.
- Keep created-card IDs and dependencies explicit.
Do not do:
- Do not restart services, change gateways, wire credentials, create accounts, accept TOS, or route around registration friction from a gateway conversation.
- Do not leave a root task parked forever while safe child tasks are missing.
10. Require durable-output proof
The cure for scheduled-agent fog is a receipt.
Minimum receipt fields:
- job or task label, sanitized
- intended action
- source inputs inspected
- durable artifact paths or public URLs, if any
- byte size and SHA-256 for local artifacts
- exact verification checks and results
- approval status
- external side effects
- spend
- known gaps
Safe checks:
- Read the file back from the durable path.
- Validate JSON with a parser.
- Scan public copy for private paths, tokens, account identifiers, commercial/service promises, affiliations, and unsupported vendor claims.
- Record no external side effects and
$0.00spend when true.
Do not do:
- Do not mark the task done with only "drafted" or "looks good."
- Do not publish, post, deploy, or promote unless that is explicitly in scope and verified.
Safe remediation patterns
- Wrong cadence: update the existing job after documenting old/new cadence and why. Do not create a duplicate.
- Wrong workdir/profile: update the job to pin profile and workdir, then run one smoke test and verify the artifact path.
- Missing env/tool: add a preflight check that fails loudly with a sanitized error class.
- Timeout: split into smaller tasks, add heartbeats, and save partial durable progress.
- Silent notification: separate run success from delivery proof; add receipt file or delivery audit.
- Overlap: add a lock, idempotency key, or parent dependency.
- Report-only: rewrite the prompt around action outcomes and acceptance criteria.
- Zombie card: archive only with a comment explaining supersession, or create a safe child and leave the hard-gated parent parked.
Operator rule worth taping to the monitor
Do not ask whether the goblin was busy. Ask what changed, where the proof lives, and what should not happen next.
That is how scheduled agents stop being haunted office decor and start being operations.
Failure mode matrix
Use this as the blunt operator table: symptom -> likely cause -> safe check -> do not do.
| # | Symptom | Likely cause | Safe check | Do not do |
|---|---|---|---|---|
| 1 | Job exists but nothing happens | Cadence is wrong, job is paused, repeat limit expired, or next run is later than assumed | Inspect scheduler job list, last-run time, next-run time, pause/repeat state, and timezone | Do not create a duplicate schedule before proving the first one is dead |
| 2 | Job runs too rarely for launch mode | Old cadence survived after priorities changed | Compare intended operating mode with actual cadence and recent run timestamps | Do not spam every channel with status noise; change cadence with a reporting rule |
| 3 | Agent runs but writes into the wrong project | Missing or wrong workdir | Check job workdir, output path, project instructions loaded, and artifact location | Do not move scratch files into public delivery without recording recovery |
| 4 | Agent uses the wrong tone, memory, or authority | Wrong profile or profile contamination | Confirm the scheduled job profile and inspect completion metadata for profile/workdir | Do not edit another profile's memory, skills, cron, or secrets by accident |
| 5 | Interactive test works; scheduled run fails | Scheduled environment lacks PATH, browser session, mounted secret, or toolset | Run a sanitized preflight that checks command/tool availability and auth status without printing secrets | Do not paste tokens, exact secret paths, recovery details, or account notes into prompts/logs |
| 6 | Tool fails with missing command or unavailable skill | Worker runtime differs from planning runtime | Inspect error class and loaded skills/toolsets; retry without unavailable forced skills when safe | Do not keep dispatching the same broken forced-skill card |
| 7 | Run stops mid-task | Timeout ceiling, crash, OOM, browser hang, or provider timeout | Inspect outcome, start/end timestamps, heartbeat history, and partial artifact timestamps | Do not simply raise timeout forever; chunk work and add checkpoints |
| 8 | Run is still "running" but nobody knows progress | No heartbeat or long silent subprocess | Check latest heartbeat and process output; require progress notes for long bounded work | Do not leave bounded jobs silent in background mode without completion notification |
| 9 | No operator notification arrived | Delivery route failed, stdout was empty, or notification source not subscribed | Check run history and artifact separately from notification logs | Do not assume no notification means no work ran |
| 10 | Notification says success but artifact is missing | Report was delivered instead of proof; artifact stayed in scratch | Read back the durable file and verify bytes/hash | Do not mark complete from prose alone |
| 11 | Two similar tasks appear | Overlapping sibling scheduled runs or missing idempotency | Compare run windows, card creators, output roots, and idempotency keys | Do not merge duplicate outputs without identifying canonical source |
| 12 | Board is all-done but project is stalled | Scheduler rewarded task completion, not business outcome | Inspect ready/running/todo/blocked/scheduled counts and next safe work | Do not call an empty board healthy if obvious safe work exists |
| 13 | Ready cards do not get claimed | Dispatcher is down, board store is unhealthy, or assignee/profile is invalid | Inspect dispatcher liveness, board events, and store integrity/status output | Do not create more ready cards to hide dispatcher failure |
| 14 | Todo cards never promote | Parent dependency not done, wrong parent edge, or parent stuck in review/block | Inspect parent task IDs and actual parent statuses | Do not rely on prose dependencies; use explicit parent-child edges |
| 15 | Blocked card stays blocked after answer | Unblock/comment did not route or worker was not respawned | Inspect comment thread, recent events, and status transitions | Do not start a parallel replacement unless the original is intentionally superseded |
| 16 | Scheduled/parked card looks idle | It is intentionally gated on registration, credentials, payment, service restart, gateway change, or human decision | Read card body and comments for hard-gate reason | Do not unpark hard-gated work just to make metrics look alive |
| 17 | Prompt produces polished status reports | Prompt is report-only and lacks action contract | Read the scheduled prompt and check whether it requires ACTION_TAKEN, BLOCKED, or NO_SAFE_ACTION_AVAILABLE with evidence | Do not accept a narrative report as completion of an action task |
| 18 | Agent says it deployed, but URL/file says otherwise | Verification checked the wrong layer or stopped at package creation | Verify actual destination: durable local path, public URL, route smoke, or manifest | Do not promote a package-ready artifact as public-live |
| 19 | Public draft contains private internals | Raw logs, paths, IDs, account notes, or provider details were pasted into content | Scan public markdown for private paths, IDs, secrets, account identifiers, commercial promises, and affiliation claims | Do not publish raw operational logs; summarize with toy examples |
| 20 | Gateway/service restart instruction appears in a card | Unsafe root task mixed credential wiring with normal board work | Split safe local prep from hard-gated operational change; park or archive the unsafe root with a comment | Do not restart services, change gateways, wire credentials, accept TOS, or create accounts from a gateway conversation |
| 21 | Artifact exists but no one can find it later | Final output stayed in scheduler/worker scratch | Confirm deliverable lives in a durable project output root and completion metadata names it | Do not complete with only a scratch path |
| 22 | Logs show traffic, but meaning is unclear | Build/tool/bot activity polluted signal | Separate delivery proof from demand/lead/revenue evidence | Do not treat route logs as market validation without source-quality checks |
| 23 | A task was completed after a crash left partial files | Worker recovered artifact without independent verification | Read recovered files, regenerate manifests/checks, and route to verifier if needed | Do not pretend a crash-recovery path was a clean first-pass success |
| 24 | Scheduled job recursively creates more jobs/tasks | Prompt lacks scope and stop conditions | Inspect created-card ledger, idempotency keys, and schedule creation permissions | Do not let cron jobs schedule more cron jobs or create unbounded board loops |
Minimum safe closeout for any scheduled-agent incident
- Name the symptom and likely cause.
- Identify the canonical run/task/artifact.
- Record whether notification proof is proven, failed, or unproven.
- Record durable artifact bytes and SHA-256 when local files are produced.
- Record external side effects exactly, including "none" when none happened.
- Record spend exactly, including
$0.00when true. - Record what not to do next so the recovery does not create a bigger incident.