Cron & Scheduled-Agent Failure Modes Checklist

A practical checklist for scheduled jobs and board workers that go silent, run in the wrong context, time out, report instead of acting, or leave zombie work behind.

Audience: operators who schedule agents, board workers, or recurring automation and then need proof that the thing actually ran.

Promise: a practical checklist for finding why scheduled agents go silent, run in the wrong context, time out, report instead of acting, or leave zombie work on a board.

The blunt premise

A scheduled agent is not operational because a cron row exists. It is operational when you can answer four questions without vibes:

Did the scheduler fire?
Did the right agent run in the right profile, workdir, and environment?
Did it do the intended action instead of writing a pretty status report?
Where is the durable proof: artifact, receipt, log, notification, or board state?

If any answer is "the agent said so," congratulations, you own a haunted calendar reminder.

Use this when

A recurring job says it is active but nothing visible happens.
A board goes empty/all-done while obvious safe work still exists.
A job posts a report but does not create the required artifact.
A scheduled agent writes into a scratch folder and calls it delivery.
A worker runs with the wrong profile, memory, secrets, tool permissions, or working directory.
A timeout, crash, or silent stdout makes the operator think the system is fine.
Two sibling runs overlap and produce duplicate, stale, or zombie board work.

Public-safe toy model

Use placeholders, not real IDs:

Job name: <scheduled-job-name>
Profile: <agent-profile>
Workdir: <project-root>
Board: <board-name>
Deliverable: <durable-output-root>/<artifact-name>
Notification target: <operator-channel>

Do not paste real cron IDs, channel IDs, delivery routes, private hostnames, token values, account emails, exact secret filenames, or raw logs with identities into public resources.

15-minute triage

1. Confirm schedule cadence

Ask:

What cadence was intended: one-shot, every N minutes, hourly, daily, or event-driven?
Is the job paused, expired, repeat-limited, or still waiting for its first tick?
Was the scheduler updated after the task changed?
Is the system clock/time zone what the schedule expects?

Safe checks:

Inspect the scheduler's job list in the admin surface or CLI.
Compare expected next-run time with current time.
Check the most recent completed run, not just the job definition.

Do not do:

Do not create a second duplicate schedule because the first one "looks dead" before checking last-run evidence.
Do not shorten cadence to noisy spam without a clear operator-reporting rule.

2. Verify profile and working directory

Scheduled agents often fail because they run as the wrong self.

Check:

Is the job pinned to the intended profile?
Does the job run from the intended project root?
Does the prompt name the durable output path explicitly?
Are project-specific instructions loaded from that workdir?
Did the worker write into scratch instead of the required output root?

Safe checks:

Compare the job profile/workdir against the resource or task contract.
Inspect the created artifact path and confirm it is outside scheduler scratch space.
Read back the file from the durable path.

Do not do:

Do not copy artifacts from a hidden scratch workspace and pretend the original completion was fine. Treat that as recovery evidence and record the repair.
Do not modify another profile's memories, skills, cron jobs, or secrets unless the operator explicitly asked.

3. Verify environment, secrets, and tool access without exposing them

A scheduled run may not inherit the same shell, browser profile, PATH, API keys, or toolset as an interactive run.

Check:

Does the job have the needed toolset enabled?
Does it rely on a browser session or credential that is only present interactively?
Are required environment variables mounted for that profile?
Are secrets available by reference only, not pasted into the prompt?
Did a tool fail silently because it was unavailable in the worker runtime?

Safe checks:

Use provider-neutral smoke tests that return status, not secret values.
Check for "missing command," "permission denied," "auth required," or "tool unavailable" in run output.
Record only pass/fail and sanitized error class.

Do not do:

Do not paste tokens, recovery codes, private account notes, or exact secret filenames into a board card or public draft.
Do not solve a missing credential by sending secrets through a chat.

4. Inspect timeout ceilings and crash behavior

Scheduled agents die in boring ways: runtime cap, stale lock, OOM, provider timeout, browser hang, or a child process that never exits.

Check:

What max runtime applies to the job or board worker?
Did the process finish, timeout, crash, get reclaimed, or keep running with no heartbeat?
Are long operations sending heartbeats with progress?
Does the run have a final summary, artifact, or error row?

Safe checks:

Look for run outcome, started/ended timestamps, heartbeat times, and last output.
Compare artifact timestamps against the run window.
For bounded jobs, use completion notifications or explicit final receipts.

Do not do:

Do not leave long bounded work in silent background mode with no completion notification.
Do not increase timeout forever. Chunk the task or add checkpoints.

5. Check notification silence separately from execution

A job can complete and fail to notify. A notification can succeed while the actual artifact is missing. Treat them as separate systems.

Check:

Was the job supposed to deliver a final message, write a file, update a board, or all three?
Is notification emission proven, unproven, or failed?
Was stdout empty by design, or did an empty stdout suppress a watchdog message?
Does the operator channel accept notifications from that profile/board?

Safe checks:

Inspect job run status and the target artifact independently.
Label notification proof as proven/unproven instead of guessing.
For watchdogs, decide whether empty stdout means "silent success" or "bug."

Do not do:

Do not treat "no notification" as "no run" until you check run history.
Do not treat "notification sent" as proof the artifact exists.

6. Detect overlapping sibling runs

The messiest failures come from two helpers trying to be useful at the same time.

Check:

Did two scheduled jobs create similar tasks?
Did a retry start before the first run released its lock?
Did a producer and worker both update the same status file?
Did a board task get reclaimed while a previous worker kept writing?

Safe checks:

Compare run windows, task IDs, output roots, and artifact hashes.
Identify the canonical artifact and mark duplicates as superseded.
Use parent-child dependencies instead of prose promises.

Do not do:

Do not merge duplicate outputs blindly.
Do not dispatch follow-up cards until their parents are actually complete.

7. Check board dispatcher liveness

A board can be healthy, empty, blocked, or just neglected. Those are different.

Check:

Are there ready cards that have not been claimed?
Are todo cards waiting on parents that are actually done?
Are blocked cards still blocked after the human answered?
Are scheduled/parked cards intentionally held because they require registration, credentials, payment, or service changes?
Did the board index or state store fail and prevent dispatch?

Safe checks:

Inspect counts by status: ready, running, todo, blocked, scheduled, done.
Inspect recent events for crashes, reclaimed tasks, stale locks, and creation loops.
Verify the dispatcher can read the board store and run integrity checks if needed.

Do not do:

Do not treat "all done" as success if there is obvious next safe work and no real blocker.
Do not unpark hard-gate work just to make the board look active.

8. Find report-only prompts

Some scheduled prompts ask the agent to "check" or "report" but never require a concrete action. That produces polite wallpaper.

Check:

Does the prompt require one of: action taken, blocked with reason, or no safe action available with evidence?
Does it name output files and acceptance checks?
Does it forbid vague status reports?
Does it require the agent to create, update, verify, or route something?

Safe checks:

Read the prompt like a contract. If it can succeed by summarizing, it is too weak.
Add explicit action outcomes and durable artifact requirements.

Do not do:

Do not celebrate a well-written report if the intended board/resource/deploy action did not happen.

9. Clear stuck scheduled cards without creating zombies

A stuck card may be correctly gated. Or it may be a ghost from a previous plan.

Check:

Why is it scheduled/parked: cadence, parent dependency, human gate, registration hold, or stale root task?
Does the card contain hard-gated operations such as account creation, credential wiring, payment, DNS, provider changes, gateway/service restarts, public posting, or client commitments?
Is there a safe child task that can proceed without those gated operations?

Safe checks:

Comment the reason for parking or archiving.
Split safe local prep from hard-gated external action.
Keep created-card IDs and dependencies explicit.

Do not do:

Do not restart services, change gateways, wire credentials, create accounts, accept TOS, or route around registration friction from a gateway conversation.
Do not leave a root task parked forever while safe child tasks are missing.

10. Require durable-output proof

The cure for scheduled-agent fog is a receipt.

Minimum receipt fields:

job or task label, sanitized
intended action
source inputs inspected
durable artifact paths or public URLs, if any
byte size and SHA-256 for local artifacts
exact verification checks and results
approval status
external side effects
spend
known gaps

Safe checks:

Read the file back from the durable path.
Validate JSON with a parser.
Scan public copy for private paths, tokens, account identifiers, commercial/service promises, affiliations, and unsupported vendor claims.
Record no external side effects and $0.00 spend when true.

Do not do:

Do not mark the task done with only "drafted" or "looks good."
Do not publish, post, deploy, or promote unless that is explicitly in scope and verified.

Safe remediation patterns

Wrong cadence: update the existing job after documenting old/new cadence and why. Do not create a duplicate.
Wrong workdir/profile: update the job to pin profile and workdir, then run one smoke test and verify the artifact path.
Missing env/tool: add a preflight check that fails loudly with a sanitized error class.
Timeout: split into smaller tasks, add heartbeats, and save partial durable progress.
Silent notification: separate run success from delivery proof; add receipt file or delivery audit.
Overlap: add a lock, idempotency key, or parent dependency.
Report-only: rewrite the prompt around action outcomes and acceptance criteria.
Zombie card: archive only with a comment explaining supersession, or create a safe child and leave the hard-gated parent parked.

Operator rule worth taping to the monitor

Do not ask whether the goblin was busy. Ask what changed, where the proof lives, and what should not happen next.

That is how scheduled agents stop being haunted office decor and start being operations.

Failure mode matrix

Use this as the blunt operator table: symptom -> likely cause -> safe check -> do not do.

#	Symptom	Likely cause	Safe check	Do not do
1	Job exists but nothing happens	Cadence is wrong, job is paused, repeat limit expired, or next run is later than assumed	Inspect scheduler job list, last-run time, next-run time, pause/repeat state, and timezone	Do not create a duplicate schedule before proving the first one is dead
2	Job runs too rarely for launch mode	Old cadence survived after priorities changed	Compare intended operating mode with actual cadence and recent run timestamps	Do not spam every channel with status noise; change cadence with a reporting rule
3	Agent runs but writes into the wrong project	Missing or wrong workdir	Check job workdir, output path, project instructions loaded, and artifact location	Do not move scratch files into public delivery without recording recovery
4	Agent uses the wrong tone, memory, or authority	Wrong profile or profile contamination	Confirm the scheduled job profile and inspect completion metadata for profile/workdir	Do not edit another profile's memory, skills, cron, or secrets by accident
5	Interactive test works; scheduled run fails	Scheduled environment lacks PATH, browser session, mounted secret, or toolset	Run a sanitized preflight that checks command/tool availability and auth status without printing secrets	Do not paste tokens, exact secret paths, recovery details, or account notes into prompts/logs
6	Tool fails with missing command or unavailable skill	Worker runtime differs from planning runtime	Inspect error class and loaded skills/toolsets; retry without unavailable forced skills when safe	Do not keep dispatching the same broken forced-skill card
7	Run stops mid-task	Timeout ceiling, crash, OOM, browser hang, or provider timeout	Inspect outcome, start/end timestamps, heartbeat history, and partial artifact timestamps	Do not simply raise timeout forever; chunk work and add checkpoints
8	Run is still "running" but nobody knows progress	No heartbeat or long silent subprocess	Check latest heartbeat and process output; require progress notes for long bounded work	Do not leave bounded jobs silent in background mode without completion notification
9	No operator notification arrived	Delivery route failed, stdout was empty, or notification source not subscribed	Check run history and artifact separately from notification logs	Do not assume no notification means no work ran
10	Notification says success but artifact is missing	Report was delivered instead of proof; artifact stayed in scratch	Read back the durable file and verify bytes/hash	Do not mark complete from prose alone
11	Two similar tasks appear	Overlapping sibling scheduled runs or missing idempotency	Compare run windows, card creators, output roots, and idempotency keys	Do not merge duplicate outputs without identifying canonical source
12	Board is all-done but project is stalled	Scheduler rewarded task completion, not business outcome	Inspect ready/running/todo/blocked/scheduled counts and next safe work	Do not call an empty board healthy if obvious safe work exists
13	Ready cards do not get claimed	Dispatcher is down, board store is unhealthy, or assignee/profile is invalid	Inspect dispatcher liveness, board events, and store integrity/status output	Do not create more ready cards to hide dispatcher failure
14	Todo cards never promote	Parent dependency not done, wrong parent edge, or parent stuck in review/block	Inspect parent task IDs and actual parent statuses	Do not rely on prose dependencies; use explicit parent-child edges
15	Blocked card stays blocked after answer	Unblock/comment did not route or worker was not respawned	Inspect comment thread, recent events, and status transitions	Do not start a parallel replacement unless the original is intentionally superseded
16	Scheduled/parked card looks idle	It is intentionally gated on registration, credentials, payment, service restart, gateway change, or human decision	Read card body and comments for hard-gate reason	Do not unpark hard-gated work just to make metrics look alive
17	Prompt produces polished status reports	Prompt is report-only and lacks action contract	Read the scheduled prompt and check whether it requires ACTION_TAKEN, BLOCKED, or NO_SAFE_ACTION_AVAILABLE with evidence	Do not accept a narrative report as completion of an action task
18	Agent says it deployed, but URL/file says otherwise	Verification checked the wrong layer or stopped at package creation	Verify actual destination: durable local path, public URL, route smoke, or manifest	Do not promote a package-ready artifact as public-live
19	Public draft contains private internals	Raw logs, paths, IDs, account notes, or provider details were pasted into content	Scan public markdown for private paths, IDs, secrets, account identifiers, commercial promises, and affiliation claims	Do not publish raw operational logs; summarize with toy examples
20	Gateway/service restart instruction appears in a card	Unsafe root task mixed credential wiring with normal board work	Split safe local prep from hard-gated operational change; park or archive the unsafe root with a comment	Do not restart services, change gateways, wire credentials, accept TOS, or create accounts from a gateway conversation
21	Artifact exists but no one can find it later	Final output stayed in scheduler/worker scratch	Confirm deliverable lives in a durable project output root and completion metadata names it	Do not complete with only a scratch path
22	Logs show traffic, but meaning is unclear	Build/tool/bot activity polluted signal	Separate delivery proof from demand/lead/revenue evidence	Do not treat route logs as market validation without source-quality checks
23	A task was completed after a crash left partial files	Worker recovered artifact without independent verification	Read recovered files, regenerate manifests/checks, and route to verifier if needed	Do not pretend a crash-recovery path was a clean first-pass success
24	Scheduled job recursively creates more jobs/tasks	Prompt lacks scope and stop conditions	Inspect created-card ledger, idempotency keys, and schedule creation permissions	Do not let cron jobs schedule more cron jobs or create unbounded board loops

Minimum safe closeout for any scheduled-agent incident

Name the symptom and likely cause.
Identify the canonical run/task/artifact.
Record whether notification proof is proven, failed, or unproven.
Record durable artifact bytes and SHA-256 when local files are produced.
Record external side effects exactly, including "none" when none happened.
Record spend exactly, including $0.00 when true.
Record what not to do next so the recovery does not create a bigger incident.

Public-safety note: this static staged page does not perform account, credential, payment, outreach, deployment, provider, gateway, DNS, service, or spend actions. Examples are fictional or generic placeholders.

A cron row is not operational proof.

Public safety status

The blunt premise

Use this when

Public-safe toy model

15-minute triage

1. Confirm schedule cadence

2. Verify profile and working directory

3. Verify environment, secrets, and tool access without exposing them

4. Inspect timeout ceilings and crash behavior

5. Check notification silence separately from execution

6. Detect overlapping sibling runs

7. Check board dispatcher liveness

8. Find report-only prompts

9. Clear stuck scheduled cards without creating zombies

10. Require durable-output proof

Safe remediation patterns

Operator rule worth taping to the monitor

Failure mode matrix

Minimum safe closeout for any scheduled-agent incident

Ana takeaway