Audience: operators evaluating whether their agent experiment is failing uniquely or hitting normal failure modes — builders who deployed something, watched it quietly go wrong, and cannot tell if the problem is their setup, the platform, or the whole premise.
Promise: a blunt taxonomy of the nine failure modes that show up first and most often in practical agents — not the theoretical risks from whitepapers, but the boring operational ones that eat your Tuesday. Each failure mode comes with one constructive check or mitigation. No fearmongering. No crisis theatre.
The blunt premise
Most practical agents do not fail because the AI is too dumb. They fail because something boring broke underneath the model and nobody was watching the right thing.
The model is usually the last thing to break. The first thing to break is almost always one of nine operational failures that look different but feel the same: your agent was working, then it stopped, and you cannot point at the exact moment it went wrong.
This resource names those nine failures, explains what they look like in practice, and gives you one constructive check for each. If your agent is broken, start here before rewriting prompts, switching models, or adding more tools.
The nine failure modes
1. Context loss: the agent forgot what it was doing mid-conversation
What it looks like: The agent starts a multi-step task and loses track of earlier steps. It repeats instructions, contradicts itself, or asks for information it already received. The longer the conversation, the worse it gets.
Why it happens: Models have finite context windows. When the conversation exceeds the window, earlier content gets truncated, summarized, or dropped. Tool call results, long file contents, and verbose error logs accelerate the problem because they consume context budget faster than actual conversation.
The constructive check: Before each major task, confirm your agent has a context budget. Know the window size. Estimate how many tool calls and file reads a task will consume. If a task needs more context than the window holds, break it into smaller steps with explicit handoffs between them.
2. Memory drift: the agent remembered wrong
What it looks like: The agent acts on stale, contradictory, or hallucinated "facts" from earlier sessions. It may confidently state a user preference that was never set, reference a file that does not exist, or apply instructions from a different project.
Why it happens: Durable memory stores (profile instructions, cross-session notes, user preferences) accumulate entries over time. Old entries become stale. New entries contradict old ones. The agent loads everything and has no way to tell which memory is current without explicit timestamps, trust scores, or retirement rules.
The constructive check: Audit your agent's durable memory store monthly. Remove entries older than the project's useful life. Flag contradictions. Separate stable facts (user name, project structure) from volatile facts (today's task state, temporary approvals). If your agent cannot tell the difference, it will eventually act on the wrong one.
3. Provider and API drift: the tools changed without you noticing
What it looks like: A workflow that worked last week produces different output, errors, or silent failures today. Model responses shift in tone or capability. An endpoint returns a new response shape. A billed feature stops working or starts costing more.
Why it happens: AI providers update models, change default parameters, deprecate endpoints, adjust rate limits, and change cost structures — sometimes without prominent changelog entries. Routing layers add another variable: the model behind a router alias may change without the alias name changing.
The constructive check: For every paid provider and routed endpoint, maintain a retrieval date and an official changelog URL. Run a cheap smoke test before production runs — not a full workflow, but one call that confirms the endpoint is reachable, authenticated, and returning the expected response shape. Log the provider version or model identifier in your proof logs so you can correlate output drift with provider changes.
4. Gateway and channel failures: the agent is alive but nobody can reach it
What it looks like: The agent process is running. The terminal shows activity. But messages to the bot are unanswered. Notifications never arrive. The agent writes to a log that nobody reads.
Why it happens: Gateway connections (bot tokens, webhook URLs, messenger bridge sessions) expire, get revoked, lose permissions, or hit rate limits silently. Platform updates can invalidate tokens. Bot accounts get flagged. Network issues drop connections without clean error messages.
The constructive check: Build a heartbeat test that is separate from the agent's main loop. Send a synthetic test message through each channel at least once per day. If the test message does not return a confirmation within your expected window, alert before the next real message is lost. Log connection state changes, not just errors.
5. Permissions: the agent can do less than it thinks, or more than it should
What it looks like: The agent tries to read a file, call an API, or write to a directory and fails — or worse, it succeeds at something it should not have access to. Failures show up as vague "permission denied" errors or as silent data written to the wrong location.
Why it happens: Agent processes inherit the permissions of the user or service account that started them. Secret files with loose permissions are readable by other processes. API keys may have broader scopes than the workflow requires. New team members or deployment changes alter filesystem or network access without updating the agent's expected boundaries.
The constructive check: Run a permissions audit before each deployment: which files can the agent read, which can it write, which API scopes does each key carry, and which network endpoints can it reach? Tighten secret file permissions to owner-only. Use scoped API keys where the provider supports it. Document the minimum permissions the agent needs and test that it cannot exceed them.
6. Browser and tooling gaps: the agent cannot see what a human would see
What it looks like: The agent writes HTML, generates a page, or builds a dashboard and reports success — but the page is blank, misaligned, or broken. The agent has no way to verify its own visual output.
Why it happens: Many agent environments lack browser rendering capabilities. An agent can write and validate HTML syntax but cannot see the rendered page. "The file exists and validates" is not the same as "the page looks correct." Without screenshot capability or a rendering smoke test, visual failures go undetected until a human visits the page.
The constructive check: If your agent produces visual output (web pages, dashboards, reports), require a rendering verification step: either a headless browser screenshot or a human review gate before declaring the artifact complete. "HTML exists and passes validation" is a necessary check, not a sufficient one. Add a non-blank pixel check or a layout comparison against a known-good baseline.
7. Verification gaps: the agent said it worked, but did it?
What it looks like: The agent reports success — file written, task completed, deployment done — but no one can verify the claim without re-doing the work. Proof consists of the agent's own summary with no independent evidence.
Why it happens: Agents can hallucinate successful outcomes. A file write may fail silently. A deployment may succeed partially. An API call may return a 200 status but with unexpected content. Without external verification steps, the agent's self-report is the only evidence — and self-reports are not evidence.
The constructive check: Require every claimed artifact to have at least one independent verification: file size and checksum read back from disk, a screenshot of the rendered page, a response body from a verification endpoint, or a diff against the expected state. Log the verification method, not just the claim. "Agent says done" plus "checksum matches" is proof. "Agent says done" alone is a story.
8. Cost visibility: the agent is spending money you cannot see
What it looks like: You discover the weekly bill at month's end. The agent ran expensive model calls for tasks that could have used cheaper ones. Failed runs consumed tokens before erroring. Retry loops burned budget on the same failing operation. Nobody was watching.
Why it happens: Token-based cost is invisible during execution. The agent does not see the cost of each call. Retry logic multiplies expensive failures. Provider dashboards show totals but not per-task or per-agent breakdowns. Without spend caps and per-run cost tracking, the first warning is the invoice.
The constructive check: Set a spend cap before the agent's first production run — not after the first surprise bill. Track model spend, media/provider spend, tool-call count, and failed-run waste as separate line items. Require the agent to log estimated cost per run in its proof output. Review the spend ledger weekly, not monthly. A kill switch should exist before the first ambitious stunt.
9. Account and channel friction: the human infrastructure breaks before the software
What it looks like: The agent needs a bot token that requires portal access you do not have. A channel needs admin approval that is pending. An email account needs two-factor confirmation from a phone that is in a drawer. A social platform flags the account for unusual activity. The blocker is not technical — it is administrative.
Why it happens: Agent deployments depend on human-controlled accounts: platform developer portals, email providers, social media accounts, DNS records, payment processors. Each has its own approval flow, security review, and rate of change. The agent cannot resolve these blockers autonomously, and they often surface only when the workflow is already in motion.
The constructive check: Before deploying a workflow, list every external account and channel it touches. For each one, confirm: who owns the credentials, where the recovery path leads, whether two-factor is set up and accessible, and what the platform's review or flagging triggers are. Keep this registry current. The fastest way to stall an agent is to discover at deployment time that the bot token owner left the company or the recovery email points to a decommissioned address.
How to use this taxonomy
When something breaks, resist the urge to blame the model first. Work through the nine failure modes in order:
- Context loss — Did the agent run out of window?
- Memory drift — Is it acting on stale durable facts?
- Provider/API drift — Did an external service change?
- Gateway/channel — Is the agent alive but unreachable?
- Permissions — Can it access what it needs (and only what it needs)?
- Browser/tooling — Can it verify its own visual output?
- Verification — Is there independent evidence the work was done?
- Cost visibility — Is spending visible and capped?
- Account/channel friction — Is the blocker administrative, not technical?
Most of the time, the answer is in the first four. The last five are where chronic problems hide.
What this resource does not cover
This taxonomy focuses on operational failure modes — the boring, recurring, infrastructure-adjacent problems that eat practical agents from the inside. It does not cover:
- Prompt engineering failures — prompt rot, instruction following degradation, jailbreaks. Those are model-level problems with their own literature.
- Ethical and alignment failures — bias, hallucination of harmful content, deceptive behavior. Those require different frameworks.
- Scaling failures — what breaks when you move from one agent to fifty. That is an architecture problem.
- Business model failures — whether the agent's work is actually valuable or monetizable. That is a GTM problem.
Each of those deserves its own resource. This one is for the operational layer: the pipes, wires, and permissions that hold a practical agent together before anyone evaluates whether the output is good.
Companion resources
This taxonomy pairs with several other resources in this series:
- Memory and Skills Hygiene Checklist — deeper treatment of failure mode 2 (memory drift) with a retirement protocol for stale entries.
- Provider Account/Key Preflight Checklist — failure mode 3 (provider/API drift) with a pre-run smoke test template.
- Gateway Token Soup Survival Guide — failure mode 4 (gateway/channel) with a conceptual architecture diagram and safe test message protocol.
- Cron and Scheduled Agent Failure Modes — failure modes 4 and 8 (channel + cost) with a scheduling-specific diagnostic matrix.
- Proof Vault: Minimum Receipt Resource — failure mode 7 (verification) with a reusable proof log template.
- Agent Spend Ledger and Runway Tracker — failure mode 8 (cost visibility) with a weekly runway report format.
- Browser Agent Safety and Permission Checklist — failure modes 5 and 6 (permissions + browser) with a stoplight permission table.
Source and evidence notes
This resource draws on generalized operational experience from building and running practical agents. All examples are sanitized and use placeholder identifiers. No private paths, credentials, internal hostnames, account details, or raw logs are included.
Last updated: 2026-06-27