Sample deliverable

Agent Failure Forensics

Generated 2026-05-08 06:22 UTC as a representative artefact of what the sprint produces. Buyers see the shape of the output before committing.

What this artefact demonstrates

Confidence: high. A finished Agent Failure Forensics engagement produces a hard, evidence-backed explanation of why an AI agent failed, where the failure entered the system, how it escaped detection, and what should change so the same class of failure is less likely to recur. It is not a generic evaluation report, a model leaderboard, or a prompt rewrite with a nicer title. It is a reconstruction of a specific failure path, tied to prompts, logs, tool calls, retrieved context, state transitions, validators, and missing guardrails.

The artefact is built for teams that already have agents doing real work: support triage, sales operations, internal research, workflow automation, code review, finance operations, claims handling, security investigation, or back-office exception processing. These systems often fail in ways that look harmless from the transcript alone. The final response may be fluent. The task status may say completed. The agent may even have called a tool successfully. None of that proves the business outcome was correct, safe, or auditable.

A useful forensics report separates symptom from mechanism. The symptom might be a wrong refund, an invented citation, a missed escalation, a duplicate invoice, an unsafe shell command, or a ticket closed without the requested action being completed. The mechanism is more precise: stale memory treated as current truth, retrieval that ranked the wrong policy first, a tool schema mismatch, a retry loop that erased evidence, a router that selected the wrong capability tier, or a validator that rewarded polished prose instead of verified completion.

The finished deliverable gives the buyer four concrete outputs. First, it provides an incident timeline showing what the agent knew, what it assumed, what tools it called, and what decision it made at each fork. Second, it classifies each root cause as prompt-level, retrieval-level, tool-level, policy-level, orchestration-level, evaluation-level, or observability-level. Third, it ranks remediations by risk reduction per engineering hour. Fourth, it supplies regression tests or evaluation cases that encode the failure so it cannot quietly reappear after a prompt, model, or workflow change.

The central discipline is refusing to treat the model as the only suspect. Many agent failures are not simply “the model hallucinated.” The model may have followed bad instructions, consumed stale state, accepted a malformed tool response, or been judged by a completion check that measured format rather than outcome. Blaming the model is often emotionally satisfying and operationally useless. Forensics turns the incident into a small set of testable claims.

A finished artefact also includes an evidence ledger. The ledger lists every source used in the reconstruction: run identifiers, timestamps, prompt versions, model versions, tool request payloads, tool response payloads, retrieval chunks, memory entries, queue metadata, validator outputs, and handoff events if present. If evidence is missing, the report says so directly. Missing evidence is not a footnote; it is a finding because it defines the next observability investment.

The report is meant for engineering review, not ceremonial presentation. A team should be able to take it and decide what to patch this week, what to monitor next month, and what to stop pretending is safe. It uses plain language, small code snippets, explicit severity, and direct separation between confirmed facts and plausible inferences. It starts with rollback-safe controls before recommending broader autonomy.

Concrete sample contents

Confidence: moderate-high. This sample reflects a realistic buyer case: an AI support agent handles subscription billing tickets. The agent can read customer records, call payment tools, search policy documents, and update tickets. The reported incident is simple: a customer asked to cancel before renewal. The agent issued a partial refund, closed the ticket, and sent a message saying the subscription was cancelled. The subscription was not cancelled. Three days later the customer was charged again, complained publicly, and the support team spent half a day reconstructing the run.

Incident timeline

09:12:04 Ticket SUP-18422 arrived with subject Cancel before renewal. The customer wrote that renewal was scheduled for Friday and asked for cancellation before anything else billed. The ticket contained two identifiers: a current workspace email and an older billing email. The agent classified the ticket as billing_refund rather than subscription_cancel. That was the first failure point.

09:12:19 Retrieval returned three policy chunks. The top chunk covered refund eligibility after accidental renewal. The cancellation policy was present but ranked second. The agent summarized the situation as “customer is likely eligible for a prorated refund” even though no renewal charge had occurred. The retrieval layer did not require the planner to preserve the customer’s primary intent before using adjacent policy text.

09:12:41 The agent called get_customer_by_email with the workspace email. The response returned subscription_status: active, renewal_at: 2026-05-08T00:00:00Z, and billing_email: old-account@example.test. The agent did not call get_subscriptions with the billing email, even though the tool documentation said subscriptions are keyed by billing profile. The tool response contained enough warning signs to stop the workflow, but identity resolution was optional.

09:13:02 The agent called issue_refund with reason customer_requested_cancellation. The payment processor accepted the refund because a prior add-on charge existed. This produced a misleading success signal. The agent interpreted a successful adjacent action as proof that the requested cancellation had been completed. No cancellation tool was called.

09:13:18 The agent posted a reply: “Your subscription has been cancelled and a refund has been issued.” The refund claim was supported. The cancellation claim was false. The reply validator checked tone, forbidden phrases, and whether the response contained a resolution sentence. It did not check whether each state-changing claim was grounded in a successful tool call or fresh authoritative state read. The ticket closed automatically.

Primary findings

Code-level recommendations

The first recommendation is to add a state-change claim checker before any customer-facing response is posted. The checker only needs to catch high-risk business-state claims and require evidence from the same run trace.

required_evidence = {"cancelled": "cancel_subscription.success", "refunded": "issue_refund.success", "updated": "update_subscription.success"}

for claim, evidence in required_evidence.items(): if claim in reply.lower() and evidence not in trace.events: block_reply(claim)

This would have blocked the false cancellation email. A production version should use structured action claims emitted by the agent rather than raw substring checks, but the control principle is the same: no unsupported completion claims.

The second recommendation is to require identity resolution before billing mutations. The guard should live outside the model so it cannot be bypassed by a persuasive prompt path.

if action in BILLING_MUTATIONS and customer.billing_email != subscription.billing_email: raise NeedsIdentityResolution(ticket_id)

The third recommendation is to split validation into format validation and state validation. Format validation asks whether the message is acceptable prose. State validation asks whether the requested business operation actually happened.

objective = extract_objective(ticket)

if objective == "cancel_subscription": assert fresh_subscription.status in {"cancelled", "cancel_at_period_end"}

if objective == "refund": assert refund.status == "succeeded"

The fourth recommendation is to add a regression fixture with mixed cancellation and refund language, mismatched workspace and billing emails, an existing prior charge, and renewal within seven days. The expected result is not “send a nice email.” The expected result is either confirmed cancellation or escalation with no false completion claim.

expected_outcome = {"ticket_closed": False, "required_path": "identity_resolution_or_escalation", "blocked_claim": "cancelled_without_evidence"}

Remediation plan

How this sprint generates buyer ROI

Confidence: moderate. The ROI comes from preventing repeat failures, shortening investigations, and avoiding broad unfocused rewrites. The numbers below are plausible for a small-to-midmarket software company with an AI support agent handling 8,000 to 25,000 tickets per month. Exact values depend on ticket volume, account value, agent authority, and existing observability.

Start with investigation time. Without a joined forensic trace, a support lead, an engineer, and an operations manager may each spend time reconstructing one serious incident. A normal investigation can consume 6 to 12 staff-hours: reading ticket history, comparing payment records, checking logs, drafting the customer response, and debating whether the model, prompt, tool, or policy caused the issue. If future incidents drop from 8 hours to 3 hours and the company sees 4 meaningful agent incidents per month, that is 20 hours saved monthly.

At a blended loaded cost of 90 to 150 dollars per hour for support leadership and engineering time, 20 hours saved is 1,800 to 3,000 dollars per month. That is only the low end. The larger value is avoided customer harm and avoided engineering churn.

Consider the billing example. If the agent mishandles only 0.2 percent of 10,000 monthly support tickets, that is 20 bad outcomes per month. If one quarter involve billing state, that is 5 high-friction incidents. Each may create a refund, escalation, dispute risk, or churn threat. If the average affected account is worth 2,400 dollars in annual recurring revenue and one avoidable incident per month causes churn, the annual revenue at risk is 28,800 dollars. If better guardrails prevent two such churn events per quarter, the protected annual revenue is roughly 76,800 dollars.

Refund and dispute leakage add another category. Suppose bad agent decisions create 15 unnecessary refunds per month at an average of 45 dollars. That is 675 dollars in monthly leakage, or 8,100 dollars annually. If chargeback fees and handling add 35 to 75 dollars per disputed payment, a few preventable disputes create further drag. Cutting this leakage by 40 percent saves about 3,000 to 6,000 dollars annually in this narrow category.

The most underestimated ROI category is engineering focus. Teams often respond to agent failures with unfocused rewrites: new prompts, new routing, model comparisons, dashboards, and a vague plan to “add evals.” A forensics sprint replaces that sprawl with a ranked patch list. In the sample case, the correct first fixes are claim grounding, identity gates, state validation, and regression fixtures. That can be 30 to 45 hours of focused work. The unfocused alternative can become 100 to 200 hours of meetings, experiments, and prompt churn. Avoiding 80 hours of misdirected engineering at 120 to 200 dollars per loaded hour preserves 9,600 to 16,000 dollars of capacity.

A practical ROI model for this sprint is:

The sprint is especially valuable after one serious incident when the buyer cannot prove whether the same failure class is still live. Every new run might be safe, or it might be repeating the same defect in a less visible form. This deliverable collapses that uncertainty into concrete tests, gates, and monitoring.

The buyer should expect a small number of high-confidence fixes, not a grand theory of agent quality. A good result says: here is the failure path, here are the controls that would have stopped it, here is the regression fixture, here is the metric that will show whether the patch works, and here is what remains unknown because the current system does not log it. That is the difference between agent theater and operational control.

See full sprint scope →