Sample deliverable — Agent Failure Forensics

Generated 2026-05-05 09:04 UTC as a representative artefact of what the sprint produces. Buyers see the shape of the output before committing.

What this artefact demonstrates

Confidence: high. This artefact demonstrates the finished shape of an Agent Failure Forensics engagement: a buyer receives an evidence-backed account of how an autonomous agent failed, why the failure was allowed to continue, which guardrail should have stopped it, and what to change next. The output is not a generic postmortem and not a vague essay about model risk. It is a technical failure record designed to be used by engineering, operations, and automation teams without translation.

The core product is a failure map. It separates symptoms from causes. A symptom might be an agent reopening completed work, retrying a blocked payment, sending a contradictory customer message, burning model budget on doomed retries, or claiming completion when the underlying action failed. A cause might be stale state, missing idempotency, weak task ownership, loose tool permissions, ambiguous prompts, or success reporting based on narrative confidence rather than verified state. The deliverable labels each failure with severity, recurrence probability, blast radius, confidence level, and the control point where prevention or detection should live.

The artefact also demonstrates trace discipline. Every conclusion is tied to concrete evidence: queue records, tool-call transcripts, state snapshots, exception traces, model-routing logs, browser-session metadata, evaluator output, and final user-facing messages. Where evidence is missing, the report says so. That absence is not treated as a footnote; it is often the most useful finding. Many agent systems do not fail because the model is mysterious. They fail because the runtime cannot reconstruct what happened after the fact.

A finished engagement produces three layers. The first is the executive failure narrative: what happened, what damage it caused or nearly caused, and why the current design made the failure plausible. The second is the technical fault tree: the exact decision points where the agent should have stopped, escalated, reconciled state, or refused to act. The third is the remediation backlog: small, ranked changes with acceptance criteria. Each recommendation is written as finding, evidence, impact, fix, verification, and residual_risk.

The report distinguishes agent failure, platform failure, and process failure. Agent failure means the reasoning loop selected or persisted in a bad action despite available evidence. Platform failure means the scheduler, state store, permissions layer, browser harness, or tool wrapper exposed the agent to misleading signals. Process failure means the surrounding workflow lacked clear acceptance criteria, escalation rules, or rollback paths. This distinction matters because the wrong fix wastes money. Rewriting prompts will not fix a duplicate-mutation path. Replacing a model will not fix a success detector that treats any non-empty output as completion.

The output is deliberately plain-spoken. If the agent ignored a stop condition, the report names the underspecified stop condition. If a wrapper returned success after partial failure, the wrapper is called out. If a prompt asked one loop to classify, execute, and notify in the same breath, the report identifies that contract as unsafe. If the available logs do not prove root cause, the report refuses to invent one. The result is a buyer-ready evidence pack: reproducible incident sequence, fault classification, recommended regression tests, safe rollout sequence, and a keep, quarantine, or remove decision for the risky behaviour.

Concrete sample contents

The following sample is written as if produced for a buyer running a customer-support agent that triages refund requests, updates a billing system, and drafts customer replies. During a two-hour incident window, the agent reopened resolved refunds, sent contradictory internal notes, and consumed excessive model calls. The apparent failure was poor reasoning. The forensic finding is sharper: the agent operated on a stale queue snapshot, lacked an idempotency key for refund actions, and had no hard boundary between drafting a recommendation and executing a billing mutation.

Finding 1: stale state converted resolved work into repeat work

Severity: high. Confidence: high. The agent processed ticket RF-18422 four times after the billing system had already marked the refund complete. The local queue showed status=pending because it was not refreshed after the external billing update. The agent trusted the local row over the authoritative provider response and treated the contradiction as a reason to retry. The first retry was understandable; the next three were a control-plane defect.

Representative trace:

09:41:12 ticket=RF-18422 local_status=pending external_refund_status=succeeded

09:41:18 action=retry_refund reason=pending_refund_unresolved

09:41:21 result=duplicate_refund_blocked provider_code=already_refunded

09:42:03 action=retry_refund reason=previous_attempt_maybe_transient

Recommended fix: introduce a reconciliation gate before every financial mutation. The gate compares local task state with external authoritative state and returns one of three outcomes: safe_to_execute, already_done_mark_local_complete, or conflict_escalate. The agent should not infer that decision from free-form tool output. The regression test is simple: when the provider returns already_refunded, the executor must perform zero additional refund attempts and must move the task out of pending_retry.

Finding 2: drafting and execution shared the same permission lane

Severity: critical. Confidence: moderate-high. The same plan step could draft a customer reply and execute a billing action. That is an unsafe contract. Drafting is reversible; refund execution is not. The trace shows a broad instruction: resolve refund and notify customer. Under that instruction, the agent called the refund tool, received a duplicate-refund response, then drafted a message saying the refund had just been issued. The message was false because the agent collapsed two facts: the refund existed, and this agent had created it.

Recommended separation:

Draft lane: may read state, classify intent, propose response text, and request execution.
Execution lane: may mutate billing only after idempotency, amount, customer, invoice, and currency checks.
Notification lane: may send customer-facing text only from structured execution results, not from a free-form reasoning summary.

Concrete schema:

RefundExecutionRequest(ticket_id, customer_id, invoice_id, amount_cents, currency, idempotency_key, trace_id)

RefundExecutionResult(status, provider_refund_id, prior_refund_id, executed_at, authoritative_message)

If status=already_refunded, the customer reply should say the refund had already been processed and include the verified date if available. It must not say that a new refund was issued during the current run.

Finding 3: retry policy rewarded motion after deterministic failure

Severity: high. Confidence: high. The agent retried after provider responses that were deterministic, not transient. The provider returned already_refunded, invoice_closed, and amount_mismatch. None should trigger automatic retry. The current retry wrapper appears to classify all non-success responses as recoverable until a maximum retry count is reached. That design spends money while increasing operational risk.

Recommended retry table:

timeout: retry with backoff up to two times, then escalate.
rate_limited: retry after provider delay, then escalate if still blocked.
already_refunded: do not retry; reconcile local state to complete.
amount_mismatch: do not retry; escalate with invoice and requested amount.
permission_denied: do not retry; quarantine credentials and alert.

The acceptance criterion is binary: feed the executor a mocked already_refunded response and assert refund_tool.call_count == 1. The final task state should be complete_from_external_truth or needs_reconciliation, never pending_retry.

Finding 4: completion language was not backed by verified state

Severity: medium-high. Confidence: high. The final summary said all affected refund tickets have been resolved. That was not supported. Two tickets had verified provider success. One had a duplicate-refund block that probably meant prior success. One had an amount mismatch. The correct summary was: two refunds verified complete, one requires reconciliation, one blocked by amount mismatch and needs manual review.

Recommended reporting schema: verified_complete_count, externally_complete_unreconciled_count, blocked_count, unknown_count, and last_verified_at. Completion summaries should be computed from structured task rows and provider confirmations, not from generated prose. Add a negative eval where the model receives mixed ticket states and must not produce blanket success language. The expected output must preserve uncertainty and name unresolved cases.

Finding 5: missing trace identifiers inflated investigation time

Severity: medium. Confidence: high. The incident was reconstructable only by correlating queue rows, billing logs, model calls, and message drafts by timestamp. There was no single trace_id across the full lifecycle. Recommended telemetry fields are trace_id, source_state_version, authoritative_read_at, mutation_idempotency_key, and decision_basis. The sample remediation backlog is intentionally modest: add idempotency and deterministic retry classification, split draft and execution lanes, add structured completion reporting, replay the four incident tickets as fixtures, and expose conflicting_state_count plus deterministic_retry_attempts on the operations dashboard.

How this sprint generates buyer ROI

Confidence: moderate. The ROI comes from reducing repeated investigation, preventing avoidable operational damage, and giving the buyer a remediation backlog small enough to ship. The sprint is not justified by claims that agents become generally smarter. It is justified when it converts recurring ambiguity into known fault classes with tests and control points.

For a realistic mid-market support operation, assume 40,000 monthly tickets, 8 percent involving billing or refunds, and an agent handling first-pass triage for half of those. That produces about 1,600 billing-adjacent agent-handled tickets per month. If 1.5 percent hit conflicting state, ambiguous execution, or retry defects, the buyer faces 24 problematic cases monthly. A single mishandled refund can trigger support rework, finance reconciliation, customer escalation, and engineering investigation.

The direct time savings are straightforward. Without a forensic trace structure, one incident review can consume 12 to 20 staff-hours across support, engineering, operations, and finance. If the same failure class recurs twice per month, the buyer burns 24 to 40 hours just reconstructing what happened. A one-week forensics sprint can reduce that by creating a reusable trace map, decision table, and regression suite. A conservative reduction of 15 hours per month at 90 dollars per blended internal hour is 1,350 dollars monthly, or 16,200 dollars yearly, before customer impact.

The risk reduction is larger. In the sample refund case, the highest-risk defect is a mutation path that permits repeated attempts after deterministic provider responses. Suppose the agent touches 1,600 billing-adjacent tickets monthly and 0.25 percent could lead to an incorrect refund, duplicate adjustment, or unreconciled credit. That is four risky cases per month. If average financial exposure is 250 dollars, direct exposure is 1,000 dollars monthly. If one in four becomes an escalated customer issue requiring retention credit, management handling, or payment-provider dispute work, fully loaded exposure can reach 2,000 to 4,000 dollars monthly.

The sprint reduces that exposure with boring controls: idempotency keys, state reconciliation, deterministic retry stops, and structured success reporting. If those controls eliminate 70 percent of risky cases, the buyer protects roughly 1,400 to 2,800 dollars per month in direct and indirect leakage. That estimate intentionally excludes brand damage and long-tail churn because those numbers are easy to exaggerate. The stronger argument does not need them.

Engineering ROI also matters. Agent incidents often create noisy backlogs because every failure sounds unique: hallucination one week, tool misuse the next, model-vendor weakness after that. The forensic process compresses incidents into a smaller taxonomy: stale state, missing authority boundary, unsafe retry, unverifiable completion, weak escalation, poor prompt contract, and inadequate telemetry. That taxonomy prevents random fixes. If a buyer avoids one two-week exploratory refactor by shipping a two-day executor patch, preserved engineering capacity can be 40 to 60 hours. At 125 dollars per hour fully loaded, that is 5,000 to 7,500 dollars preserved.

Revenue protection becomes visible when the agent sits near onboarding, payments, renewals, or high-value support queues. Consider a sales-operations agent that mishandles account-enrichment tasks and causes 3 percent of qualified leads to route late. With 500 qualified leads per month, that affects 15 leads. If only two would have converted and average gross margin per closed deal is 1,800 dollars, monthly margin at risk is 3,600 dollars. A forensics sprint that identifies the failure as stale CRM reads rather than model quality can justify itself quickly because the fix is targeted: refresh authority, attach state versions, and block action on stale reads.

A reasonable ROI model for the sample engagement is:

15 to 30 monthly investigation hours saved: 1,350 to 2,700 dollars.
70 percent reduction in four monthly refund-risk cases: 1,400 to 2,800 dollars protected.
One avoided exploratory engineering detour per quarter: 1,650 to 2,500 dollars monthly equivalent.
One to three days faster safe redeployment after a serious incident.

On those assumptions, monthly economic value lands around 4,400 to 8,000 dollars, with upside when the agent is close to revenue or regulated operations. The more important operational value is faster truth. Instead of debating whether the agent is generally unreliable, the buyer gets a precise statement: deterministic provider failures were retried because the wrapper collapsed provider codes into generic failure, and completion reporting trusted generated summaries over structured state. That sentence is worth money because it points directly to the fix.

The finished sprint pays for itself when it converts one serious incident, or several recurring smaller incidents, into durable controls. It does not promise that autonomous agents stop failing. It makes failures legible, bounded, and testable. That is the difference between a team arguing about agent quality and a team shipping targeted controls while preserving useful automation.