Sample deliverable

Agent Failure Forensics

Generated 2026-05-06 17:38 UTC as a representative artefact of what the sprint produces. Buyers see the shape of the output before committing.

What this artefact demonstrates

Agent Failure Forensics produces a practical incident record, not a decorative postmortem. The finished engagement reconstructs how an autonomous or semi-autonomous agent failed, which controls did not fire, which prompts or tools amplified the mistake, and which changes will prevent the same class of failure from recurring. The output is designed for teams that already have agents in production or near production and are tired of vague explanations like the model got confused. Confusion is not a root cause. A usable forensic artefact names the triggering condition, the mistaken belief, the unsafe action, the missing guardrail, the detection gap, and the measurable fix.

The deliverable starts with a timeline that aligns events across logs, prompts, tool calls, state files, evaluation traces, support tickets, and deployment records. That timeline distinguishes facts from inference. It shows what the agent observed, what it did not observe, what it was allowed to do, and what the surrounding system assumed would happen. The point is to eliminate myth. In most agent failures, the damaging behavior is not random. It is produced by a small number of repeatable defects: state drift, stale memory, ambiguous authority, missing idempotency, poor tool result validation, weak escalation policy, or an evaluation suite that never tested the real edge case.

A finished engagement also produces a failure taxonomy. Instead of treating each incident as unique, the artefact maps the failure to a class such as instruction hierarchy collision, stale context execution, unverified external state, tool contract mismatch, overbroad autonomy grant, silent partial completion, or non-deterministic recovery loop. This classification matters because it tells the buyer where to spend engineering time. A prompt rewrite is useless when the real defect is a missing lease check before a browser action. A dashboard is useless when the real defect is that failed tool calls are summarized as successful work. A bigger model is useless when the agent has no way to know whether its target file is current.

The artefact includes evidence-backed findings. Each finding has a severity, confidence level, evidence slice, failure mechanism, buyer impact, and recommended correction. A good finding does not merely say that the agent should be more careful. It states the control that should exist and how to verify it. For example: Every destructive tool call must require a fresh state_read timestamp less than 120 seconds old and must fail closed when state freshness is unknown. Or: Any task summary containing completed=true must include at least one machine-verifiable artefact path or external transaction identifier. These are operational requirements that can be tested.

The final package normally contains six components: an executive incident brief, a technical reconstruction, a ranked finding register, a remediation plan, regression tests or evaluation cases, and a monitoring checklist. The executive brief is short and blunt. The technical reconstruction is detailed enough for an engineer to reproduce the failure path. The register ranks what to fix first. The remediation plan separates quick containment from structural repairs. The eval cases convert the failure into repeatable tests. The monitoring checklist explains which future signals prove the system is improving rather than merely producing nicer reports.

This sample demonstrates the shape of that work. It is not a generic AI safety essay. It is a buyer-facing technical artefact for a team that has seen an agent make a bad decision, waste operator time, damage trust, or create financial risk. It assumes the buyer wants hard conclusions. It also assumes the buyer does not need theatrical certainty. Where evidence is strong, the conclusion is firm. Where evidence is incomplete, the artefact says so and names the missing log, trace, or state snapshot required to raise confidence.

Concrete sample contents

Scenario: a customer-success agent was allowed to triage refund requests, draft replies, update a CRM record, and create a refund ticket for human review. During a high-volume Monday queue, the agent incorrectly marked eleven refund cases as resolved without creating the required review tickets. Four customers received messages implying that refunds were approved, while the payment system had no matching refund request. Support leadership initially described the event as a hallucination. The forensic conclusion is sharper: the agent followed a stale success path after a tool contract changed, then its summary layer converted partial failure into apparent completion.

Timeline reconstruction: at 09:12, the CRM tool schema changed from returning {"ticket_id":"R-20481","status":"created"} to returning {"request":{"id":"R-20481"},"state":"queued"}. At 09:18, the agent processed the first refund case using prompt instructions that still expected ticket_id. The ticket creation call returned HTTP 200, but the parser extracted a null ticket identifier. At 09:19, the agent wrote refund_review_ticket=null into its scratch state, then moved to the customer reply step because the orchestration policy treated HTTP 200 as success. At 09:21, the final summary template emitted resolved=true because the reply had been drafted and the CRM note had been updated. The absent ticket identifier was not considered a blocking field.

Finding 1, high severity, high confidence: the agent had no contract test for the CRM ticket response. Evidence: archived traces show the ticket tool returning the new nested response shape, while the extraction function still searched only for the top-level key ticket_id. The direct mechanism is a schema drift failure. The deeper system defect is that the agent relied on implicit tool success rather than typed completion criteria. Recommended fix: define a strict tool result validator that rejects any refund workflow where refund_review_ticket is null, empty, or not prefixed with R-. The workflow should halt before any customer-facing reply if validation fails.

Finding 2, high severity, moderate confidence: the summary layer concealed partial completion. Evidence: five sampled traces contain ticket_create.ok=true, ticket_id=null, and case_status=resolved in the same run. The confidence is moderate rather than high because only five of the eleven failed cases retained full trace payloads. The failure mechanism is a status abstraction bug: the final summary flattened multiple step outcomes into one boolean. Recommended fix: replace resolved=true with explicit fields: customer_reply_status, crm_note_status, refund_ticket_status, and blocking_errors. A case can only be marked resolved when all mandatory fields are successful and externally identifiable.

Finding 3, medium severity, high confidence: escalation rules were written in natural language but never enforced in code. The prompt said, If refund review ticket creation fails, escalate to a human. That instruction did not matter because the orchestrator never represented null ticket extraction as a failure. The agent did not disobey; the system failed to convert a policy into a state transition. Recommended fix: implement an escalation gate with a condition equivalent to if workflow_type == "refund" and not refund_review_ticket: require_human_review(). The test should simulate a valid HTTP response with a missing ticket identifier, because that is the exact edge case that caused the incident.

Finding 4, medium severity, moderate confidence: the agent reused stale scratch state across adjacent cases. In two traces, the previous case's non-null ticket identifier appears in the reasoning context for the next case, although it was not written to the final CRM record. That probably did not cause the eleven false resolutions, but it increased diagnostic noise and could cause a future cross-customer data leak. Recommended fix: enforce per-case state isolation and add a redaction check that fails any run where a prior customer's name, email, ticket id, or order id appears in the next case's working memory.

Sample remediation patch: the engagement would not ship only prose. It would include concrete acceptance criteria and testable snippets. For example, the validator specification would read: validate_refund_ticket(result) returns success only when result.request.id matches /^R-[0-9]+$/ or result.ticket_id matches /^R-[0-9]+$/; otherwise returns blocking_error="refund_ticket_missing". The regression test would include three cases: old schema succeeds, new schema succeeds, HTTP 200 without ticket id fails closed. The summary contract would read: case_resolved = customer_reply_sent and crm_note_written and refund_ticket_status == "created" and len(blocking_errors) == 0.

Operational recommendation: stop allowing this agent to send refund approval language until the ticket validator, summary contract, and escalation gate are deployed. It may continue drafting internal notes if the output is labeled draft-only and cannot update customer-visible state. After remediation, run a backfill audit over the last thirty days of refund cases, searching for resolved records with null or malformed refund ticket identifiers. Any matching case should be reopened for manual review. This is not optional cleanup; it is the only way to discover whether the incident is eleven cases or merely eleven detected cases.

Monitoring recommendation: add three production counters. First: refund_ticket_missing_block_total, which should rise during containment and then fall after the schema parser is fixed. Second: agent_partial_completion_total, segmented by workflow step. Third: customer_visible_action_without_external_id_total, which should be zero. Alerts should trigger on any customer-visible action lacking a durable external identifier. A customer email saying a refund is approved without a refund ticket id is not a soft failure. It is a broken promise with support, finance, and reputational consequences.

Root cause statement: the proximate cause was CRM response schema drift. The contributing causes were missing tool contract validation, summary flattening, prompt-only escalation, and inadequate state isolation. The incident was preventable. The system had enough information to stop itself: the ticket id was null before the customer reply step executed. The agent was permitted to proceed because the workflow trusted transport success rather than business success. The corrective priority is therefore not better wording. The priority is to make impossible states unrepresentable: a refund case without a review ticket cannot be resolved.

How this sprint generates buyer ROI

The sprint generates ROI by converting ambiguous agent failure into ranked engineering work. Without forensics, teams burn time in three low-yield loops: debating whether the model hallucinated, reading scattered logs without a hypothesis, and applying prompt edits that do not touch the actual failure mechanism. A compact forensic engagement can replace that with a defensible incident record, three to eight prioritized fixes, and regression cases that prevent recurrence. The value is not theoretical. It shows up as fewer repeated incidents, less engineering thrash, faster support recovery, and lower exposure from customer-visible errors.

For a small production agent team, a plausible incident response pattern looks like this: two engineers spend eight hours each collecting traces, one support lead spends six hours identifying affected customers, one manager spends four hours preparing an internal explanation, and the team still ends with an uncertain root cause. That is roughly twenty-six staff hours before implementation begins. At a blended loaded cost of 100 to 175 dollars per hour, the diagnostic burn is 2,600 to 4,550 dollars for a single incident. If the result is a weak prompt tweak, the same class of failure can return the next week.

An Agent Failure Forensics sprint targets the expensive part: uncertainty. A good deliverable can save 40 to 70 percent of diagnostic time on the first incident by centralizing evidence, separating proximate and systemic causes, and producing specific acceptance criteria. In the sample refund scenario, that means reducing twenty-six hours of internal investigation to perhaps eight to twelve hours of buyer review and implementation planning. That saves fourteen to eighteen hours immediately, or about 1,400 to 3,150 dollars at the blended rate above. The larger gain is preventing recurrence, because repeated incidents carry compounding support and trust costs.

Risk reduction is more important than raw labor savings. Suppose the refund agent processes 1,200 cases per month and 3 percent involve a workflow exception. That is thirty-six exception cases monthly. If schema drift or summary flattening incorrectly resolves even 10 percent of exceptions, the team creates three or four customer-visible failures per month. Each failure may require a support recovery call, manual finance review, apology credit, and manager escalation. If the all-in cost per mishandled refund case is 150 to 500 dollars, the monthly leakage is 450 to 2,000 dollars before reputational damage. If one public complaint or enterprise escalation adds ten hours of executive and account time, the cost jumps quickly.

The sprint also protects revenue indirectly by preserving confidence in automation. Teams often respond to a visible agent failure by turning the entire agent off. That can be rational during containment, but expensive when the failure is narrow. If an agent saves four minutes per routine case and handles 5,000 cases per month, it saves about 333 staff hours monthly. If a preventable incident causes a two-week shutdown, the buyer gives back roughly 166 hours of automation benefit. At 75 dollars per support hour, that is 12,450 dollars of avoidable labor exposure. A forensic artefact that isolates the unsafe path while keeping safe draft-only or low-risk workflows running can preserve a large share of that value.

There is also a governance ROI. Buyers need to show customers, auditors, executives, or internal risk teams that agent incidents are being handled with discipline. A vague internal note saying we improved the prompt does not survive scrutiny. A forensic register with evidence, confidence levels, controls, and regression tests does. It demonstrates that the team knows what failed and has converted that knowledge into durable controls. This reduces the probability that every future agent proposal is blocked by one ugly incident. In practical terms, it keeps automation investment from being judged by rumor.

The sprint should be priced against avoided waste, not against the length of the report. A buyer receiving a compact but rigorous forensic package can reasonably expect value from four buckets. First, 1,500 to 5,000 dollars in immediate diagnostic time saved. Second, 2,000 to 15,000 dollars in avoided recurrence and support recovery over the next quarter, depending on case volume. Third, 5,000 to 25,000 dollars in protected automation capacity if the findings allow partial operation instead of a full shutdown. Fourth, reduced governance friction for future deployments, which is harder to price but often larger than the incident itself.

The key constraint is honesty. Forensics has ROI only when it names uncomfortable defects. If the logs are incomplete, the artefact should say confidence=moderate and specify the missing evidence. If a prompt instruction was decorative because no code enforced it, the artefact should say that. If the agent was given authority without rollback, the artefact should say that too. The buyer is not paying for reassurance. The buyer is paying for a shorter path from failure to control. The finished sprint earns its keep when the same failure cannot happen again silently.

Bottom line: this deliverable demonstrates a repeatable, evidence-first way to turn agent incidents into operational improvements. It reduces time wasted on narrative arguments, exposes the control gaps that actually matter, and converts one failure into regression coverage. The practical output is a prioritized repair plan with clear verification. The economic output is fewer repeated failures, less manual recovery, and more automation left safely in service. Confidence: high for the structure and mechanisms; moderate for the illustrative dollar ranges because actual ROI depends on buyer volume, labor cost, incident severity, and existing observability.

See full sprint scope →