Free Download · No Sign-up

Production Agent Failure Forensics Checklist

A 3-phase structured checklist for engineers who need to detect, capture, and replay silent production agent failures — before customers report them.

Pain addressed: Production agents fail silently. No replay-fixture monitoring. Teams discover failures from customer reports, can't repro, and ship blind.

Phase A — Detect & Catch

Instrument a failure signal log

Add a structured agent_signal_log at every tool-call boundary: input hash, output hash, tool name, latency, and a success/soft_fail/hard_fail tag. Emit this asynchronously — never block the agent thread.

Detect
Define silent-failure thresholds

Pick 3–5 proxy signals that indicate something went wrong without a crash: zero tool calls in a multi-step turn, latency spike >3σ above baseline, output token count drops below minimum expected, downstream API call skipped. Alert on any threshold breach.

Detect
Log the full turn context on soft failures

On any soft_fail, serialize: user message, conversation history, tool plan, all tool-call args (redact secrets), model temperature/top-p, and final output. Write to a append-only failure artifact store — not stdout.

Detect
Wire a PagerDuty / Slack alert for hard failures

Hard failures (traceback, API timeout, auth error) should page immediately. Use a separate on-call rotation from normal service alerts — agent failures are a distinct incident class.

Detect

Phase B — Replay & Reproduce

Build a replay fixture from each failure artifact

A replay fixture is a self-contained JSON file: {session_id, messages, tool_history, context_env}. Every failure artifact should be directly loadable as a test fixture without manual assembly.

Replay
Run a deterministic replay test in CI

On every failure artifact: replay against the current agent build, assert output matches expected (or assert it fails the same way). This catches regressions that re-introduce fixed bugs.

Replay
Record the diff between failing and fixed runs

After fixing a bug, store the diff between the failing fixture and the fixed output. This creates a regression corpus that grows with every incident — the opposite of "we fixed it and forgot it."

Replay

Phase C — Monitor & Prevent

Add a replay-fixture regression test to every PR

Gate merges on the replay corpus. If a PR causes any historical fixture to produce a different output (even a "better" one), require an engineer to explicitly acknowledge the output change. Avoid silent output drift.

Prevent
Track MTTR from failure signal to fix merged

Measure: time from first silent-failure signal → time a replay fixture was created → time fix was identified → time fix was merged. Target: fixture available within 1 hour of first customer-impacting signal.

Prevent
Run a monthly failure pattern review

Every 30 days: cluster failure artifacts by root cause type (context overflow, tool timeout, model degradation, prompt drift). Identify the top 2 categories and add structured mitigations — not just individual hot-fixes.

Prevent

Reference · Replay Fixture Schema

Minimal viable replay fixture (save as agent_fail_YYYYMMDD_HHMMSS.json)

{
  // ── Identity ──────────────────────────────────────────
  "fixture_id": "agent_fail_20260513_081500",
  "agent_version": "v1.4.2",
  "recorded_at": "2026-05-13T08:15:00Z",
  "session_id": "sess_abc123",

  // ── Turn context ─────────────────────────────────────
  "messages": [
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "..."}
  ],

  // ── Tool calls that ran ──────────────────────────────
  "tool_history": [
    {
      "tool": "search_db",
      "args_hash": "sha256:abc...",
      "result_hash": "sha256:def...",
      "latency_ms": 2840,
      "status": "soft_fail"   // "ok" | "soft_fail" | "hard_fail"
    }
  ],

  // ── Environment at time of failure ───────────────────
  "context_env": {
    "model": "gpt-4o",
    "temperature": 0.7,
    "max_tokens": 2048,
    "region": "us-east-1"
  },

  // ── Failure classification ───────────────────────────
  "failure_type": "tool_timeout",
  "failure_signal": "zero_tool_calls_in_5_step_plan",
  "customer_visible": false
}

Reference · Failure Signal Quick-Check

Signal	Threshold	Severity	Immediate Action
Zero tool calls in multi-step turn	>2 consecutive steps with 0 tools	Hard fail	Page on-call, create replay fixture
Tool-call latency spike	>3σ above 7-day baseline	Soft fail	Log artifact, alert in #agent-alerts
Output token count drop	<min_expected_tokens (set per task type)	Soft fail	Log artifact, flag for review
Downstream API call skipped	Any planned tool call absent from history	Hard fail	Page on-call, isolate context
Model error / API 5xx	Any provider error returned to agent	Hard fail	Page on-call, check provider status
Tool result hash mismatch	Same args → different output hash	Soft fail	Log diff, add to regression corpus
All turns green — but CSAT dropped	Customer satisfaction down >10% week-over-week	Soft fail	Retrospective, add new signal rule

Want a working agent failure forensics setup?

The full sprint deliverable includes a Python instrumentation library, a replay fixture runner, a Slack alert integration, and a Notion incident template — all wired and tested.

View the Full Sprint →