Free Download · No Sign-up

Production Agent Failure Forensics Checklist

A 3-phase structured checklist for engineers who need to detect, capture, and replay silent production agent failures — before customers report them.

Pain addressed: Production agents fail silently. No replay-fixture monitoring. Teams discover failures from customer reports, can't repro, and ship blind.

Phase A — Detect & Catch
Phase B — Replay & Reproduce
Phase C — Monitor & Prevent

Reference · Replay Fixture Schema
Minimal viable replay fixture (save as agent_fail_YYYYMMDD_HHMMSS.json)
{
  // ── Identity ──────────────────────────────────────────
  "fixture_id": "agent_fail_20260513_081500",
  "agent_version": "v1.4.2",
  "recorded_at": "2026-05-13T08:15:00Z",
  "session_id": "sess_abc123",

  // ── Turn context ─────────────────────────────────────
  "messages": [
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "..."}
  ],

  // ── Tool calls that ran ──────────────────────────────
  "tool_history": [
    {
      "tool": "search_db",
      "args_hash": "sha256:abc...",
      "result_hash": "sha256:def...",
      "latency_ms": 2840,
      "status": "soft_fail"   // "ok" | "soft_fail" | "hard_fail"
    }
  ],

  // ── Environment at time of failure ───────────────────
  "context_env": {
    "model": "gpt-4o",
    "temperature": 0.7,
    "max_tokens": 2048,
    "region": "us-east-1"
  },

  // ── Failure classification ───────────────────────────
  "failure_type": "tool_timeout",
  "failure_signal": "zero_tool_calls_in_5_step_plan",
  "customer_visible": false
}

Reference · Failure Signal Quick-Check
Signal Threshold Severity Immediate Action
Zero tool calls in multi-step turn >2 consecutive steps with 0 tools Hard fail Page on-call, create replay fixture
Tool-call latency spike >3σ above 7-day baseline Soft fail Log artifact, alert in #agent-alerts
Output token count drop <min_expected_tokens (set per task type) Soft fail Log artifact, flag for review
Downstream API call skipped Any planned tool call absent from history Hard fail Page on-call, isolate context
Model error / API 5xx Any provider error returned to agent Hard fail Page on-call, check provider status
Tool result hash mismatch Same args → different output hash Soft fail Log diff, add to regression corpus
All turns green — but CSAT dropped Customer satisfaction down >10% week-over-week Soft fail Retrospective, add new signal rule

Want a working agent failure forensics setup?

The full sprint deliverable includes a Python instrumentation library, a replay fixture runner, a Slack alert integration, and a Notion incident template — all wired and tested.

View the Full Sprint →