A 3-phase structured checklist for engineers who need to detect, capture, and replay silent production agent failures — before customers report them.
Pain addressed: Production agents fail silently. No replay-fixture monitoring. Teams discover failures from customer reports, can't repro, and ship blind.
{
// ── Identity ──────────────────────────────────────────
"fixture_id": "agent_fail_20260513_081500",
"agent_version": "v1.4.2",
"recorded_at": "2026-05-13T08:15:00Z",
"session_id": "sess_abc123",
// ── Turn context ─────────────────────────────────────
"messages": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
],
// ── Tool calls that ran ──────────────────────────────
"tool_history": [
{
"tool": "search_db",
"args_hash": "sha256:abc...",
"result_hash": "sha256:def...",
"latency_ms": 2840,
"status": "soft_fail" // "ok" | "soft_fail" | "hard_fail"
}
],
// ── Environment at time of failure ───────────────────
"context_env": {
"model": "gpt-4o",
"temperature": 0.7,
"max_tokens": 2048,
"region": "us-east-1"
},
// ── Failure classification ───────────────────────────
"failure_type": "tool_timeout",
"failure_signal": "zero_tool_calls_in_5_step_plan",
"customer_visible": false
}
| Signal | Threshold | Severity | Immediate Action |
|---|---|---|---|
| Zero tool calls in multi-step turn | >2 consecutive steps with 0 tools | Hard fail | Page on-call, create replay fixture |
| Tool-call latency spike | >3σ above 7-day baseline | Soft fail | Log artifact, alert in #agent-alerts |
| Output token count drop | <min_expected_tokens (set per task type) | Soft fail | Log artifact, flag for review |
| Downstream API call skipped | Any planned tool call absent from history | Hard fail | Page on-call, isolate context |
| Model error / API 5xx | Any provider error returned to agent | Hard fail | Page on-call, check provider status |
| Tool result hash mismatch | Same args → different output hash | Soft fail | Log diff, add to regression corpus |
| All turns green — but CSAT dropped | Customer satisfaction down >10% week-over-week | Soft fail | Retrospective, add new signal rule |
The full sprint deliverable includes a Python instrumentation library, a replay fixture runner, a Slack alert integration, and a Notion incident template — all wired and tested.
View the Full Sprint →