AI Agent Health Check
25-point diagnostic checklist for AI engineering teams — catch agent failures before production
AI agents fail in predictable ways. This checklist was built from analyzing 200+ production agent failure logs — the same methodology behind the Agent Failure Forensics Sprint ($750) and the Revenue Action Timeout Resolver Sprint ($3,500).
Use this checklist during: pre-deploy reviews, incident retrospectives, sprint planning, and new AI feature kickoffs. Each check takes 2–5 minutes.
What you get
One checklist file (.md) — ready to paste into Notion, Linear, GitHub Issues, or any project tracker.
25 diagnostic checks
- Timeout values match actual P99 latency for each tool call P1
- Context window budget tracked per conversation turn P1
- Tool-call retry logic has exponential backoff and max-attempt cap P1
- Rate-limit headers (X-RateLimit-*) are read and respected P1
- Auth tokens refresh before expiry under sustained-session conditions P1
- Output schema validation fires on every tool-call response P1
- Agent loop detector: max-turns guard prevents infinite loops P1
- Context drift check: system prompt reinforcement fires at 60% context P2
- Tool-call dependency graph has no circular dependencies P2
- Stale-context eviction policy is defined for long conversations P2
- Error classification: retryable vs. terminal errors handled separately P2
- Partial-response recovery: agent can resume from mid-flight failure P2
- Token counting is accurate (not estimated) for context-aware operations P2
- Graceful degradation defined for each third-party tool dependency P2
- Idempotency keys used on all non-idempotent operations P3
- Observability: structured logs on every tool call entry/exit P3
- Cost tracking: per-turn token cost is monitored and alertable P3
- Human-in-loop escalation path defined for P1 failures P3
- Sandbox timeout is longer than the longest expected tool call P3
- Cache invalidation policy defined for cached tool results P3
- Prompt injection defense: untrusted inputs sanitized before system prompt concatenation P1
- State serialization: agent checkpoint survives process restart P2
- Concurrent session limit enforced per user/account P3
- API version pinning: agent pins API versions that affect tool schemas P3
- Rollback plan defined for agent model version upgrades P3
$9
Instant download · One .md file · 25 checks
🔒 Secure payment via PayPal · redirected to download immediately after payment
30-day money-back guarantee · Instant delivery after PayPal · No account required
Questions? Email miloantaeus@gmail.com