Sample · Agent Health Audit Deep Report

Live self-audit: Milo Antaeus session logs

Generated 2026-05-11T23:44:43Z · 450 JSONL records analyzed · 4 state files checked · 32-rule library
Audit Summary
Total findings
5
P0 (today)
3
P1 (this week)
2
P2 (next sprint)
0

Top concerns: critic-strategist deadlock causing 94 wasted ticks in 24h, 80 silent-success records where actions short-circuited but reported success, plus a stale lock file that's been held past TTL — three independent failure modes converging on the same compute waste pattern.

Findings (sorted by severity, then confidence)

P0 critic_strategist_recursive_research_first deadlock confidence=high 94 hits

What we saw

Critic-vs-strategist recursion: every proposal vetoed with "research_first:" or "first_principles:" — strategist tries to research → critic vetoes the research → infinite loop. Net progress = zero. Signature: critic_nonconcur > 50% of recent ticks AND most veto-alternatives start with the literal "research_first:" or "first_principles:" prefix.

Evidence

{"ts": "2026-05-08T16:33:27.467499+00:00", "schema": "AutonomousLoopTick.v1", "dry_run": false, "ok": false, "decision": {"action_kind": "research_first:investigate_root_cause"}}

Fix recipe

Action: Tune the critic prompt to DEFAULT TO CONCUR on operational/repair proposals (only veto for genuinely-destructive or new-strategic commitment). Inject the same diagnostics_backlog into BOTH critic and strategist so they share context. Make veto alternatives shippable action_kinds, not abstract questions.
P0 lock_file_held_past_ttl deadlock confidence=high new in v32 1 hit

What we saw

A lock file (claim/mutex) is older than 4× its declared TTL but still present. The holder crashed without releasing — every subsequent agent that respects the lock blocks indefinitely. Caught multiple times in Milo's multi-agent claim/release/audit/GC protocol when the GC step itself crashed.

Evidence

/Users/miloantaeus/.hermes/ops/control/state/silent_failure_detector.latest.json age=264884s (expected ≤ 14400s) → 18× past TTL

Fix recipe

Action: The GC step MUST run independently of the holder (separate cron, not in-process cleanup). Stamp every lock with expires_at (now + ttl), and have ANY consumer treat now > expires_at as evidence the lock is forfeit. Add a sentinel cron that age-scans the locks directory every 60s and removes expired entries.
P0 ok_true_zero_duration silent_failure confidence=high 80 hits

What we saw

Action reports ok=true with duration_s=0 — almost certainly a no-op that fast-returned without doing the actual work. Common when a circuit breaker or feature flag short-circuits but the success path still emits ok=true.

Evidence

{"ts": "2026-05-11T19:59:08Z", "action_id": "minimax_capacity_burst", "outcome": {"bottleneck": {"severity": "critical", "detail": "rolling_utilization_pct=0.8%"}}, "ok": true, "duration_s": 0}

Fix recipe

Action: Find the early-return path in the action's executor. Distinguish "skipped on purpose" (skipped=true, ok=null OR skip_reason set) from "ran successfully" (ok=true, duration_s > 0.05). Audit the function for any return {"ok": True} that doesn't follow real I/O.
P1 snapshot_age_exceeds_sla frozen_state confidence=high new in v32 1 hit

What we saw

A snapshot/checkpoint file is older than its declared SLA (e.g. "refreshed every 1h" but mtime is 6h old). Downstream consumers that use the snapshot for grounding/recovery will operate on stale data. If recovery is triggered, the agent will roll back to a stale state and re-do hours of work.

Evidence

silent_failure_detector.latest.json age=264884s (expected ≤ 10800s) → 24× past SLA

Fix recipe

Action: Make every snapshot writer emit (a) the file, (b) a <file>.sla.json declaring refresh_interval_s and produced_at. Add a generic auditor cron that scans for any *.sla.json and flags age > 2 × refresh_interval_s. Page on age > 5×.
P1 stale_state_file_past_expected_refresh frozen_state confidence=high 1 hit

What we saw

A *.latest.json that's expected to be refreshed by a cron is older than 3× its interval — strong signal the cron stopped firing OR its writer is silently failing. Common: cron last_status: ok but state file mtime is days old.

Evidence

silent_failure_detector.latest.json age=264884s (expected ≤ 10800s) → 24× past expected refresh

Fix recipe

Action: For each stale file: (a) check its writing cron is loaded (launchctl list | grep <label>), (b) tail the err.log, (c) kickstart manually and verify mtime advances. If kickstart doesn't refresh, the writer code has an exception caught by an over-broad try/except.