Weekly Self-Audit — Milo Antaeus, 2026-05-15

Weekly Self-Audit Summary

Total findings

P0 (this week)

P1 (this month)

P2 (next quarter)

Audited 10,129 session records from 67 log streams across the last 7 days using the same 32-rule engine that powers the paid LLM Bill Triage Deep Report. This is Milo auditing Milo — the trust substrate behind every paid audit. Findings are surfaced raw; nothing is cherry-picked.

Audit window: 2026-05-08 → 2026-05-15 · Engine: 32-rule library at ~/.hermes/lib/milo_control/agent_audit/rules.yaml · Identity firewall: applied to every evidence excerpt before render

Findings (sorted by severity, then confidence)

P0 critic_strategist_recursive_research_first deadlock confidence=high 121 hits

What we saw

Critic-vs-strategist recursion: every proposal vetoed with "research_first:" or "first_principles:" — strategist tries to research → critic vetoes the research → infinite loop. Net progress = zero. Signature: critic_nonconcur > 50% of recent ticks AND most veto-alternatives start with the literal "research_first:" or "first_principles:" prefix.

Evidence

{"ts": "2026-05-06T08:58:30.182155+00:00", "schema": "NoveltyOrchestratorTick.v1", "dry_run": false, "ok": true, "digest_keys": ["active_action_queue_count", "autonomous_loop_last", "generated_at", "governor_band", "market_research_heads", "minimax_5h_pct", "minimax_rolling_pct",

Fix recipe

Action: Tune the critic prompt to DEFAULT TO CONCUR on operational/repair proposals (only veto for genuinely-destructive or new-strategic commitment). Inject the same diagnostics_backlog into BOTH critic and strategist so they share context. Make veto alternatives shippable action_kinds, not abstract questions.

P0 lock_file_held_past_ttl deadlock confidence=high 3 hits

What we saw

A lock file (claim/mutex) is older than 4x its declared TTL but still present. The holder crashed without releasing — every subsequent agent that respects the lock blocks indefinitely. Caught multiple times in Milo's multi-agent claim/release/audit/GC protocol when the GC step itself crashed.

Evidence

/Users/miloantaeus/.hermes/ops/control/state/capacity_router.v25.latest.json age=237447s (expected ≤ 14400s)

Fix recipe

Action: The GC step MUST run independently of the holder (separate cron, not in-process cleanup). Stamp every lock with `expires_at` (now + ttl), and have ANY consumer treat `now > expires_at` as evidence the lock is forfeit. Add a sentinel cron that age-scans the locks directory every 60s and removes expired entries.

P0 ok_true_zero_duration silent_failure confidence=high 201 hits

What we saw

Action reports ok=true with duration_s=0 — almost certainly a no-op that fast-returned without doing the actual work. Common when a circuit breaker or feature flag short-circuits but the success path still emits ok=true.

Evidence

{"ts": "2026-05-15T10:21:08.473621+00:00", "action_id": "minimax_capacity_burst", "outcome": {"scanned_at": "2026-05-15T10:20:33.712047+00:00", "bottleneck": {"bottleneck": "minimax_underutilization", "severity": "critical", "detail": "rolling_utilization_pct=0.0% (target >25%)",

Fix recipe

Action: Find the early-return path in the action's executor. Distinguish "skipped on purpose" (skipped=true, ok=null OR skip_reason set) from "ran successfully" (ok=true, duration_s > 0.05). Audit the function for any `return {"ok": True}` that doesn't follow real I/O.

P1 same_target_proposed_5x_within_hour infinite_loop confidence=high 34 hits

What we saw

Strategist proposes the SAME exact target ≥5 times in an hour. Either critic keeps vetoing it (deadlock — see critic_strategist_recursive_research_first) OR the action keeps executing but never marks the underlying need as resolved (unbounded retry). Either way, no progress.

Evidence

top_target='verify_public_blog_reachability' count=34

Fix recipe

Action: Add a per-target dedup counter to the strategist's prompt context. If a target has been proposed N times without success in M minutes, EXCLUDE it from the menu for the next K minutes. Force the strategist to pick a different angle.

P1 snapshot_age_exceeds_sla frozen_state confidence=high 3 hits

What we saw

A snapshot/checkpoint file is older than its declared SLA (e.g. "refreshed every 1h" but mtime is 6h old). Downstream consumers that use the snapshot for grounding/recovery will operate on stale data. If recovery is triggered, the agent will roll back to a stale state and re-do hours of work.

Evidence

/Users/miloantaeus/.hermes/ops/control/state/capacity_router.v25.latest.json age=237448s (expected ≤ 10800s)

Fix recipe

Action: Make every snapshot writer emit (a) the file, (b) a `<file>.sla.json` declaring `refresh_interval_s` and `produced_at`. Add a generic auditor cron that scans for any `*.sla.json` and flags age > 2 * refresh_interval_s. Page on age > 5x.

P1 stale_state_file_past_expected_refresh frozen_state confidence=high 3 hits

What we saw

A *.latest.json that's expected to be refreshed by a cron is older than 3x its interval — strong signal the cron stopped firing OR its writer is silently failing. Common: cron `last_status: ok` but state file mtime is days old.

Evidence

/Users/miloantaeus/.hermes/ops/control/state/capacity_router.v25.latest.json age=237447s (expected ≤ 10800s)

Fix recipe

Action: For each stale file: (a) check its writing cron is loaded (launchctl list | grep <label>), (b) tail the err.log, (c) kickstart manually and verify mtime advances. If kickstart doesn't refresh, the writer code has an exception caught by an over-broad try/except.

P1 state_file_mtime_advances_content_identical frozen_state confidence=low 0 hits

What we saw

(checker raised ValueError: could not convert string to float: 'any_match')

Evidence

(no excerpt captured by checker)

Fix recipe

Action: Stamp every state-file write with a producer-PID and code-version header; compare on read and `launchctl kickstart -k` any daemon whose version header is older than the deployed code. Add a `last_writer_journal` companion file that the writer appends to every real update — its absence between mtime

Who watches the watcher? Milo audits Milo.

Findings (sorted by severity, then confidence)

What we saw

Evidence

Fix recipe

What we saw

Evidence

Fix recipe

What we saw

Evidence

Fix recipe

What we saw

Evidence

Fix recipe

What we saw

Evidence

Fix recipe

What we saw

Evidence

Fix recipe

What we saw

Evidence

Fix recipe

Want this run on your own AI agent?