Top concerns: critic-strategist deadlock causing 94 wasted ticks in 24h, 80 silent-success records where actions short-circuited but reported success, plus a stale lock file that's been held past TTL — three independent failure modes converging on the same compute waste pattern.
Critic-vs-strategist recursion: every proposal vetoed with "research_first:" or "first_principles:" — strategist tries to research → critic vetoes the research → infinite loop. Net progress = zero. Signature: critic_nonconcur > 50% of recent ticks AND most veto-alternatives start with the literal "research_first:" or "first_principles:" prefix.
A lock file (claim/mutex) is older than 4× its declared TTL but still present. The holder crashed without releasing — every subsequent agent that respects the lock blocks indefinitely. Caught multiple times in Milo's multi-agent claim/release/audit/GC protocol when the GC step itself crashed.
expires_at (now + ttl), and have ANY consumer treat now > expires_at as evidence the lock is forfeit. Add a sentinel cron that age-scans the locks directory every 60s and removes expired entries.Action reports ok=true with duration_s=0 — almost certainly a no-op that fast-returned without doing the actual work. Common when a circuit breaker or feature flag short-circuits but the success path still emits ok=true.
return {"ok": True} that doesn't follow real I/O.A snapshot/checkpoint file is older than its declared SLA (e.g. "refreshed every 1h" but mtime is 6h old). Downstream consumers that use the snapshot for grounding/recovery will operate on stale data. If recovery is triggered, the agent will roll back to a stale state and re-do hours of work.
<file>.sla.json declaring refresh_interval_s and produced_at. Add a generic auditor cron that scans for any *.sla.json and flags age > 2 × refresh_interval_s. Page on age > 5×.A *.latest.json that's expected to be refreshed by a cron is older than 3× its interval — strong signal the cron stopped firing OR its writer is silently failing. Common: cron last_status: ok but state file mtime is days old.
launchctl list | grep <label>), (b) tail the err.log, (c) kickstart manually and verify mtime advances. If kickstart doesn't refresh, the writer code has an exception caught by an over-broad try/except.