What does the orchestrator score propagation audit include?

A full root-cause diagnosis of why your autonomous agent loop reports success but scores 0.0 — delivered as annotated JSON traces, before/after propagation maps, and a prioritised fix list in 5 business days.

What information do I need to provide to start the audit?

Access to your orchestrator logs (JSONL format preferred), your scoring/evaluation config, and a brief description of the agent loop topology. No code or credentials required.

What if the root cause turns out to be in the LLM provider rather than my orchestration code?

The audit covers the full propagation chain — orchestrator code, prompt routing, evaluation harness, and LLM provider quirks. You receive concrete attribution regardless of where the bug lives.

Is there a refund if the audit cannot identify the root cause?

Yes — if no actionable root cause is identified within the 5-day window, a full refund is issued. No questions asked.

Orchestrator Audit Sprint

Your Agent Finishes Strong —
But the Score Stays at 0.0

Your orchestrator completes the full run successfully, yet the market_research_deepening metric reads 0.0. That gap between execution and evaluation is exactly what this sprint diagnoses — and fixes with artefacts you can deploy immediately.

Fixed Price — No Surprises

^$3,000

One flat fee · Delivered in 5 business days

Secured by PayPal · You will be redirected to complete payment

What You Get — 5 Concrete Artefacts

Propagation Audit Report (PDF, 14–20 pages) Root-cause analysis of the orchestrator-to-scorer data path. Documents where market_research_deepening is dropped, suppressed, or never emitted — with annotated call-graph excerpts and Pydantic schema diffs.
Deterministic Replay Fixture (Python, conftest.py + test case) A self-contained pytest fixture that deterministically reproduces the 0.0 scoring condition against your codebase, so your team can regression-test any fix without manual orchestration runs.
Schema-Validation Pre-Flight Contract (YAML) A standalone validation layer you can gate-check before each production run. Lists every required field, type, and range constraint that the scoring module expects — catch schema mismatches before they silently zero out your scores.
Scoring Algorithm Threshold Alignment Doc Documents whether the "deepening" threshold is misaligned with your subagent's output format (e.g., string vs. numeric, unnormalized vs. normalized score range). Includes recommended SLO thresholds and acceptance criteria for future runs.
Reference Appendix (tooling list + links + next steps) Curated links to LangGraph coordination patterns, Pydantic schema enforcement patterns, Anthropic multi-agent feedback loop docs, and Elastic anomaly-score debugging references. Includes a one-page operator runbook for on-call scenarios.

Frequently Asked Questions

My orchestrator uses LangGraph — does this sprint cover it?

Yes. The audit is architecture-agnostic and focuses on the data-flow contract between your orchestrator (LangGraph, Claude Code skills, or custom multi-agent), the worker subagent payload, and the scoring module. If you're using Pydantic schemas for output validation, the report will explicitly call out any missing or mis-typed market_research_deepening fields in those schemas.

What if the root cause turns out to be intentional suppression, not a bug?

The audit delivers an unambiguous determination either way. If the 0.0 score is intentional gating (e.g., a safety filter or minimum-confidence threshold), the report documents the exact gate condition, its threshold, and the recommended adjustment range — with a risk note for any threshold change. You'll have a written record either way.

How do I share my codebase without exposing proprietary data?

You can anonymize the orchestrator graph definition and scoring module independently — the replay fixture accepts a structured input dict that mirrors your data shape without requiring your full production codebase. Milo will share a minimal test harness template first so you can see exactly what data shape is needed.

What happens if 5 days isn't enough time to fully diagnose?

The sprint delivers the five artefacts as specified within 5 business days. If the diagnosis identifies a deeper structural issue that warrants a second engagement, Milo will flag it explicitly in the Propagation Audit Report with a recommended scope and timeline — so you can decide whether to proceed without ambiguity.

Your Agent Finishes Strong —But the Score Stays at 0.0

How It Works