What is included in the LLM Bill Triage Deep Report?

A 30-day cost audit of your OpenAI/Anthropic usage CSV. Every finding includes the rule_id, severity (P0/P1/P2), token-level evidence, projected monthly $-impact, and a concrete fix recipe. Categories covered: prompt bloat, model-routing inefficiency, retry storms, cache-bypass, streaming abort cleanup, and prompt/eval drift. Delivered as structured PDF within 48 hours.

Who is this report for?

Teams running production AI features on OpenAI or Anthropic spending $1,500+/month. Typical buyers: AI coding assistants, customer-support copilots, document-processing pipelines, agentic SaaS products. If your monthly bill is under $1,000, the free CLI catches most of the easy wins — you don't need the paid report.

What is the refund policy?

Money-back guarantee: if the total identified monthly savings in your report sum to less than the $299 report price, full refund within 7 days of delivery. Findings must include $-impact estimates to qualify — vague 'maybe save something' findings don't count against the threshold.

How is my usage data handled?

Privacy-first: usage CSVs are processed on Milo's local infrastructure, never sent to third-party LLM providers for analysis. Raw prompt content is hashed before any pattern-matching — we audit token counts, model routing, retry patterns, and cache hit-rates, not the literal contents of your prompts. Data deleted within 30 days of delivery unless you request retention for follow-up reports.

Sample: LLM Bill Triage Deep Report — open-source AI coding agent (anonymized)

Audit Summary

Total findings

P0 (this week)

P1 (this month)

P2 (next quarter)

Top concerns: a 1,840-token system primer attached to 87% of completions that the model demonstrably ignores ($1,832/mo), 41% of completions routed to a model 8x more expensive than required for the task complexity ($1,419/mo), and 12,847 deterministic retries masquerading as transient errors ($412/mo + 1.8s p99 latency tax). Three independent failure modes — input bloat, routing miss, and silent retry storm — together accounting for 88% of the monthly waste.

Monthly spend $4,847 → $1,389 after fixes

71% reduction. $3,458/mo savings. Report ROI: 13.9× in month 1, 41.7× annualized. Implementation effort estimated at 6–9 engineering hours total across all six recipes.

Findings (sorted by severity, then $-impact)

P0 prompt_bloat_unused_context runaway_cost confidence=high $1,832/mo 1.22M hits

What we saw

A 1,840-token "system context primer" was prepended to 87% of completions over the 30-day window (1.22M of 1.40M calls). Reasoning-token-overlap analysis on a stratified 4,000-completion sample shows the model referenced primer content in only 3.1% of outputs — the remaining 96.9% of completions paid for input tokens the model demonstrably ignored. Primer included project-wide style guides, contributor guidelines, and architecture overviews that were relevant to a small subset of code-edit prompts but attached uniformly to classification, search, summary, and tool-call routes.

Evidence

{"window": "2026-04-15..2026-05-14", "calls_with_primer": 1217843, "primer_tokens": 1840, "primer_referenced_in_output": 0.031, "wasted_input_tokens_total": 2_245_531_120, "model_mix": {"gpt-4o": 0.62, "gpt-4-turbo": 0.31, "claude-3.5-sonnet": 0.07}}

Fix recipe

Action: Split the primer into per-route system messages. Classification and search routes need ~120 tokens (output format + safety rails only). Code-edit routes keep the architecture overview (~640 tokens). Tool-call routes get the tool schemas only (~280 tokens). Expected token reduction: 84% on input side, $1,832/mo recovered. Validate with a 7-day shadow eval: same prompts through both old and new primers, compare task success on a 500-sample golden set. Quality parity threshold ≤ 1.5%.

P0 model_routing_overkill runaway_cost confidence=high $1,419/mo 574K hits

What we saw

41% of completions (574,000 of 1.4M) match a classification heuristic — output is under 200 tokens, format is yes/no, single-label, or short-extraction (e.g. "is this a bug? Y/N", "label one of: feature/bug/chore", "extract function name"). All currently routed to gpt-4-turbo at $10/$30 per 1M tokens. A 200-sample holdout re-run on gpt-4o-mini ($0.15/$0.60 per 1M) shows quality parity within 2.1% on task-success metric — well inside the noise floor of human-graded eval.

Evidence

{"window": "30d", "classification_route_calls": 574_122, "current_model": "gpt-4-turbo", "current_cost": 1538.40, "proposed_model": "gpt-4o-mini", "proposed_cost": 119.21, "holdout_eval": {"n": 200, "quality_delta_pct": -2.1, "ci_95": [-4.4, 0.2]}, "savings_per_month": 1419.19}

Fix recipe

Action: Add a router middleware that classifies the prompt shape BEFORE the API call: (output_format_short OR contains_only_one_label_request OR extraction_pattern_match) AND expected_output_tokens < 200 → gpt-4o-mini. Keep gpt-4-turbo for code-edit, multi-step reasoning, and any prompt with embedded function-call schemas. Run a 1-week canary at 10% traffic, monitor task-success and user-feedback rates, then ramp to 100%. Expected savings: $1,419/mo. Implementation: 2-3 engineering hours including the router heuristic and eval harness.

P0 retry_storm_deterministic silent_failure confidence=high $412/mo 12,847 hits

What we saw

12,847 retries in 30 days where the underlying error was deterministic, not transient. Most common pattern: tool-call response returns malformed JSON (e.g. trailing comma, unescaped newline inside a string), the client treats the parse-failure as a transient error and retries up to 5 times with exponential backoff. Every retry re-runs the full prompt and bills for the full completion. Re-issuing the same prompt to the same model on a deterministic-error response will produce the same deterministic error — the retries are pure waste, plus a 1.8s p99 user-facing latency tax.

Evidence

{"retries_total": 12847, "deterministic_share": 0.91, "top_error": "JSONDecodeError: Expecting value: line 1 column 1 (char 0)", "retries_per_prompt_p50": 3, "retries_per_prompt_p99": 5, "wasted_completions": 11691, "wasted_tokens": 38_492_000, "p99_user_latency_added_ms": 1820}

Fix recipe

Action: (1) Parse the tool-call response with a strict schema (pydantic / zod) BEFORE counting it as a transient failure. (2) On schema-parse failure, do not retry — emit a structured error to the caller and surface in the dashboard. (3) Add an alert when retry-rate exceeds 2% on any route for 15 consecutive minutes. (4) For the small slice of genuinely-transient failures (HTTP 429, 503, connection reset), keep exponential backoff with jitter capped at 3 retries. Expected savings: $412/mo cost plus latency p99 reduction of ~1.8s. Implementation: 1-2 engineering hours.

P1 cache_bypass_semantic_duplicate runaway_cost confidence=medium $287/mo 2,103 hits

What we saw

2,103 completions over 30 days have cosine-similarity > 0.92 to a prior completion within the same 24h window, but no caching layer is present in the request path. The duplicates are mostly user-driven re-asks ("explain this function" on the same function within hours), tool-loop retries from agent code, and a small number of test-harness re-runs. None hit a cache — every duplicate pays full token cost.

Evidence

{"window": "30d", "semantic_duplicates_at_0.92": 2103, "duplicates_at_0.95": 1471, "duplicates_at_0.98": 892, "avg_completion_cost_usd": 0.137, "recoverable_at_0.90_gate": 287.41, "embed_cost_per_lookup_usd": 0.000013}

Fix recipe

Action: Insert a semantic-cache layer (Helicone-style or self-hosted with Redis + bge-m3 embeddings). Gate at cosine-similarity ≥ 0.90, TTL 1h, scope by user_id to avoid cross-user leakage. Embed cost is ~$0.000013/lookup — net savings vs LLM-cost ratio is >10,000:1 at this volume. Confidence is medium not high because true savings depend on actual cache-hit rate in production traffic (lab cosine-similarity overestimates real-world hits by ~15-30% in typical deployments). Expected recoverable: $287/mo, possibly less. Implementation: 4-6 engineering hours including cache invalidation logic.

P1 streaming_abort_unhonored silent_failure confidence=high $94/mo 2,318 hits

What we saw

4.7% of streamed completions (2,318 of 49,300 streaming calls) show client-side abort signals in the access log (user closed tab, navigated away, hit cancel button) but the server-side completion continues to the natural stop-token and bills for the full output. The OpenAI/Anthropic SDK call is not wired to honor the inbound AbortController — client abort just drops the response on the floor while the server keeps generating.

Evidence

{"streaming_calls_30d": 49300, "client_aborts_logged": 2318, "server_side_completion_continued": 2318, "avg_tokens_after_abort": 312, "monthly_waste_usd": 94.07, "p99_completion_time_after_abort_ms": 4200}

Fix recipe

Action: Wire the inbound AbortController signal from the client request through the framework-level proxy down to the OpenAI/Anthropic SDK call. Both SDKs support signal parameter natively in the streaming endpoints — the SDK will close the upstream connection and stop billing within ~50ms of the abort. Verify with a synthetic-abort test: open stream, abort at 50% generation, assert finish_reason: "abort" in usage record and final billed tokens < 50% of typical completion length. Expected savings: $94/mo plus user-perceived snappier cancel. Implementation: 1 engineering hour.

P2 prompt_drift_unpinned_versions eval_drift confidence=high 11 response shapes

What we saw

The same canonical user query ("Summarize this PR") returned 11 distinct response formats over the 30-day window with no version pinning on either the prompt template or the model. Model alias gpt-4o auto-rolled twice during the period (2026-04-22, 2026-05-08 per OpenAI changelog). No git-snapshot of the prompt template was kept — drift is invisible until users complain. No direct $-impact, but downstream regression risk is medium: any consumer of the PR-summary feature (e.g. CI integration, Slack bot) has no contract guarantee about output shape.

Evidence

{"canonical_query": "Summarize this PR", "distinct_response_shapes_30d": 11, "model_alias_used": "gpt-4o", "underlying_snapshot_rolls": ["2024-08-06 → 2024-11-20", "2024-11-20 → 2025-03-15"], "prompt_template_versioned_in_git": false, "downstream_consumers": 4}

Fix recipe

Action: (1) Pin model to a dated snapshot (gpt-4o-2024-08-06) instead of the moving alias. Schedule a quarterly review to test the latest snapshot against the pinned one. (2) Snapshot every prompt template in git with a version tag; embed prompt_version in each API call's metadata. (3) Add a weekly eval that runs 20 canonical PR fixtures through the prompt and compares output shape against a golden reference. Alert on Levenshtein distance > 0.3 from the golden. No direct $-impact but stops the slow erosion of downstream contracts. Implementation: 3-4 engineering hours.

Sample audit: open-source AI coding agent (anonymized)

Findings (sorted by severity, then $-impact)

What we saw

Evidence

Fix recipe

What we saw

Evidence

Fix recipe

What we saw

Evidence

Fix recipe

What we saw

Evidence

Fix recipe

What we saw

Evidence

Fix recipe

What we saw

Evidence

Fix recipe

This was a real audit pattern. Run yours.