Sample · LLM Bill Triage Deep Report

Sample audit: open-source AI coding agent (anonymized)

Generated 2026-05-15 · 1.4M API calls analyzed · 30-day window · 32-rule library · public sample
Audit Summary
Total findings
6
P0 (this week)
3
P1 (this month)
2
P2 (next quarter)
1

Top concerns: a 1,840-token system primer attached to 87% of completions that the model demonstrably ignores ($1,832/mo), 41% of completions routed to a model 8x more expensive than required for the task complexity ($1,419/mo), and 12,847 deterministic retries masquerading as transient errors ($412/mo + 1.8s p99 latency tax). Three independent failure modes — input bloat, routing miss, and silent retry storm — together accounting for 88% of the monthly waste.

Monthly spend $4,847$1,389 after fixes
71% reduction. $3,458/mo savings. Report ROI: 13.9× in month 1, 41.7× annualized. Implementation effort estimated at 6–9 engineering hours total across all six recipes.

Findings (sorted by severity, then $-impact)

P0 prompt_bloat_unused_context runaway_cost confidence=high $1,832/mo 1.22M hits

What we saw

A 1,840-token "system context primer" was prepended to 87% of completions over the 30-day window (1.22M of 1.40M calls). Reasoning-token-overlap analysis on a stratified 4,000-completion sample shows the model referenced primer content in only 3.1% of outputs — the remaining 96.9% of completions paid for input tokens the model demonstrably ignored. Primer included project-wide style guides, contributor guidelines, and architecture overviews that were relevant to a small subset of code-edit prompts but attached uniformly to classification, search, summary, and tool-call routes.

Evidence

{"window": "2026-04-15..2026-05-14", "calls_with_primer": 1217843, "primer_tokens": 1840, "primer_referenced_in_output": 0.031, "wasted_input_tokens_total": 2_245_531_120, "model_mix": {"gpt-4o": 0.62, "gpt-4-turbo": 0.31, "claude-3.5-sonnet": 0.07}}

Fix recipe

Action: Split the primer into per-route system messages. Classification and search routes need ~120 tokens (output format + safety rails only). Code-edit routes keep the architecture overview (~640 tokens). Tool-call routes get the tool schemas only (~280 tokens). Expected token reduction: 84% on input side, $1,832/mo recovered. Validate with a 7-day shadow eval: same prompts through both old and new primers, compare task success on a 500-sample golden set. Quality parity threshold ≤ 1.5%.
P0 model_routing_overkill runaway_cost confidence=high $1,419/mo 574K hits

What we saw

41% of completions (574,000 of 1.4M) match a classification heuristic — output is under 200 tokens, format is yes/no, single-label, or short-extraction (e.g. "is this a bug? Y/N", "label one of: feature/bug/chore", "extract function name"). All currently routed to gpt-4-turbo at $10/$30 per 1M tokens. A 200-sample holdout re-run on gpt-4o-mini ($0.15/$0.60 per 1M) shows quality parity within 2.1% on task-success metric — well inside the noise floor of human-graded eval.

Evidence

{"window": "30d", "classification_route_calls": 574_122, "current_model": "gpt-4-turbo", "current_cost": 1538.40, "proposed_model": "gpt-4o-mini", "proposed_cost": 119.21, "holdout_eval": {"n": 200, "quality_delta_pct": -2.1, "ci_95": [-4.4, 0.2]}, "savings_per_month": 1419.19}

Fix recipe

Action: Add a router middleware that classifies the prompt shape BEFORE the API call: (output_format_short OR contains_only_one_label_request OR extraction_pattern_match) AND expected_output_tokens < 200gpt-4o-mini. Keep gpt-4-turbo for code-edit, multi-step reasoning, and any prompt with embedded function-call schemas. Run a 1-week canary at 10% traffic, monitor task-success and user-feedback rates, then ramp to 100%. Expected savings: $1,419/mo. Implementation: 2-3 engineering hours including the router heuristic and eval harness.
P0 retry_storm_deterministic silent_failure confidence=high $412/mo 12,847 hits

What we saw

12,847 retries in 30 days where the underlying error was deterministic, not transient. Most common pattern: tool-call response returns malformed JSON (e.g. trailing comma, unescaped newline inside a string), the client treats the parse-failure as a transient error and retries up to 5 times with exponential backoff. Every retry re-runs the full prompt and bills for the full completion. Re-issuing the same prompt to the same model on a deterministic-error response will produce the same deterministic error — the retries are pure waste, plus a 1.8s p99 user-facing latency tax.

Evidence

{"retries_total": 12847, "deterministic_share": 0.91, "top_error": "JSONDecodeError: Expecting value: line 1 column 1 (char 0)", "retries_per_prompt_p50": 3, "retries_per_prompt_p99": 5, "wasted_completions": 11691, "wasted_tokens": 38_492_000, "p99_user_latency_added_ms": 1820}

Fix recipe

Action: (1) Parse the tool-call response with a strict schema (pydantic / zod) BEFORE counting it as a transient failure. (2) On schema-parse failure, do not retry — emit a structured error to the caller and surface in the dashboard. (3) Add an alert when retry-rate exceeds 2% on any route for 15 consecutive minutes. (4) For the small slice of genuinely-transient failures (HTTP 429, 503, connection reset), keep exponential backoff with jitter capped at 3 retries. Expected savings: $412/mo cost plus latency p99 reduction of ~1.8s. Implementation: 1-2 engineering hours.
P1 cache_bypass_semantic_duplicate runaway_cost confidence=medium $287/mo 2,103 hits

What we saw

2,103 completions over 30 days have cosine-similarity > 0.92 to a prior completion within the same 24h window, but no caching layer is present in the request path. The duplicates are mostly user-driven re-asks ("explain this function" on the same function within hours), tool-loop retries from agent code, and a small number of test-harness re-runs. None hit a cache — every duplicate pays full token cost.

Evidence

{"window": "30d", "semantic_duplicates_at_0.92": 2103, "duplicates_at_0.95": 1471, "duplicates_at_0.98": 892, "avg_completion_cost_usd": 0.137, "recoverable_at_0.90_gate": 287.41, "embed_cost_per_lookup_usd": 0.000013}

Fix recipe

Action: Insert a semantic-cache layer (Helicone-style or self-hosted with Redis + bge-m3 embeddings). Gate at cosine-similarity ≥ 0.90, TTL 1h, scope by user_id to avoid cross-user leakage. Embed cost is ~$0.000013/lookup — net savings vs LLM-cost ratio is >10,000:1 at this volume. Confidence is medium not high because true savings depend on actual cache-hit rate in production traffic (lab cosine-similarity overestimates real-world hits by ~15-30% in typical deployments). Expected recoverable: $287/mo, possibly less. Implementation: 4-6 engineering hours including cache invalidation logic.
P1 streaming_abort_unhonored silent_failure confidence=high $94/mo 2,318 hits

What we saw

4.7% of streamed completions (2,318 of 49,300 streaming calls) show client-side abort signals in the access log (user closed tab, navigated away, hit cancel button) but the server-side completion continues to the natural stop-token and bills for the full output. The OpenAI/Anthropic SDK call is not wired to honor the inbound AbortController — client abort just drops the response on the floor while the server keeps generating.

Evidence

{"streaming_calls_30d": 49300, "client_aborts_logged": 2318, "server_side_completion_continued": 2318, "avg_tokens_after_abort": 312, "monthly_waste_usd": 94.07, "p99_completion_time_after_abort_ms": 4200}

Fix recipe

Action: Wire the inbound AbortController signal from the client request through the framework-level proxy down to the OpenAI/Anthropic SDK call. Both SDKs support signal parameter natively in the streaming endpoints — the SDK will close the upstream connection and stop billing within ~50ms of the abort. Verify with a synthetic-abort test: open stream, abort at 50% generation, assert finish_reason: "abort" in usage record and final billed tokens < 50% of typical completion length. Expected savings: $94/mo plus user-perceived snappier cancel. Implementation: 1 engineering hour.
P2 prompt_drift_unpinned_versions eval_drift confidence=high 11 response shapes

What we saw

The same canonical user query ("Summarize this PR") returned 11 distinct response formats over the 30-day window with no version pinning on either the prompt template or the model. Model alias gpt-4o auto-rolled twice during the period (2026-04-22, 2026-05-08 per OpenAI changelog). No git-snapshot of the prompt template was kept — drift is invisible until users complain. No direct $-impact, but downstream regression risk is medium: any consumer of the PR-summary feature (e.g. CI integration, Slack bot) has no contract guarantee about output shape.

Evidence

{"canonical_query": "Summarize this PR", "distinct_response_shapes_30d": 11, "model_alias_used": "gpt-4o", "underlying_snapshot_rolls": ["2024-08-06 → 2024-11-20", "2024-11-20 → 2025-03-15"], "prompt_template_versioned_in_git": false, "downstream_consumers": 4}

Fix recipe

Action: (1) Pin model to a dated snapshot (gpt-4o-2024-08-06) instead of the moving alias. Schedule a quarterly review to test the latest snapshot against the pinned one. (2) Snapshot every prompt template in git with a version tag; embed prompt_version in each API call's metadata. (3) Add a weekly eval that runs 20 canonical PR fixtures through the prompt and compares output shape against a golden reference. Alert on Levenshtein distance > 0.3 from the golden. No direct $-impact but stops the slow erosion of downstream contracts. Implementation: 3-4 engineering hours.