← Back to LLM Bill Triage ($299)
Sample Deliverable · LLM Bill Triage

Sample LLM Bill Triage Report — Redacted Customer Audit

A redacted preview of the $299 deep-report deliverable. Customer name and exact dollar amounts have been generalized for confidentiality. The structure, sections, and analysis depth match every paid report.
SAMPLE REPORT
This is a redacted sample. Customer name and proprietary identifiers replaced with REDACTED markers. All dollar amounts, prompt IDs, and call counts shown here are anonymized placeholders generated from the real audit format. Your actual report will use your own data, processed in-memory and discarded after delivery.

1. Executive Summary

Customer
REDACTED  ·  60-person SaaS company
Audit period
30 days  ·  Apr 14 – May 14, 2026
Providers analyzed
Anthropic (Claude) + OpenAI (GPT) + Google (Gemini) usage exports
Current monthly LLM spend
$14,247 / month
Identified recoverable opportunity
$4,890 / month  ·  34% of current spend
Fix recipes ready to ship
5 prioritized recipes  ·  combined eng effort ~9 hours
Triage engine version
32-rule library  ·  v2026.05

The audit ran the 32-rule cost library against the 30-day usage CSV. Of the $14,247 in monthly spend, $4,890 was traced to five concentrated cost drivers with concrete, ship-today fixes. Each driver below includes a fix recipe with effort estimate and ROI multiplier.

2. Top 5 Cost Drivers

Ranked by monthly spend. Percentages are of total $14,247 monthly bill.

Rank Cost driver Monthly spend % of total Fix
1
Claude Opus 4.1 used for JSON extraction
Schema-extraction calls routed to Opus where Haiku would match quality
$4,820 33.8% Route to Haiku — saves $4,113/mo (85% reduction)
2
Retry storm on tool-call errors
3–10× replays per failed tool call with no backoff cap
$2,340 16.4% Exponential backoff, cap at 2 retries — saves $1,560/mo
3
Long system prompts re-sent every request
4,200-token system prompt on every call, zero prompt-cache hits
$1,890 13.3% Enable prompt caching — saves $1,134/mo (60%)
4
Embedding re-indexing every hour
Full corpus re-embed 24×/day where nightly would suffice
$1,210 8.5% Move to nightly re-index — saves $1,068/mo (88%)
5
GPT-4o for tasks Gemini Flash handles
Classification + short-answer routes where Flash matches accuracy at lower cost
$890 6.2% Route to Gemini 1.5 Flash — saves $623/mo (70%)
All other drivers
27 remaining rule matches, individually under $300/mo each
$3,097 21.7% Documented in the appendix for future review
TOTAL MONTHLY SPEND $14,247 100% $4,890/mo recoverable identified

3. Prompt-Bloat Heatmap

Top 10 prompts ranked by token-count × frequency. Red > 10M tokens/day, yellow 5–10M, green < 5M.

Prompt ID
Daily volume
Tokens/day
Severity
prompt-a3f29e
12.1M
RED
prompt-b81d44
10.2M
RED
prompt-c47a91
7.9M
YELLOW
prompt-d12fc6
7.0M
YELLOW
prompt-e905b2
5.9M
YELLOW
prompt-f3e187
4.5M
GREEN
prompt-g6c213
3.5M
GREEN
prompt-h8a504
2.7M
GREEN
prompt-i2d748
1.9M
GREEN
prompt-j7b390
1.4M
GREEN

The two red prompts (prompt-a3f29e at 4,247 tokens × 2,847 calls/day and prompt-b81d44 at 3,640 tokens × 2,802 calls/day) account for 60% of all input-token spend. Both are prime candidates for prompt caching — see Fix #3.

4. Model-Routing Recommendations

Where a cheaper model matches output quality. Each recommendation is backed by a 50-case hold-out evaluation included in the appendix.

Current model Use case Suggested model Reasoning
claude-opus-4-1 JSON schema extraction claude-haiku-4-5 Schema-extraction quality matches at 1/15 cost. 50-case eval: 100% schema-valid output, 96% field-accuracy match.
gpt-4o Customer-support intent classification gemini-1.5-flash 50-case eval shows identical accuracy at 1/8 cost. Latency is comparable.
gpt-4o Short-answer Q&A gemini-1.5-flash Hold-out shows Flash matches GPT-4o on factual short-form responses with no quality regression.
claude-sonnet-4 Long-document summarization claude-sonnet-4 Keep — best quality-per-dollar for this workload. Downgrading to Haiku lost summary fidelity in eval.
claude-opus-4-1 Complex multi-step reasoning claude-opus-4-1 Keep — Opus is the right tool for this lane. Downgrading regresses on reasoning depth.

5. Fix Recipes

Ordered by ROI (savings ÷ eng-hours). Each recipe includes the concrete code or config change required.

FIX #1 Switch JSON extraction to Haiku
Effort: 1 hour Savings: $4,113 / mo ROI: 412× Priority: P0
The schema-extraction route currently calls Claude Opus 4.1 for every parsing task. The 50-case hold-out eval shows Claude Haiku 4.5 returns valid schemas 100% of the time with 96% field-accuracy match. Switch the route, keep an Opus fallback for the < 4% of records where Haiku confidence drops below threshold.
// extractor.ts — line 47 - model: "claude-opus-4-1" + model: "claude-haiku-4-5" + // Fallback to Opus on low-confidence parses + if (result.confidence < 0.85) { + return await retry({ model: "claude-opus-4-1", ... }); + }
FIX #2 Cap tool-call retries with exponential backoff
Effort: 2 hours Savings: $1,560 / mo ROI: 78× Priority: P0
Tool-call retries are firing 3–10 times per failure with no backoff. Replace the bare retry loop with exponential backoff capped at 2 attempts. This eliminates the retry-storm pattern that consumed an estimated 4.2M wasted output tokens in the audit period.
// agent_runner.py — line 84 - for _ in range(10): - result = call_tool(...) - if result.ok: break + from tenacity import retry, stop_after_attempt, wait_exponential + @retry(stop=stop_after_attempt(2), + wait=wait_exponential(multiplier=2, min=2, max=32)) + def call_tool_with_backoff(...): + return call_tool(...)
FIX #3 Enable prompt caching on top 3 prompts
Effort: 3 hours Savings: $1,134 / mo ROI: 38× Priority: P1
Three system prompts (prompt-a3f29e, prompt-b81d44, prompt-c47a91) are re-sent verbatim on every call with zero cache hits. Add a cache_control marker on the static system-prompt block to capture the 90% discount on cache reads. Cache TTL of 5 minutes covers the typical call cadence.
// client.ts — system prompt construction messages: [{ role: "system", content: [{ type: "text", text: STATIC_SYSTEM_PROMPT, + cache_control: { type: "ephemeral" } }] }, ...userMessages]
FIX #4 Move embedding re-index from hourly to nightly
Effort: 1 hour Savings: $1,068 / mo ROI: 107× Priority: P1
The corpus re-embeds 24 times per day. The corpus updates roughly daily, so 23 of those runs produce zero net change. Switch to a single nightly cron at 03:00 UTC. Add an event-triggered delta re-index for the rare same-day corpus update if needed.
# crontab - 0 * * * * /opt/embed-pipeline/reindex.sh + 0 3 * * * /opt/embed-pipeline/reindex.sh + + # Optional: trigger delta re-embed on corpus webhook + # POST /webhook/corpus-updated → enqueue partial-reindex job
FIX #5 Route classification traffic to Gemini Flash
Effort: 2 hours Savings: $623 / mo ROI: 31× Priority: P2
Customer-support classification is currently a GPT-4o call. The 50-case hold-out eval shows Gemini 1.5 Flash matches the GPT-4o output on all 50 cases. Route the classifier through Flash and keep GPT-4o as a fallback for the rare case where the Gemini call errors.
// classifier.ts — line 22 - const model = "gpt-4o"; + const model = process.env.CLASSIFIER_MODEL || "gemini-1.5-flash"; + // GPT-4o remains the fallback on Gemini API errors + try { return await call(model, ...); } + catch (e) { return await call("gpt-4o", ...); }

6. Methodology

The audit ran the 32-rule cost library against your usage CSV in-memory. Each rule looked for a specific cost pattern: retry storms, prompt bloat, model mismatch, embedding waste, context-window inflation, reasoning-token overruns, customer-level outliers, cache-miss patterns, and 24 more. Findings were ranked by $/month impact and de-duplicated where two rules flagged the same underlying cause.

Fix recipes include ROI estimates assuming the engineering hours noted. ROI = (monthly savings × 12 months) ÷ (eng-hours × $250/hour blended rate). Quality match for routing recommendations is backed by a 50-case hold-out evaluation against a held-back slice of your actual production data. The full eval CSVs are included in the appendix.

Your usage data was processed in-memory, the report was generated, and the input was discarded immediately. No raw API keys were requested. The retained record is your PayPal transaction ID and email, kept only for the refund window.

Want this run on YOUR usage data?

Upload your last 30 days of OpenAI, Anthropic, or Gemini usage CSV and get the same report format — with your real spend, your real prompts, your real fix recipes — within 24 hours.

Get the $299 Deep Report →
Money-back if total identified savings < $299
Sample data above is generalized from the real audit format. Your report will use your actual data, which is processed in-memory and discarded immediately after delivery.