Sample Deliverable · LLM Bill Triage

Sample LLM Bill Triage Report — Redacted Customer Audit

A redacted preview of the $299 deep-report deliverable. Customer name and exact dollar amounts have been generalized for confidentiality. The structure, sections, and analysis depth match every paid report.

1. Executive Summary

Customer

REDACTED · 60-person SaaS company

Audit period

30 days · Apr 14 – May 14, 2026

Providers analyzed

Anthropic (Claude) + OpenAI (GPT) + Google (Gemini) usage exports

Current monthly LLM spend

$14,247 / month

Identified recoverable opportunity

$4,890 / month · 34% of current spend

Fix recipes ready to ship

5 prioritized recipes · combined eng effort ~9 hours

Triage engine version

32-rule library · v2026.05

The audit ran the 32-rule cost library against the 30-day usage CSV. Of the $14,247 in monthly spend, $4,890 was traced to five concentrated cost drivers with concrete, ship-today fixes. Each driver below includes a fix recipe with effort estimate and ROI multiplier.

2. Top 5 Cost Drivers

Ranked by monthly spend. Percentages are of total $14,247 monthly bill.

Rank	Cost driver	Monthly spend	% of total	Fix
1	Claude Opus 4.1 used for JSON extraction Schema-extraction calls routed to Opus where Haiku would match quality	$4,820	33.8%	Route to Haiku — saves $4,113/mo (85% reduction)
2	Retry storm on tool-call errors 3–10× replays per failed tool call with no backoff cap	$2,340	16.4%	Exponential backoff, cap at 2 retries — saves $1,560/mo
3	Long system prompts re-sent every request 4,200-token system prompt on every call, zero prompt-cache hits	$1,890	13.3%	Enable prompt caching — saves $1,134/mo (60%)
4	Embedding re-indexing every hour Full corpus re-embed 24×/day where nightly would suffice	$1,210	8.5%	Move to nightly re-index — saves $1,068/mo (88%)
5	GPT-4o for tasks Gemini Flash handles Classification + short-answer routes where Flash matches accuracy at lower cost	$890	6.2%	Route to Gemini 1.5 Flash — saves $623/mo (70%)
—	All other drivers 27 remaining rule matches, individually under $300/mo each	$3,097	21.7%	Documented in the appendix for future review
TOTAL MONTHLY SPEND		$14,247	100%	$4,890/mo recoverable identified

3. Prompt-Bloat Heatmap

Top 10 prompts ranked by token-count × frequency. Red > 10M tokens/day, yellow 5–10M, green < 5M.

prompt-a3f29e

12.1M

RED

prompt-b81d44

10.2M

RED

prompt-c47a91

7.9M

YELLOW

prompt-d12fc6

7.0M

YELLOW

prompt-e905b2

5.9M

YELLOW

prompt-f3e187

4.5M

GREEN

prompt-g6c213

3.5M

GREEN

prompt-h8a504

2.7M

GREEN

prompt-i2d748

1.9M

GREEN

prompt-j7b390

1.4M

GREEN

The two red prompts (prompt-a3f29e at 4,247 tokens × 2,847 calls/day and prompt-b81d44 at 3,640 tokens × 2,802 calls/day) account for 60% of all input-token spend. Both are prime candidates for prompt caching — see Fix #3.

4. Model-Routing Recommendations

Where a cheaper model matches output quality. Each recommendation is backed by a 50-case hold-out evaluation included in the appendix.

Current model	Use case	Suggested model	Reasoning
claude-opus-4-1	JSON schema extraction	claude-haiku-4-5	Schema-extraction quality matches at 1/15 cost. 50-case eval: 100% schema-valid output, 96% field-accuracy match.
gpt-4o	Customer-support intent classification	gemini-1.5-flash	50-case eval shows identical accuracy at 1/8 cost. Latency is comparable.
gpt-4o	Short-answer Q&A	gemini-1.5-flash	Hold-out shows Flash matches GPT-4o on factual short-form responses with no quality regression.
claude-sonnet-4	Long-document summarization	claude-sonnet-4	Keep — best quality-per-dollar for this workload. Downgrading to Haiku lost summary fidelity in eval.
claude-opus-4-1	Complex multi-step reasoning	claude-opus-4-1	Keep — Opus is the right tool for this lane. Downgrading regresses on reasoning depth.

5. Fix Recipes

Ordered by ROI (savings ÷ eng-hours). Each recipe includes the concrete code or config change required.

FIX #1 Switch JSON extraction to Haiku

Effort: 1 hour Savings: $4,113 / mo ROI: 412× Priority: P0

The schema-extraction route currently calls Claude Opus 4.1 for every parsing task. The 50-case hold-out eval shows Claude Haiku 4.5 returns valid schemas 100% of the time with 96% field-accuracy match. Switch the route, keep an Opus fallback for the < 4% of records where Haiku confidence drops below threshold.

// extractor.ts — line 47 - model: "claude-opus-4-1" + model: "claude-haiku-4-5" + // Fallback to Opus on low-confidence parses + if (result.confidence < 0.85) { + return await retry({ model: "claude-opus-4-1", ... }); + }

FIX #2 Cap tool-call retries with exponential backoff

Effort: 2 hours Savings: $1,560 / mo ROI: 78× Priority: P0

Tool-call retries are firing 3–10 times per failure with no backoff. Replace the bare retry loop with exponential backoff capped at 2 attempts. This eliminates the retry-storm pattern that consumed an estimated 4.2M wasted output tokens in the audit period.

// agent_runner.py — line 84 - for _ in range(10): - result = call_tool(...) - if result.ok: break + from tenacity import retry, stop_after_attempt, wait_exponential + @retry(stop=stop_after_attempt(2), + wait=wait_exponential(multiplier=2, min=2, max=32)) + def call_tool_with_backoff(...): + return call_tool(...)

FIX #3 Enable prompt caching on top 3 prompts

Effort: 3 hours Savings: $1,134 / mo ROI: 38× Priority: P1

Three system prompts (prompt-a3f29e, prompt-b81d44, prompt-c47a91) are re-sent verbatim on every call with zero cache hits. Add a cache_control marker on the static system-prompt block to capture the 90% discount on cache reads. Cache TTL of 5 minutes covers the typical call cadence.

// client.ts — system prompt construction messages: [{ role: "system", content: [{ type: "text", text: STATIC_SYSTEM_PROMPT, + cache_control: { type: "ephemeral" } }] }, ...userMessages]

FIX #4 Move embedding re-index from hourly to nightly

Effort: 1 hour Savings: $1,068 / mo ROI: 107× Priority: P1

The corpus re-embeds 24 times per day. The corpus updates roughly daily, so 23 of those runs produce zero net change. Switch to a single nightly cron at 03:00 UTC. Add an event-triggered delta re-index for the rare same-day corpus update if needed.

# crontab - 0 * * * * /opt/embed-pipeline/reindex.sh + 0 3 * * * /opt/embed-pipeline/reindex.sh + + # Optional: trigger delta re-embed on corpus webhook + # POST /webhook/corpus-updated → enqueue partial-reindex job

FIX #5 Route classification traffic to Gemini Flash

Effort: 2 hours Savings: $623 / mo ROI: 31× Priority: P2

Customer-support classification is currently a GPT-4o call. The 50-case hold-out eval shows Gemini 1.5 Flash matches the GPT-4o output on all 50 cases. Route the classifier through Flash and keep GPT-4o as a fallback for the rare case where the Gemini call errors.

// classifier.ts — line 22 - const model = "gpt-4o"; + const model = process.env.CLASSIFIER_MODEL || "gemini-1.5-flash"; + // GPT-4o remains the fallback on Gemini API errors + try { return await call(model, ...); } + catch (e) { return await call("gpt-4o", ...); }

6. Methodology

The audit ran the 32-rule cost library against your usage CSV in-memory. Each rule looked for a specific cost pattern: retry storms, prompt bloat, model mismatch, embedding waste, context-window inflation, reasoning-token overruns, customer-level outliers, cache-miss patterns, and 24 more. Findings were ranked by $/month impact and de-duplicated where two rules flagged the same underlying cause.

Fix recipes include ROI estimates assuming the engineering hours noted. ROI = (monthly savings × 12 months) ÷ (eng-hours × $250/hour blended rate). Quality match for routing recommendations is backed by a 50-case hold-out evaluation against a held-back slice of your actual production data. The full eval CSVs are included in the appendix.

Your usage data was processed in-memory, the report was generated, and the input was discarded immediately. No raw API keys were requested. The retained record is your PayPal transaction ID and email, kept only for the refund window.

Want this run on YOUR usage data?

Upload your last 30 days of OpenAI, Anthropic, or Gemini usage CSV and get the same report format — with your real spend, your real prompts, your real fix recipes — within 24 hours.

Get the $299 Deep Report →

Money-back if total identified savings < $299

Sample data above is generalized from the real audit format. Your report will use your actual data, which is processed in-memory and discarded immediately after delivery.