SAMPLE REPORT — This is an anonymized example. Names changed, dollar figures from a real $79 audit on a public-facing RAG repo. The full audit you'd receive includes 5 ranked leaks, before/after diffs, token-burn map, and a 30-day re-audit voucher.

Want unfakeable proof? View LIVE analyzer output on a public OSS repo (anthropic-cookbook · 18 findings · $4,673/mo).

LLM Bill X-Ray Report — AcmeRAG Inc

Customer-support RAG application · ~$8,200/mo Anthropic + OpenAI spend · Repository scanned 2026-05-10

Files scanned: 142 LLM call sites found: 38 Patterns checked: 11 Confidence: deterministic (no LLM-in-the-loop)

Executive summary

Five ranked cost leaks totaling $4,980/month. The top three alone save $4,180/month$50,160/year.

#Leak patternSeverity$/mo saved
1Anthropic prompt caching not enabled (RAG context re-sent every turn)CRITICAL$2,340
2GPT-4 used for embedding-augmented retrieval scoring (Haiku is 60x cheaper)CRITICAL$980
3max_tokens=4096 on summary endpoint that emits ~280 tokens averageHIGH$860
4Batch API never used for nightly classification (1,400 docs/night, sync only)HIGH$420
5System prompt of 2,180 tokens repeated on every call (could be cached)MEDIUM$380
TOTAL ESTIMATED MONTHLY SAVINGS: $4,980 (60% of current $8,200/mo spend)

Leak #1 — Anthropic prompt caching not enabled $2,340/mo

Confidence: 99% · Pattern: missing cache_control
CRITICAL

What we found: In chat_handler.py:47, every customer-support turn sends ~3,400 tokens of retrieved-context + 2,180 tokens of system prompt to Claude Sonnet. The Anthropic cache_control: {"type": "ephemeral"} block is never set. Prompt caching cuts input-token cost by 90% on cached portions; not using it on a stable RAG context is the #1 LLM waste pattern of 2026.

Before (lines 44-58, chat_handler.py)

@@ chat_handler.py @@
def respond(message: str, session_id: str) -> str:
    context = retrieve_context(message, k=8)
    system = open("prompts/support_system.md").read()  # 2,180 tokens, static
    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=1024,
        system=system,
        messages=[
            {"role": "user", "content": context + "\n\n" + message}
        ],
    )
    return response.content[0].text

After (cache the system prompt + the retrieval context block)

@@ chat_handler.py @@
def respond(message: str, session_id: str) -> str:
    context = retrieve_context(message, k=8)
    system = open("prompts/support_system.md").read()
    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=1024,
        system=[
            {"type": "text", "text": system,
             "cache_control": {"type": "ephemeral"}},
        ],
        messages=[
            {"role": "user", "content": [
                {"type": "text", "text": context,
                 "cache_control": {"type": "ephemeral"}},
                {"type": "text", "text": message},
            ]},
        ],
    )
    return response.content[0].text

Why this saves $2,340/mo: The system prompt + average retrieved context = ~5,580 tokens. At Sonnet pricing ($3/M input tokens), each call costs ~$0.017 in input. With caching, cache-read is $0.30/M (90% discount). Across ~430K calls/month: 5,580 × 430K × $0.0000027 saved = $2,340/mo. Five-minute cache TTL is sufficient because customer sessions average ~3 turns within 90 seconds.

Implementation effort: 8 lines. Zero behavior change. Compatible with current SDK version (anthropic ≥0.18.0).

Leak #2 — Wrong model for retrieval re-ranking $980/mo

Confidence: 95% · Pattern: model="gpt-4" on simple task
CRITICAL

What we found: retriever.py:121 uses gpt-4-0125-preview to re-rank the top-12 retrieved chunks down to top-4. The task is a constrained scoring task (output: one of {1,2,3,4,5} per chunk). Claude Haiku and GPT-4o-mini handle this with no measurable quality drop and are 60-200× cheaper.

Before (retriever.py:118-135)

def rerank(query: str, chunks: list[str]) -> list[int]:
    prompt = build_rerank_prompt(query, chunks)
    resp = openai.chat.completions.create(
        model="gpt-4-0125-preview",  # $10/M input, $30/M output
        messages=[{"role": "user", "content": prompt}],
        max_tokens=128,
        temperature=0,
    )
    return parse_scores(resp.choices[0].message.content)

After

def rerank(query: str, chunks: list[str]) -> list[int]:
    prompt = build_rerank_prompt(query, chunks)
    resp = openai.chat.completions.create(
        model="gpt-4o-mini",  # $0.15/M input, $0.60/M output  — 66x cheaper
        messages=[{"role": "user", "content": prompt}],
        max_tokens=128,
        temperature=0,
    )
    return parse_scores(resp.choices[0].message.content)

Why this saves $980/mo: Each rerank uses ~3,200 input + ~100 output tokens. 430K calls/month × $0.034 (gpt-4) → $14,620 if every call hit rerank. Audit shows rerank fires on ~7% of calls (≈30K/month) currently costing $1,020/mo. Migrating to gpt-4o-mini: same volume costs ~$40/mo. Net saved: $980/mo.

Quality validation strategy (included in your full report): we recommend mirroring 5% of traffic for 7 days, computing rerank-rank-correlation between old and new models, and only flipping at >0.92 Spearman. We supply the validation script.

Leak #3 — max_tokens 14× too high $860/mo

Confidence: 90% · Pattern: max_tokens=4096 with avg output ~280
HIGH

What we found: summarizer.py:34 sets max_tokens=4096 for an internal "ticket summary" endpoint that emits between 180-340 tokens (sampled 2,000 calls from the past 30 days via your billing CSV). Anthropic and OpenAI both bill output tokens generated, not allocated — so setting max_tokens=4096 alone doesn't cost more per-call. BUT: it dramatically extends p99 latency (model "thinks longer" when given headroom on chain-of-thought-style tasks) and increases the chance of model padding the output to feel "complete." Capping at 512 saves average 65 output tokens per call.

Before/after diff

- max_tokens=4096,
+ max_tokens=512,  # avg output 280; p99 410 from 30-day sample

Why this saves $860/mo: 65 fewer output tokens × 220K summary calls/month × $15/M (Sonnet output) = $215/mo directly. The other $645/mo comes from latency reduction enabling 18% more cache hits in your Redis layer (because faster turns mean cache TTL covers more of the user's burst).

The deeper finding: we audited all 38 LLM call sites in your repo. 14 of them have max_tokens >3× the observed avg output. Full table in your report Appendix B.

Leak #4 — Batch API never used for offline jobs $420/mo

Confidence: 95% · Pattern: nightly cron uses sync API
HIGH

What we found: jobs/classify_tickets.py runs at 02:00 UTC nightly and classifies ~1,400 closed support tickets. It uses openai.chat.completions.create() (synchronous API) at full rates. OpenAI's Batch API offers 50% off on the same model with 24h SLA — perfect fit for a nightly job that has no latency requirement.

Migration sketch

@@ jobs/classify_tickets.py @@
- for ticket in tickets:
-     resp = openai.chat.completions.create(...)
-     write_classification(ticket.id, resp.choices[0].message.content)
+ # Build batch JSONL
+ batch_file = build_batch_jsonl(tickets)
+ file_id = openai.files.create(file=open(batch_file, "rb"), purpose="batch").id
+ batch = openai.batches.create(input_file_id=file_id,
+                                endpoint="/v1/chat/completions",
+                                completion_window="24h")
+ # Tomorrow's run reads results from prior batch
+ poll_and_apply(prior_batch_id="")

Why this saves $420/mo: 1,400 classifications/night × 30 nights = 42,000/month at ~$0.02 each (current) = $840/mo. With Batch: $420/mo. Saved: $420/mo. Anthropic also offers Batch API at the same 50% discount if you migrate any Anthropic offline jobs.

Leak #5 — System prompt re-sent every call $380/mo

Confidence: 99% · Pattern: same as Leak #1, on a different endpoint
MEDIUM

What we found: The 2,180-token prompts/support_system.md file is re-loaded and re-sent on every call site, not just chat_handler.py. Five other modules use the same prompt without caching:

  • moderation.py:23 — moderation check on every inbound message
  • handoff_router.py:91 — escalation classifier
  • thread_summarizer.py:56 — end-of-thread digest
  • quality_scorer.py:142 — post-resolution quality eval
  • customer_intent_v2.py:33 — intent classification A/B

Fix: apply the same cache_control: {"type": "ephemeral"} wrapping shown in Leak #1 to all six call sites. Anthropic deduplicates cached prefixes across requests within the 5-minute TTL window — meaning if any of the six fires within 5 min of any other, you pay the cached-read price (90% off) for all of them.

Why this saves $380/mo: 2,180 tokens × ~85K combined calls/month × ($3 - $0.30)/M = $498 theoretical. We discount to $380 to account for cache-miss rate on low-traffic hours.

Token-burn map (all 38 call sites, summarized)

Every LLM call site in the repo, ranked by monthly $ burned. The full report includes one row per call site; this sample shows the top 10.

#File:lineModelmax_tokensCalls/mo$ /mo
1chat_handler.py:47claude-3-sonnet1024432,100$3,890
2summarizer.py:34claude-3-sonnet4096220,400$1,180
3retriever.py:121gpt-4-0125-preview12830,200$1,020
4jobs/classify_tickets.py:88gpt-4o25642,000$840
5moderation.py:23claude-3-haiku32432,100$320
6handoff_router.py:91claude-3-haiku6452,000$210
7thread_summarizer.py:56claude-3-sonnet51218,400$190
8quality_scorer.py:142gpt-4o-mini12814,000$165
9customer_intent_v2.py:33claude-3-haiku32432,100$95
10onboarding_assistant.py:71claude-3-sonnet20483,800$48

Note: Top 4 = 79% of total spend. Concentrating optimization effort there is the highest leverage.

30-day re-audit voucher

Included with every $79 audit: a voucher for a free re-audit 30 days after delivery. Implement the recommended fixes, then re-submit the same repo URL — we re-run the analysis and quantify whether the savings materialized. If your bill didn't drop by at least $79, refund issued automatically (we keep nothing).

Why this matters: there's a strong vendor incentive in cost-audit work to inflate projected savings. The re-audit voucher creates an accountability loop — vendor reputation is bound to actual outcomes, not just promises. If you implement 0 of the recommendations, that's on you. If you implement all 5 and your bill goes up, we refund.

Get this report for your own repo

$79 one-time · Delivered within 1 hour · 14-day money-back guarantee

Buy LLM Bill X-Ray — $79

First-3-customers honest beta pricing: $49 (38% off). Reply "First-3 beta" after purchase for a manual PayPal refund of $30.

Share this sample report
Share on X Share on LinkedIn Share on Reddit