5 LLM cost-leak patterns I found in anthropic-cookbook

Live $79 X-Ray analyzer output on a real public repo — 18 findings, $4,673/mo theoretical waste, before/after diffs you can paste into a PR.

Published 2026-05-16 ~8 min read By Milo Antaeus

What this is: I ran the deterministic regex-based analyzer that powers my $79 LLM Bill X-Ray service against github.com/anthropics/anthropic-cookbook. The repo is meant as an educational reference, so it intentionally shows non-prod patterns — but those patterns are the exact ones that bleed money in production codebases. View the full live report →

Almost every customer support bill triage I run ends the same way: 5 patterns account for 80% of the leak. Different model, different scale, different industry — same handful of root causes.

So instead of writing another generic "10 ways to optimize your LLM costs" post, I'm going to walk through the top 5 leak patterns my analyzer flagged when run against the literal anthropic-cookbook repo. These are findings from real code in a real public repo, with the actual file:line references and the actual fix diffs the analyzer would email you if you bought a $79 X-Ray.

The setup

I cloned the repo (--depth=1, ~360MB), ran my analyzer over the 99 Python files it contains, and let it produce a deliverable HTML report. The analyzer applies 9 deterministic patterns (no LLM-in-the-loop, so 0% hallucination rate and 100% reproducibility). Results:

Files scanned: 99
LLM call sites found: 26
Findings: 18
Total estimated monthly savings: $4,673 (theoretical — the cookbook isn't running in prod)

Of the 18 findings, 15 were the same root pattern. Let's start there.

Pattern 1: Anthropic prompt caching not enabled CRITICAL

Where in cookbook: skills/file_utils.py:26 (4 call sites), plus 14 other files with 1 call site each.

The pattern: a Python file calls client.messages.create(...) with a static system prompt and never wraps that prompt in a cache_control: {"type": "ephemeral"} block. Every call re-sends every token of the system prompt at full input rate ($3 per million for Sonnet). If your system prompt is 2,000 tokens and you make 100K calls/month, that's $600/month evaporating into cached-capable tokens.

The fix is 8 lines of code:

- system="You are a helpful assistant ...",
+ system=[
+     {
+         "type": "text",
+         "text": "You are a helpful assistant ...",
+         "cache_control": {"type": "ephemeral"},
+     },
+ ],

That single change cuts the input-token cost on the cached portion by 90%: cache-write is $3.75/M (one-time), cache-read is $0.30/M (every subsequent call within 5 minutes). On a chatbot with 3-turn sessions averaging 90 seconds apart, you hit the cache on turns 2 and 3 of every session.

Why this is the #1 leak of 2026: Anthropic released prompt caching in August 2024. Most codebases written before that — or by developers who haven't checked the changelog — still ship without it. Every RAG pipeline, every customer-support chatbot, every code-review assistant that re-sends its 2,000-token system prompt on every turn is leaving money on the table.

Pattern 2: Using GPT-4 for simple low-output tasks CRITICAL

The cookbook didn't trigger this one (it's an Anthropic repo, so they don't mix GPT-4 in), but it's the second-most-common pattern I find in production audits.

The pattern: a file has model="gpt-4-0125-preview" nearby a max_tokens=128 setting. That's a flagship model on a task that emits at most 128 tokens — almost certainly a classification, scoring, or extraction task that gpt-4o-mini handles equally well at 60× lower cost.

Real-world example from a customer audit:

# retriever.py:121 — re-ranks top-12 retrieved chunks down to top-4
def rerank(query, chunks):
    resp = openai.chat.completions.create(
-       model="gpt-4-0125-preview",  # $10/M input, $30/M output
+       model="gpt-4o-mini",         # $0.15/M input, $0.60/M output — 60x cheaper
        messages=[{"role": "user", "content": build_rerank_prompt(query, chunks)}],
        max_tokens=128,
        temperature=0,
    )
    return parse_scores(resp.choices[0].message.content)

Validation strategy: don't just flip the model. Mirror 5% of traffic to the new model for 7 days, compute rank-correlation between old and new outputs, only switch fully when correlation > 0.92 Spearman. The X-Ray report includes a validation script template.

Pattern 3: max_tokens set 3-10× larger than observed output HIGH

Where in cookbook: none triggered (cookbook examples use modest max_tokens). But this is the third-most-common pattern in production.

The pattern: max_tokens=4096 on a summary endpoint that emits ~280 tokens on average. Anthropic and OpenAI bill on tokens GENERATED, not allocated, so this doesn't directly cost more per-call. BUT — and this is the subtle waste — it dramatically extends p99 latency. The model "thinks longer" when given headroom on chain-of-thought-style tasks, and the longer it thinks, the more output it produces.

Capping at 2 × observed p99 (typically 512–1024) usually saves 50–80 tokens per call without affecting completion quality. On a 220K-call/month endpoint at Sonnet output pricing ($15/M), that's $200–$300/month.

The two-line fix:

- max_tokens=4096,
+ max_tokens=512,  # avg output 280 tokens, p99 410, from 30-day billing sample

The X-Ray report lists every max_tokens= setting in your repo with a flag if it's >3× the typical output for that pattern. Customers typically fix 5-10 of these in one pass.

Pattern 4: Synchronous API in offline/batch jobs HIGH

Where in cookbook: none triggered (cookbook doesn't ship long-running jobs).

The pattern: a file at a path like jobs/classify_tickets.py or cron/nightly_summarize.py uses synchronous openai.chat.completions.create() in a for-loop over batches of documents. Both OpenAI and Anthropic offer Batch APIs that give you 50% off the per-call rate in exchange for a 24-hour SLA. For a nightly cron job, that SLA is meaningless — you submit at 02:00 UTC, results land by 02:00 UTC tomorrow, ready for the next night's run.

Migration sketch:

# jobs/classify_tickets.py
- for ticket in tickets:
-     resp = openai.chat.completions.create(...)
-     write_classification(ticket.id, resp.choices[0].message.content)
+ # Build batch JSONL once
+ batch_file = build_batch_jsonl(tickets)
+ file_id = openai.files.create(file=open(batch_file, "rb"), purpose="batch").id
+ batch = openai.batches.create(input_file_id=file_id,
+                               endpoint="/v1/chat/completions",
+                               completion_window="24h")
+ # Tomorrow's run reads results from prior batch
+ poll_and_apply(prior_batch_id="...")

For 1,400 nightly classifications × 30 nights = 42,000 classifications/month at ~$0.02 each (current sync rate), that's $840/mo. With Batch: $420/mo. Saved: $420/month.

Pattern 5: Streaming not used on user-facing endpoints MEDIUM

Where in cookbook: several places — the cookbook leans heavily on non-streaming examples because streaming requires extra UI plumbing that's outside the educational scope. In production code, this is a real cost in retention rather than tokens.

The pattern: a file at path matching /api/, /routes/, /handlers/, /controllers/, or *_route.py, etc. has chat.completions.create() without stream=True. Non-streaming means the user waits for the full response before seeing any output. On user-facing chat, that's a 3-5× perceived-latency hit and a measurable drop in session continuation rate (we typically see 12-18% retention lift from adding streaming alone).

Two-character fix at the API layer:

# app/api/chat_route.py:34
resp = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
+   stream=True,
)
- return resp.choices[0].message.content
+ return StreamingResponse(stream_to_sse(resp), media_type="text/event-stream")

We frame this as $60/month in retention value rather than direct token savings (because the math depends heavily on your funnel), but in our experience this is the highest-ROI single change for any user-facing chat product that doesn't already stream.

Want this on YOUR repo?

$79 one-shot. Drop your GitHub URL. Get a personalized report like the one I ran on anthropic-cookbook — 5+ ranked findings, before/after diffs, 30-day re-audit voucher. 14-day money-back if savings < $79.

Order LLM Bill X-Ray — $79 →

Or view the full live report on anthropic-cookbook first: sample-llm-bill-xray-real-report.html

What's NOT in this post (but is in the audit)

The analyzer also detects 4 more patterns I didn't cover above because they didn't surface in the cookbook scan:

hardcoded_model_no_env — 3+ hardcoded model strings in one file without env-var indirection. Blocks A/B testing.
retry_storm_no_backoff — API call wrapped in try/except with no exponential backoff. On a transient outage, can hammer the provider for as long as the loop runs.
gemini_context_caching_missing — Gemini equivalent of pattern 1. 75% off on cached tokens.
embedding_redundancy — embeddings.create() inside a for-loop without batching. The API supports up to 2048 inputs per call.

That's the full 9-pattern v1 ruleset. v2 will add deeper inspection (tool-use loops, multi-turn context-pruning gaps, temperature-on-creative-tasks misuse).

FAQ

Why deterministic regex instead of "AI-powered code review"?

Because hallucination is a feature for chatbots and a bug for audits. A customer paying $79 needs to be able to trust every finding in the report. Regex + AST inspection guarantees: same input always produces same output, zero false-hallucination findings, customer can re-run against the same repo and verify the engine is reproducible.

What languages do you support?

v1: Python, TypeScript/JavaScript. v2 adds Go, Rust, Ruby. If your repo is in a language we don't support, refund.

Do you need access to my production environment?

No. Static code analysis only. You generate a fine-grained read-only GitHub PAT scoped to a single repo, we clone (--depth=1), analyze, delete, send report. No prod traffic, no API keys, no observability tooling.

How does this compare to CloudZero or Vantage?

CloudZero and Vantage do runtime cost observability against your cloud bill — they tell you where your spend is going right now. The X-Ray is upstream: it tells you which patterns in your code are causing the spend, with paste-into-PR fix code. They're complementary; ours is sub-$100 one-shot, theirs are $1K+/mo subscriptions.