LIVE REAL-REPO RUN — This is the actual output of Milo's $79 LLM Bill X-Ray analyzer when run against github.com/anthropics/anthropic-cookbook (a public educational repo that intentionally shows non-prod patterns for teaching purposes). The findings are real. Dollar figures are theoretical (the cookbook isn't running in prod) but the analyzer logic is identical to what runs on paid-customer repos. Order your own $79 X-Ray →

LLM Bill X-Ray · by Milo Antaeus

REAL X-Ray Sample: anthropic-cookbook

Static-analysis cost audit · https://github.com/anthropics/anthropic-cookbook · Generated 2026-05-16 21:09 UTC

Files scanned: 99 LLM call sites found: 26 Patterns checked: 9 Confidence: deterministic (no LLM-in-the-loop)

Executive summary

18 ranked cost leaks across 99 files. Implementing the top 3 could save approximately $4,673/month — $56,075/year.

#	Leak	Severity	$/mo saved
1	Anthropic prompt caching not enabled (4 call sites)	CRITICAL	$1,200
2	Anthropic prompt caching not enabled (1 call site)	CRITICAL	$300
3	Anthropic prompt caching not enabled (1 call site)	CRITICAL	$300
4	Anthropic prompt caching not enabled (1 call site)	CRITICAL	$300
5	Anthropic prompt caching not enabled (1 call site)	CRITICAL	$300

TOTAL ESTIMATED MONTHLY SAVINGS: $4,673

Leak #1 — Anthropic prompt caching not enabled (4 call sites) $1,200/mo

Confidence: 95% · Rule: anthropic_cache_missing

CRITICAL

Where: skills/file_utils.py:26

What we found: Found 4 Anthropic API call site(s) in this file with no cache_control directive. Static system prompts and repeated retrieval contexts are billed at full input rate on every call. Adding cache_control: ephemeral cuts input-token cost on the cached portion by 90% (cache-read $0.30/M vs $3/M for Sonnet). 5-minute TTL is usually enough for chat sessions.

Before (skills/file_utils.py:26)

Args:
        response: The response object from client.beta.messages.create()

    Returns:
        List of file IDs found in the response

After

# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
         "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
    {"type": "text", "text": context,
     "cache_control": {"type": "ephemeral"}},
    {"type": "text", "text": user_msg}
]}]

Leak #2 — Anthropic prompt caching not enabled (1 call site) $300/mo

Confidence: 95% · Rule: anthropic_cache_missing

CRITICAL

Where: tool_use/utils/visualize.py:330

What we found: Found 1 Anthropic API call site(s) in this file with no cache_control directive. Static system prompts and repeated retrieval contexts are billed at full input rate on every call. Adding cache_control: ephemeral cuts input-token cost on the cached portion by 90% (cache-read $0.30/M vs $3/M for Sonnet). 5-minute TTL is usually enough for chat sessions.

Before (tool_use/utils/visualize.py:330)

Usage:
        viz = visualize(auto_show=True)
        response = client.messages.create(...)
        viz.capture(response)
    """

    def __init__(self, auto_show: bool = True):

After

# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
         "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
    {"type": "text", "text": context,
     "cache_control": {"type": "ephemeral"}},
    {"type": "text", "text": user_msg}
]}]

Leak #3 — Anthropic prompt caching not enabled (1 call site) $300/mo

Confidence: 95% · Rule: anthropic_cache_missing

CRITICAL

Where: tool_use/memory_demo/demo_helpers.py:76

Before (tool_use/memory_demo/demo_helpers.py:76)

request_params["context_management"] = context_management

    response = client.beta.messages.create(**request_params)

    assistant_content = []
    tool_results = []

After

# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
         "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
    {"type": "text", "text": context,
     "cache_control": {"type": "ephemeral"}},
    {"type": "text", "text": user_msg}
]}]

Leak #4 — Anthropic prompt caching not enabled (1 call site) $300/mo

Confidence: 95% · Rule: anthropic_cache_missing

CRITICAL

Where: tool_use/memory_demo/code_review_demo.py:126

Before (tool_use/memory_demo/code_review_demo.py:126)

while True:
            print(f"  🔄 Turn {turn}: Calling Claude API...", end="", flush=True)
            response = self.client.beta.messages.create(
                model=MODEL,
                max_tokens=4096,
                system=self._create_system_prompt(),
                messages=self.messages,

After

# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
         "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
    {"type": "text", "text": context,
     "cache_control": {"type": "ephemeral"}},
    {"type": "text", "text": user_msg}
]}]

Leak #5 — Anthropic prompt caching not enabled (1 call site) $300/mo

Confidence: 95% · Rule: anthropic_cache_missing

CRITICAL

Where: capabilities/summarization/evaluation/custom_evals/llm_eval.py:57

Before (capabilities/summarization/evaluation/custom_evals/llm_eval.py:57)

Evaluation (JSON format):"""

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1000,
        temperature=0,
        messages=[{"role": "user", "content": prompt}, {"role": "assistant", "content": "<json>"}],

After

# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
         "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
    {"type": "text", "text": context,
     "cache_control": {"type": "ephemeral"}},
    {"type": "text", "text": user_msg}
]}]

Leak #6 — Anthropic prompt caching not enabled (1 call site) $300/mo

Confidence: 95% · Rule: anthropic_cache_missing

CRITICAL

Where: capabilities/retrieval_augmented_generation/evaluation/provider_retrieval.py:71

Before (capabilities/retrieval_augmented_generation/evaluation/provider_retrieval.py:71)

client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
    try:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=50,
            messages=[
                {"role": "user", "content": prompt},

After

# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
         "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
    {"type": "text", "text": context,
     "cache_control": {"type": "ephemeral"}},
    {"type": "text", "text": user_msg}
]}]

Leak #7 — Anthropic prompt caching not enabled (1 call site) $300/mo

Confidence: 95% · Rule: anthropic_cache_missing

CRITICAL

Where: capabilities/retrieval_augmented_generation/evaluation/prompts.py:113

Before (capabilities/retrieval_augmented_generation/evaluation/prompts.py:113)

"""
    try:
        response = client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=50,
            messages=[
                {"role": "user", "content": prompt},

After

# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
         "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
    {"type": "text", "text": context,
     "cache_control": {"type": "ephemeral"}},
    {"type": "text", "text": user_msg}
]}]

Leak #8 — Anthropic prompt caching not enabled (1 call site) $300/mo

Confidence: 95% · Rule: anthropic_cache_missing

CRITICAL

Where: capabilities/retrieval_augmented_generation/evaluation/eval_end_to_end.py:37

Before (capabilities/retrieval_augmented_generation/evaluation/eval_end_to_end.py:37)

client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
    try:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1500,
            messages=[
                {"role": "user", "content": prompt},

After

# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
         "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
    {"type": "text", "text": context,
     "cache_control": {"type": "ephemeral"}},
    {"type": "text", "text": user_msg}
]}]

Leak #9 — Anthropic prompt caching not enabled (1 call site) $300/mo

Confidence: 95% · Rule: anthropic_cache_missing

CRITICAL

Where: patterns/agents/util.py:23

Before (patterns/agents/util.py:23)

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
    messages = [{"role": "user", "content": prompt}]
    response = client.messages.create(
        model=model,
        max_tokens=4096,
        system=system_prompt,
        messages=messages,

After

# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
         "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
    {"type": "text", "text": context,
     "cache_control": {"type": "ephemeral"}},
    {"type": "text", "text": user_msg}
]}]

Leak #10 — Anthropic prompt caching not enabled (1 call site) $300/mo

Confidence: 95% · Rule: anthropic_cache_missing

CRITICAL

Where: skills/skill_utils.py:248

Before (skills/skill_utils.py:248)

skills.append({"type": "anthropic", "skill_id": anthropic_skill, "version": "latest"})

    response = client.beta.messages.create(
        model=model,
        max_tokens=4096,
        container={"skills": skills},
        tools=[{"type": "code_execution_20250825", "name": "code_execution"}],

After

# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
         "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
    {"type": "text", "text": context,
     "cache_control": {"type": "ephemeral"}},
    {"type": "text", "text": user_msg}
]}]

Leak #11 — max_tokens=4096 likely far above observed output $101/mo

Confidence: 65% · Rule: oversized_max_tokens

HIGH

Where: tool_use/memory_demo/code_review_demo.py:128

What we found: max_tokens=4096 is unusually large. Anthropic + OpenAI bill on tokens GENERATED, so this doesn't directly cost more per-call. BUT: it increases p99 latency (model 'thinks longer' with headroom on chain-of-thought tasks) and chance of padded output. Check your billing CSV: if avg output for this endpoint is < 500 tokens, cap at 2× observed p99 (typically 512-1024).

Before (tool_use/memory_demo/code_review_demo.py:128)

max_tokens=4096

After

max_tokens=512  # cap at 2x observed p99 from billing CSV sample

Leak #12 — max_tokens=4096 likely far above observed output $101/mo

Confidence: 65% · Rule: oversized_max_tokens

HIGH

Where: patterns/agents/util.py:25

Before (patterns/agents/util.py:25)

max_tokens=4096

After

max_tokens=512  # cap at 2x observed p99 from billing CSV sample

Leak #13 — max_tokens=4096 likely far above observed output $101/mo

Confidence: 65% · Rule: oversized_max_tokens

HIGH

Where: skills/skill_utils.py:250

Before (skills/skill_utils.py:250)

max_tokens=4096

After

max_tokens=512  # cap at 2x observed p99 from billing CSV sample

Leak #14 — API call in try/except with no backoff — potential retry storm $120/mo

Confidence: 55% · Rule: retry_storm_no_backoff

MEDIUM

Where: capabilities/retrieval_augmented_generation/evaluation/provider_retrieval.py:70

What we found: An LLM API call is wrapped in try/except but no backoff or sleep is detected anywhere in this file. On a transient outage, this loop can hammer the provider for as long as the wrapping loop runs, generating billable input tokens on every failed attempt. Add exponential backoff via `backoff` or `tenacity` library — or at minimum time.sleep(min(2**attempt, 30)).

Before (capabilities/retrieval_augmented_generation/evaluation/provider_retrieval.py:70)

client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
    try:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=50,
            messages=[

After

import backoff
@backoff.on_exception(backoff.expo, Exception, max_tries=4, max_time=60)
def call_with_retry(...):
    return client.messages.create(...)

Leak #15 — API call in try/except with no backoff — potential retry storm $120/mo

Confidence: 55% · Rule: retry_storm_no_backoff

MEDIUM

Where: capabilities/retrieval_augmented_generation/evaluation/prompts.py:112

Before (capabilities/retrieval_augmented_generation/evaluation/prompts.py:112)

<relevant_indices>put the numbers of your indices here, seeparted by commas</relevant_indices>
    """
    try:
        response = client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=50,
            messages=[

After

import backoff
@backoff.on_exception(backoff.expo, Exception, max_tries=4, max_time=60)
def call_with_retry(...):
    return client.messages.create(...)

Leak #16 — API call in try/except with no backoff — potential retry storm $120/mo

Confidence: 55% · Rule: retry_storm_no_backoff

MEDIUM

Where: capabilities/retrieval_augmented_generation/evaluation/eval_end_to_end.py:36

Before (capabilities/retrieval_augmented_generation/evaluation/eval_end_to_end.py:36)

client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
    try:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1500,
            messages=[

After

import backoff
@backoff.on_exception(backoff.expo, Exception, max_tries=4, max_time=60)
def call_with_retry(...):
    return client.messages.create(...)

Leak #17 — max_tokens=2048 likely far above observed output $60/mo

Confidence: 65% · Rule: oversized_max_tokens

MEDIUM

Where: capabilities/knowledge_graph/evaluation/eval_extraction.py:75

What we found: max_tokens=2048 is unusually large. Anthropic + OpenAI bill on tokens GENERATED, so this doesn't directly cost more per-call. BUT: it increases p99 latency (model 'thinks longer' with headroom on chain-of-thought tasks) and chance of padded output. Check your billing CSV: if avg output for this endpoint is < 500 tokens, cap at 2× observed p99 (typically 512-1024).

Before (capabilities/knowledge_graph/evaluation/eval_extraction.py:75)

max_tokens=2048

After

max_tokens=512  # cap at 2x observed p99 from billing CSV sample

Leak #18 — 4 hardcoded model strings without env-var indirection $50/mo

Confidence: 70% · Rule: hardcoded_model_no_env

MEDIUM

Where: capabilities/retrieval_augmented_generation/evaluation/vectordb.py:37

What we found: Found 4 hardcoded model strings in this file, none routed through env vars. This blocks A/B testing cheaper models, prevents quick rollback when a vendor releases a better-priced equivalent, and forces a code deploy for every routing change. Introduce env vars (MODEL_PRIMARY, MODEL_RERANK, MODEL_BATCH).

Before (capabilities/retrieval_augmented_generation/evaluation/vectordb.py:37)

model="voyage-2"

After

model=os.getenv("MODEL_PRIMARY", "voyage-2")

Token-burn map

v1 audit does not include the per-call-site cost table shown in the public sample report — that requires uploading your billing CSV during intake (coming in v2). The findings above are based on static code analysis only, with estimated $/mo savings calibrated to mid-size SaaS workloads. If you'd like a calibrated cost table, email miloantaeus@gmail.com with your last 30-day billing CSV and we'll regenerate the report at no extra charge.

30-day re-audit voucher

Included with your $79 audit: a voucher for a free re-audit 30 days after delivery. Implement the recommended fixes, then re-submit the same repo URL via reply email — we re-run the analysis and quantify whether the savings materialized. If your LLM bill didn't drop by at least $79, refund issued automatically.

Why this matters: there's a strong vendor incentive to inflate projected savings. The re-audit voucher creates an accountability loop — vendor reputation is bound to actual outcomes, not just promises. If you implement 0 of the recommendations, that's on you. If you implement all of them and your bill goes up, we refund.