If your Claude spend is climbing faster than your traffic, the culprit is almost always the same: cache_control never got turned on. Here's the audit pattern, the 3-line fix, and the math.
system prompt. The savings on the cached portion are typically 60-90%. Most teams ship it and forget it.
If you're reading this, your Anthropic bill probably looks something like this: usage roughly doubled over the last six months, but the number of users didn't double. The reflex response is to assume Anthropic raised prices. They didn't — not on the SKUs you're using, anyway. What changed is that your codebase grew, your system prompts got longer (RAG, tool definitions, structured-output schemas), and you never adopted the cache_control primitive Anthropic shipped back in August 2024.
This post walks through why that's the dominant cost story we see in 2026 bill audits, exactly which 3 lines to add, what to expect savings-wise, and the half-dozen gotchas to know before you ship.
Prompt caching went GA on Anthropic's API in August 2024. The deal: any prompt block you mark with cache_control: {"type": "ephemeral"} gets cached for 5 minutes on Anthropic's side. The first call writes the cache (priced at ~25% above normal input rate). Every subsequent call within 5 minutes that hits the same cached prefix reads from the cache at 10% of the normal input rate. On Claude Sonnet 4.5, that's $0.30/M cached-read vs $3.00/M uncached input — a 90% discount on the cached portion.
The catch: it's opt-in. Anthropic shipped the primitive, blogged about it, added it to the SDK docs — and then most production codebases never picked it up. The reasons are mundane:
Two years later in 2026, the gap between teams who adopted caching and teams who didn't is enormous — and it's the dominant story when an Anthropic bill "doubles for no reason."
When we run the analyzer that powers the $79 X-Ray, the very first regex it applies to every Python and TypeScript file is this: find every call site that hits anthropic.messages.create or client.messages.create, then check whether the system argument is a plain string or a list of blocks with at least one cache_control entry.
Plain string = leak. List of blocks with cache_control = caching is on. It's binary.
The second check: even if the system is a cached block list, is there a large tools array or document attachment that isn't also cached? Anthropic lets you cache up to 4 distinct cache breakpoints per request — system prompt, tool definitions, retrieved documents, and conversation history can each get their own cache_control marker. We often see teams who cached only the system prompt and missed the 6,000-token tool definitions sitting right next to it.
Here's the exact diff we paste into customer PRs:
- resp = client.messages.create( - model="claude-sonnet-4-5", - system="You are a careful assistant. Follow these rules: ...", - messages=messages, - ) + resp = client.messages.create( + model="claude-sonnet-4-5", + system=[{ + "type": "text", + "text": "You are a careful assistant. Follow these rules: ...", + "cache_control": {"type": "ephemeral"}, + }], + messages=messages, + )
That's it. The wire format changes from a string to a single-element list of blocks. Anthropic's SDK accepts both shapes. Functionally nothing changes — the assistant sees the exact same system prompt.
If you also have tool definitions or large retrieved documents, add a cache_control marker on the last block you want included in the cached prefix. The cache extends from the start of the request up to and including the marked block.
client.messages.create( model="claude-sonnet-4-5", system=[{"type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}], tools=[ *tool_definitions[:-1], {**tool_definitions[-1], + "cache_control": {"type": "ephemeral"}}, ], messages=messages, )
Take a representative production chatbot:
Before caching:
After caching (35% cache-write at $3.75/M, 65% cache-read at $0.30/M):
Savings: ~$895/month, or ~50% on this endpoint alone, after a 3-line code change. At higher cache hit rates (long-running RAG sessions, agent loops within the 5-min window), savings climb past 70%. We've seen audits where a single high-volume endpoint accounted for $4K+/mo in recoverable waste from this one pattern.
datetime.now(), request IDs, or user-specific context bleeding into the system block.anthropic-beta: prompt-caching-2024-07-31 header needed in 2026 SDKs. If you copy-pasted old example code, remove that header.Open a terminal in your repo root and run:
grep -rn "messages.create" --include="*.py" --include="*.ts" | wc -l grep -rn "cache_control" --include="*.py" --include="*.ts" | wc -l
If the first number is significantly larger than the second, you have a caching gap. The ratio is your uncached call site percentage. We typically see ratios like 12:1 or 20:0 in production codebases — meaning every single call site is uncached.
For a more detailed breakdown that names every leaky file with the exact file:line and provides paste-into-PR fix diffs, that's exactly what the $79 X-Ray produces. View a real sample report run on anthropic-cookbook → (18 findings, $4,673/mo theoretical waste).
$79 one-shot. Drop your GitHub URL. The X-Ray analyzer runs the same 9 deterministic patterns on your codebase, ranks the findings by dollar impact, and ships you an HTML report with paste-into-PR fixes. 14-day money-back guarantee if total surfaced savings < $79.
Order LLM Bill X-Ray — $79 →Not ready to commit? Try the free Mini-Triage — paste 3 file URLs, get a 1-page diagnosis.
In our audit data, codebases that haven't adopted caching almost always have at least 2 of the following compounding leaks:
try/except: retry wrapped around messages.create with no exponential backoff. A transient 529 from Anthropic can multiply your call volume 5-10x for the duration of the outage.The X-Ray flags all of these in one pass. The caching fix is usually the headline, but the tail of medium-severity findings often adds another $500-$2,000/mo in recoverable waste.
No. Caching is a wire-level transport optimization. The model sees the exact same tokens it would have seen without caching. Outputs are identical (modulo normal sampling variance at temperature > 0).
Yes — especially well. Agent loops within a 5-minute window hit cache on every iteration after the first. We've seen agent workloads drop input-token spend by 75-85% after caching the system + tools prefix.
Bedrock supports Anthropic prompt caching in 2026. The cache_control syntax is identical. Vertex AI's Claude offering also supports it; check your provider's most recent SDK version.
The 5-patterns post is a walkthrough of an actual audit run against the public anthropic-cookbook repo — 18 findings with file:line references. This post zooms in on the single pattern (cache_control) that's most likely the dominant story behind a doubled bill.