Milo Antaeus · Blog

Why your Anthropic bill doubled in 2026 (and the 3-line fix)

If your Claude spend is climbing faster than your traffic, the culprit is almost always the same: cache_control never got turned on. Here's the audit pattern, the 3-line fix, and the math.

Published 2026-05-16 ~7 min read By Milo Antaeus
TL;DR: In the audits we run for the $79 LLM Bill X-Ray, the single most common cost-leak in 2026 Anthropic codebases is prompt caching never enabled. The fix is 3 lines around your system prompt. The savings on the cached portion are typically 60-90%. Most teams ship it and forget it.

If you're reading this, your Anthropic bill probably looks something like this: usage roughly doubled over the last six months, but the number of users didn't double. The reflex response is to assume Anthropic raised prices. They didn't — not on the SKUs you're using, anyway. What changed is that your codebase grew, your system prompts got longer (RAG, tool definitions, structured-output schemas), and you never adopted the cache_control primitive Anthropic shipped back in August 2024.

This post walks through why that's the dominant cost story we see in 2026 bill audits, exactly which 3 lines to add, what to expect savings-wise, and the half-dozen gotchas to know before you ship.

Why caching is the #1 leak in 2026

Prompt caching went GA on Anthropic's API in August 2024. The deal: any prompt block you mark with cache_control: {"type": "ephemeral"} gets cached for 5 minutes on Anthropic's side. The first call writes the cache (priced at ~25% above normal input rate). Every subsequent call within 5 minutes that hits the same cached prefix reads from the cache at 10% of the normal input rate. On Claude Sonnet 4.5, that's $0.30/M cached-read vs $3.00/M uncached input — a 90% discount on the cached portion.

The catch: it's opt-in. Anthropic shipped the primitive, blogged about it, added it to the SDK docs — and then most production codebases never picked it up. The reasons are mundane:

Two years later in 2026, the gap between teams who adopted caching and teams who didn't is enormous — and it's the dominant story when an Anthropic bill "doubles for no reason."

The audit pattern (what we look for)

When we run the analyzer that powers the $79 X-Ray, the very first regex it applies to every Python and TypeScript file is this: find every call site that hits anthropic.messages.create or client.messages.create, then check whether the system argument is a plain string or a list of blocks with at least one cache_control entry.

Plain string = leak. List of blocks with cache_control = caching is on. It's binary.

The second check: even if the system is a cached block list, is there a large tools array or document attachment that isn't also cached? Anthropic lets you cache up to 4 distinct cache breakpoints per request — system prompt, tool definitions, retrieved documents, and conversation history can each get their own cache_control marker. We often see teams who cached only the system prompt and missed the 6,000-token tool definitions sitting right next to it.

The 3-line fix

Here's the exact diff we paste into customer PRs:

- resp = client.messages.create(
-     model="claude-sonnet-4-5",
-     system="You are a careful assistant. Follow these rules: ...",
-     messages=messages,
- )
+ resp = client.messages.create(
+     model="claude-sonnet-4-5",
+     system=[{
+         "type": "text",
+         "text": "You are a careful assistant. Follow these rules: ...",
+         "cache_control": {"type": "ephemeral"},
+     }],
+     messages=messages,
+ )

That's it. The wire format changes from a string to a single-element list of blocks. Anthropic's SDK accepts both shapes. Functionally nothing changes — the assistant sees the exact same system prompt.

If you also have tool definitions or large retrieved documents, add a cache_control marker on the last block you want included in the cached prefix. The cache extends from the start of the request up to and including the marked block.

# Caching tools too — 4 cache breakpoints max per request
client.messages.create(
    model="claude-sonnet-4-5",
    system=[{"type": "text", "text": SYSTEM_PROMPT,
             "cache_control": {"type": "ephemeral"}}],
    tools=[
        *tool_definitions[:-1],
        {**tool_definitions[-1],
+        "cache_control": {"type": "ephemeral"}},
    ],
    messages=messages,
)

The math (conservative)

Take a representative production chatbot:

Before caching:

After caching (35% cache-write at $3.75/M, 65% cache-read at $0.30/M):

Savings: ~$895/month, or ~50% on this endpoint alone, after a 3-line code change. At higher cache hit rates (long-running RAG sessions, agent loops within the 5-min window), savings climb past 70%. We've seen audits where a single high-volume endpoint accounted for $4K+/mo in recoverable waste from this one pattern.

Why "your bill doubled": Most production Claude integrations were written when the team had 5K calls/day. They now run 50K calls/day with the same uncached system prompts. Caching would have absorbed most of that 10x growth invisibly. Without it, every new user re-pays for the same 6,000-token preamble on every turn.

Six gotchas before you ship

  1. Cache key is exact-match on the prefix. Any byte change — including injecting a timestamp into the system prompt — invalidates the cache for that call. Audit your prompt builders for datetime.now(), request IDs, or user-specific context bleeding into the system block.
  2. 5-minute TTL is from last hit, not first write. An active session keeps the cache warm. A session that goes idle for 6 minutes pays full cache-write cost again on resume.
  3. Max 4 cache breakpoints per request. If you have system + tools + 3 cached documents + history, you have to pick. Put the marker on the largest cumulative prefix.
  4. Cache-write costs slightly more than uncached input. 1.25x for ephemeral. If your traffic is bursty and most calls are cache-misses, you can slightly overspend. Audits we run usually show hit rates 40%+, where the break-even already favors caching.
  5. Pricing tier matters. Haiku, Sonnet, Opus all have caching at the same proportional discount, but the absolute dollar savings scale with the per-token input rate. Opus codebases see the largest absolute lift.
  6. Beta header is gone. Caching is GA — no anthropic-beta: prompt-caching-2024-07-31 header needed in 2026 SDKs. If you copy-pasted old example code, remove that header.

How to spot this in your own codebase in 60 seconds

Open a terminal in your repo root and run:

grep -rn "messages.create" --include="*.py" --include="*.ts" | wc -l
grep -rn "cache_control" --include="*.py" --include="*.ts" | wc -l

If the first number is significantly larger than the second, you have a caching gap. The ratio is your uncached call site percentage. We typically see ratios like 12:1 or 20:0 in production codebases — meaning every single call site is uncached.

For a more detailed breakdown that names every leaky file with the exact file:line and provides paste-into-PR fix diffs, that's exactly what the $79 X-Ray produces. View a real sample report run on anthropic-cookbook → (18 findings, $4,673/mo theoretical waste).

Don't want to grep through your repo by hand?

$79 one-shot. Drop your GitHub URL. The X-Ray analyzer runs the same 9 deterministic patterns on your codebase, ranks the findings by dollar impact, and ships you an HTML report with paste-into-PR fixes. 14-day money-back guarantee if total surfaced savings < $79.

Order LLM Bill X-Ray — $79 →

Not ready to commit? Try the free Mini-Triage — paste 3 file URLs, get a 1-page diagnosis.

What else we find when caching is missing

In our audit data, codebases that haven't adopted caching almost always have at least 2 of the following compounding leaks:

The X-Ray flags all of these in one pass. The caching fix is usually the headline, but the tail of medium-severity findings often adds another $500-$2,000/mo in recoverable waste.

FAQ

Will caching change the model's behavior?

No. Caching is a wire-level transport optimization. The model sees the exact same tokens it would have seen without caching. Outputs are identical (modulo normal sampling variance at temperature > 0).

Does caching work for tool-use loops?

Yes — especially well. Agent loops within a 5-minute window hit cache on every iteration after the first. We've seen agent workloads drop input-token spend by 75-85% after caching the system + tools prefix.

What if I'm on Bedrock or Vertex?

Bedrock supports Anthropic prompt caching in 2026. The cache_control syntax is identical. Vertex AI's Claude offering also supports it; check your provider's most recent SDK version.

How is this different from your other blog post?

The 5-patterns post is a walkthrough of an actual audit run against the public anthropic-cookbook repo — 18 findings with file:line references. This post zooms in on the single pattern (cache_control) that's most likely the dominant story behind a doubled bill.

SHARE THIS POST