Static-analysis prompt-caching audit · https://github.com/anthropics/anthropic-cookbook · Generated 2026-05-16 22:03 UTC
1 ranked Anthropic prompt-caching opportunities across 12 Anthropic call site(s) (99 total files scanned). Implementing the top 1 could save approximately $0/month — $0/year.
Note: estimates use Sonnet rates ($3/M input, $0.30/M cache-read = 90% off) calibrated to mid-volume workloads (50K-500K calls/month). Verify in your next billing cycle.
| # | Opportunity | Severity | $/mo saved |
|---|---|---|---|
| 1 | Multi-paragraph system prompt duplicated across 1 files | HIGH | $0 |
Where: capabilities/retrieval_augmented_generation/evaluation/prompts.py:29
What we found: Found the same multi-paragraph system prompt (length 509 chars) repeated in 1 files: anchor at capabilities/retrieval_augmented_generation/evaluation/prompts.py, also in capabilities/retrieval_augmented_generation/evaluation/prompts.py, capabilities/retrieval_augmented_generation/evaluation/prompts.py. Two problems: (1) any edit to the canonical wording must be made in N places, (2) each copy gets cached independently in Anthropic's prompt cache — so the 5-minute prefix cache warmth from one call site does NOT help another. Extract to a single module-level constant (or a shared prompts/ directory), import it everywhere, and the cache becomes shared across all call sites.
# capabilities/retrieval_augmented_generation/evaluation/prompts.py:29
SYSTEM_PROMPT = """
You have been tasked with helping us to answer the following query:
<query>
{input_query}
</query>
You have access to the following documents which are meant to provide context as..."""
# (and duplicated in 0 other file(s))
# prompts/system.py (single source of truth):
SYSTEM_PROMPT = """...the canonical multi-paragraph prompt..."""
# all call sites:
from prompts.system import SYSTEM_PROMPT
client.messages.create(
system=[{"type": "text", "text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}}],
...
)
Anthropic's prompt cache lets you mark static portions of your prompt (system instructions, retrieval
context, few-shot examples) with cache_control: {"type": "ephemeral"}. The first call writes
the cache (1.25x base input rate); subsequent calls within ~5 minutes read from the cache at 0.1x base
rate (a 90% discount). For Sonnet, that's $0.30/M cached tokens vs $3/M base.
The math gets dramatic at moderate scale: a 4K-token system prompt called 100K times/month costs $1,200 uncached vs $150 cached ($120 reads + ~$30 amortized writes). That's $1,050/month saved on a single block — and most production workloads have 3-5 such blocks.
Cache scope: the cache key is the entire prefix up to (and including) the last cache_control marker. So order matters: put the most-static content first, then less-static, then the per-call variable content last. Anthropic supports up to 4 cache breakpoints per request.
Why this matters: Anthropic cache savings only materialize once the code change ships. The re-audit voucher creates an accountability loop — we can't claim "issue resolved" unless the v1 ruleset agrees on re-scan. Same deterministic engine, same file paths, same line numbers. No moving goalposts.