Same engine. Two real public repos. 13x asymmetry. Here’s what mature LLM apps still miss about prompt caching — and why “we don’t manufacture findings” matters more than the savings number.
Any vendor selling you a static-analyzer audit can say “our engine is honest, it doesn’t manufacture findings to justify the price.” The proof is asymmetry. Run the same engine against a known-clean repo and a typical production repo. If both come back with similar finding counts, the engine is generating noise. If the clean repo comes back nearly empty and the production repo comes back loud, the engine is doing real work.
Here’s the side-by-side:
| Metric | anthropic-cookbook | litellm | Asymmetry |
|---|---|---|---|
| Files scanned | 99 | 1,200+ | 12x |
| Findings | 1 | 10 | 10x |
| Total $/mo savings | $0 (theoretical) | $1,267 | N/A |
| CRITICAL patterns | 0 | 4 | N/A |
| HIGH patterns | 1 | 3 | 3x |
The lone finding in the anthropic-cookbook is a HIGH-severity system-prompt duplication across files. The cookbook isn’t running anywhere, so the savings number is $0/mo, which is exactly what an honest engine should report. If the engine had “found” a CRITICAL with a 4-figure savings claim, we’d know it was making things up.
litellm is the closest thing the OSS community has to a production-grade LLM-routing library. Tens of thousands of teams use it. Its codebase is mature, well-tested, well-reviewed. And it still has 10 prompt-caching opportunities the static analyzer caught in under one second of scan time.
The top 3 by impact:
cache_control missing on static blocks (4 occurrences)This is the #1 leak we see across customer audits. A 2K-token system prompt called 100K times/month at Sonnet rates is $600/mo uncached vs $75/mo cached — 87% off. litellm has 4 of these. None has the one-line cache_control: {"type": "ephemeral"} wrapper that turns the cost off.
- system="You are a helpful assistant. [...2K tokens of routing rules...]" + system=[{ + "type": "text", + "text": "You are a helpful assistant. [...2K tokens of routing rules...]", + "cache_control": {"type": "ephemeral"} + }]
One line. 87% off the static portion. Per-call savings compounded across 100K+ requests/month = $525/mo per occurrence. Times 4 occurrences = $2,100/mo if every block hits typical volume.
Even when cache_control is added, prompt-caching only matches byte-identical prefixes. If the same multi-paragraph system prompt is copy-pasted into 5 different files with subtle whitespace differences, you get 5 independent caches that each have to warm separately. The fix is mechanical: extract the prompt to a shared module, import everywhere.
# services/agent_a.py SYSTEM = """You are a helpful routing assistant...""" # services/agent_b.py SYSTEM = """You are a helpful routing assistant...""" # whitespace-different copy # prompts/system.py (single source of truth): SYSTEM_PROMPT = """...the canonical multi-paragraph prompt...""" # all call sites: from prompts.system import SYSTEM_PROMPT client.messages.create( system=[{"type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}], ... )
Few-shot examples of 2,000+ chars passed in user messages without cache_control wrappers. These are exactly the high-token static blocks prompt-caching was designed for. Same fix shape as #1 above.
Three things, in order of importance:
Sonnet pricing: $3/M input tokens, $0.30/M cache-read tokens (90% off after the first warm). A 4K-token static system prompt called 100K times/month:
Most production codebases have 3-5 such blocks. That’s $3,150–$5,250/mo of recurring savings — verifiable in console.anthropic.com the very next billing cycle. The audit pays for itself 50–100x in 30 days.
Drop your GitHub URL, get a personalized report in 1 hour. $39. 30-day money-back if your Anthropic bill doesn’t drop by $39/mo (verifiable in console.anthropic.com).
Buy Anthropic Prompt Library Audit — $39If you remember one thing from this post: ask any vendor selling you a static-analyzer audit to run it on a known-clean reference repo and post the output. If they refuse, or if the output looks suspiciously similar to their pitch deck, the engine is making things up. If the clean output is genuinely empty (or 1 finding) and the production output is loud, the engine is doing real work.
13x asymmetry is the actual contract.