Static-analysis prompt-caching audit · https://github.com/BerriAI/litellm · Generated 2026-05-16 22:29 UTC
10 ranked Anthropic prompt-caching opportunities across 7 Anthropic call site(s) (5240 total files scanned). Implementing the top 3 could save approximately $1,267/month — $15,203/year.
Note: estimates use Sonnet rates ($3/M input, $0.30/M cache-read = 90% off) calibrated to mid-volume workloads (50K-500K calls/month). Verify in your next billing cycle.
| # | Opportunity | Severity | $/mo saved |
|---|---|---|---|
| 1 | Static text block (3842 chars) missing cache_control | CRITICAL | $626 |
| 2 | Static text block (1804 chars) missing cache_control | CRITICAL | $321 |
| 3 | `{role: 'system'}` entry inside messages[] (should use top-level `system` param) | LOW | $40 |
| 4 | `{role: 'system'}` entry inside messages[] (should use top-level `system` param) | LOW | $40 |
| 5 | `{role: 'system'}` entry inside messages[] (should use top-level `system` param) | LOW | $40 |
Where: tests/llm_translation/test_anthropic_completion.py:687
What we found: Found a 3842-character static text content block with no `cache_control: ephemeral` directive. Anthropic's prompt cache gives a 90% discount on cached input tokens ($0.30/M cache-read vs $3/M base for Sonnet). A static system prompt or retrieval context block this size called even 50K times/month at full rate costs hundreds of dollars more than necessary. Add `cache_control: {type: 'ephemeral'}` to this specific block to enable 5-minute prefix caching.
{
"content": [
{
"type": "text",
"text": "[Current URL: https://github.com/ryanhoangt]\n[Focused element bid: 119]\n\n[Action executed successfully.]\n============== BEGIN accessibility tree ==============\nRootWebArea 'ryanhoangt (Ryan H. Tran) · GitHub', focused\n\t[119] generic\n\t\t[120] generic\n\t\t\t[121] generic\n\t\t\t\t[122] link 'Skip to content', clickable\n\t\t\t\t[123] generic\n\t\t\t\t\t[124] generic\n\t\t\t\t[135] generic\n\t\t\t\t\t[137] generic, clickable\n\t\t\t\t[142] banner ''\n\t\t\t\t\t[143] heading 'Nav
{
"type": "text",
"text": LARGE_STATIC_TEXT,
"cache_control": {"type": "ephemeral"}
}
Where: tests/llm_translation/test_anthropic_completion.py:657
What we found: Found a 1804-character static text content block with no `cache_control: ephemeral` directive. Anthropic's prompt cache gives a 90% discount on cached input tokens ($0.30/M cache-read vs $3/M base for Sonnet). A static system prompt or retrieval context block this size called even 50K times/month at full rate costs hundreds of dollars more than necessary. Add `cache_control: {type: 'ephemeral'}` to this specific block to enable 5-minute prefix caching.
"content": [
{"type": "text", "text": "go to github ryanhoangt by browser"},
{
"type": "text",
"text": '<extra_info>\nThe following information has been included based on a keyword match for "github". It may or may not be relevant to the user\'s request.\n\nYou have access to an environment variable, `GITHUB_TOKEN`, which allows you to interact with\nthe GitHub API.\n\nYou can use `curl` with the `GITHUB_TOKEN` to interact with GitHub\'s API.\nALWAYS use the GitHub API for operations instead of a web browser.\n\nHere are s
{
"type": "text",
"text": LARGE_STATIC_TEXT,
"cache_control": {"type": "ephemeral"}
}
Where: litellm/litellm_core_utils/litellm_logging.py:4704
What we found: Anthropic's API expects the system prompt at the top level (`system="..."` or `system=[content_blocks]`), NOT as a `{"role": "system"}` entry in the messages array. OpenAI uses the latter convention; mixing them is a common port mistake. Functionally Anthropic will accept it as a regular message but cache-hit rate drops (system content gets re-tokenized as part of the user turn rather than as a stable cacheable prefix). Move the system content out of messages[] into the top-level `system` parameter.
) == kwargs.get("system"):
return messages
messages = [
{"role": "system", "content": kwargs.get("system")}
] + messages
elif isinstance(messages, str):
messages = [
# Anthropic-native shape (cacheable prefix):
client.messages.create(
system=SYSTEM_PROMPT, # top-level
messages=[{"role": "user", # NO system entries here
"content": user_msg}],
...
)
Where: litellm/main.py:7882
What we found: Anthropic's API expects the system prompt at the top level (`system="..."` or `system=[content_blocks]`), NOT as a `{"role": "system"}` entry in the messages array. OpenAI uses the latter convention; mixing them is a common port mistake. Functionally Anthropic will accept it as a regular message but cache-hit rate drops (system content gets re-tokenized as part of the user turn rather than as a stable cacheable prefix). Move the system content out of messages[] into the top-level `system` parameter.
fallback_messages = messages or []
if system and fallback_messages:
fallback_messages = [{"role": "system", "content": system}] + fallback_messages
local_count = litellm.token_counter(
model=model,
messages=fallback_messages,
tools=tools, # type: ignore[arg-type]
# Anthropic-native shape (cacheable prefix):
client.messages.create(
system=SYSTEM_PROMPT, # top-level
messages=[{"role": "user", # NO system entries here
"content": user_msg}],
...
)
Where: tests/llm_translation/test_anthropic_completion.py:527
What we found: Anthropic's API expects the system prompt at the top level (`system="..."` or `system=[content_blocks]`), NOT as a `{"role": "system"}` entry in the messages array. OpenAI uses the latter convention; mixing them is a common port mistake. Functionally Anthropic will accept it as a regular message but cache-hit rate drops (system content gets re-tokenized as part of the user turn rather than as a stable cacheable prefix). Move the system content out of messages[] into the top-level `system` parameter.
res = litellm.completion(
**base_completion_call_args,
messages=[
{
"role": "system",
"content": "response user question with JSON object",
},
# Anthropic-native shape (cacheable prefix):
client.messages.create(
system=SYSTEM_PROMPT, # top-level
messages=[{"role": "user", # NO system entries here
"content": user_msg}],
...
)
Where: tests/llm_translation/test_gemini.py:225
What we found: Anthropic's API expects the system prompt at the top level (`system="..."` or `system=[content_blocks]`), NOT as a `{"role": "system"}` entry in the messages array. OpenAI uses the latter convention; mixing them is a common port mistake. Functionally Anthropic will accept it as a regular message but cache-hit rate drops (system content gets re-tokenized as part of the user turn rather than as a stable cacheable prefix). Move the system content out of messages[] into the top-level `system` parameter.
def test_gemini_context_caching_separate_messages():
messages = [
# System Message
{
"role": "system",
"content": [
# Anthropic-native shape (cacheable prefix):
client.messages.create(
system=SYSTEM_PROMPT, # top-level
messages=[{"role": "user", # NO system entries here
"content": user_msg}],
...
)
Where: tests/local_testing/test_anthropic_prompt_caching.py:246
What we found: Anthropic's API expects the system prompt at the top level (`system="..."` or `system=[content_blocks]`), NOT as a `{"role": "system"}` entry in the messages array. OpenAI uses the latter convention; mixing them is a common port mistake. Functionally Anthropic will accept it as a regular message but cache-hit rate drops (system content gets re-tokenized as part of the user turn rather than as a stable cacheable prefix). Move the system content out of messages[] into the top-level `system` parameter.
response = await litellm.acompletion(
model="anthropic/claude-sonnet-4-5-20250929",
messages=[
# System Message
{
"role": "system",
"content": [
# Anthropic-native shape (cacheable prefix):
client.messages.create(
system=SYSTEM_PROMPT, # top-level
messages=[{"role": "user", # NO system entries here
"content": user_msg}],
...
)
Where: tests/local_testing/test_completion.py:247
What we found: Anthropic's API expects the system prompt at the top level (`system="..."` or `system=[content_blocks]`), NOT as a `{"role": "system"}` entry in the messages array. OpenAI uses the latter convention; mixing them is a common port mistake. Functionally Anthropic will accept it as a regular message but cache-hit rate drops (system content gets re-tokenized as part of the user turn rather than as a stable cacheable prefix). Move the system content out of messages[] into the top-level `system` parameter.
litellm.set_verbose = True
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are 2twNLGfqk4GMOn3ffp4p."}],
},
# Anthropic-native shape (cacheable prefix):
client.messages.create(
system=SYSTEM_PROMPT, # top-level
messages=[{"role": "user", # NO system entries here
"content": user_msg}],
...
)
Where: tests/test_litellm/integrations/test_anthropic_cache_control_hook.py:61
What we found: Anthropic's API expects the system prompt at the top level (`system="..."` or `system=[content_blocks]`), NOT as a `{"role": "system"}` entry in the messages array. OpenAI uses the latter convention; mixing them is a common port mistake. Functionally Anthropic will accept it as a regular message but cache-hit rate drops (system content gets re-tokenized as part of the user turn rather than as a stable cacheable prefix). Move the system content out of messages[] into the top-level `system` parameter.
response = await litellm.acompletion(
model="bedrock/anthropic.claude-3-5-haiku-20241022-v1:0",
messages=[
{
"role": "system",
"content": [
{
# Anthropic-native shape (cacheable prefix):
client.messages.create(
system=SYSTEM_PROMPT, # top-level
messages=[{"role": "user", # NO system entries here
"content": user_msg}],
...
)
Where: tests/test_litellm/llms/anthropic/chat/test_anthropic_chat_transformation.py:1875
What we found: Anthropic's API expects the system prompt at the top level (`system="..."` or `system=[content_blocks]`), NOT as a `{"role": "system"}` entry in the messages array. OpenAI uses the latter convention; mixing them is a common port mistake. Functionally Anthropic will accept it as a regular message but cache-hit rate drops (system content gets re-tokenized as part of the user turn rather than as a stable cacheable prefix). Move the system content out of messages[] into the top-level `system` parameter.
# Test empty string content - should not produce any anthropic system message content
messages = [
{"role": "system", "content": ""},
{"role": "user", "content": "Hello"},
]
# Anthropic-native shape (cacheable prefix):
client.messages.create(
system=SYSTEM_PROMPT, # top-level
messages=[{"role": "user", # NO system entries here
"content": user_msg}],
...
)
Anthropic's prompt cache lets you mark static portions of your prompt (system instructions, retrieval
context, few-shot examples) with cache_control: {"type": "ephemeral"}. The first call writes
the cache (1.25x base input rate); subsequent calls within ~5 minutes read from the cache at 0.1x base
rate (a 90% discount). For Sonnet, that's $0.30/M cached tokens vs $3/M base.
The math gets dramatic at moderate scale: a 4K-token system prompt called 100K times/month costs $1,200 uncached vs $150 cached ($120 reads + ~$30 amortized writes). That's $1,050/month saved on a single block — and most production workloads have 3-5 such blocks.
Cache scope: the cache key is the entire prefix up to (and including) the last cache_control marker. So order matters: put the most-static content first, then less-static, then the per-call variable content last. Anthropic supports up to 4 cache breakpoints per request.
Why this matters: Anthropic cache savings only materialize once the code change ships. The re-audit voucher creates an accountability loop — we can't claim "issue resolved" unless the v1 ruleset agrees on re-scan. Same deterministic engine, same file paths, same line numbers. No moving goalposts.