Anthropic Prompt Library Audit · by Milo Antaeus

Your Anthropic Prompt Library Audit Report

Static-analysis prompt-caching audit · https://github.com/BerriAI/litellm · Generated 2026-05-16 22:29 UTC

Files scanned: 5240 Anthropic call sites: 7 Patterns checked: 4 Confidence: deterministic (no LLM-in-the-loop)

Executive summary

10 ranked Anthropic prompt-caching opportunities across 7 Anthropic call site(s) (5240 total files scanned). Implementing the top 3 could save approximately $1,267/month — $15,203/year.

Note: estimates use Sonnet rates ($3/M input, $0.30/M cache-read = 90% off) calibrated to mid-volume workloads (50K-500K calls/month). Verify in your next billing cycle.

#	Opportunity	Severity	$/mo saved
1	Static text block (3842 chars) missing cache_control	CRITICAL	$626
2	Static text block (1804 chars) missing cache_control	CRITICAL	$321
3	`{role: 'system'}` entry inside messages[] (should use top-level `system` param)	LOW	$40
4	`{role: 'system'}` entry inside messages[] (should use top-level `system` param)	LOW	$40
5	`{role: 'system'}` entry inside messages[] (should use top-level `system` param)	LOW	$40

TOTAL ESTIMATED MONTHLY SAVINGS: $1,267

Opportunity #1 — Static text block (3842 chars) missing cache_control $626/mo

Confidence: 90% · Rule: cache_control_missing_on_static_block

CRITICAL

Where: tests/llm_translation/test_anthropic_completion.py:687

What we found: Found a 3842-character static text content block with no `cache_control: ephemeral` directive. Anthropic's prompt cache gives a 90% discount on cached input tokens ($0.30/M cache-read vs $3/M base for Sonnet). A static system prompt or retrieval context block this size called even 50K times/month at full rate costs hundreds of dollars more than necessary. Add `cache_control: {type: 'ephemeral'}` to this specific block to enable 5-minute prefix caching.

Before (tests/llm_translation/test_anthropic_completion.py:687)

{
            "content": [
                {
                    "type": "text",
                    "text": "[Current URL: https://github.com/ryanhoangt]\n[Focused element bid: 119]\n\n[Action executed successfully.]\n============== BEGIN accessibility tree ==============\nRootWebArea 'ryanhoangt (Ryan H. Tran) · GitHub', focused\n\t[119] generic\n\t\t[120] generic\n\t\t\t[121] generic\n\t\t\t\t[122] link 'Skip to content', clickable\n\t\t\t\t[123] generic\n\t\t\t\t\t[124] generic\n\t\t\t\t[135] generic\n\t\t\t\t\t[137] generic, clickable\n\t\t\t\t[142] banner ''\n\t\t\t\t\t[143] heading 'Nav

After

{
    "type": "text",
    "text": LARGE_STATIC_TEXT,
    "cache_control": {"type": "ephemeral"}
}

Opportunity #2 — Static text block (1804 chars) missing cache_control $321/mo

Confidence: 90% · Rule: cache_control_missing_on_static_block

CRITICAL

Where: tests/llm_translation/test_anthropic_completion.py:657

What we found: Found a 1804-character static text content block with no `cache_control: ephemeral` directive. Anthropic's prompt cache gives a 90% discount on cached input tokens ($0.30/M cache-read vs $3/M base for Sonnet). A static system prompt or retrieval context block this size called even 50K times/month at full rate costs hundreds of dollars more than necessary. Add `cache_control: {type: 'ephemeral'}` to this specific block to enable 5-minute prefix caching.

Before (tests/llm_translation/test_anthropic_completion.py:657)

"content": [
                {"type": "text", "text": "go to github ryanhoangt by browser"},
                {
                    "type": "text",
                    "text": '<extra_info>\nThe following information has been included based on a keyword match for "github". It may or may not be relevant to the user\'s request.\n\nYou have access to an environment variable, `GITHUB_TOKEN`, which allows you to interact with\nthe GitHub API.\n\nYou can use `curl` with the `GITHUB_TOKEN` to interact with GitHub\'s API.\nALWAYS use the GitHub API for operations instead of a web browser.\n\nHere are s

After

{
    "type": "text",
    "text": LARGE_STATIC_TEXT,
    "cache_control": {"type": "ephemeral"}
}

Opportunity #3 — `{role: 'system'}` entry inside messages[] (should use top-level `system` param) $40/mo

Confidence: 85% · Rule: role_inconsistency

LOW

Where: litellm/litellm_core_utils/litellm_logging.py:4704

What we found: Anthropic's API expects the system prompt at the top level (`system="..."` or `system=[content_blocks]`), NOT as a `{"role": "system"}` entry in the messages array. OpenAI uses the latter convention; mixing them is a common port mistake. Functionally Anthropic will accept it as a regular message but cache-hit rate drops (system content gets re-tokenized as part of the user turn rather than as a stable cacheable prefix). Move the system content out of messages[] into the top-level `system` parameter.

Before (litellm/litellm_core_utils/litellm_logging.py:4704)

) == kwargs.get("system"):
                        return messages
                    messages = [
                        {"role": "system", "content": kwargs.get("system")}
                    ] + messages
                elif isinstance(messages, str):
                    messages = [

After

# Anthropic-native shape (cacheable prefix):
client.messages.create(
    system=SYSTEM_PROMPT,            # top-level
    messages=[{"role": "user",  # NO system entries here
               "content": user_msg}],
    ...
)

Opportunity #4 — `{role: 'system'}` entry inside messages[] (should use top-level `system` param) $40/mo

Confidence: 85% · Rule: role_inconsistency

LOW

Where: litellm/main.py:7882

Before (litellm/main.py:7882)

fallback_messages = messages or []
    if system and fallback_messages:
        fallback_messages = [{"role": "system", "content": system}] + fallback_messages
    local_count = litellm.token_counter(
        model=model,
        messages=fallback_messages,
        tools=tools,  # type: ignore[arg-type]

After

# Anthropic-native shape (cacheable prefix):
client.messages.create(
    system=SYSTEM_PROMPT,            # top-level
    messages=[{"role": "user",  # NO system entries here
               "content": user_msg}],
    ...
)

Opportunity #5 — `{role: 'system'}` entry inside messages[] (should use top-level `system` param) $40/mo

Confidence: 85% · Rule: role_inconsistency

LOW

Where: tests/llm_translation/test_anthropic_completion.py:527

Before (tests/llm_translation/test_anthropic_completion.py:527)

res = litellm.completion(
                **base_completion_call_args,
                messages=[
                    {
                        "role": "system",
                        "content": "response user question with JSON object",
                    },

After

# Anthropic-native shape (cacheable prefix):
client.messages.create(
    system=SYSTEM_PROMPT,            # top-level
    messages=[{"role": "user",  # NO system entries here
               "content": user_msg}],
    ...
)

Opportunity #6 — `{role: 'system'}` entry inside messages[] (should use top-level `system` param) $40/mo

Confidence: 85% · Rule: role_inconsistency

LOW

Where: tests/llm_translation/test_gemini.py:225

Before (tests/llm_translation/test_gemini.py:225)

def test_gemini_context_caching_separate_messages():
    messages = [
        # System Message
        {
            "role": "system",
            "content": [

After

# Anthropic-native shape (cacheable prefix):
client.messages.create(
    system=SYSTEM_PROMPT,            # top-level
    messages=[{"role": "user",  # NO system entries here
               "content": user_msg}],
    ...
)

Opportunity #7 — `{role: 'system'}` entry inside messages[] (should use top-level `system` param) $40/mo

Confidence: 85% · Rule: role_inconsistency

LOW

Where: tests/local_testing/test_anthropic_prompt_caching.py:246

Before (tests/local_testing/test_anthropic_prompt_caching.py:246)

response = await litellm.acompletion(
        model="anthropic/claude-sonnet-4-5-20250929",
        messages=[
            # System Message
            {
                "role": "system",
                "content": [

After

# Anthropic-native shape (cacheable prefix):
client.messages.create(
    system=SYSTEM_PROMPT,            # top-level
    messages=[{"role": "user",  # NO system entries here
               "content": user_msg}],
    ...
)

Opportunity #8 — `{role: 'system'}` entry inside messages[] (should use top-level `system` param) $40/mo

Confidence: 85% · Rule: role_inconsistency

LOW

Where: tests/local_testing/test_completion.py:247

Before (tests/local_testing/test_completion.py:247)

litellm.set_verbose = True

    messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are 2twNLGfqk4GMOn3ffp4p."}],
        },

After

# Anthropic-native shape (cacheable prefix):
client.messages.create(
    system=SYSTEM_PROMPT,            # top-level
    messages=[{"role": "user",  # NO system entries here
               "content": user_msg}],
    ...
)

Opportunity #9 — `{role: 'system'}` entry inside messages[] (should use top-level `system` param) $40/mo

Confidence: 85% · Rule: role_inconsistency

LOW

Where: tests/test_litellm/integrations/test_anthropic_cache_control_hook.py:61

Before (tests/test_litellm/integrations/test_anthropic_cache_control_hook.py:61)

response = await litellm.acompletion(
                model="bedrock/anthropic.claude-3-5-haiku-20241022-v1:0",
                messages=[
                    {
                        "role": "system",
                        "content": [
                            {

After

# Anthropic-native shape (cacheable prefix):
client.messages.create(
    system=SYSTEM_PROMPT,            # top-level
    messages=[{"role": "user",  # NO system entries here
               "content": user_msg}],
    ...
)

Opportunity #10 — `{role: 'system'}` entry inside messages[] (should use top-level `system` param) $40/mo

Confidence: 85% · Rule: role_inconsistency

LOW

Where: tests/test_litellm/llms/anthropic/chat/test_anthropic_chat_transformation.py:1875

Before (tests/test_litellm/llms/anthropic/chat/test_anthropic_chat_transformation.py:1875)

# Test empty string content - should not produce any anthropic system message content
    messages = [
        {"role": "system", "content": ""},
        {"role": "user", "content": "Hello"},
    ]

After

# Anthropic-native shape (cacheable prefix):
client.messages.create(
    system=SYSTEM_PROMPT,            # top-level
    messages=[{"role": "user",  # NO system entries here
               "content": user_msg}],
    ...
)

How Anthropic prompt caching works

Anthropic's prompt cache lets you mark static portions of your prompt (system instructions, retrieval context, few-shot examples) with cache_control: {"type": "ephemeral"}. The first call writes the cache (1.25x base input rate); subsequent calls within ~5 minutes read from the cache at 0.1x base rate (a 90% discount). For Sonnet, that's $0.30/M cached tokens vs $3/M base.

The math gets dramatic at moderate scale: a 4K-token system prompt called 100K times/month costs $1,200 uncached vs $150 cached ($120 reads + ~$30 amortized writes). That's $1,050/month saved on a single block — and most production workloads have 3-5 such blocks.

Cache scope: the cache key is the entire prefix up to (and including) the last cache_control marker. So order matters: put the most-static content first, then less-static, then the per-call variable content last. Anthropic supports up to 4 cache breakpoints per request.

30-day re-audit voucher

Included with your $39 audit: a voucher for a free re-audit 30 days after delivery. Implement the recommended cache_control changes, then re-submit the same repo URL via reply email — we re-run the analysis and confirm the cacheable blocks are now wrapped. If we still flag any of the CRITICAL findings from this report, refund issued automatically.

Why this matters: Anthropic cache savings only materialize once the code change ships. The re-audit voucher creates an accountability loop — we can't claim "issue resolved" unless the v1 ruleset agrees on re-scan. Same deterministic engine, same file paths, same line numbers. No moving goalposts.