Static-analysis cost audit · https://github.com/anthropics/anthropic-cookbook · Generated 2026-05-16 21:09 UTC
18 ranked cost leaks across 99 files. Implementing the top 3 could save approximately $4,673/month — $56,075/year.
| # | Leak | Severity | $/mo saved |
|---|---|---|---|
| 1 | Anthropic prompt caching not enabled (4 call sites) | CRITICAL | $1,200 |
| 2 | Anthropic prompt caching not enabled (1 call site) | CRITICAL | $300 |
| 3 | Anthropic prompt caching not enabled (1 call site) | CRITICAL | $300 |
| 4 | Anthropic prompt caching not enabled (1 call site) | CRITICAL | $300 |
| 5 | Anthropic prompt caching not enabled (1 call site) | CRITICAL | $300 |
Where: skills/file_utils.py:26
What we found: Found 4 Anthropic API call site(s) in this file with no cache_control directive. Static system prompts and repeated retrieval contexts are billed at full input rate on every call. Adding cache_control: ephemeral cuts input-token cost on the cached portion by 90% (cache-read $0.30/M vs $3/M for Sonnet). 5-minute TTL is usually enough for chat sessions.
Args:
response: The response object from client.beta.messages.create()
Returns:
List of file IDs found in the response
# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
{"type": "text", "text": context,
"cache_control": {"type": "ephemeral"}},
{"type": "text", "text": user_msg}
]}]
Where: tool_use/utils/visualize.py:330
What we found: Found 1 Anthropic API call site(s) in this file with no cache_control directive. Static system prompts and repeated retrieval contexts are billed at full input rate on every call. Adding cache_control: ephemeral cuts input-token cost on the cached portion by 90% (cache-read $0.30/M vs $3/M for Sonnet). 5-minute TTL is usually enough for chat sessions.
Usage:
viz = visualize(auto_show=True)
response = client.messages.create(...)
viz.capture(response)
"""
def __init__(self, auto_show: bool = True):
# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
{"type": "text", "text": context,
"cache_control": {"type": "ephemeral"}},
{"type": "text", "text": user_msg}
]}]
Where: tool_use/memory_demo/demo_helpers.py:76
What we found: Found 1 Anthropic API call site(s) in this file with no cache_control directive. Static system prompts and repeated retrieval contexts are billed at full input rate on every call. Adding cache_control: ephemeral cuts input-token cost on the cached portion by 90% (cache-read $0.30/M vs $3/M for Sonnet). 5-minute TTL is usually enough for chat sessions.
request_params["context_management"] = context_management
response = client.beta.messages.create(**request_params)
assistant_content = []
tool_results = []
# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
{"type": "text", "text": context,
"cache_control": {"type": "ephemeral"}},
{"type": "text", "text": user_msg}
]}]
Where: tool_use/memory_demo/code_review_demo.py:126
What we found: Found 1 Anthropic API call site(s) in this file with no cache_control directive. Static system prompts and repeated retrieval contexts are billed at full input rate on every call. Adding cache_control: ephemeral cuts input-token cost on the cached portion by 90% (cache-read $0.30/M vs $3/M for Sonnet). 5-minute TTL is usually enough for chat sessions.
while True:
print(f" 🔄 Turn {turn}: Calling Claude API...", end="", flush=True)
response = self.client.beta.messages.create(
model=MODEL,
max_tokens=4096,
system=self._create_system_prompt(),
messages=self.messages,
# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
{"type": "text", "text": context,
"cache_control": {"type": "ephemeral"}},
{"type": "text", "text": user_msg}
]}]
Where: capabilities/summarization/evaluation/custom_evals/llm_eval.py:57
What we found: Found 1 Anthropic API call site(s) in this file with no cache_control directive. Static system prompts and repeated retrieval contexts are billed at full input rate on every call. Adding cache_control: ephemeral cuts input-token cost on the cached portion by 90% (cache-read $0.30/M vs $3/M for Sonnet). 5-minute TTL is usually enough for chat sessions.
Evaluation (JSON format):"""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1000,
temperature=0,
messages=[{"role": "user", "content": prompt}, {"role": "assistant", "content": "<json>"}],
# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
{"type": "text", "text": context,
"cache_control": {"type": "ephemeral"}},
{"type": "text", "text": user_msg}
]}]
Where: capabilities/retrieval_augmented_generation/evaluation/provider_retrieval.py:71
What we found: Found 1 Anthropic API call site(s) in this file with no cache_control directive. Static system prompts and repeated retrieval contexts are billed at full input rate on every call. Adding cache_control: ephemeral cuts input-token cost on the cached portion by 90% (cache-read $0.30/M vs $3/M for Sonnet). 5-minute TTL is usually enough for chat sessions.
client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
try:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=50,
messages=[
{"role": "user", "content": prompt},
# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
{"type": "text", "text": context,
"cache_control": {"type": "ephemeral"}},
{"type": "text", "text": user_msg}
]}]
Where: capabilities/retrieval_augmented_generation/evaluation/prompts.py:113
What we found: Found 1 Anthropic API call site(s) in this file with no cache_control directive. Static system prompts and repeated retrieval contexts are billed at full input rate on every call. Adding cache_control: ephemeral cuts input-token cost on the cached portion by 90% (cache-read $0.30/M vs $3/M for Sonnet). 5-minute TTL is usually enough for chat sessions.
"""
try:
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=50,
messages=[
{"role": "user", "content": prompt},
# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
{"type": "text", "text": context,
"cache_control": {"type": "ephemeral"}},
{"type": "text", "text": user_msg}
]}]
Where: capabilities/retrieval_augmented_generation/evaluation/eval_end_to_end.py:37
What we found: Found 1 Anthropic API call site(s) in this file with no cache_control directive. Static system prompts and repeated retrieval contexts are billed at full input rate on every call. Adding cache_control: ephemeral cuts input-token cost on the cached portion by 90% (cache-read $0.30/M vs $3/M for Sonnet). 5-minute TTL is usually enough for chat sessions.
client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
try:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1500,
messages=[
{"role": "user", "content": prompt},
# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
{"type": "text", "text": context,
"cache_control": {"type": "ephemeral"}},
{"type": "text", "text": user_msg}
]}]
Where: patterns/agents/util.py:23
What we found: Found 1 Anthropic API call site(s) in this file with no cache_control directive. Static system prompts and repeated retrieval contexts are billed at full input rate on every call. Adding cache_control: ephemeral cuts input-token cost on the cached portion by 90% (cache-read $0.30/M vs $3/M for Sonnet). 5-minute TTL is usually enough for chat sessions.
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
messages = [{"role": "user", "content": prompt}]
response = client.messages.create(
model=model,
max_tokens=4096,
system=system_prompt,
messages=messages,
# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
{"type": "text", "text": context,
"cache_control": {"type": "ephemeral"}},
{"type": "text", "text": user_msg}
]}]
Where: skills/skill_utils.py:248
What we found: Found 1 Anthropic API call site(s) in this file with no cache_control directive. Static system prompts and repeated retrieval contexts are billed at full input rate on every call. Adding cache_control: ephemeral cuts input-token cost on the cached portion by 90% (cache-read $0.30/M vs $3/M for Sonnet). 5-minute TTL is usually enough for chat sessions.
skills.append({"type": "anthropic", "skill_id": anthropic_skill, "version": "latest"})
response = client.beta.messages.create(
model=model,
max_tokens=4096,
container={"skills": skills},
tools=[{"type": "code_execution_20250825", "name": "code_execution"}],
# Cache system + the stable retrieval context block:
system=[{"type": "text", "text": sys_prompt,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": [
{"type": "text", "text": context,
"cache_control": {"type": "ephemeral"}},
{"type": "text", "text": user_msg}
]}]
Where: tool_use/memory_demo/code_review_demo.py:128
What we found: max_tokens=4096 is unusually large. Anthropic + OpenAI bill on tokens GENERATED, so this doesn't directly cost more per-call. BUT: it increases p99 latency (model 'thinks longer' with headroom on chain-of-thought tasks) and chance of padded output. Check your billing CSV: if avg output for this endpoint is < 500 tokens, cap at 2× observed p99 (typically 512-1024).
max_tokens=4096
max_tokens=512 # cap at 2x observed p99 from billing CSV sample
Where: patterns/agents/util.py:25
What we found: max_tokens=4096 is unusually large. Anthropic + OpenAI bill on tokens GENERATED, so this doesn't directly cost more per-call. BUT: it increases p99 latency (model 'thinks longer' with headroom on chain-of-thought tasks) and chance of padded output. Check your billing CSV: if avg output for this endpoint is < 500 tokens, cap at 2× observed p99 (typically 512-1024).
max_tokens=4096
max_tokens=512 # cap at 2x observed p99 from billing CSV sample
Where: skills/skill_utils.py:250
What we found: max_tokens=4096 is unusually large. Anthropic + OpenAI bill on tokens GENERATED, so this doesn't directly cost more per-call. BUT: it increases p99 latency (model 'thinks longer' with headroom on chain-of-thought tasks) and chance of padded output. Check your billing CSV: if avg output for this endpoint is < 500 tokens, cap at 2× observed p99 (typically 512-1024).
max_tokens=4096
max_tokens=512 # cap at 2x observed p99 from billing CSV sample
Where: capabilities/retrieval_augmented_generation/evaluation/provider_retrieval.py:70
What we found: An LLM API call is wrapped in try/except but no backoff or sleep is detected anywhere in this file. On a transient outage, this loop can hammer the provider for as long as the wrapping loop runs, generating billable input tokens on every failed attempt. Add exponential backoff via `backoff` or `tenacity` library — or at minimum time.sleep(min(2**attempt, 30)).
client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
try:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=50,
messages=[
import backoff
@backoff.on_exception(backoff.expo, Exception, max_tries=4, max_time=60)
def call_with_retry(...):
return client.messages.create(...)
Where: capabilities/retrieval_augmented_generation/evaluation/prompts.py:112
What we found: An LLM API call is wrapped in try/except but no backoff or sleep is detected anywhere in this file. On a transient outage, this loop can hammer the provider for as long as the wrapping loop runs, generating billable input tokens on every failed attempt. Add exponential backoff via `backoff` or `tenacity` library — or at minimum time.sleep(min(2**attempt, 30)).
<relevant_indices>put the numbers of your indices here, seeparted by commas</relevant_indices>
"""
try:
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=50,
messages=[
import backoff
@backoff.on_exception(backoff.expo, Exception, max_tries=4, max_time=60)
def call_with_retry(...):
return client.messages.create(...)
Where: capabilities/retrieval_augmented_generation/evaluation/eval_end_to_end.py:36
What we found: An LLM API call is wrapped in try/except but no backoff or sleep is detected anywhere in this file. On a transient outage, this loop can hammer the provider for as long as the wrapping loop runs, generating billable input tokens on every failed attempt. Add exponential backoff via `backoff` or `tenacity` library — or at minimum time.sleep(min(2**attempt, 30)).
client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
try:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1500,
messages=[
import backoff
@backoff.on_exception(backoff.expo, Exception, max_tries=4, max_time=60)
def call_with_retry(...):
return client.messages.create(...)
Where: capabilities/knowledge_graph/evaluation/eval_extraction.py:75
What we found: max_tokens=2048 is unusually large. Anthropic + OpenAI bill on tokens GENERATED, so this doesn't directly cost more per-call. BUT: it increases p99 latency (model 'thinks longer' with headroom on chain-of-thought tasks) and chance of padded output. Check your billing CSV: if avg output for this endpoint is < 500 tokens, cap at 2× observed p99 (typically 512-1024).
max_tokens=2048
max_tokens=512 # cap at 2x observed p99 from billing CSV sample
Where: capabilities/retrieval_augmented_generation/evaluation/vectordb.py:37
What we found: Found 4 hardcoded model strings in this file, none routed through env vars. This blocks A/B testing cheaper models, prevents quick rollback when a vendor releases a better-priced equivalent, and forces a code deploy for every routing change. Introduce env vars (MODEL_PRIMARY, MODEL_RERANK, MODEL_BATCH).
model="voyage-2"
model=os.getenv("MODEL_PRIMARY", "voyage-2")
v1 audit does not include the per-call-site cost table shown in the public sample report — that requires uploading your billing CSV during intake (coming in v2). The findings above are based on static code analysis only, with estimated $/mo savings calibrated to mid-size SaaS workloads. If you'd like a calibrated cost table, email miloantaeus@gmail.com with your last 30-day billing CSV and we'll regenerate the report at no extra charge.
Why this matters: there's a strong vendor incentive to inflate projected savings. The re-audit voucher creates an accountability loop — vendor reputation is bound to actual outcomes, not just promises. If you implement 0 of the recommendations, that's on you. If you implement all of them and your bill goes up, we refund.