Customer-support RAG application · ~$8,200/mo Anthropic + OpenAI spend · Repository scanned 2026-05-10
Five ranked cost leaks totaling $4,980/month. The top three alone save $4,180/month — $50,160/year.
| # | Leak pattern | Severity | $/mo saved |
|---|---|---|---|
| 1 | Anthropic prompt caching not enabled (RAG context re-sent every turn) | CRITICAL | $2,340 |
| 2 | GPT-4 used for embedding-augmented retrieval scoring (Haiku is 60x cheaper) | CRITICAL | $980 |
| 3 | max_tokens=4096 on summary endpoint that emits ~280 tokens average | HIGH | $860 |
| 4 | Batch API never used for nightly classification (1,400 docs/night, sync only) | HIGH | $420 |
| 5 | System prompt of 2,180 tokens repeated on every call (could be cached) | MEDIUM | $380 |
What we found: In chat_handler.py:47, every customer-support turn sends ~3,400 tokens of retrieved-context + 2,180 tokens of system prompt to Claude Sonnet. The Anthropic cache_control: {"type": "ephemeral"} block is never set. Prompt caching cuts input-token cost by 90% on cached portions; not using it on a stable RAG context is the #1 LLM waste pattern of 2026.
def respond(message: str, session_id: str) -> str: context = retrieve_context(message, k=8) system = open("prompts/support_system.md").read() response = client.messages.create( model="claude-3-sonnet-20240229", max_tokens=1024, system=system, messages=[ {"role": "user", "content": context + "\n\n" + message} ], ) return response.content[0].text
def respond(message: str, session_id: str) -> str: context = retrieve_context(message, k=8) system = open("prompts/support_system.md").read() response = client.messages.create( model="claude-3-sonnet-20240229", max_tokens=1024, system=[ {"type": "text", "text": system, "cache_control": {"type": "ephemeral"}}, ], messages=[ {"role": "user", "content": [ {"type": "text", "text": context, "cache_control": {"type": "ephemeral"}}, {"type": "text", "text": message}, ]}, ], ) return response.content[0].text
Why this saves $2,340/mo: The system prompt + average retrieved context = ~5,580 tokens. At Sonnet pricing ($3/M input tokens), each call costs ~$0.017 in input. With caching, cache-read is $0.30/M (90% discount). Across ~430K calls/month: 5,580 × 430K × $0.0000027 saved = $2,340/mo. Five-minute cache TTL is sufficient because customer sessions average ~3 turns within 90 seconds.
Implementation effort: 8 lines. Zero behavior change. Compatible with current SDK version (anthropic ≥0.18.0).
What we found: retriever.py:121 uses gpt-4-0125-preview to re-rank the top-12 retrieved chunks down to top-4. The task is a constrained scoring task (output: one of {1,2,3,4,5} per chunk). Claude Haiku and GPT-4o-mini handle this with no measurable quality drop and are 60-200× cheaper.
def rerank(query: str, chunks: list[str]) -> list[int]:
prompt = build_rerank_prompt(query, chunks)
resp = openai.chat.completions.create(
model="gpt-4-0125-preview",
messages=[{"role": "user", "content": prompt}],
max_tokens=128,
temperature=0,
)
return parse_scores(resp.choices[0].message.content)
def rerank(query: str, chunks: list[str]) -> list[int]:
prompt = build_rerank_prompt(query, chunks)
resp = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=128,
temperature=0,
)
return parse_scores(resp.choices[0].message.content)
Why this saves $980/mo: Each rerank uses ~3,200 input + ~100 output tokens. 430K calls/month × $0.034 (gpt-4) → $14,620 if every call hit rerank. Audit shows rerank fires on ~7% of calls (≈30K/month) currently costing $1,020/mo. Migrating to gpt-4o-mini: same volume costs ~$40/mo. Net saved: $980/mo.
Quality validation strategy (included in your full report): we recommend mirroring 5% of traffic for 7 days, computing rerank-rank-correlation between old and new models, and only flipping at >0.92 Spearman. We supply the validation script.
What we found: summarizer.py:34 sets max_tokens=4096 for an internal "ticket summary" endpoint that emits between 180-340 tokens (sampled 2,000 calls from the past 30 days via your billing CSV). Anthropic and OpenAI both bill output tokens generated, not allocated — so setting max_tokens=4096 alone doesn't cost more per-call. BUT: it dramatically extends p99 latency (model "thinks longer" when given headroom on chain-of-thought-style tasks) and increases the chance of model padding the output to feel "complete." Capping at 512 saves average 65 output tokens per call.
+ max_tokens=512, # avg output 280; p99 410 from 30-day sample
Why this saves $860/mo: 65 fewer output tokens × 220K summary calls/month × $15/M (Sonnet output) = $215/mo directly. The other $645/mo comes from latency reduction enabling 18% more cache hits in your Redis layer (because faster turns mean cache TTL covers more of the user's burst).
The deeper finding: we audited all 38 LLM call sites in your repo. 14 of them have max_tokens >3× the observed avg output. Full table in your report Appendix B.
What we found: jobs/classify_tickets.py runs at 02:00 UTC nightly and classifies ~1,400 closed support tickets. It uses openai.chat.completions.create() (synchronous API) at full rates. OpenAI's Batch API offers 50% off on the same model with 24h SLA — perfect fit for a nightly job that has no latency requirement.
- for ticket in tickets: - resp = openai.chat.completions.create(...) - write_classification(ticket.id, resp.choices[0].message.content) + # Build batch JSONL + batch_file = build_batch_jsonl(tickets) + file_id = openai.files.create(file=open(batch_file, "rb"), purpose="batch").id + batch = openai.batches.create(input_file_id=file_id, + endpoint="/v1/chat/completions", + completion_window="24h") + # Tomorrow's run reads results from prior batch + poll_and_apply(prior_batch_id="")
Why this saves $420/mo: 1,400 classifications/night × 30 nights = 42,000/month at ~$0.02 each (current) = $840/mo. With Batch: $420/mo. Saved: $420/mo. Anthropic also offers Batch API at the same 50% discount if you migrate any Anthropic offline jobs.
What we found: The 2,180-token prompts/support_system.md file is re-loaded and re-sent on every call site, not just chat_handler.py. Five other modules use the same prompt without caching:
moderation.py:23 — moderation check on every inbound messagehandoff_router.py:91 — escalation classifierthread_summarizer.py:56 — end-of-thread digestquality_scorer.py:142 — post-resolution quality evalcustomer_intent_v2.py:33 — intent classification A/BFix: apply the same cache_control: {"type": "ephemeral"} wrapping shown in Leak #1 to all six call sites. Anthropic deduplicates cached prefixes across requests within the 5-minute TTL window — meaning if any of the six fires within 5 min of any other, you pay the cached-read price (90% off) for all of them.
Why this saves $380/mo: 2,180 tokens × ~85K combined calls/month × ($3 - $0.30)/M = $498 theoretical. We discount to $380 to account for cache-miss rate on low-traffic hours.
Every LLM call site in the repo, ranked by monthly $ burned. The full report includes one row per call site; this sample shows the top 10.
| # | File:line | Model | max_tokens | Calls/mo | $ /mo |
|---|---|---|---|---|---|
| 1 | chat_handler.py:47 | claude-3-sonnet | 1024 | 432,100 | $3,890 |
| 2 | summarizer.py:34 | claude-3-sonnet | 4096 | 220,400 | $1,180 |
| 3 | retriever.py:121 | gpt-4-0125-preview | 128 | 30,200 | $1,020 |
| 4 | jobs/classify_tickets.py:88 | gpt-4o | 256 | 42,000 | $840 |
| 5 | moderation.py:23 | claude-3-haiku | 32 | 432,100 | $320 |
| 6 | handoff_router.py:91 | claude-3-haiku | 64 | 52,000 | $210 |
| 7 | thread_summarizer.py:56 | claude-3-sonnet | 512 | 18,400 | $190 |
| 8 | quality_scorer.py:142 | gpt-4o-mini | 128 | 14,000 | $165 |
| 9 | customer_intent_v2.py:33 | claude-3-haiku | 32 | 432,100 | $95 |
| 10 | onboarding_assistant.py:71 | claude-3-sonnet | 2048 | 3,800 | $48 |
Note: Top 4 = 79% of total spend. Concentrating optimization effort there is the highest leverage.
Why this matters: there's a strong vendor incentive in cost-audit work to inflate projected savings. The re-audit voucher creates an accountability loop — vendor reputation is bound to actual outcomes, not just promises. If you implement 0 of the recommendations, that's on you. If you implement all 5 and your bill goes up, we refund.
$79 one-time · Delivered within 1 hour · 14-day money-back guarantee
First-3-customers honest beta pricing: $49 (38% off). Reply "First-3 beta" after purchase for a manual PayPal refund of $30.