How to Find LLM API Cost Leaks in Your Codebase

If you've ever been blindsided by an unexpectedly high OpenAI or Anthropic bill, you're not alone. LLM API cost leaks are one of the most common—and most painful—issues I see in production applications. Unlike traditional API costs, LLM inference costs scale with token counts, and small inefficiencies can compound quickly into serious budget overruns.

The good news: cost leaks are almost always findable once you know where to look.

In this guide, I'll walk through the most common sources of LLM API cost leaks and give you concrete steps to find them in your codebase.

Why LLM Costs Are Different from Traditional API Costs

With a typical REST API, you're paying per request. With LLM APIs, you're paying per token—and tokens add up fast. A single prompt with a long system message, a verbose context window, and a chat history of 50 messages can generate thousands of tokens per request.

This means even a small inefficiency—like including unnecessary context in every request—can multiply into hundreds of dollars in extra costs per week.

The Most Common Sources of LLM API Cost Leaks

1. Unbounded Chat History
The most frequent culprit I find. Developers store full conversation histories to maintain context, but forget to limit how far back that history goes. Each API call re-sends the entire history, paying tokens for every message every time.

2. Overly Verbose System Prompts
System prompts set once and forgotten. A 2000-token system prompt sent with every single request—even when most requests don't need that much context—is a silent budget drain.

3. Missing or Ineffective Caching
Not implementing response caching for repeated or similar queries. Or implementing it incorrectly so cache hits never actually occur.

4. Debug Output in Production
Logging full prompts and responses in production code. What was a helpful debugging tool becomes a cost multiplier when those logs are processed or when you're inadvertently re-sending logged content.

5. No Token Budgeting or Rate Limiting
No safeguards on how many tokens a single request or a single user can consume.

How to Find Cost Leaks—Step by Step

Step 1: Audit Your API Call Sites

Start by finding every place you call the LLM API. Use grep or your IDE's search to find calls to openai.ChatCompletion.create, anthropic.messages.create, or similar SDK methods.

For each call site, document:

What system prompt is being used?
How much chat history is being sent?
Is there any caching in place?
Are there any guards on token limits?

Step 2: Log Token Counts in Development

Before optimizing, you need visibility. Add logging to track token counts on every API call during your development and staging environments. This gives you a baseline for what's actually being sent.

def log_token_usage(prompt_tokens, completion_tokens, model):
    cost_estimate = (prompt_tokens * PROMPT_COST_PER_1K[model] +
                     completion_tokens * COMPLETION_COST_PER_1K[model]) / 1000
    logger.info(f"Token usage: {prompt_tokens} prompt + {completion_tokens} completion = ${cost_estimate:.4f}")

Step 3: Identify Your Top Token Consumers

Once you have token logging in place, run your most common user flows and identify which endpoints consume the most tokens. Focus your optimization efforts there first.

Step 4: Check for Cache Misses

If you've implemented caching, add instrumentation to track cache hit rates. A cache hit rate below 20% often means your cache key strategy needs work—or that your caching layer has a bug.

Step 5: Review System Prompts

Search your codebase for system prompt strings. Look for anything over 500 tokens and ask yourself: does every API call actually need all of this?

Quick Wins to Stop the Bleeding

Implement conversation windowing: Only send the last N messages, not the full history.
Trim system prompts: Audit them quarterly. Remove anything that's not strictly necessary.
Add token limits: Cap the maximum tokens allowed in a single request.
Implement semantic caching: Cache similar queries rather than identical ones to improve hit rates.

When to Use a Tool Instead of DIY

If you're running multiple LLM integrations across several services, or if your application has many developers making API calls, manual auditing becomes unsustainable. You need automated visibility.

LLM Bill X-Ray gives you token-level visibility across all your LLM providers in one report. It automatically identifies the highest-cost endpoints, tracks cache hit rates, and alerts you when cost anomalies appear. Rather than piecing together logs from multiple providers, you get a unified view of where your LLM money is going—with before/after code fixes you can apply immediately.

If you're serious about controlling LLM costs at scale, explore the full audit suite.