← All tools
Free Calculator · RAG Pipeline Cost

RAG Pipeline Cost Calculator

The most detailed breakdown of the 5-calculator suite. Plug in your query volume, query length, retrieved context size, embedding model, and generation model — get a line-item monthly cost across embed + retrieve + generate phases. Browser-only, no signup, 2026-05 rates.

Your monthly RAG workload

Monthly RAG pipeline cost
$0/mo
Pick your models and click Calculate
Line item Token math Monthly cost Share
Query embedding $0
Generation input (query + context) $0
Generation output $0
Total $0

Vector DB storage and read costs are excluded — they're typically a rounding error vs the generation-input line. Prompt caching can cut the generation-input bill 50-90% on workloads where the retrieved context is stable across many queries.

Published rates used (as of 2026-05)

Sources: openai.com/api/pricing, anthropic.com/pricing, ai.google.dev/pricing, docs.voyageai.com/pricing
Model Phase Input ($/MTok) Output ($/MTok)
text-embedding-3-smallEmbed$0.02
text-embedding-3-largeEmbed$0.13
voyage-3Embed$0.18
GPT-4oGenerate$2.50$10.00
GPT-4o-miniGenerate$0.15$0.60
Claude Sonnet 4Generate$3.00$15.00
Claude Haiku 4.5Generate$1.00$5.00
Claude Opus 4.1Generate$15.00$75.00
Gemini 2.5 ProGenerate$1.25$5.00

Get the RAG cost optimization cheat-sheet

One-page line-item breakdown + one-page "7 ways to cut a RAG bill 50% without changing models" — chunk size, retrieval-k, reranker placement, prompt caching, summarized context, hybrid retrieval, query rewriting. PDF sent to your inbox.

When list-price math isn't enough
Get the LLM Bill Triage Deep Report
One-shot $299 audit of your real RAG or agent usage. 30-day cost-driver scan, prompt-bloat heatmap (especially valuable for RAG), model-routing wins, fix recipes. PDF in 24 hours. Money-back if total identified monthly savings is under $299.
Get the deep audit — $299 →
Money-back guarantee · PDF in 24 hours · No API keys required

How the math works

Three lines:

Example: 100,000 queries/month × 50 query tokens × 3,000 retrieved context tokens × 300 output tokens, embedded with text-embedding-3-small and generated with GPT-4o:

Embedding cost is 0.01% of the bill. The generation-input line (the retrieved context the model has to read) is 72% — and that's where optimization actually moves the needle.

Where most RAG bills waste money

Frequently Asked Questions

What is a RAG pipeline and where does the cost come from?

Retrieval-Augmented Generation. For every query: embed → retrieve → stuff context + query into a generation model → answer. Cost comes from three lines: query-embedding tokens, generation input tokens (query + retrieved context), and generation output tokens. Vector DB cost is usually a rounding error — the LLM bill dominates.

How much retrieved context should I budget?

Typical production RAG retrieves 5-20 chunks of 200-500 tokens. Most teams over-retrieve. If you can drop from 20 to 10 chunks with no quality loss, your generation-input bill halves. Test recall at k = 5, 10, 15, 20 on a held-out eval before settling.

Which embedding model should I pick?

text-embedding-3-small ($0.02/MTok) is the default — cheap and good enough for 80% of production RAG. text-embedding-3-large ($0.13) is worth the 6.5x premium only when retrieval recall is measurably the bottleneck. voyage-3 ($0.18) leads on most public benchmarks — pick it only if recall quality drives revenue.

Why isn't vector database cost in this calculator?

Because for most production RAG, vector DB cost is a rounding error vs LLM generation cost. A typical Pinecone serverless or pgvector cluster runs $50-$500/month even at high QPS, while generation lines on 100K-query/month RAG routinely cross $5K-$20K. Optimize LLM lines first.

Which line of the RAG bill is usually the biggest?

Generation input — the retrieved context the model has to read. For 100K queries/month with 10 chunks of 300 tokens on GPT-4o, the gen-input bill is ~$750/month JUST for context the model reads. Embedding on the same workload is ~$0.10. The 7,500x cost ratio is real — and the embedding line is almost always the wrong place to optimize.

Related free tools

The full AI API cost calculator suite