Question 1

What is a RAG pipeline and where does the cost come from?

Accepted Answer

Retrieval-Augmented Generation. For every user query: (1) embed the query into a vector, (2) retrieve the most relevant chunks from a vector database, (3) stuff the query + retrieved chunks into a generation model's context window and ask it to answer. Cost comes from three lines: query-embedding tokens, generation input tokens (query + retrieved context), and generation output tokens. Vector DB storage and read costs are usually a rounding error — the LLM bill dominates.

Question 2

How much retrieved context should I budget?

Accepted Answer

Typical production RAG retrieves 5-20 chunks of 200-500 tokens each. Most teams over-retrieve. If you can drop from 20 chunks to 10 with no quality loss, your generation-input bill halves. Test recall at retrieval-k = 5, 10, 15, 20 on a held-out eval before settling on a number.

Question 3

Which embedding model should I pick?

Accepted Answer

text-embedding-3-small ($0.02 per MTok) is the default — cheap and good enough for 80% of production RAG. text-embedding-3-large ($0.13 per MTok) is worth the 6.5x premium only when retrieval recall is the bottleneck and the smaller model has been measured to underperform. voyage-3 ($0.18 per MTok) leads on most public eval benchmarks but is slower and more expensive — pick it if recall quality measurably drives revenue.

Question 4

Why isn't vector database cost in this calculator?

Accepted Answer

Because for most production RAG, vector DB cost is a rounding error compared to LLM generation cost. A typical Pinecone serverless or pgvector cluster runs $50-$500/month even at high QPS, while the generation lines on a 100K-query/month RAG stack routinely cross $5K-$20K. Optimize the LLM lines first — then look at vector DB if you have unusual storage volume.

Question 5

Which line of the RAG bill is usually the biggest?

Accepted Answer

Generation input — the retrieved context the model has to read. For a typical 100K-query/month RAG with 10 retrieved chunks of 300 tokens each on GPT-4o, the generation-input bill is around (100,000 × 3,000 / 1,000,000) × $2.50 = $750/month JUST for the context the model reads. Embedding cost on the same workload is around (100,000 × 50 / 1,000,000) × $0.02 = $0.10/month. The 7,500x cost ratio is real — and the embedding line is almost always the wrong place to optimize.

Model	Phase	Input ($/MTok)	Output ($/MTok)
text-embedding-3-small	Embed	$0.02	—
text-embedding-3-large	Embed	$0.13	—
voyage-3	Embed	$0.18	—
GPT-4o	Generate	$2.50	$10.00
GPT-4o-mini	Generate	$0.15	$0.60
Claude Sonnet 4	Generate	$3.00	$15.00
Claude Haiku 4.5	Generate	$1.00	$5.00
Claude Opus 4.1	Generate	$15.00	$75.00
Gemini 2.5 Pro	Generate	$1.25	$5.00

RAG Pipeline Cost Calculator

Your monthly RAG workload

Published rates used (as of 2026-05)

Get the RAG cost optimization cheat-sheet

How the math works

Where most RAG bills waste money

Frequently Asked Questions

What is a RAG pipeline and where does the cost come from?

How much retrieved context should I budget?

Which embedding model should I pick?

Why isn't vector database cost in this calculator?

Which line of the RAG bill is usually the biggest?

Line item	Token math	Monthly cost	Share
Query embedding	—	$0	—
Generation input (query + context)	—	$0	—
Generation output	—	$0	—
Total		$0	100%

RAG Pipeline Cost Calculator

Your monthly RAG workload

Published rates used (as of 2026-05)

Get the RAG cost optimization cheat-sheet

How the math works

Where most RAG bills waste money

Frequently Asked Questions

What is a RAG pipeline and where does the cost come from?

How much retrieved context should I budget?

Which embedding model should I pick?

Why isn't vector database cost in this calculator?

Which line of the RAG bill is usually the biggest?

The full AI API cost calculator suite