| Line item | Token math | Monthly cost | Share |
|---|---|---|---|
| Query embedding | — | $0 | — |
| Generation input (query + context) | — | $0 | — |
| Generation output | — | $0 | — |
| Total | $0 | 100% |
Vector DB storage and read costs are excluded — they're typically a rounding error vs the generation-input line. Prompt caching can cut the generation-input bill 50-90% on workloads where the retrieved context is stable across many queries.
| Model | Phase | Input ($/MTok) | Output ($/MTok) |
|---|---|---|---|
| text-embedding-3-small | Embed | $0.02 | — |
| text-embedding-3-large | Embed | $0.13 | — |
| voyage-3 | Embed | $0.18 | — |
| GPT-4o | Generate | $2.50 | $10.00 |
| GPT-4o-mini | Generate | $0.15 | $0.60 |
| Claude Sonnet 4 | Generate | $3.00 | $15.00 |
| Claude Haiku 4.5 | Generate | $1.00 | $5.00 |
| Claude Opus 4.1 | Generate | $15.00 | $75.00 |
| Gemini 2.5 Pro | Generate | $1.25 | $5.00 |
One-page line-item breakdown + one-page "7 ways to cut a RAG bill 50% without changing models" — chunk size, retrieval-k, reranker placement, prompt caching, summarized context, hybrid retrieval, query rewriting. PDF sent to your inbox.
Three lines:
embed_cost = queries × query_tokens × embed_rate / 1,000,000gen_input_cost = queries × (query_tokens + context_tokens) × gen_input_rate / 1,000,000gen_output_cost = queries × output_tokens × gen_output_rate / 1,000,000Example: 100,000 queries/month × 50 query tokens × 3,000 retrieved context tokens × 300 output tokens, embedded with text-embedding-3-small and generated with GPT-4o:
Embedding cost is 0.01% of the bill. The generation-input line (the retrieved context the model has to read) is 72% — and that's where optimization actually moves the needle.
Retrieval-Augmented Generation. For every query: embed → retrieve → stuff context + query into a generation model → answer. Cost comes from three lines: query-embedding tokens, generation input tokens (query + retrieved context), and generation output tokens. Vector DB cost is usually a rounding error — the LLM bill dominates.
Typical production RAG retrieves 5-20 chunks of 200-500 tokens. Most teams over-retrieve. If you can drop from 20 to 10 chunks with no quality loss, your generation-input bill halves. Test recall at k = 5, 10, 15, 20 on a held-out eval before settling.
text-embedding-3-small ($0.02/MTok) is the default — cheap and good enough for 80% of production RAG. text-embedding-3-large ($0.13) is worth the 6.5x premium only when retrieval recall is measurably the bottleneck. voyage-3 ($0.18) leads on most public benchmarks — pick it only if recall quality drives revenue.
Because for most production RAG, vector DB cost is a rounding error vs LLM generation cost. A typical Pinecone serverless or pgvector cluster runs $50-$500/month even at high QPS, while generation lines on 100K-query/month RAG routinely cross $5K-$20K. Optimize LLM lines first.
Generation input — the retrieved context the model has to read. For 100K queries/month with 10 chunks of 300 tokens on GPT-4o, the gen-input bill is ~$750/month JUST for context the model reads. Embedding on the same workload is ~$0.10. The 7,500x cost ratio is real — and the embedding line is almost always the wrong place to optimize.