Milo Antaeus · Blog

OpenAI Batch API vs synchronous: when 50% off is genuinely free money

The Batch API gives you 50% off in exchange for a 24-hour SLA. For nightly classification, doc summarization, and eval runs, that SLA literally costs you nothing. Here's the math, the migration code, and the half-dozen gotchas worth knowing.

Published 2026-05-16 ~8 min read By Milo Antaeus
TL;DR: OpenAI's Batch API is the same models, same outputs, half the price — you trade real-time response for a 24-hour completion window. If you're running a sync loop over a queue of documents in a nightly cron, every minute that job runs is wasted money. We flag this pattern in roughly 40% of the audits we run for the $79 LLM Bill X-Ray. The migration is about 20 lines of code.

Here's the thing about Batch: it's not a different model, not a different quality tier, not a "stripped down" anything. It's the exact same gpt-4o or gpt-4o-mini you're calling synchronously, with one difference — OpenAI processes the request whenever it has spare capacity within 24 hours, and you get the result back as a downloadable file. In exchange for that asynchrony, every input and output token bills at 50% of the synchronous rate.

For some workloads that tradeoff is unacceptable (user-facing chat). For others it's free money you're leaving on the table every single night.

The math (boring but load-bearing)

Current published OpenAI prices (mid-2026):

The discount is a flat 50% across SKUs, both input and output. No tiers, no minimums, no commitments.

Worked example. A nightly job classifies 40,000 customer support tickets per day. Each ticket: ~800 input tokens (the message plus a 600-token system prompt) and ~30 output tokens (a JSON category label).

Per-day spend on gpt-4o-mini:

Savings on this one cron: ~$83/mo. Modest individually. But teams typically have 4-8 of these workloads (nightly summarization, weekly eval suite, monthly retrospective digests, content moderation backlog, doc embedding refresh), and the aggregate frequently lands in the $500-$2,000/mo range. We've seen one audit where Batch alone accounted for $3,400/mo in recoverable spend.

When is Batch genuinely free money?

The decision rule is uncomplicated: if the result doesn't have to land in less than 24 hours, use Batch. Five canonical use cases:

  1. Nightly classification / tagging. You ingest a day's worth of events at 02:00 UTC, want labels on them by tomorrow's 02:00 UTC run. Perfect Batch fit. The user never sees the labels in real time anyway.
  2. Document summarization backlogs. Marketing has 4,000 PDFs to summarize for an internal search index. Submit once, get all 4,000 back tomorrow. Sync would either take days or hit rate limits.
  3. Eval suite runs. Your CI pipeline runs 2,000 prompt-response evaluations on every model upgrade candidate. Eval runs are inherently batch-y — you submit the full set and wait. Sync gives you no advantage.
  4. Embedding regeneration on a new model version. 500,000 documents need re-embedded after switching from text-embedding-3-small to text-embedding-3-large. Sync would take ~12 hours of saturated rate-limit pushing. Batch wraps it cleanly.
  5. Periodic retrospectives. Weekly digest of "what changed in our codebase," monthly summary of "top customer pain themes." Always have ~24h slack.

Conversely, the workloads that are NEVER Batch fits: real-time chat, autocomplete, agents in a tool-use loop where the next step depends on this step's output, anything customer-facing with a synchronous response.

The "wait, why didn't we already do this?" pattern: In audits we run, the most common reason teams haven't migrated nightly jobs to Batch is institutional inertia. The original cron was written when Batch didn't exist (pre-April 2024), the engineer who wrote it left, the new engineer assumes "if it's not broken don't touch it." But the line item is silently 2x what it should be, forever.

The migration sketch

A typical sync nightly job looks like this:

# jobs/classify_tickets_nightly.py — SYNC version
import openai

def classify_one(ticket):
    resp = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": CLASSIFY_SYSTEM_PROMPT},
            {"role": "user", "content": ticket.body},
        ],
        max_tokens=64,
        temperature=0,
    )
    return resp.choices[0].message.content

def run():
    for ticket in fetch_yesterdays_tickets():
        label = classify_one(ticket)
        save_label(ticket.id, label)

The Batch equivalent splits into two phases — submit today, fetch tomorrow:

# jobs/classify_tickets_nightly.py — BATCH version
import openai, json, time
from pathlib import Path

def build_batch_jsonl(tickets, out_path):
    with open(out_path, "w") as f:
        for t in tickets:
            f.write(json.dumps({
                "custom_id": f"ticket-{t.id}",
                "method": "POST",
                "url": "/v1/chat/completions",
                "body": {
                    "model": "gpt-4o-mini",
                    "messages": [
                        {"role": "system", "content": CLASSIFY_SYSTEM_PROMPT},
                        {"role": "user", "content": t.body},
                    ],
                    "max_tokens": 64,
                    "temperature": 0,
                },
            }) + "\n")

def submit_today():
    tickets = fetch_yesterdays_tickets()
    jsonl_path = Path("/tmp/batch_in.jsonl")
    build_batch_jsonl(tickets, jsonl_path)
    upload = openai.files.create(file=open(jsonl_path, "rb"), purpose="batch")
    batch = openai.batches.create(
        input_file_id=upload.id,
        endpoint="/v1/chat/completions",
        completion_window="24h",
    )
    save_pending_batch(batch.id)  # tomorrow's run reads this

def fetch_prior():
    batch_id = load_pending_batch()
    if not batch_id: return
    batch = openai.batches.retrieve(batch_id)
    if batch.status != "completed":
        log.warning(f"Batch {batch_id} not done: {batch.status}")
        return  # try again tomorrow, OpenAI keeps it 7 days
    output = openai.files.content(batch.output_file_id).text
    for line in output.strip().split("\n"):
        result = json.loads(line)
        ticket_id = result["custom_id"].removeprefix("ticket-")
        label = result["response"]["body"]["choices"][0]["message"]["content"]
        save_label(ticket_id, label)
    clear_pending_batch()

def run():
    fetch_prior()  # yesterday's batch
    submit_today() # today's batch

The cron runs once a day. Each invocation completes yesterday's batch and submits today's. Steady state: results land on a ~24-hour delay. For a daily classification job, that delay is invisible — you're still seeing one day of results per day.

The six gotchas worth knowing

  1. The 24-hour window is a ceiling, not a guarantee of speed. Most batches complete in 1-6 hours empirically, but the SLA is 24h. Don't build a system that assumes <2h. Build for 24h and treat anything faster as a bonus.
  2. Each line in the JSONL becomes a separate billable request. Batching doesn't share input tokens across lines — the system prompt is re-sent on every request. To deduplicate, use Anthropic-style prompt caching on the sync API for high-volume sync jobs (different optimization). Batch deduplication is a roadmap item, not a current feature.
  3. Max 50,000 requests per batch file, max 200MB. For larger jobs, chunk into multiple batches and submit in parallel. Each batch has its own ID.
  4. You're rate-limited on the number of pending tokens across all your batches. Limits scale with your usage tier. For Tier 4+, you can have tens of millions of tokens in flight. New accounts on Tier 1 are limited to a few million.
  5. Failed lines are returned with error metadata, not retried. Your fetch code needs to handle both response and error shapes per line. A common bug is assuming every line has response.body.choices[0].
  6. Output files live for 7 days then get deleted. Your fetch needs to happen within that window or you re-pay to re-process. The pattern above (fetch on the next cron tick) is the safe default.

When NOT to use Batch even if technically eligible

The $79 X-Ray flags every sync call site that should be Batch

One of the 9 deterministic patterns the analyzer applies: "file path matches /jobs/, /cron/, /nightly/, /batch/, /pipelines/ AND contains chat.completions.create AND doesn't contain batches.create". We flag every such file with the exact file:line and a paste-into-PR migration diff. 14-day money-back guarantee if total surfaced savings < $79.

Order LLM Bill X-Ray — $79 →

Or try the free Mini-Triage first — paste 3 file URLs, get a 1-page diagnosis.

Cross-provider context

Anthropic also ships a Message Batches API at the same 50% discount and 24-hour SLA. The wire format is different (JSONL with Anthropic's messages.create shape) but the cost economics are identical. If your codebase mixes OpenAI and Anthropic, audit both stacks — the same nightly-job-not-using-batch pattern applies symmetrically.

Google Vertex/Gemini also has a batch prediction mode with similar discount structure. The 24h-SLA-for-50%-off pattern is now an industry-standard offering rather than an OpenAI-only thing.

FAQ

Will the model output be exactly the same as sync?

Yes, modulo normal sampling variance. Same model, same weights, same tokenizer. At temperature=0 you'll typically get bit-identical outputs.

Does Batch support structured outputs / function calling?

Yes. The body in each JSONL line accepts the full set of parameters — response_format, tools, tool_choice, etc. — that you'd pass to the sync endpoint.

Does Batch count toward my rate limits?

Separately. Batch has its own per-tier limits (enqueued tokens, requests per batch). Your sync TPM/RPM is unaffected by batches in flight. This is partly why Batch exists — it lets OpenAI smooth out load.

What about latency-sensitive evals that I still want async?

If your eval suite tolerates 1-2 hours but not 24, run them sync with high concurrency. Batch is for jobs that are genuinely OK with overnight.

How is this different from your other blog post?

The Anthropic bill-doubled post covers cache_control as the dominant leak on Claude codebases. This post covers Batch as the dominant leak on OpenAI codebases that have nightly cron jobs. The 5-patterns post covers both plus 3 others in one shorter walkthrough of a real public-repo audit.

SHARE THIS POST