Top concerns: a 1,840-token system primer attached to 87% of completions that the model demonstrably ignores ($1,832/mo), 41% of completions routed to a model 8x more expensive than required for the task complexity ($1,419/mo), and 12,847 deterministic retries masquerading as transient errors ($412/mo + 1.8s p99 latency tax). Three independent failure modes — input bloat, routing miss, and silent retry storm — together accounting for 88% of the monthly waste.
A 1,840-token "system context primer" was prepended to 87% of completions over the 30-day window (1.22M of 1.40M calls). Reasoning-token-overlap analysis on a stratified 4,000-completion sample shows the model referenced primer content in only 3.1% of outputs — the remaining 96.9% of completions paid for input tokens the model demonstrably ignored. Primer included project-wide style guides, contributor guidelines, and architecture overviews that were relevant to a small subset of code-edit prompts but attached uniformly to classification, search, summary, and tool-call routes.
41% of completions (574,000 of 1.4M) match a classification heuristic — output is under 200 tokens, format is yes/no, single-label, or short-extraction (e.g. "is this a bug? Y/N", "label one of: feature/bug/chore", "extract function name"). All currently routed to gpt-4-turbo at $10/$30 per 1M tokens. A 200-sample holdout re-run on gpt-4o-mini ($0.15/$0.60 per 1M) shows quality parity within 2.1% on task-success metric — well inside the noise floor of human-graded eval.
(output_format_short OR contains_only_one_label_request OR extraction_pattern_match) AND expected_output_tokens < 200 → gpt-4o-mini. Keep gpt-4-turbo for code-edit, multi-step reasoning, and any prompt with embedded function-call schemas. Run a 1-week canary at 10% traffic, monitor task-success and user-feedback rates, then ramp to 100%. Expected savings: $1,419/mo. Implementation: 2-3 engineering hours including the router heuristic and eval harness.12,847 retries in 30 days where the underlying error was deterministic, not transient. Most common pattern: tool-call response returns malformed JSON (e.g. trailing comma, unescaped newline inside a string), the client treats the parse-failure as a transient error and retries up to 5 times with exponential backoff. Every retry re-runs the full prompt and bills for the full completion. Re-issuing the same prompt to the same model on a deterministic-error response will produce the same deterministic error — the retries are pure waste, plus a 1.8s p99 user-facing latency tax.
2,103 completions over 30 days have cosine-similarity > 0.92 to a prior completion within the same 24h window, but no caching layer is present in the request path. The duplicates are mostly user-driven re-asks ("explain this function" on the same function within hours), tool-loop retries from agent code, and a small number of test-harness re-runs. None hit a cache — every duplicate pays full token cost.
4.7% of streamed completions (2,318 of 49,300 streaming calls) show client-side abort signals in the access log (user closed tab, navigated away, hit cancel button) but the server-side completion continues to the natural stop-token and bills for the full output. The OpenAI/Anthropic SDK call is not wired to honor the inbound AbortController — client abort just drops the response on the floor while the server keeps generating.
AbortController signal from the client request through the framework-level proxy down to the OpenAI/Anthropic SDK call. Both SDKs support signal parameter natively in the streaming endpoints — the SDK will close the upstream connection and stop billing within ~50ms of the abort. Verify with a synthetic-abort test: open stream, abort at 50% generation, assert finish_reason: "abort" in usage record and final billed tokens < 50% of typical completion length. Expected savings: $94/mo plus user-perceived snappier cancel. Implementation: 1 engineering hour.The same canonical user query ("Summarize this PR") returned 11 distinct response formats over the 30-day window with no version pinning on either the prompt template or the model. Model alias gpt-4o auto-rolled twice during the period (2026-04-22, 2026-05-08 per OpenAI changelog). No git-snapshot of the prompt template was kept — drift is invisible until users complain. No direct $-impact, but downstream regression risk is medium: any consumer of the PR-summary feature (e.g. CI integration, Slack bot) has no contract guarantee about output shape.
gpt-4o-2024-08-06) instead of the moving alias. Schedule a quarterly review to test the latest snapshot against the pinned one. (2) Snapshot every prompt template in git with a version tag; embed prompt_version in each API call's metadata. (3) Add a weekly eval that runs 20 canonical PR fixtures through the prompt and compares output shape against a golden reference. Alert on Levenshtein distance > 0.3 from the golden. No direct $-impact but stops the slow erosion of downstream contracts. Implementation: 3-4 engineering hours.