Skip to main content

Command Palette

Search for a command to run...

The Great Token Heist of ‘26

The 45% Token Inflation That Wasn't a Price Change

Updated
16 min read
The Great Token Heist of ‘26

Your LLM provider charges by the token, but a token is not a portable unit of work, a unit of meaning, or any unit you can compare across vendors. It only has meaning inside one vendor's system and loses it the moment you step outside. The token is a currency: vendors mint it, vendors re-denominate it, and the exchange rate with English (meaning how many tokens it takes to express the same sentence) is theirs to set. So when two vendors quote the same dollars-per-token, they are still quoting in different currencies.

On April 17, 2026, the day after Anthropic shipped Claude Opus 4.7, a Hacker News user named anabranch posted a public leaderboard they had built at tokens.billchambers.me with crowdsourced before-and-after token counts on the same prompts run through Opus 4.6 and Opus 4.7. The thread title, "Opus 4.7 to 4.6 Inflation is ~45%", climbed to 117 points before lunch. anabranch's summary: "It can be more than 2x for small prompts."

The comments filled with developers reporting specific impacts on their bills. hgoel, on the Claude Max 5x plan, hit their five-hour usage limit in under two. tiffanyh exceeded the weekly limit after seven prompts on a single-file HTML/CSS/JS project under 300 lines. KellyCriterion: "killed my weekly limit with just three prompts." manmal put the number on it: "The same prompt costs roughly 30% more now, for input."

None of this was a price change. Opus 4.7 launched at the same $5 per million input tokens and $25 per million output as 4.6.

Independent measurements followed within days. Simon Willison upgraded his Claude Token Counter to compare the two models on the same input and posted: "Opus 4.7 does appear to use 1.46x times the tokens for text", with up to 3× for images, priced the same as Opus 4.6 on a per-token basis. OpenRouter ran the numbers across more than a million real customer requests and published a bucketed breakdown: 32–45% more native tokens for equivalent text (highest on prompts under 2,000 tokens), with a 12–27% increase in net cost after prompt caching absorbed some of the inflation.

What changed was not the price per token. It was the token itself. The same prompt that cost you X tokens on 4.6 now costs roughly 1.4X on 4.7: same input, same answer, same work. The vendor changed the resolution. If that sounds like a camera going from 12 to 18 megapixels, Anthropic would accept that comparison: more resolution, better model, same price per unit. But a megapixel upgrade is a one-time capability you choose. This tokenizer change bills you per pixel on every shot, cannot be switched off, and shipped with no announcement. And more megapixels never guaranteed a better photo, only a bigger file. More tokens do not guarantee a better answer. They guarantee a bigger bill.

The currency got re-denominated

Buried in Anthropic's migration guide is one easy-to-miss sentence: Opus 4.7 uses a new tokenizer that "may use roughly 1x to 1.35x as many tokens" as previous models, varying by content. The price per token did not move; the number of tokens required to express the same English sentence did.

The honest defense is that this is an upgrade: a finer tokenizer can mean better resolution and better answers, and you simply pay in more tokens for them. But "better answers" is a measurable claim. If the quality is real, it shows up as a higher success rate and the cost per finished task still holds. If it does not, you are paying more tokens for the same result. The catch is that they shipped the cost with no announcement and no opt-out, while leaving the "better outcomes" half for you to verify on your own bill.

If this happened in a fiat currency, we would call it a redenomination: the kind of operation central banks use when they need to recalibrate without admitting they are recalibrating. Your salary stays "the same" in the new currency, but the new currency buys 30% less coffee. Anthropic, of course, does not have the obligations of a central bank, so there is no announcement, no press conference, and no notice period. The tokenizer is part of the model. It changes when the model changes.

The Hacker News reactions were not just complaints about bills. They were the predictable response to opacity. Shailendra_S, a bootstrapped founder, argued that 45% inflation breaks the unit economics for most indie products and pushes builders toward open models, or back to the more token-efficient Opus 4.5. dakiol announced his team had dropped Claude entirely. Boris Cherny from the Claude Code team showed up in the thread, but to address a separate complaint about over-fixated safety warnings, not the cost question. The cost question is hard to address. The currency really did get smaller, and the vendor can keep calling it "the same price."

"Nobody raised prices. The currency got smaller."

There is no standard token

Anthropic's redenomination is not unusual. It is the rule, dressed up as an exception.

Tokens are made by tokenizers, which are deterministic functions that split text into integer IDs. Each vendor trains its own. OpenAI's tiktoken is open-source: you can download the vocabulary files and count locally before you pay. GPT-3.5 and GPT-4 use cl100k_base, with 100,256 entries. GPT-4o, GPT-5, and the o-series use o200k_base, with 200,019. The transition from one to the other in 2024 was not priced as a change. It was, of course, a change.

The pattern repeats across vendors. Llama 2 to Llama 3 jumped from a 32K to a 128K vocabulary, and Meta says that buys up to 15% fewer tokens for the same text. Mistral's Tekken tokenizer, introduced for NeMo and Pixtral, compresses code about 30% more efficiently than its SentencePiece predecessor, and Korean and Arabic by 2–3×. Gemini's tokenizer, currently a 262,144-token SentencePiece model shared with Gemma 3, has rotated through versions across Gemini 1.5, 2.x, and 3.x. DeepSeek's V3 paper documents a 100K → 128K vocabulary expansion from V2.

The vendor with the most aggressive policy of opacity is Anthropic. The legacy claude-v1-tokenization.json was published for Claude 1 and never again. For Claude 3, 3.5, 4, and 4.7, the only authoritative way to count tokens is to call Anthropic's count_tokens endpoint. The repository javirandor/anthropic-tokenizer is the community workaround, reverse-engineered by asking Claude to repeat text and observing what comes back. It is approximate, and for many teams it is also the only way to estimate Claude cost without a round-trip to Anthropic's servers.

You can quote two vendors at the same dollar-per-million-token. You cannot quote them in the same units.

The exchange rate is not the same for everyone

If tokens are a currency, English speakers got a significantly better deal on it.

Petrov, La Malfa, Torr, and Bibi published "Language Model Tokenizers Introduce Unfairness Between Languages" at NeurIPS 2023. They evaluated 17 tokenizers across 200+ languages on FLORES-200, the standard multilingual parallel corpus. The single most cited number from that paper: the same translated sentence can produce "differences up to 15 times" in token count across languages, and the disparity persists even on tokenizers explicitly designed for multilingual support.

Yennie Jun's parallel study across 52 languages and 2,033 messages quantified the everyday cost. English averaged 7 tokens per message. Spanish, French, and Portuguese came in at similar rates. Hindi and Bengali ran approximately 5× English. Armenian ran 9×. Burmese reached 72 tokens, roughly 10× English for the same sentence. Her summary: "some languages require up to 10 times more tokens."

Newer tokenizers narrow the gap without closing it. Tamil's நீ ("you") was four tokens in cl100k_base and is a single token in o200k_base. Gemini's larger SentencePiece vocabulary handles CJK (Chinese, Japanese, Korean) noticeably better than the GPT-4 generation did. But the underlying training distribution is still English-dominant, and the structural penalty has not been engineered away.

What this means in practice: if your customer base is Hindi-speaking and your model uses an English-trained tokenizer, you are paying roughly four to five times what an equivalent English workload costs you, for content of equivalent meaning. The vendor did not charge you more per token. The exchange rate did the charging.

The fees you cannot see on the menu

Once you accept that the token is the currency and the rate is the vendor's, the rest of the LLM bill becomes simultaneously more legible and stranger.

Reasoning tokens. Newer models think before they answer, and the thinking counts. OpenAI's o-series, DeepSeek's V4 in its thinking mode, Anthropic's extended-thinking models (Opus 4.7 by default now runs its xhigh effort tier in Claude Code), and Gemini's thinking modes all generate internal chain-of-thought that gets billed at the output rate. OpenAI exposes a count of reasoning tokens in output_tokens_details.reasoning_tokens, though the actual chain-of-thought is summarized or hidden. Anthropic in Opus 4.7 omits thinking content from the response by default, but the count is billed whether or not you request a summary. A 10,000-token reasoning budget on GPT-5.5 at $30/M output costs thirty cents before the user sees a word. Most teams running translation or classification through reasoning models are paying many times what they should because nobody turned thinking off.

Caching, which sometimes is not a discount. OpenAI applies cache discounts automatically, up to 90% off on cached input, with no write premium and no code changes required. Anthropic uses explicit cache_control markers: reads are 10% of base input (the deepest discount on the market), but writes cost 1.25× base on a five-minute TTL or 2× base on a one-hour TTL. Anthropic's own pricing page says the math amortizes after one cache hit on the five-minute tier, or two on the hour-long one. But if your prefix, the stable front of the prompt you mark for caching, gets written and rarely re-read, caching costs you more than not caching. Gemini layers in a per-hour storage fee ($4.50 per million tokens cached per hour on 3.1 Pro), and the cache-read rate also doubles above 200K context, so the long-context cliff applies to caching too.

Context cliffs. Most vendors charge a flat per-token rate across their full context window. A few do not. Gemini 3.1 Pro is \(2 in / \)12 out per MTok up to 200K, then \(4 in / \)18 out above. OpenAI's GPT-5.5 jumps to \(10 in / \)45 out above 272K input tokens. Anthropic's Opus 4.6 and 4.7 are flat across their 1M window, with one caveat: there is a 1.1× surcharge if you request US-only inference routing. A RAG pipeline whose prompt-length distribution flirts with these thresholds can change cheapest-vendor depending on the day's typical input length.

Batch. This is the one place the price card stops misrepresenting your actual cost. Google's batch API is half the standard rate; OpenAI and Anthropic match. If your workload tolerates async latency, batch is the single most impactful switch you can make, more than swapping vendors, and usually more than switching models within a vendor. The number of production pipelines that are eligible for batch and not using it is alarming.

What the price card actually tells you

Pull all of this together and the dollar-per-million-token figure on a vendor's pricing page tells you exactly one thing: the dollar cost of one million units of that vendor's specific currency at this moment for the specific model variant and inference region and effort tier and TTL choice you picked. It does not tell you the cost of doing your work. The translation from "currency" to "work" sits inside the tokenizer, the language mix of your content, your reasoning-token policy, your cache hit pattern, your batch share, and your prompt-length distribution against context cliffs.

The honest unit for cross-vendor comparison is not cost-per-token. It is cost-per-task: what you pay, end-to-end, for a unit of output your users would accept. The formula is:

cost_per_task = (price_per_token × tokens_per_request × (1 + reasoning_overhead) × (1 − cache_hit_rate × cache_discount) × (1 − batch_share × 0.5)) ÷ task_success_rate

Every term is measurable. Token count comes from each vendor's official counter. Reasoning overhead comes from the usage block on each response. Cache hit rate and batch share come from your own logs. Task success rate comes from your eval. The headline $/MTok is one term on the right side, not the answer to the equation.

Cost is not the only thing that moves. More tokens means a longer sequence, which brings more latency, more KV-cache memory, and a higher chance of crossing the 200K/272K context cliffs in ways that can change how an agent loop behaves. Tokenizer inflation lands in your latency and memory budgets, not just the bill.

The state of the major currencies, May 2026

Here is what each vendor is actually doing under their published rate card, in the dimensions that matter for real bills:

Vendor Can you count tokens locally? How stable is the tokenizer across versions? Caching Long-context cliff Batch
OpenAI Yes: tiktoken is open Changed once: GPT-4 → GPT-4o (cl100k → o200k) Automatic, ≤90% off, no write premium Flat to 1M on GPT-4.1; 272K cliff on GPT-5.5 (≈2× input, 1.5× output) 50% off
Anthropic No: only via count_tokens API Changed 4.6 → 4.7: 1.0–1.35× per Anthropic, 1.46× measured by Willison Explicit cache_control; 0.10× reads, 1.25–2× write premium Flat to 1M; 1.1× surcharge on US-only inference 50% off
Google Gemini Partial: Gemma 3 SentencePiece (262,144 vocab) is published Differs across 1.5 / 2.x / 3.x lines Explicit + per-hour storage fee; cache reads also double above 200K 2× input, 1.5× output above 200K on 3.1 Pro 50% off
DeepSeek Yes: BBPE (byte-level BPE) published V2 → V3 expanded 100K → 128K; V4 added dual thinking/non-thinking modes Automatic disk cache, no write premium Flat Off-peak discount
Meta Llama Yes: tokenizer.json ships with the model Llama 2 → 3: 32K → 128,256 Host-dependent (Bedrock, Together, Groq differ) Host-dependent Host-dependent
Mistral Yes SentencePiece → Tekken (NeMo, Pixtral) Host-dependent Host-dependent Host-dependent

The vendors split into two groups. Open-tokenizer providers (OpenAI, DeepSeek, Meta, Mistral) let you compute cost client-side before you send the request, which is useful for budgeting, useful for routing, and useful when you are trying to A/B against another model. Closed-tokenizer providers (Anthropic, Gemini in part) require a round-trip to find out what the request will cost. That asymmetry is itself a feature of the pricing, even though it does not appear on any pricing page.

Auditing your own currency exposure

The practical version of this post is a worksheet, not a checklist. Sit with your team and answer:

  1. What does your token actually buy? Take 50–100 real prompts from production logs. Run each through every candidate vendor's official token counter. Compare the counts. A 20% gap on your real content is normal; a 50% gap on non-English content is normal. Now multiply each count by each vendor's $/MTok and look at the result. This is the real per-call cost, not the price card cost.

  2. How much are you paying to think? For every reasoning-capable model you use, pull the reasoning-token count from the usage block. Multiply by output price. If a non-trivial fraction of your bill is thinking tokens on workloads that do not need them, such as translation, classification, extraction, or short Q&A, turn thinking off. This is almost always the largest single optimization available.

  3. Does caching cost you or save you? Pull the actual ratio of cache writes to cache reads from your logs. On Anthropic, if your prefix gets written more often than it gets read, the write premium is making things worse, not better. On Gemini, work out whether the per-hour storage fee on rarely-hit caches is worth the discount on the hits.

  4. Where in your prompt-length distribution do the cliffs sit? Histogram your input lengths against 200K (Gemini) and 272K (GPT-5.5). If the 90th percentile is anywhere near those numbers, you are paying tier-2 rates more often than you realize.

  5. What fraction of your traffic could be batch? If 30% of your work is async-tolerant and you are not batching it, you are leaving ~15% of your total bill on the table.

  6. What's your real success rate per dollar? A cheaper-per-token model that fails your eval 20% more often is more expensive in production. Build the cost-per-accepted-task number. Reorder vendor preferences accordingly.

Why this state of affairs persists

It is tempting to read all of the above as vendors being deliberately opaque. Mostly, they are not. The Opus 4.6 → 4.7 tokenizer change was, on the merits, the right engineering trade for downstream quality. Anthropic's cache write premium reflects real storage and warm-cache costs. Gemini's 200K cliff is honestly disclosed. OpenAI's automatic caching is genuinely user-friendly. Every individual decision makes sense on its own.

What is missing is a market norm requiring vendors to publish their tokenizers, announce changes, and standardize the comparison unit. Until that norm exists, and there is no commercial incentive for any vendor to ship it unilaterally, the dollar-per-million-token figure on a pricing page will continue to be the most quoted and least informative number in the industry.

The fix lives on the buyer side. Track cost-per-task, not cost-per-token. Tokenize your real prompts before you migrate. Route by workload, not by vendor. Audit reasoning overhead like you audit any other line item. Treat the price card as one input to a model of your true unit economics, not the model itself.

And when a thread titled "Opus 4.7 to 4.6 Inflation is ~45%" climbs the front page of Hacker News on the day after a model launch, check the tokenizer before you check the price card.


— Written from the Altimate team. We build Altimate-Code, an agentic data engineering platform for enterprises. If you're working through this kind of vendor economics for your own stack, we'd like to compare notes.