TurboQuant: How Google Shrinks the LLM KV Cache 6×
Google's TurboQuant squeezes an LLM's KV cache to about 3 bits per value with near-zero quality loss. Here's the rotation-plus-one-bit trick, decoded.

Load a 70B model with a 128K-token context and the weights aren't the scary part. The keys and values the model caches so it doesn't have to re-read that context can balloon to 40 GB on their own, for a single conversation. That cache is the quiet reason long-context inference falls over, and it's exactly what Google's TurboQuant goes after. It squeezes the KV cache down to roughly 3 bits per value, a 6× cut, with quality loss small enough to round to zero. No retraining, no calibration.
What the KV cache is, and why it explodes
A transformer generates one token at a time. To produce the next token, it attends back over every token that came before. Recomputing the keys and values for all of that history on every step would be brutal, so the model caches them. That cache is the "KV cache," and it grows linearly with how much context you're holding.
The size adds up fast because you pay for it per layer and per attention head. Per token, the cache stores this many numbers:
2 × layers × kv_heads × head_dimThe 2 is for keys and values. Multiply by your context length and the bytes per number, and you get the total. Run the real shape of a big model through it:
Forty gigabytes, for one sequence, on top of the 140 GB the weights already want. Batch a handful of users and the KV cache, not the model, is what runs you out of VRAM. Shrink it and two things happen at once: longer contexts fit, and you can serve more people per card.
Why kv_heads and not heads?
Modern models use grouped-query attention, so many query heads share a smaller set of key/value heads. Llama-3-70B has 64 query heads but only 8 KV heads. The cache is sized by the KV heads, which is already a big saving before any quantization. TurboQuant stacks on top of it.
Why you can't just round the numbers down
The obvious move is to store the cache in 4-bit or 3-bit integers instead of 16-bit floats. People have tried. It tends to wreck the model in two ways that aren't obvious until you measure them.
First, attention is a giant pile of dot products. Each new token scores itself against every cached key with a dot product, then a softmax turns those scores into attention weights. A quantizer can look great on ordinary reconstruction error and still bias those dot products in a consistent direction. Bias the scores and you bend the softmax, which quietly changes what the model pays attention to. If the dot-product intuition is fuzzy, the same math powers embeddings and semantic search, where closeness is literally a dot product.
Second, outliers. A few coordinates in each vector carry huge values while the rest are small. To cover that one big coordinate, a uniform quantizer has to stretch its range wide, which leaves every other coordinate sharing coarse buckets and wasting precision. Classic vector quantization fights back by storing a separate scale and offset for each small block, in full precision. That overhead adds 1 to 2 bits per value right back, eating the compression you were after. Calibration-based methods take a different route and tune the quantizer to a sample dataset, then degrade on inputs that don't look like the sample. Long contexts are exactly the inputs that don't look like the sample.
Quick check
A quantizer has low reconstruction error on the KV cache but the model's answers get worse. What's the most likely culprit?
How TurboQuant works
TurboQuant is two stages, and each one targets one of those failure modes directly.
Stage one: rotate, then quantize. Before quantizing, it applies a random rotation to each vector. A rotation built from a Johnson-Lindenstrauss-style transform spreads a vector's energy evenly across all its coordinates, so no single outlier gets to dictate the range. Once the energy is flat, a plain uniform grid is close to the best you can do, and the same fixed number of bits per coordinate is no longer wasteful. The rotation doesn't depend on your data at all, which is the whole point: there's nothing to calibrate. See the spreading effect for yourself:
The spiky vector is about 7× lopsided. After the rotation it's barely above 1×, which means a uniform quantizer fits it well and spends its bits evenly.
Stage two: spend one bit on direction. A quantizer that only minimizes absolute error can still point the reconstructed vector slightly the wrong way, and "the wrong way" is what biases the dot products. So TurboQuant reserves a single extra bit, using a Quantized Johnson-Lindenstrauss correction, to fix the direction rather than shrink the absolute error any further. That one bit is what keeps the dot-product estimates unbiased, which is what keeps attention honest.
Because none of this looks at your data, any standard transformer benefits the moment you turn it on. Load a model from a GGUF file and it works. No fine-tuning, no calibration pass, no per-model tuning.
Quick check
TurboQuant is described as 'data-oblivious.' What does that buy you?
The numbers
| Approach | Bits per value | KV cache at 128K | Quality |
|---|---|---|---|
| FP16 (baseline) | 16 | ~40 GB | reference |
| Naive 4-bit | 4 + block overhead | ~12 GB | visible drop |
| TurboQuant | ~3 effective | ~6.7 GB | near-zero drop |
A few headline figures from the work. The KV cache shrinks by at least 6×. Part of that is going from 16 bits to about 3, and part is dropping the full-precision per-block constants that normal quantization has to carry. The 4-bit variant runs attention up to 8× faster than unquantized 32-bit keys on an H100, because there's far less data to move and multiply. On needle-in-a-haystack retrieval at long context it scores perfectly, and on standard benchmarks the drop is small enough that Google reports it as zero. The method is provably close to the information-theoretic limit for this kind of quantization, which is a real claim with proofs behind it, not marketing.
If you want the primary source, it's the paper "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate" by Amir Zandieh and Vahab Mirrokni at Google Research, headed to ICLR 2026.
What it changes if you ship LLMs
The 40 GB cache from earlier becomes about 6.7 GB. That single fact moves a few real decisions.
You get long context on the hardware you already have. A 128K window that used to spill out of an 80 GB card now leaves room to spare, so the feature you shelved because "the context doesn't fit" might just fit now. Re-run the calculator above with your model's real shape and your target context length before you assume you need a bigger GPU.
You get cheaper serving, not just bigger windows. KV memory is usually what caps your batch size, and batch size is what makes inference cheap per request. Free up 6× of the cache and you can pack far more concurrent users onto one card. That pushes the same direction as the falling token prices I wrote about in the 2026 LLM price war, and it takes a bit of pressure off the power and compute crunch by getting more useful work out of each watt.
Adoption is close to free. Because there's no retraining or calibration, this lands in inference stacks as a cache setting rather than a model change. Expect it to show up in llama.cpp and vLLM-style servers as an option you flip on.
Read the fine print before you trust it
Three caveats worth holding onto. This compresses the KV cache, not the model weights, so it helps long-context and high-batch serving, not the base memory the weights need. "Zero loss" is measured on the authors' benchmarks, so run it against your own eval before you ship it. And it produces an unbiased estimate of attention, not exact arithmetic. The whole design is about keeping that estimate honest, but it's still an estimate.
The takeaway
The KV cache was the tax nobody talked about on long context, and it scales with exactly the thing everyone wants more of. TurboQuant cuts it about 6× with a rotation and one carefully spent bit, and it asks nothing of your model in return.
The mental model is worth keeping: spread a vector's energy out so uniform buckets fit it, then spend a single bit to keep the dot products pointing the right way. Rotation handles the outliers, the correction bit handles attention. If you want to go a level deeper on why dot products are the thing worth protecting, embeddings and semantic search is the same idea from the other end.

Written by
Rhythm Bhiwani
Engineer and relentless builder, happiest reverse-engineering hard problems until they click.
Enjoyed this?
Tap the heart to leave some love.
Be the first to react
Comments
Join the conversation.
Loading comments…

