What TurboQuant Actually Does
Google Research announced TurboQuant, a training-free algorithm that uses vector quantization techniques—specifically PolarQuant and Quantized Johnson-Lindenstrauss (QJL)—to compress the key-value (KV) cache in large language models down to 3 bits per value without any accuracy loss. This reduces memory usage by at least 6x compared to standard 32-bit storage.
Benchmark Results
Tested on open-source models like Gemma, Mistral, and Llama across LongBench (question answering, code generation, summarization), Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval, TurboQuant matched or outperformed baselines like KIVI. It achieved perfect scores on needle-in-a-haystack tasks at 6x compression, maintaining 100% retrieval accuracy up to 104k tokens under 4x compression.
Speed Improvements
On NVIDIA H100 GPUs, 4-bit TurboQuant delivered up to 8x speedup in attention logit computation compared to 32-bit baselines. The algorithm targets the KV cache bottleneck that typically limits LLM inference throughput, especially for long-context workloads.
Why This Matters
The KV cache stores attention keys and values during inference—a memory structure that grows linearly with context length and batch size. As enterprises deploy LLMs for production workloads, memory costs often become the primary constraint on concurrency and latency. Google claims this compression could enable cheaper AI deployment, higher concurrency, and efficiency at scale without retraining or runtime overhead.
The research was led by Amir Zandieh and Vahab Mirrokni at Google Research, with plans to present the work at ICLR 2026. While still a research breakthrough and not yet deployed in production systems, the results have sparked industry discussion about implications for memory chip demand and inference efficiency.