AI Haven
AI News

Google's TurboQuant Cuts LLM Memory Use by 6x With No Accuracy Loss

Google Research announces TurboQuant, a memory compression algorithm that reduces LLM KV cache size by 6x with zero accuracy loss and up to 8x speedup on H100 GPUs.

March 26, 2026

What TurboQuant Actually Does

Google Research announced TurboQuant, a training-free algorithm that uses vector quantization techniques—specifically PolarQuant and Quantized Johnson-Lindenstrauss (QJL)—to compress the key-value (KV) cache in large language models down to 3 bits per value without any accuracy loss. This reduces memory usage by at least 6x compared to standard 32-bit storage.

Benchmark Results

Tested on open-source models like Gemma, Mistral, and Llama across LongBench (question answering, code generation, summarization), Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval, TurboQuant matched or outperformed baselines like KIVI. It achieved perfect scores on needle-in-a-haystack tasks at 6x compression, maintaining 100% retrieval accuracy up to 104k tokens under 4x compression.

Speed Improvements

On NVIDIA H100 GPUs, 4-bit TurboQuant delivered up to 8x speedup in attention logit computation compared to 32-bit baselines. The algorithm targets the KV cache bottleneck that typically limits LLM inference throughput, especially for long-context workloads.

Why This Matters

The KV cache stores attention keys and values during inference—a memory structure that grows linearly with context length and batch size. As enterprises deploy LLMs for production workloads, memory costs often become the primary constraint on concurrency and latency. Google claims this compression could enable cheaper AI deployment, higher concurrency, and efficiency at scale without retraining or runtime overhead.

The research was led by Amir Zandieh and Vahab Mirrokni at Google Research, with plans to present the work at ICLR 2026. While still a research breakthrough and not yet deployed in production systems, the results have sparked industry discussion about implications for memory chip demand and inference efficiency.

Source: Google Research / TechCrunchView original →