AI Haven
AI News

Google TurboQuant Achieves 6x LLM Memory Reduction Without Quality Loss

Google Research's TurboQuant algorithm compresses LLM memory by 6x while delivering 8x speedup with zero accuracy loss, targeting the KV cache bottleneck in long-context inference.

March 28, 2026

What TurboQuant Does

TurboQuant targets the key-value (KV) cache, which stores context during LLM inference to avoid recomputation. The algorithm uses a two-stage quantization process that compresses high-dimensional vectors from 16 bits down to just 3 bits per value.

The first stage, called PolarQuant, converts vectors to polar coordinates (magnitude and angles), exploiting predictable angular patterns to skip costly normalization. The second stage, QJL (Quantized Johnson-Lindenstrauss), applies a 1-bit transform to the residual error, providing unbiased inner product estimates critical for transformer attention accuracy.

Performance Results

Google tested TurboQuant on models including Llama-3.1-8B-Instruct, Gemma, and Mistral. The results were striking: over 5x to 6x KV cache compression with zero accuracy loss. On the Needle-In-A-Haystack benchmark, the compressed models maintained 100% retrieval accuracy up to 104k tokens. Speed improvements reached up to 8x on H100 GPUs for 4-bit attention operations compared to 32-bit baselines.

The compression is data-oblivious, meaning it requires no model training or fine-tuning. This makes it immediately applicable to any existing LLM deployment.

Community Response and Implementation

The open-source community has already begun implementing TurboQuant. A GitHub discussion in llama.cpp outlines an implementation approach using the TurboQuant_mse algorithm, while a dedicated fork (TheTom/llama-cpp-turboquant) extends llama.cpp for TurboQuant support. On Apple Silicon, developers achieved 4.6x KV cache compression with custom Metal kernels for Qwen 32B, reaching 98% of FP16 speeds.

Google will present TurboQuant at ICLR 2026 next month. The breakthrough could significantly lower AI deployment costs by enabling larger models or more users per GPU, particularly for long-context applications.

Source: TechCrunchView original →