PolarQuant: Quantizing KV Caches in Polar Coordinates

Creado: 2026-04-27 | Tamaño: 10500 bytes

TL;DR

In a February 2025 paper, Han, Kacham, Karbasi, Mirrokni, and Zandieh propose PolarQuant, a KV cache compressor that reframes the problem as angle quantization in polar coordinates rather than per-channel Cartesian quantization. A random preconditioning step turns arbitrary KV vectors into Gaussian ones, and a recursive polar transform splits a d-dim vector into a single radius and d-1 angles organized in log2(d) levels. The angles end up independent and tightly concentrated, so each level can be quantized with very few bits and no joint optimization. Result: 4.2x compression at 3.875 bits per coordinate, 48.37 vs 48.63 exact on LongBench, 0.991 vs 0.984 KIVI on Needle-in-a-Haystack at up to 104K tokens.

Not another outlier-suppression trick

Most KV compression work either prunes tokens (SnapKV, PyramidKV, H2O, ScissorHands) or quantizes Cartesian values per channel (KIVI, KVQuant, GEAR). Pruning loses information at long context. Channel-wise quantization needs per-channel scale and zero-point, which costs roughly 1 extra bit per number, and outlier channels keep dragging error budgets up.

PolarQuant is neither. It does not prune tokens, does not chase outliers, and does not quantize per channel. It changes the coordinate system, then quantizes coordinates that were never available in the original frame.

The polar reframing

The recursive transform pairs adjacent coordinates and converts each pair (a, b) to (r, theta) with r = sqrt(a^2 + b^2) and theta = arctan(b/a). Apply that recursively and a d-dim vector becomes one radius plus d-1 angles, organized as a tree of log2(d) levels.

Higher levels carry coarser, more redundant angular information. Lower levels carry the fine angular detail. That structure matters for the bit budget: levels can take different bit widths.

Random preconditioning is the unlock

Polar coordinates alone do not buy you compression. Real KV vectors are not Gaussian, and their angle distributions are messy. PolarQuant fixes that by multiplying the input by a random preconditioner, in practice a scaled Hadamard transform that runs in O(d log d) instead of O(d^2).

By Johnson-Lindenstrauss, the preconditioner preserves inner products. By Lemma 2 of the paper, the preconditioned vector's polar decomposition is separable: the radius and every angle are statistically independent, not merely uncorrelated, and each angle's distribution depends only on its level. Higher levels concentrate around pi/4, so they need fewer bits.

The independence claim matters. Uncorrelated angles would still leave joint structure on the table, justifying expensive vector-quantization codebooks. Full independence means a separate scalar codebook per level is information-theoretically sufficient, no joint optimization recovers anything. The geometric intuition: at deeper levels each angle is the arctan of two iid sqrt(sum-of-squares) quantities of equal variance, whose ratio concentrates at 1, so the angle concentrates at pi/4.

The practical bit budget the authors choose:

Level	Bits per angle	Notes
1 (finest)	4	Most angular variance
2	2	Concentrating
3	2	More concentrated
4 (coarsest)	2	Tight around pi/4
Radii	16 (bFP16)	One per L-level stub (8 of them for `d=128, L=4`)

Average: 3.875 bits per coordinate. No per-channel scale, no per-channel zero-point. Codebooks are constructed by partitioning each level's angle CDF into equal-probability bins, an asymptotically optimal choice that Theorem 1 ties to a guarantee of E[||x - x'||^2] = epsilon * ||x||^2 using O(log(1/epsilon)) bits per coordinate.

Online vs offline codebooks

Because the transformed angle distributions are theoretically known, the codebook can be precomputed offline and shared across prompts. PolarQuant also offers an online variant (PolarQuant-R) that runs k-means per prompt for tighter quantization, at the cost of extra prefill time.

Choose offline for throughput, online for the last few quality points on long contexts.

What the numbers show

LongBench, end-to-end generation on Llama 3.1-8B-Instruct:

Method	Score	Bits/coord
Exact (FP16)	48.63	16
PolarQuant-R	48.37	3.875
KIVI	46.70	~2-bit asym
SnapKV	44.57	token-pruned

Needle-in-a-Haystack at 0.25x compression, contexts from 4K to 104K tokens:

PolarQuant: 0.991
KIVI: 0.984
SnapKV / PyramidKV: degrade at long context

Wall clock on the same model:

Method	Prefill (s)	Generation (s)
Exact	2.934	38.374
PolarQuant-R offline	3.364	44.097
KIVI	3.590	49.564

So PolarQuant pays roughly 15% generation overhead versus exact, beats KIVI on both quality and speed, and crushes token-pruning baselines on long-context faithfulness. Pruning methods are still faster, but they are also the ones that fall over above 64K tokens.

Streaming quant vs offline quant

Weight quantization gets hours of offline calibration. AWQ, GPTQ, SmoothQuant, QuaRot all run a search pass over a held-out set, find per-channel scales, and bake them into the model. KV cache cannot do that. Tokens arrive in the forward pass and must be encoded immediately, with no second look at the data.

This is the regime PolarQuant is built for. Random preconditioning needs no calibration, the codebook is determined by the (known) post-preconditioning angle distribution, and per-token encoding is a fixed O(d log d) Hadamard plus a CDF lookup. No outliers to chase because the basis change has spread them out by construction. Years of per-channel calibration research is a consequence of asking weight-quant questions about a streaming-quant problem.

Where this fits in the inference stack

In the memory-bound decode regime, throughput is roughly tokens/s ∝ HBM_budget / cache_per_token. A 4.2x cache shrink directly buys 4.2x at the same HBM. Two ways to spend it:

Larger batch. 4.2x batch at fixed seq_len → roughly 4x throughput in the memory-bound regime.
Longer context. 4.2x seq_len at batch=1 → million-token windows on hardware that today caps out at ~256K.

The choice is a product question, not a research one. Either way, this pairs with our recent post on inference hardware being optimized for the wrong workload: when the cache dominates HBM, a quality-preserving cache shrink beats another doubling of tensor-core throughput. It also complements the kernel-level work in FlashAttention-4, which speeds up attention but does nothing for the cache itself.

Caveats and open questions

The bit allocation (4, 2, 2, 2) is hand-picked, not learned. Different model families likely want different schedules.
The 15% generation slowdown is real. Most of it lives in the per-token inverse Hadamard plus codebook lookup, both O(d log d) per cache read. At 100K tokens × 32 heads that adds up. PolarQuant is a memory-quality tradeoff, not a latency win. Pair it with a faster attention kernel if you want both.
Online k-means costs prefill time. For chat workloads with short prompts and long generations, the offline codebook is the right default.
The Hadamard-based preconditioner needs d to be a power of two. Most modern model dimensions are, but check before adopting.
Composes with token-pruning. PolarQuant compresses every token kept; SnapKV/PyramidKV decide which tokens to keep. Orthogonal axes. Prune-then-PolarQuant on the survivors is unexplored and probably free quality on top of either alone.
The random preconditioner is a sign-flipped Hadamard. The seed must be pinned and serialized alongside the cache, otherwise the receiver decodes garbage. Not hard, just easy to forget.

Closing

The KV cache is the single largest memory consumer at long context, and the quantization community has been beating its head against per-channel outliers for years. PolarQuant says the outliers are an artifact of asking the wrong question in the wrong basis. Move to polar coordinates, randomize once, and the angles arrange themselves into a quantization-friendly shape with formal guarantees attached. That is a much better starting point for the next generation of inference stacks than another round of empirical scale-and-clip.

References

PolarQuant: Quantizing KV Caches with Polar Transformation, Original source (Han, Kacham, Karbasi, Mirrokni, Zandieh, February 2025)
KIVI: A tuning-free asymmetric 2-bit quantization for KV cache, Cartesian quantization baseline
KVQuant: Towards 10 million context length LLM inference with KV cache quantization, Long-context quantization comparison
SnapKV: LLM knows what you are looking for before generation, Token-pruning baseline
PyramidKV: Dynamic KV cache compression based on pyramidal information funneling, Token-pruning baseline
LongBench: A bilingual, multitask benchmark for long context understanding, Evaluation benchmark
QJL: 1-bit quantized JL transform for KV cache quantization with zero overhead, Earlier JL-based work from the same group
Inference Hardware Is Built for the Wrong Workload, Related post on the memory-bound nature of decode
FlashAttention-4: Attention at Matmul Speed, Related post on attention-kernel optimization