Lengthy-context inference makes the KV cache one of many fundamental prices of serving LLMs. Throughout autoregressive decoding, the cache grows with context size, batch dimension, and mannequin depth. At excessive batch sizes and lengthy contexts with 100K tokens throughout dozens of concurrent requests the KV cache consumes a big fraction of GPU reminiscence. Compressing it’s a direct option to improve batch dimension and cut back reminiscence visitors.
The plain strategy is quantization. However pushing KV caches to INT2 (2-bit) precision has been largely impractical. Prior strategies both collapse in accuracy or require customized serving layouts incompatible with paged KV-cache techniques. Collectively AI’s OSCAR (Offline Spectral Covariance-Conscious Rotation) addresses each issues.
Why INT2 KV Cache Quantization is Arduous
KV activations include channel-wise outliers. A small subset of channels holds extraordinarily massive values. Most channels are well-behaved. Once you apply INT2 quantization which has solely 4 representable ranges and people outliers dominate the dimensions issue. The quantizer wastes most of its vary on uncommon spikes. Regular values get compressed into only one or two efficient ranges. This degrades consideration high quality considerably.
Rotation-based quantization addresses this by making use of a set orthogonal rework, usually a Hadamard rework, to redistribute outlier power throughout all channels. This strategy works moderately nicely at INT4. At INT2, a deeper downside stays: the rotation is data-oblivious. It may well easy activation ranges, but it surely doesn’t know which instructions the eye mechanism really reads. Spreading quantization error uniformly just isn’t the identical as pushing it into low-importance instructions. At INT2, with solely 4 ranges, that distinction determines whether or not the mannequin works in any respect.

What OSCAR Does In a different way
OSCAR’s key remark is that the rotation utilized earlier than quantization ought to be derived from consideration statistics themselves — not from the uncooked distribution of KV activations.
For keys, the downstream error that issues just isn’t the Euclidean reconstruction error of Okay. It’s the error in consideration logits. The analysis crew confirmed this error is: ‖QK⊤ − QK̂⊤‖²F = tr((Okay − Okaŷ)Q⊤Q(Okay − Okaŷ)⊤). The weighting matrix is the question covariance Q⊤Q, not Okay⊤Okay. Instructions the place queries have massive power amplify quantization errors in logits. OSCAR estimates the empirical question covariance CQ = (1/N) Σ qn⊤qn from a calibration set, eigen-decomposes it, and makes use of the eigenvectors UQ as the important thing rotation foundation.
For values, the related error is within the consideration output SV. This is determined by how the eye rating matrix S weights every worth row. The analysis crew defines the score-weighted worth covariance CS = (1/N) V⊤S⊤SV. Instructions that stay massive after aggregation by S are those quantization error propagates via. OSCAR makes use of the eigenvectors US of CS as the worth rotation foundation.
The ultimate composed rotations are:
RK = UQ · HHad · PbrRV = US · HHad · Pbr
Every of the three elements addresses a definite failure mode of per-group low-bit quantization:
- UQ / US aligns channels with attention-importance instructions. This diagonalizes the error-weighting matrix so an important instructions are identifiable.
- HHad (Walsh-Hadamard rework) then equalizes channel significance precisely. Lemma 1 within the analysis paper proves each diagonal entry of
HHad⊤ Λ HHadequalstr(Λ)/d— the peaky eigenspectrum uncovered by UQ is compressed to a uniform worth throughout all channels. - Pbr (permuted bit-reversal) reorders channels in order that for any power-of-two quantization group dimension, every group receives one consultant from every degree of the significance hierarchy.
The analysis crew offers Theorem 1 proving UQ and US are optimum below a frozen-error surrogate goal with diagonal residual assumptions.
The Serving System: Blended-Precision Cache Structure
OSCAR integrates into SGLang’s manufacturing serving stack as an INT2 KV-cache mode with full compatibility with paged consideration.
The KV cache structure makes use of three areas per request:
- Sink tokens (first S0 = 64 tokens): saved in BF16. These operate as consideration sinks.
- Current tokens (final W = 256 tokens earlier than present place): saved in BF16.
- Historical past tokens (every part in between): saved as INT2 after OSCAR rotation and clipping.
At 128K context size, the BF16 sink and up to date home windows characterize solely 0.24% of complete tokens. The ablation (Desk 5 within the analysis paper) exhibits (S=64, R=256) is the accuracy-efficiency knee: smaller home windows noticeably harm accuracy; bigger home windows give negligible extra profit at larger BF16 reminiscence value.


Write and skim paths use fused Triton kernels. On the write path, every token is rotated, clipped to a calibration-derived percentile threshold (typical values: cK = 0.96, cV = 0.92), then quantized with per-token uneven INT2 at a default group dimension of GK = 64 channels per group. On the learn path, the INT2 kernel unpacks bytes, dequantizes, inverse-rotates, and passes outcomes to the eye kernel — multi functional fused move with out additional reminiscence visitors. The worth rotation RV is absorbed into the mannequin’s projection weights offline, eliminating its on-line compute value.
End result
The analysis crew evaluated OSCAR on 4 mannequin configurations: Qwen3-4B-Pondering-2507, Qwen3-8B, Qwen3-32B, and GLM-4.7-FP8 (358B parameters). Benchmarks embody AIME25, GPQA-Diamond, HumanEval, LiveCodeBench v6, and MATH500, all at 32K most technology size.
Accuracy (at 2.28 bits per KV ingredient):
| Mannequin | BF16 Imply | OSCAR Imply | Hole to BF16 |
|---|---|---|---|
| Qwen3-4B-Pondering-2507 | 75.64 | 71.86 | −3.78 |
| Qwen3-8B | 70.84 | 69.42 | −1.42 |
| Qwen3-32B | 74.19 | 74.17 | −0.02 |
| GLM-4.7-FP8 (358B) | 77.89 | 78.16 | +0.27 |
For context on how competing strategies examine: naive INT2 (no rotation) scores 0.00 on each Qwen3-4B and Qwen3-8B. QuaRot-INT2 (Hadamard-only rotation) scores 1.40 on Qwen3-4B and 10.14 on Qwen3-8B. TurboQuant at 3.25 bits drops 43.90 factors on Qwen3-4B-Pondering. Noticed-INT4 at 4.25 bits reaches 73.11 on Qwen3-4B — OSCAR at 2.28 bits reaches 71.86.


The analysis crew additionally in contrast towards channel-wise strategies on AIME25 (Desk 1). On Qwen3-8B, OSCAR at 2.38 BPE achieves 66.67±3.33 — above KIVI-KV2* at 57.67 (2.26 BPE) and Kitty at 59.67 (2.39 BPE). Be aware that channel-wise strategies require residual buffers or customized web page layouts that don’t match normal paged-attention serving, so this comparability is restricted to the one shared benchmark the place outcomes had been out there.
Lengthy-context robustness (RULER-NIAH):
| Mannequin | Technique | 16K | 32K | 64K | 128K |
|---|---|---|---|---|---|
| Qwen3-4B-Pondering | BF16 | 99.7 | 99.3 | 85.3 | 81.0 |
| Qwen3-4B-Pondering | QuaRot-INT2 | 0.0 | 0.0 | 15.6 | 0.0 |
| Qwen3-4B-Pondering | OSCAR | 97.8 | 87.6 | 61.9 | 39.5 |
| Qwen3-8B | BF16 | 98.9 | 97.3 | 79.2 | 78.2 |
| Qwen3-8B | QuaRot-INT2 | 19.0 | 9.8 | 0.0 | 0.0 |
| Qwen3-8B | OSCAR | 93.9 | 86.3 | 61.9 | 45.0 |
On GLM-4.7-FP8, OSCAR matches the BF16 curve via 128K.
Throughput (H100, 100K context, batch dimension 1):
Decode throughput speedup relative to BF16, at growing context lengths:
| Mannequin | 30K | 60K | 100K |
|---|---|---|---|
| Qwen3-4B-Pondering | 1.98× | 2.52× | 3.08× |
| Qwen3-8B | 1.84× | 2.29× | 2.88× |
| GLM-4.7-FP8 | 1.98× | 2.49× | 2.83× |
At batch dimension 32, job-level throughput at 100K context reaches 6.17× over BF16 on Qwen3-4B-Pondering and seven.83× on GLM-4.7-FP8. The speedup will increase with context size as a result of decoding turns into more and more KV-bandwidth-bound. Decreasing KV reminiscence by 8× straight reduces that bottleneck. The net rotation overhead is absorbed into the decode kernels.
Marktechpost’s Visible Explainer
Key Takeaways
- OSCAR quantizes LLM KV caches to 2-bit precision by rotating activations utilizing attention-aware covariance matrices, not generic Hadamard transforms.
- At 2.28 bits per KV ingredient, OSCAR stays inside 3.78 factors of BF16 accuracy on Qwen3-4B-Pondering whereas naive INT2 collapses to zero.
- KV cache reminiscence drops roughly 8×, decode pace improves as much as 3× at 100K context, and job-level throughput reaches as much as 7.83× at massive batch sizes.
- Pre-computed rotation matrices for Qwen3-4B/8B/32B, GLM-4.7-FP8, and MiniMax-M2.7 can be found in RotationZoo — no recalibration wanted.
- OSCAR integrates straight into SGLang with full paged KV-cache and prefix cache compatibility, requiring no modifications to the inference shopper.
Try the Repo on GitHub, Modelscope and Research Paper. Additionally, be happy to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Have to associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Connect with us

