Tuesday, May 26, 2026
banner
Top Selling Multipurpose WP Theme

Lengthy-context inference makes the KV cache one of many fundamental prices of serving LLMs. Throughout autoregressive decoding, the cache grows with context size, batch dimension, and mannequin depth. At excessive batch sizes and lengthy contexts with 100K tokens throughout dozens of concurrent requests the KV cache consumes a big fraction of GPU reminiscence. Compressing it’s a direct option to improve batch dimension and cut back reminiscence visitors.

The plain strategy is quantization. However pushing KV caches to INT2 (2-bit) precision has been largely impractical. Prior strategies both collapse in accuracy or require customized serving layouts incompatible with paged KV-cache techniques. Collectively AI’s OSCAR (Offline Spectral Covariance-Conscious Rotation) addresses each issues.

Why INT2 KV Cache Quantization is Arduous

KV activations include channel-wise outliers. A small subset of channels holds extraordinarily massive values. Most channels are well-behaved. Once you apply INT2 quantization which has solely 4 representable ranges and people outliers dominate the dimensions issue. The quantizer wastes most of its vary on uncommon spikes. Regular values get compressed into only one or two efficient ranges. This degrades consideration high quality considerably.

Rotation-based quantization addresses this by making use of a set orthogonal rework, usually a Hadamard rework, to redistribute outlier power throughout all channels. This strategy works moderately nicely at INT4. At INT2, a deeper downside stays: the rotation is data-oblivious. It may well easy activation ranges, but it surely doesn’t know which instructions the eye mechanism really reads. Spreading quantization error uniformly just isn’t the identical as pushing it into low-importance instructions. At INT2, with solely 4 ranges, that distinction determines whether or not the mannequin works in any respect.

https://arxiv.org/pdf/2605.17757v1

What OSCAR Does In a different way

OSCAR’s key remark is that the rotation utilized earlier than quantization ought to be derived from consideration statistics themselves — not from the uncooked distribution of KV activations.

For keys, the downstream error that issues just isn’t the Euclidean reconstruction error of Okay. It’s the error in consideration logits. The analysis crew confirmed this error is: ‖QK − QK̂‖²F = tr((Okay − Okaŷ)QQ(Okay − Okaŷ)). The weighting matrix is the question covariance QQ, not OkayOkay. Instructions the place queries have massive power amplify quantization errors in logits. OSCAR estimates the empirical question covariance CQ = (1/N) Σ qnqn from a calibration set, eigen-decomposes it, and makes use of the eigenvectors UQ as the important thing rotation foundation.

For values, the related error is within the consideration output SV. This is determined by how the eye rating matrix S weights every worth row. The analysis crew defines the score-weighted worth covariance CS = (1/N) VSSV. Instructions that stay massive after aggregation by S are those quantization error propagates via. OSCAR makes use of the eigenvectors US of CS as the worth rotation foundation.

The ultimate composed rotations are:

RK = UQ · HHad · Pbr
RV = US · HHad · Pbr

Every of the three elements addresses a definite failure mode of per-group low-bit quantization:

  • UQ / US aligns channels with attention-importance instructions. This diagonalizes the error-weighting matrix so an important instructions are identifiable.
  • HHad (Walsh-Hadamard rework) then equalizes channel significance precisely. Lemma 1 within the analysis paper proves each diagonal entry of HHad Λ HHad equals tr(Λ)/d — the peaky eigenspectrum uncovered by UQ is compressed to a uniform worth throughout all channels.
  • Pbr (permuted bit-reversal) reorders channels in order that for any power-of-two quantization group dimension, every group receives one consultant from every degree of the significance hierarchy.

The analysis crew offers Theorem 1 proving UQ and US are optimum below a frozen-error surrogate goal with diagonal residual assumptions.

The Serving System: Blended-Precision Cache Structure

OSCAR integrates into SGLang’s manufacturing serving stack as an INT2 KV-cache mode with full compatibility with paged consideration.

The KV cache structure makes use of three areas per request:

  • Sink tokens (first S0 = 64 tokens): saved in BF16. These operate as consideration sinks.
  • Current tokens (final W = 256 tokens earlier than present place): saved in BF16.
  • Historical past tokens (every part in between): saved as INT2 after OSCAR rotation and clipping.

At 128K context size, the BF16 sink and up to date home windows characterize solely 0.24% of complete tokens. The ablation (Desk 5 within the analysis paper) exhibits (S=64, R=256) is the accuracy-efficiency knee: smaller home windows noticeably harm accuracy; bigger home windows give negligible extra profit at larger BF16 reminiscence value.

https://arxiv.org/pdf/2605.17757

Write and skim paths use fused Triton kernels. On the write path, every token is rotated, clipped to a calibration-derived percentile threshold (typical values: cK = 0.96, cV = 0.92), then quantized with per-token uneven INT2 at a default group dimension of GK = 64 channels per group. On the learn path, the INT2 kernel unpacks bytes, dequantizes, inverse-rotates, and passes outcomes to the eye kernel — multi functional fused move with out additional reminiscence visitors. The worth rotation RV is absorbed into the mannequin’s projection weights offline, eliminating its on-line compute value.

End result

The analysis crew evaluated OSCAR on 4 mannequin configurations: Qwen3-4B-Pondering-2507, Qwen3-8B, Qwen3-32B, and GLM-4.7-FP8 (358B parameters). Benchmarks embody AIME25, GPQA-Diamond, HumanEval, LiveCodeBench v6, and MATH500, all at 32K most technology size.

Accuracy (at 2.28 bits per KV ingredient):

Mannequin BF16 Imply OSCAR Imply Hole to BF16
Qwen3-4B-Pondering-2507 75.64 71.86 −3.78
Qwen3-8B 70.84 69.42 −1.42
Qwen3-32B 74.19 74.17 −0.02
GLM-4.7-FP8 (358B) 77.89 78.16 +0.27

For context on how competing strategies examine: naive INT2 (no rotation) scores 0.00 on each Qwen3-4B and Qwen3-8B. QuaRot-INT2 (Hadamard-only rotation) scores 1.40 on Qwen3-4B and 10.14 on Qwen3-8B. TurboQuant at 3.25 bits drops 43.90 factors on Qwen3-4B-Pondering. Noticed-INT4 at 4.25 bits reaches 73.11 on Qwen3-4B — OSCAR at 2.28 bits reaches 71.86.

https://arxiv.org/pdf/2605.17757

The analysis crew additionally in contrast towards channel-wise strategies on AIME25 (Desk 1). On Qwen3-8B, OSCAR at 2.38 BPE achieves 66.67±3.33 — above KIVI-KV2* at 57.67 (2.26 BPE) and Kitty at 59.67 (2.39 BPE). Be aware that channel-wise strategies require residual buffers or customized web page layouts that don’t match normal paged-attention serving, so this comparability is restricted to the one shared benchmark the place outcomes had been out there.

Lengthy-context robustness (RULER-NIAH):

Mannequin Technique 16K 32K 64K 128K
Qwen3-4B-Pondering BF16 99.7 99.3 85.3 81.0
Qwen3-4B-Pondering QuaRot-INT2 0.0 0.0 15.6 0.0
Qwen3-4B-Pondering OSCAR 97.8 87.6 61.9 39.5
Qwen3-8B BF16 98.9 97.3 79.2 78.2
Qwen3-8B QuaRot-INT2 19.0 9.8 0.0 0.0
Qwen3-8B OSCAR 93.9 86.3 61.9 45.0

On GLM-4.7-FP8, OSCAR matches the BF16 curve via 128K.

Throughput (H100, 100K context, batch dimension 1):

Decode throughput speedup relative to BF16, at growing context lengths:

Mannequin 30K 60K 100K
Qwen3-4B-Pondering 1.98× 2.52× 3.08×
Qwen3-8B 1.84× 2.29× 2.88×
GLM-4.7-FP8 1.98× 2.49× 2.83×

At batch dimension 32, job-level throughput at 100K context reaches 6.17× over BF16 on Qwen3-4B-Pondering and seven.83× on GLM-4.7-FP8. The speedup will increase with context size as a result of decoding turns into more and more KV-bandwidth-bound. Decreasing KV reminiscence by 8× straight reduces that bottleneck. The net rotation overhead is absorbed into the decode kernels.

Marktechpost’s Visible Explainer

OSCAR — How-To Information
01 / 08

01

Overview

What’s OSCAR?

OSCAR (Offline Spectral Covariance-Conscious Rotation) is a 2-bit KV cache quantization system from Collectively AI for long-context LLM serving.

As a substitute of making use of a generic Hadamard rotation, OSCAR derives attention-aware rotations from a one-time offline calibration move — aligning quantization noise with instructions that focus is least delicate to.

The outcome: INT2 precision with near-BF16 accuracy and full compatibility with paged KV-cache serving.


KV Reminiscence Discount


Decode Speedup

2.28
Bits Per KV Component

02

Setup

Conditions

Earlier than getting began, ensure you have the next in place:

  • 01
    {Hardware}: NVIDIA H100 GPU (80 GB) really useful. A100 may go for smaller fashions.
  • 02
    SGLang put in: OSCAR is built-in into the SGLang serving framework. Set up the most recent model from supply.
  • 03
    Triton: Customized fused kernels are written in Triton. Triton ships with most up-to-date PyTorch / SGLang installs.
  • 04
    A supported mannequin: Qwen3-4B, Qwen3-8B, Qwen3-32B, GLM-4.7-FP8, or MiniMax-M2.7. Pre-computed rotations can be found for all of those.
pip set up sglang[all] --upgrade
pip set up triton

03

Step 1

Obtain Pre-Computed Rotations by way of RotationZoo

Collectively AI publishes pre-computed rotation matrices and clip thresholds for supported fashions in RotationZoo on ModelScope. No recalibration wanted.

from modelscope import snapshot_download

# Obtain RotationZoo to your mannequin
rotation_path = snapshot_download(
    'togethercomputer/OSCAR-RotationZoo'
)

The downloaded artifact incorporates per-layer RK, RV rotation matrices and clip thresholds cK, cV for every supported mannequin. These are fastened offline parameters — they don’t seem to be up to date at runtime.

Qwen3-4B / 8B / 32B2.28 BPE

GLM-4.7-FP8 (358B)2.28 BPE

MiniMax-M2.72.28 BPE

Customized (run calibration)any mannequin

04

Step 2 (Non-obligatory)

Run Offline Calibration for a Customized Mannequin

In case your mannequin just isn’t in RotationZoo, run the one-time calibration move. OSCAR dumps Q, Okay, V activations from a small dataset, estimates attention-aware covariance, and writes out rotation matrices and clip thresholds.

python calibrate_oscar.py 
  --model-path /path/to/your-model 
  --calib-data gpqa_diamond 
  --calib-tokens 8192 
  --output-dir ./oscar_rotations/
Calibration just isn’t task-specific. The paper exhibits that outcomes are low-sensitivity to area (MMLU, WikiText, GPQA-Diamond all produce related accuracy). Run it as soon as and reuse throughout all duties.

Typical values produced: cK ≈ 0.96, cV ≈ 0.92 per layer.

05

Step 3

Launch SGLang with INT2 KV Cache Enabled

Cross the rotation path and allow INT2 KV mode when launching the SGLang server.

python -m sglang.launch_server 
  --model-path Qwen/Qwen3-8B 
  --kv-cache-dtype int2 
  --oscar-rotation-path ./oscar_rotations/ 
  --oscar-sink-size 64 
  --oscar-recent-size 256 
  --tp 1 
  --port 30000
Tensor parallelism is supported. For Qwen3-32B use --tp 2 (2×H100). For GLM-4.7-FP8 use --tp 8 (8×H100).

The server exposes a normal OpenAI-compatible API. No client-side modifications are wanted.

06

Step 4

Key Configuration Parameters

Parameter Default What it controls
–oscar-sink-size 64 First N tokens saved in BF16 as consideration sinks
–oscar-recent-size 256 Final N tokens saved in BF16 earlier than present place
cK (clip ratio) 0.96 Percentile clip for rotated key activations
cV (clip ratio) 0.92 Percentile clip for rotated worth activations
Group dimension GK 64 Channels per INT2 quantization group (head dim)

The paper identifies (sink=64, current=256) because the accuracy-efficiency knee. Smaller home windows cut back accuracy noticeably; bigger home windows add BF16 reminiscence overhead with negligible achieve.

07

Step 5

Run Inference and Confirm

As soon as the server is operating, question it with the usual OpenAI shopper:

from openai import OpenAI

shopper = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="none"
)

response = shopper.chat.completions.create(
    mannequin="Qwen/Qwen3-8B",
    messages=[{"role": "user",
               "content": "Your long-context prompt here"}],
    max_tokens=1024
)
print(response.decisions[0].message.content material)

Prefix caching works out of the field. OSCAR preserves the usual paged KV-cache abstraction, so SGLang’s radix cache and prefix reuse operate usually. No application-level modifications are wanted.

08

Outcomes

Accuracy vs BF16 Baseline

Averaged throughout AIME25, GPQA-Diamond, HumanEval, LiveCodeBench v6, and MATH500 at 32K technology size.

Qwen3-4B-Pondering

−3.78

Paper: arXiv:2605.17757   RotationZoo: modelscope.cn/fashions/togethercomputer/OSCAR-RotationZoo

Key Takeaways

  • OSCAR quantizes LLM KV caches to 2-bit precision by rotating activations utilizing attention-aware covariance matrices, not generic Hadamard transforms.
  • At 2.28 bits per KV ingredient, OSCAR stays inside 3.78 factors of BF16 accuracy on Qwen3-4B-Pondering whereas naive INT2 collapses to zero.
  • KV cache reminiscence drops roughly 8×, decode pace improves as much as 3× at 100K context, and job-level throughput reaches as much as 7.83× at massive batch sizes.
  • Pre-computed rotation matrices for Qwen3-4B/8B/32B, GLM-4.7-FP8, and MiniMax-M2.7 can be found in RotationZoo — no recalibration wanted.
  • OSCAR integrates straight into SGLang with full paged KV-cache and prefix cache compatibility, requiring no modifications to the inference shopper.

Try the Repo on GitHub, Modelscope and Research PaperAdditionally, be happy to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Have to associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Connect with us


banner
Top Selling Multipurpose WP Theme

Converter

Top Selling Multipurpose WP Theme

Newsletter

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

banner
Top Selling Multipurpose WP Theme

Leave a Comment

banner
Top Selling Multipurpose WP Theme

Latest

Best selling

22000,00 $
16000,00 $
6500,00 $
5999,00 $

Top rated

6500,00 $
22000,00 $
900000,00 $

Products

Knowledge Unleashed
Knowledge Unleashed

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.