MiniMax Sparse Consideration (MSA): 2-branch block sparse consideration educated with 109B parameter MoE on 3T token finances

by root June 17, 2026

written by root June 17, 2026 0 comment 71 views

MiniMax has launched MSA (MiniMax Sparse Attendance), a sparse consideration technique constructed instantly on Grouped Question Consideration (GQA). It targets one bottleneck: the quadratic value of softmax consideration in lengthy contexts. The MiniMax analysis workforce examined inside a 109B parameter skilled combination mannequin educated on native multimodal knowledge. We additionally open sourced our inference kernel and shipped the product mannequin MiniMax-M3.

What’s MSA (MiniMax Sparse Consideration)?

MSA (MiniMax Sparse Attendance) divides consideration into two levels: index department and most important department. Index branches decide which blocks of keys and values every question reads. The principle department then performs actual softmax consideration solely on these blocks.

Choice is completed at block granularity, not token by token. The default block sizes are: B_okay = 128 token. Every question and GQA group is okay = 16 block. This fixes the finances per question. kB_okay = 2,048 Key-value token.

The 2 have completely different value buildings. Dense GQA consideration scales O(N) per question, i.e. as a whole context. MSA is O(kB_okay), which stays mounted as N will increase. Subsequently, the computing hole widens because the context size will increase.

Choice is shared inside every GQA group, however impartial between teams. One key/worth head corresponds to a number of question heads, which share one set of blocks. Totally different teams can take part in numerous lengthy distance areas.

How the 2 branches work

The index department provides solely two projection matrices to the usual GQA layer. Outline one index question head and one shared index key head for every GQA group. Rating seen key tokens and max-pool these scores to the block degree.

The High-k operator then selects the best scoring block for every question and group. Native blocks containing queries are all the time included. This prevents the selector from eradicating the rapid neighborhood of the question.

The Important Department collects causally seen tokens from the chosen blocks. Apply a restricted scaled dot product softmax consideration to those tokens. Every question head maintains its personal question projection, however shares the group’s set of blocks.

The visualization within the report exhibits the alternatives made by the discovered indexer. The pinnacle concentrates on the native diagonal and the primary block. They save the remainder of their finances for some lengthy distance stripes.

Methods to prepare MSA

For the reason that High-k choice is non-differentiable, the index projection can’t be educated by language modeling loss. MSA solves this by way of KL alignment loss. The loss matches the distribution of the index department and the featured sample of the principle department. The instructor is the principle department distribution of the group imply over the chosen tokens.

Three mechanisms stabilize sparse coaching. Gradient Detach applies a stop-gradient to the index department enter. This limits the KL loss to the exponential prediction slightly than the spine. With out this, massive KL coefficients would end in gradient spikes and loss divergence.

Indexer warmup is carried out with utmost care on each branches within the first iteration. The indexer learns from the KL loss earlier than controlling the routing. Power native blocking reserves one slot for close by contexts.

Ablation shaped the ultimate recipe. An earlier variant added an Index Department worth head with its personal output. With warm-up, you do not want that worth head anymore. The ultimate design removes it for effectivity causes.

MSA helps two coaching routes. MSA-PT trains from scratch after warming up the indexer for 40B tokens. MSA-CPT converts dense GQA checkpoints educated on 2.6T tokens. It then continues for 400B tokens, together with a warm-up 40B token.

Kernel co-design

Theoretical sparsity doesn’t translate into velocity with out matching GPU paths. MSA combines the algorithm with the thought of two kernels.

first High-k choice with no expertise factors. Softmax preserves order, so rating the uncooked scores yields similar indices. The kernel skips the max, exp, and sum steps earlier than choice. In a 128K context okay = 16ran 5.1x quicker torch.topk. It additionally outperformed the TileLang radix-select kernel by an element of three.7.

second KV exterior sparse consideration utilizing question assortment. Iterating over a KV block will increase the computational depth in comparison with iterating over a question. The kernel packs ⌈128/G⌉ question positions into one 128×128 rating MMA. Two-phase switch divides consideration and combines steps throughout the CTA.

open supply kernel, fmha_sm100targets the NVIDIA SM100 GPU. It ships a dense FlashAttendant and a sparse High-k kernel beneath the MIT license. Helps BF16, FP8, NVFP4, and FP4 precision.

Comparability of MSA with different sparse strategies

The analysis workforce positions MSA in opposition to 4 natively educated sparse designs.

The desk beneath summarizes the variations described.

technique	spine	Granularity of choice	Indexer/choice sign
MSA	GQA	block degree (`B_k = 128`), High-k by GQA group	KL alignment loss
N.S.A.	MQA/MHA	Compression + Chosen Blocks + Sliding Window	Native (end-to-end) coaching
InfLLM-V2	Dense↔sparse switchable	Parameter-free block choice + sliding window	No parameters (no educated indexer)
MoBA	GQA	Very massive KV block (block averaging key)	LM slope solely
D.S.A.	MLA (MQA mode)	token degree. A single High-k is shared by all heads.	ReLU Lightning Indexer

The distinctive pair of MSA is the mix of High-k sharing and block-level choice per GQA group. This retains the KV readings steady whereas giving every group its personal acquisition.

The standard is holding up. Each sparse fashions stay almost as aggressive as the total consideration baseline.

The desk beneath exhibits typical outcomes for a 3T token finances.

benchmark	full	MSA-PT	MSA-CPT
MMLU	67.0	67.2	66.8
GSM8K	76.2	77.7	73.7
HumanEval	61.0	64.0	57.9
Ruler-8K	79.8	84.2	77.2
Ruler-32K	75.0	77.5	75.7
Video MME	41.11	45.48	39.65

After lengthy context growth, MSA-CPT remained virtually full on HELMET-128K and RULER-128K. Every question nonetheless processes solely 2,048 key-value tokens.

commentator’s playground

Utilization and examples

MSA is meant for workloads the place context size is a binding deployment constraint.

long run agent: Brokers accumulate massive transcripts spanning tons of of inference and motion steps. Intensive consideration to that historical past will increase quadratically. MSA maintains a per-query finances of two,048 tokens, no matter size.
Repository-wide code inference: When a coding agent hundreds a whole repository, it could possibly exceed tons of of 1000’s of tokens. Indexers route every question to a number of associated blocks. Unrelated information stay exterior the chosen set.
lasting reminiscence: A protracted-running assistant retains growing the dialog state. MSA reads a fixed-size slice of probably the most related blocks for every question. As reminiscence will increase, decoding prices stay roughly fixed.
Understanding lengthy movies: The mannequin is natively multimodal and educated on picture and video knowledge. MSA-PT acquired the best scores out of three runs on a number of video benchmarks, together with VideoMME and TemporalBench. The sparse choice expands to accommodate lengthy visible token sequences.

Operating the kernel

Quickest path makes use of hug face kernels library.

# pip set up -U kernels
from kernels import get_kernel

kernel_module = get_kernel("MiniMaxAI/msa", model=0)
sparse_atten_func = kernel_module.sparse_atten_func

sparse_atten_func(...)

Planners, indexers, and reminders are additionally instantly launched on this repository.

import torch
from fmha_sm100 import fmha_sm100, fmha_sm100_plan, sparse_topk_select

page_size, topk = 128, 16

# Dense proxy cross: per-block max rating from an inexpensive Q slice.
proxy_plan = fmha_sm100_plan(
    qo_lens, kv_lens, proxy_q.form[1],
    num_kv_heads=1, page_size=page_size, output_maxscore=True,
)
_, max_score = fmha_sm100(
    proxy_q, proxy_k_pages, proxy_v_pages, proxy_plan,
    kv_indices=kv_indices, output_o=False, output_maxscore=True,
)

# Block scores -> chosen KV block indexes.
kv_block_indexes = sparse_topk_select(
    max_score.contiguous(), topk, num_valid_pages=num_pages,
)

# Sparse consideration over the chosen blocks.
sparse_plan = fmha_sm100_plan(
    qo_lens, kv_lens, q.form[1],
    num_kv_heads=k_pages.form[1], page_size=page_size, kv_block_num=topk,
)
out, _ = fmha_sm100(
    q, k_pages, v_pages, sparse_plan,
    kv_indices=kv_indices, kv_block_indexes=kv_block_indexes,
)

These are official utilization examples for repositories. The enter is a paged key-value tensor ready by the caller. The primary run JIT-compiles the indexer, which can take a number of minutes. Necessities are SM100 GPU, CUDA toolkit, Python 3.10 or later.

Benefits and downsides

Strengths

Within the reported configuration, the per-token consideration computation is diminished by an element of 28.4 for 1M contexts.
The precise measured clock velocity enchancment reached 14.2 occasions for prefill and seven.6 occasions for decoding at 1M of H800.
This design provides solely two projection matrices to the usual GQA layer.
Helps each coaching from scratch and conversion from dense checkpoints.
The inference kernel is launched beneath the MIT License.

Weaknesses and open questions

The launched kernel targets NVIDIA SM100. Different architectures require completely different work.
Whereas some subtasks obtain full consideration, lengthy context searches depart gaps.
Reported speedups assume particular head configurations and H800 setups.
KL loss will increase the complexity throughout coaching in comparison with a easy dense layer.
Outcomes come from MiniMax’s proprietary analysis suite, not from third-party reproductions.

Please verify full paper and lipo. Additionally, be at liberty to observe us Twitter Remember to affix us 150k+ML subreddit and subscribe our newsletter. dangle on! Are you on telegram? You can now also participate by telegram.

Must companion with us to advertise your GitHub repository, Hug Face Web page, product launch, webinar, and so forth.? connect with us

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

MiniMax Sparse Consideration (MSA): 2-branch block sparse consideration educated with 109B parameter MoE on 3T token finances

What’s MSA (MiniMax Sparse Consideration)?

How the 2 branches work

Methods to prepare MSA

Kernel co-design

Comparability of MSA with different sparse strategies

The desk beneath summarizes the variations described.

The desk beneath exhibits typical outcomes for a 3T token finances.

commentator’s playground

Utilization and examples

Operating the kernel

Benefits and downsides

Strengths

Weaknesses and open questions

Finest Insurers for Claims within the USA | 5-Star Claims

Girls with HIV died from trauma, not the virus, and dying data did not point out it.

Converter

Editors Pick

Newsletter

Categories

Related Posts