The MoonMath AI staff has launched the bf16 ahead consideration kernel for AMD’s MI300X GPU. Written in HIP, not hand-written meeting. The code is open supply underneath the MIT license. The MoonMath.ai staff studies that it outperforms AMD’s proprietary optimized kernel, AITER v3, in all shapes examined. Naked metallic entry was supplied by HotAisle, an AMD cloud supplier.
The main focus is on fusion softmax(QKᵀ/√d)·V Operation inside all transformers. MI300X is AMD’s CDNA3 datacenter GPU with ISA goal (gfx942). This kernel will solely run on that {hardware}.
TL;DR
- MoonMath.ai is open sourcing a bf16 ahead consideration kernel for AMD MI300X written in HIP somewhat than meeting (MIT).
- Outperforms AMD’s AITER v3 in all shapes and rounding modes (geometric imply 1.18x/1.15x/1.08x, max 1.26x).
- Vital trick: The one-instruction asm wrapper permits the compiler to decide on the opcode when allocating registers.
- A lot of the speedup is in reminiscence placement. That’s, Ok in LDS, V scorching in L1, Q within the register and the accumulator.
- Precise SGLang PR used this to hurry up the unfold of Wan2.1 movies by 1.23x with none high quality loss.
Perceive the kernel
A kernel is a small program that runs immediately on the GPU’s many cores and performs one particular computation (on this case, an consideration computation) as quick because the {hardware} permits. The kernel solely calculates ahead consideration on bf16 for MI300X. It receives enter in BSHD or BHSD structure, however with out transposition. Head dimensions are fastened at 128. Helps arbitrary sequence lengths, together with cross-attention.
Actuality has its limits. There aren’t any causal masks, GQA, and varlen batching. The output is bf16 and runs solely on gfx942 {hardware}.
Numbers are strictly managed. All three rounding modes match AITER’s mode-specific rounding guidelines. All finite outputs fall inside 1 bf16 ULP of AITER. Dealing with NaN and Inf is bitwise similar, and the outcomes are deterministic.
Core trick: one-instruction asm wrapper
This core expertise avoids frequent dilemmas. Compiler intrinsics preserve your code tidy, however permit the compiler to reorder and rename operands. Uncooked inline meeting offers you management, however forces guide administration of registers and addresses.
MoonMath wraps precisely one instruction in a single format. __device__ __forceinline__ operate. Prolonged asm constraints describe operands. The analysis staff selects an opcode. The compiler continues to allocate registers and observe information circulation.
// in/out tied to the SAME VGPR → no accumulator rename, no v_mov copy.
__device__ __forceinline__ void asm_mfma(bf16x4_t a, bf16x4_t b, fp32x4_t& c) {
asm unstable("v_mfma_f32_16x16x16_bf16 %0, %1, %2, %0"
: "+v"(c) : "v"(a), "v"(b));
}
of "+v"(c) The constraint ties the enter and output of the accumulator to the identical VGPR. No copy directions are issued. This makes the kernel extra like common HIP. Nonetheless, you use the machine one instruction at a time.
Structure: 8 waves, 2 teams, 2 obstacles
The CDNA3 computational unit has 4 SIMD items. The textbook block has 4 waves. MoonMath as a substitute runs eight waves per block, divided into two teams of 4.
The 2 teams carry out the identical approach Q*Oksoftmax, O += P*V order. They’re offset by section. Whereas one group saturates the matrix cores, the opposite group runs softmax and points load. They’re then changed, so the matrix cores are by no means idle.
There are two s_barrierfor every iteration. One sits in section handoff. One is positioned on the repeat boundary. Waits on every counter deal with the remaining synchronization.
This displays the alternating habits of matmul and Softmax in FlashAttendant-3. FA3 producer-consumer warp splits usually are not copied. In CDNA3, all reminiscence actions are already asynchronous, so there isn’t any want for a devoted producer wave.
The place is the information and why use 16x16x16?
A lot of the speedup is because of reminiscence placement. Ok The stream from HBM to LDS is double buffered and shared by all eight waves. V It stays scorching in L1 and is learn by all PV matmuls. Q And the accumulator resides in a register.
The analysis staff selected a 16×16×16 MFMA as a substitute of a 32×32×8. The throughput is identical for each geometries. Small tiles accumulate to 4 fp32 components per lane versus 16 fp32 components. Q tile.
| choice | alternative | motive |
|---|---|---|
| Variety of waves per block | 8 folks (2 teams of 4 folks every) | Plan your pipeline immediately. Share a replica of 1K |
| MFMA form | 16×16×16 bf16 | Similar throughput, decrease VGPR strain, higher energy effectivity |
| Ok association | LDS, double buffer, 32 KiB | Shared throughout all 8 waves and swapped each iteration |
| Placement of V | L1, resident, prefetch | Reread the whole PV and preserve it deliberately scorching |
| Q + accumulator | VGPR | Learn all iterations and by no means reload |
Two wins after that and the hole narrows. third Q Tiles (3Q) trigger information reuse on each load Ok and V tile. A flash decoding fashion tail KV cut up rescues stranded fractional rounds throughout the MI300X’s 304 CU. These wins are available in cascades. transfer V Third LDS launched to L1 Q The tile will likely be crammed.
benchmark
Assessments have been carried out on MI300X with BF16, head measurement 128. Every form was measured in three rounding modes. RTNE rounds to the closest even quantity. RTNA rounds to the closest worth, ranging from zero. RTZ is truncated in the direction of zero.
| Form (B, H, S, D) | spherical | Ours (ms) | AITER v3 (ms) | vs Aitor | vs. MAX |
|---|---|---|---|---|---|
| (2, 24, 8192, 128) | RTNE | 3.083 | 3.792 | 1.23 instances | 1.37× |
| (2, 24, 16384, 128) | RTNE | 11.670 | 14.691 | 1.26 instances | 1.54 instances |
| (4, 16, 16384, 128) | RTZ | 15.055 | 16.183 | 1.07× | 1.47× |
| (2, 24, 32768, 128) | RTNA | 44.440 | 52.363 | 1.18× | 1.57× |
| (1, 16, 131072, 128) | RTNE | 232.517 | 269.278 | 1.16x | 1.46 instances |
Geomeans throughout the sweep favor MoonMath. In comparison with AITER, the rating is 1.18x (RTNE), 1.15x (RTNA), and 1.08x (RTZ). In comparison with Modular MAX, the geometric imply performs 1.44x to 1.49x, and the per-shape speedup reaches 1.59x.
RTZ is AITER’s distinctive quickest mode and the hardest race. (4, 16, 16384) RTZ form moved from 0.95 to 1.07 instances. A KV cut up within the tail closed the ultimate hole.
interactive explainer
Utilization instance
The kernel is put in utilizing pip and exposes a small API. It is fired on the calling stream, so it is duplicated throughout the bigger pipeline.
import torch
import moonmath_attention as ma
# PyTorch's ROCm construct makes use of the "cuda" gadget string on AMD GPUs
q = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, gadget="cuda")
ok = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, gadget="cuda")
v = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, gadget="cuda")
out = ma.ahead(q, ok, v, structure="bshd")
out_rtz = ma.ahead(q, ok, v, structure="bshd", round_mode="rtz")
One particular use case is video dissemination. The staff added LiteAttendant assist and submitted a PR to popularize SGLang. Relating to the Wan2.1-T2V-1.3B-Diffuser, they’re from AITER liteattention_rocm. Finish-to-end manufacturing is 1.23x higher on MI300X with no seen high quality loss.
The BSHD structure is immediately appropriate for diffusion tensors. Cross-attention works with any KV size and with out padding.
Vital factors
- The kernel is bf16 ahead stress for MI300X, written in MIT’s HIP.
- Outperforms AITER v3 with geometric imply values of 1.18x/1.15x/1.08x for all shapes and rounding modes.
- A one-instruction asm wrapper gives opcode management when the compiler allocates registers.
- A lot of the acquire was decided by reminiscence placement. Ok for LDS, V scorching for L1, and Q for register.
- In follow, SGLang PR sped up the unfold of Wan2.1 movies by 1.23x with none high quality loss.
Please test technical details. Please be happy to observe us too Twitter Do not forget to hitch us 150k+ML subreddit and subscribe our newsletter. hold on! Are you on telegram? You can now also participate by telegram.
Must companion with us to advertise your GitHub repository, Hug Face Web page, product launch, webinar, and many others.? connect with us

