Flash consideration is a power-optimized transformer consideration mechanism that provides 15% effectivity.
Flash consideration is a power-optimized transformer consideration mechanism that gives 15% effectivity when it comes to wall-clock pace with out approximations.
Transformer fashions are sluggish for lengthy sequences and reminiscence intensive (time and reminiscence complexity are primarily quadratic), so we use flash consideration (paper) achieves a 15% end-to-end wall clock pace enchancment on BERT-large and 3x speedup on GPT-2.
Contemplating that coaching these massive fashions consumes monumental quantities of vitality, Flash Consideration, by way of software program and {hardware} optimization, is ready to obtain 15% effectivity, which is a major enchancment.
Beneath we clarify the essential ideas behind Flash Consideration and how you can implement it.
Basic Ideas of Computing and Reminiscence
Earlier than we dive deeper into compute and reminiscence, let’s overview them once more.
What’s Compute?
- The time spent computing precise floating-point operations on the GPU (FLOPS)
What’s Reminiscence?
- The time it takes to switch a tensor inside a GPU
Ideally, the gCPU would all the time be operating matrix multiplications and never be reminiscence sure. However the actuality is that computing has grow to be extra superior relative to reminiscence, and the gCPU sits idle ready for knowledge to be loaded. That is normally The bondage of reminiscence Operations. See the diagram beneath that illustrates this. Matrix multiplication is taken into account the compute, and reminiscence shops the info (consider it as a warehouse). The compute wants knowledge to course of, and reminiscence bandwidth must help that operation.
What’s the reminiscence hierarchy?
The A100 GPU 40-80GB Excessive Bandwidth Reminiscence 1.5-2.0 TB/sec and 192KB The on-chip SRAM helps 108 streaming multiprocessors, with a bandwidth of roughly 19TB/sec.
With the above context in thoughts, a self-attention structure is Certain by reminiscence.
Trying on the consideration computation, it’s the softmax operation that causes reminiscence limitations.
- Quantitative Proof: As proven beneath, operations like softmax, dropout and masking take more often than not in comparison with matrix multiplication (Matmul).
Why is softmax a memory-bound operation?
The operational scale is the largest bottleneck.
- N -> variety of tokens
- d -> variety of embedding dimensions
- Multiplying queries by keys, the eye matrix explodes to N * N, consuming a whole lot of reminiscence. For reference (d ~ 128, N ~ 128k tokens, Google Gemini: ~ 1 million tokens)
Beneath is the algorithm that implements the self-attention mechanism:
As talked about within the part above, transferring info to HBM (writing S to HBM), then loading it from HBM to gCPU to compute the softmax, and writing it again to HBM includes transferring a whole lot of info, Reminiscence-bound operations.
Together with the diagram, the steps beneath assist clarify how self-attention is computed by way of matrix multiplication.
step 1:
- I simplified this, in truth I simply generate embeddings by including a positional encoding to every token, then feed that to a linear layer <キー、クエリ、値> For illustration functions, we’ve chosen a dimension of three (normally within the vary 64-128), which is the enter for normal transformer architectures.
Step 2
- Key -> Key’ (transpose) is calculated and multiplied with Question to get QK’, which is N*N, which comprises the significance of every token and the remaining tokens. The diagram beneath additionally exhibits the connection. Since these are tokens and we have to calculate the significance of every token relative to one another, a softmax operation is utilized row-wise and normalized to 0 -1.
- This step A transfer to HBM is required, which is the most expensive operation As we mentioned, the total Flash Consideration paper describes how you can optimize this course of.
Step 3
- Softmax(QK’) * V is computed as the ultimate output matrix, the place the size are the identical because the enter embeddings of keys, queries, and values.
- The final row of the output matrix
- The 1*5 implies that the embedding of “this” needs to be modified to include its relationship to the opposite token.
- 2*5 implies that the embedding of “is” must be modified to include its relationship to different tokens.
- Equally for the opposite strains
The fundamental thought is defined within the diagram beneath the place blocks of keys, queries and values are propagated from the HBM to the SRAM, and thru some mathematical tips (defined beneath), the calculations made listed here are the precise right solutions, not simply approximations.
This implementation permits for quicker wall pace occasions by accessing info inside a block with out sacrificing accuracy.
The algorithm behind the paper: How is Flash Consideration carried out?
That is probably the most sophisticated a part of the paper. Let’s break this drawback down into sub-aspects and dig deeper.
The diagram beneath exhibits how you can break up a matrix into blocks and use every block to compute a partial softmax after which a modified softmax.
- Preliminary Enter: Token: This can be a flash warning slip
- Key: 4 (tokens) x 3 (dimensions), Question: 4 (tokens) x 3 (dimensions), Worth: 4 (tokens) x 3 (dimensions)
Step 0
- Assumes 24 bytes of reminiscence
- The SRAM is split into 4 blocks: question, key, worth, and output matrix.
- The question, key, worth, and output every take 6 bytes to retailer their info (12 bytes / 4)
- Since every embedding can’t be destroyed, every dimension is 3, so
- Question: 6 bytes / 3 (dimensions) = 2. Values, keys and outputs are the identical
- subsequently, [M/4d] Denotes the scale of every block. On this case, the block dimension is 2, which implies that 2 rows will be fetched into SRAM.
- Typically, the block dimension is [M/4d] The variety of blocks is [N*4D/M]
Steps 1 and a pair of: Evaluating flash consideration mechanics with reminiscence and computational elements I’ve added a desk beneath displaying steps 1 and a pair of.
The diagram beneath helps visualize the matrix multiplication (block-wise) utilized in flash consideration.
What are the mathematical elements of softmax?
One of the necessary elements of the paper is how the accuracy of the softmax will be calculated even when the matrix is decomposed. Beneath is a mathematical instance displaying how you can mix two totally different matrices and recalculate the softmax.
Instinct
- This can be a nice property of the index being leveraged right here.
- Every softmax is calculated individually, however together with the utmost worth of the row, the overall exponent worth can also be saved.
- When combining with one other matrix, we have to see how a lot the max differs from the worldwide max of the 2 matrices. As a result of there may be an exponent, each the numerator and denominator are adjusted by e^(current_max — global_max) to account for this.
The logic is sort of sophisticated, so I will clarify it with an instance beneath, and when you get accustomed to the instance, the instinct above will probably be simpler to know.
Let us take a look at the complexity evaluation to see how issues have modified.
Self-Consideration
- Calculating S = QK’ ends in an N*N matrix that must be propagated to HRAM and pulled again from HRAM.
- Subsequently, O(N*N + N*N) = O(N*N) for HBM accesses.
Flash Consideration
- Outer loop: the important thing and question are accessed O(Nd) occasions
- Interior loop: To function on a block, loading from HBM requires solely O(Nd/M).
- Whole: O(N*N*d*d/M)
- In follow, d is way smaller than M. The vary of d is (64 to 128) whereas the vary of M is 100KB, so HBM accesses are optimized.
- We began with the target of optimizing HBM entry, and the complexity evaluation confirmed that this paper optimizes HBM entry. HBM entry by (d*d/M) coefficients with out approximation.
It is a very complicated paper, however the effectivity has improved considerably. I hope the above clarification offers you some understanding of how flash consideration optimizes and improves efficiency. I have not but defined block sparse flash consideration, however how does it evaluate to different optimization strategies, resembling ahead cross optimization? I hope to clarify that in a future publish.

