To know the adjustments made right here, we first want to speak concerning the key-value cache. Contained in the transformer, there are three vectors (key, worth, question) which are important for consideration to work. At a excessive degree, consideration is about passing necessary details about the earlier token to the present token in order that it may possibly predict the following token. In a one-headed self-attention instance, we multiply the question vector of the present token with the important thing vector of the earlier token and normalize the ensuing matrix (we name the ensuing matrix the eye sample). We then multiply the worth vector with the eye sample to get an replace for every token. This knowledge is added to the embedding of the present token to present it context to find out what comes subsequent.
As we create consideration patterns for every new token we create, our queries have a tendency to vary, however the keys and values stay fixed. Subsequently, our present structure tries to cut back the computation time by caching the key-value vectors generated in every spherical of consideration. This cache known as the key-value cache.
Though architectures similar to encoder-only and encoder-decoder transformer fashions have had success, the authors argue that the autoregression proven above and the velocity it provides the mannequin is why decoder-only fashions are essentially the most generally used at the moment.
To know the YOCO structure, we first want to know how the layers are arrange.
In a single half of the mannequin, we use a single kind of consideration to generate the vectors wanted to fill the KV cache, whereas within the different half, we dedicate the KV cache to key and worth vectors, respectively, to generate embeddings for the output tokens.
This new structure requires two varieties of consideration: environment friendly self-attention and mutual consideration, every of which is defined under.
Environment friendly Self-Consideration (ESA) is designed to realize fixed inference reminiscence. In different phrases, we wish the cache complexity to rely upon the variety of layers in a block, not on the enter size. Within the formulation under, the authors summary away ESA, however the remainder of the self-decoder is constant as proven under.
Let’s undergo the equation step-by-step. X^l is the token embedding and Y^l is the intermediate variable used to generate the following token embedding X^l+1. Within the equation, ESA is the Environment friendly Self-Consideration and LN is the layer normalization operate. Right here we all the time use the Root Imply Sq. Norm (RMSNorm ), And at last SwiGLU. SwiGLU It’s outlined as follows:
right here swish = x*sigmoid (Wg * x)the place Wg is a trainable parameter. Then we take the element-wise product (Hadamard product) between the outcome and X*W1, and multiply the entire product by W2. SwiGLU The purpose is to have an activation operate that passes completely different quantities of data to the following token by the layer relying on a situation.
Now that we all know how the self-decoder works, let’s talk about two methods the authors thought-about implementing ESA.
First, they thought-about one thing referred to as gated retention. Whereas retention and self-attention are certainly very comparable, the authors of the paper “Retentive Community: A Successor to Transformer for Giant Language Fashions” say that the important thing distinction is within the activation operate. Retention eliminates softmax, permitting for a recursive formulation. They use this recursive formulation and parallelizability to make it extra reminiscence environment friendly.
Digging into the mathematical particulars, we get:
We have now typical matrices Q, Ok, and V, every multiplied by a learnable weight related to every matrix. We then take the Hadamard product between the weighted matrices and a scalar Θ. The aim of utilizing Θ is to create an exponential decay, whereas the D matrix is then used to assist in informal masking (stopping future tokens from interacting with the present token) and activation.
Gate retention is completely different from retention by γ values, the place we use the matrix Wγ to make ESA data-driven.
Sliding window ESA introduces the concept of limiting the variety of tokens that the eye window ought to take note of. Whereas in common self-attention, all earlier tokens are attended to ultimately (even when they’ve a worth of 0), in sliding window ESA we select a continuing C that limits the dimensions of those matrices, which means that we will scale back the KV cache to a continuing complexity throughout inference time.
Let us take a look at the maths once more:
The matrix is scaled by the corresponding weight. Then, we compute the heads equally to how multi-head consideration is computed, the place B acts as a causal map to make sure that solely token C is attended to.

