A token journey: What’s actually happening inside a transformer?

by root December 14, 2025

written by root December 14, 2025 0 comment 135 views

On this article, you’ll learn the way a transformer transforms an enter token right into a context-aware illustration and in the end transforms the chance of the following token.

Matters coated embody:

The right way to put together for tokenization, embedding, and location-based enter
What contributions does multihead consideration and feedforward networks make inside every layer?
How remaining projection and softmax generate the chance of the following token

Let’s begin the journey.

The Token Journey: What’s Actually Taking place Inside a Transformer (Click on to Enlarge)
Picture by editor

The start of the journey

giant language mannequin (LLM) is predicated on a transformer structure, which is a fancy deep neural community with a set of token embeddings as enter. After a deep course of that appears like a parade of many stacked attentions and feedforward transformations, the output is a chance distribution that signifies which token must be produced subsequent as a part of the mannequin’s response. However how can this course of from enter to output be defined for a single token within the enter sequence?

On this article, you’ll be taught what is going on contained in the Transformer Mannequin (the structure behind LLM) on the token stage. In different phrases, see how an enter token or a part of an enter textual content sequence adjustments right into a generated textual content output, and the rationale behind the adjustments and transformations that happen throughout the transformer.

The reason of this course of by means of a transformer mannequin is predicated on the diagram above that reveals the final transformer structure and the way data flows and evolves by means of it.

Transformer enter: from uncooked enter textual content to enter embedding

Earlier than going into the depth of the transformer mannequin, some transformations have already been accomplished on the textual content enter, primarily to make sure that it’s represented in a kind that’s totally comprehensible by the inner layers of the transformer.

tokenization

The tokenizer is an algorithmic element that sometimes works symbiotically with the LLM’s transformer mannequin. It takes a uncooked textual content sequence, comparable to a consumer immediate, and breaks it into particular person tokens (typically subword models or bytes, typically even complete phrases), with every token within the supply language mapped to an identifier. I.

Embedding the token

I’ve a educated embedded desk E one thing that has a form |V| × d (vocabulary measurement by embedding dimension). Discover Identifier of Sequence of Size n generate an embedding matrix X one thing that has a form n × d. That’s, every token identifier is d– dimensional embedding vector forming one row X. Two embedding vectors are related to one another if they’re related to tokens with related meanings (e.g., king and emperor, or vice versa). Importantly, at this stage, every token embedding conveys the semantic and lexical data of that single token with out incorporating details about the remainder of the sequence (no less than not but).

positional encoding

Earlier than totally getting into the core a part of the transformer, it should be injected inside every token embedding vector, i.e. inside every row of the embedding matrix. X — Details about the place of that token throughout the sequence. That is additionally known as location injection and is often accomplished utilizing trigonometric capabilities comparable to sine and cosine, though there are additionally methods primarily based on discovered location embeddings. The approximate residual element is added to the earlier embedding vector. e_t Related to the token as follows:

[
x_t^{(0)} = e_t + p_{text{pos}}

and p_pos


Now let's go deeper into the transformer and see what's going on inside.
Deep inside transformers: from input embeddings to output probabilities
Let's explain what happens as each "enriched" single-token embedding vector passes through one transformer layer, and then zoom out to explain what happens across the entire stack of layers.
formula
[
h_t^{(0)} = x_t^{(0)}
]
is used to point the illustration of a token at layer 0 (the primary layer), however extra generally used as follows: h_t^(l) Signifies the embedded illustration of the token within the layer I.
Watch out for a number of heads
The primary main element inside every replicated layer of the transformer is multifaceted consideration. That is in all probability probably the most influential element in your entire structure in terms of figuring out and incorporating into every token's illustration lots of significant details about its function within the total sequence and its relationship to different tokens within the textual content, comparable to syntactic, semantic, or other forms of linguistic relationships. The a number of heads of this so-called consideration mechanism are every specialised in concurrently capturing totally different linguistic features and patterns of your entire token and the sequence to which it belongs.
Consequence with token illustration h_t^(l) (With location data injection) a priorido not forget!) Passing by means of this multidirectional consideration throughout the layer is a context-enriched or context-aware token illustration. Through the use of residual connections and layer normalization throughout the transformer layers, the newly generated vector turns into a secure mix of its personal earlier illustration and the multihead consideration output. This helps enhance consistency throughout processes which are utilized iteratively between layers.
feedforward neural community
Listed below are some comparatively uncomplicated ones: feedforward neural community (FFN) Layer. For instance, these may be token-wise multilayer perceptrons (MLPs), whose aim is to additional rework and refine the options of the tokens which are being discovered over time.
The primary distinction between the Consideration stage and this stage is that whereas Consideration mixes and incorporates context data from throughout all tokens into every token illustration, the FFN step is utilized to every token independently, refining the already built-in context patterns and producing helpful "data" from them. These layers are additionally complemented by the remaining connections and layer normalization, and this course of leads to an up to date illustration on the finish of the transformer layer. h_t^(l+1) This turns into the enter to the following transformer layer, thereby getting into one other multi-head consideration block.
This whole course of is repeated for as many stacked layers as outlined by the structure, step by step enriching the tokens with higher-level, summary, and long-range linguistic data embedded behind the seemingly indecipherable numbers.
remaining vacation spot
So what occurs on the finish? On the prime of the stack, after passing by means of the final replicated transformer layer, we get the ultimate token illustration. h_t*^(L) (the place t* signifies the present predicted place). linear output layer Subsequently smooth max.
The linear layer produces a denormalized rating known as . logitand softmax converts these logits into chances of the following token.
Calculating logit:
[
text{logits}_j = W_{text{vocab}, j} cdot h_{t^*}^{(L)} + b_j
]
Compute normalized chances by making use of softmax.
[
text{softmax}(text{logits})_j = frac{exp(text{logits}_j)}{sum_{k} exp(text{logits}_k)}
]
Use the softmax output because the chance of the following token:
[
P(text{token} = j) = text{softmax}(text{logits})_j
]
These chances are computed for all attainable tokens within the vocabulary. The subsequent token generated by LLM is then chosen. Essentially the most possible token is usually chosen, however sampling-based decoding methods are additionally frequent.
finish of the journey
This text gives a common understanding, with a light stage of technical element, of what occurs to the textual content supplied to an LLM (probably the most distinguished mannequin primarily based on the Transformer structure) by means of the Transformer structure, and the way this textual content is processed and remodeled throughout the mannequin on the token stage, in the end producing the mannequin's output: the following phrase.
We hope you get pleasure from your journey with us. We look ahead to the chance to take you on one other journey within the close to future.

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

A token journey: What’s actually happening inside a transformer?

The start of the journey

Transformer enter: from uncooked enter textual content to enter embedding

tokenization

Embedding the token

positional encoding

Deep inside transformers: from input embeddings to output probabilities

Watch out for a number of heads

feedforward neural community

remaining vacation spot

finish of the journey

The best way to use VA loans for residence renovations

NYT Connections Suggestions and Solutions for December 14th, Tricks to Clear up “Connections” #917.

Converter

Editors Pick

Newsletter

Categories

Related Posts