A transformer with nothing hidden

Watch a model guess the next word, one number at a time

This is a complete transformer, the same machinery behind ChatGPT, shrunk until every single number fits on screen. It reads four words and predicts what comes next.

Teal cells are the model's weights and word vectors: you can edit them. Grey cells are computed and update live. Hit Randomize to scramble the weights and watch the prediction lurch, because with untrained weights, the guess is meaningless. Turning those weights into good ones is exactly what training does, and that's the one thing this page leaves out.

input: We learnt about the → predicted next word: …

editable: weights & word vectors computed live the prediction

Turn words into vectors

Each word in the 6-word dictionary owns a row of 3 numbers, its embedding. We look up the four input words, then add a second vector that encodes position (1st word, 2nd word…), so the model knows the order. The sum, X = word + position, is what the rest of the network actually works on.

embedding_table: one row per dictionary word (6 × 3)

position_table: one row per slot (first 4 of 8 shown)

X = embedding + position: the four input tokens, ready for attention (4 × 3)

Project into Query, Key, Value

Attention is a soft lookup. We multiply X by three weight matrices to get three different views of each token: Q (what each token searches with), K (what it can be matched on), and V (what it hands over if chosen). The names are a story laid over the math. Underneath, it's just three linear projections: Q = X·Wq, K = X·Wk, V = X·Wv.

Wq (3 × 3)

Wk (3 × 3)

Wv (3 × 3)

Q (4 × 3)

K (4 × 3)

V (4 × 3)

Score every token against every other

Each query is dot-product-compared with every key, then divided by √3 to keep the numbers tame: scores = (Q · Kᵀ) / √d. We then mask the upper triangle, a word may only look at itself and the words before it, never ahead. Masked cells are set to −∞ so they vanish in the next step.

scaled, masked scores. Row = the token doing the looking, column = the token looked at

Softmax turns each row of scores into attention weights that sum to 1, the fraction of attention each token pays to each earlier token. weights = softmax(row).

attention weights (each row sums to 1)

Blend the values

Each token builds its new, context-aware vector by taking a weighted average of the Value vectors, using the attention weights from step 3: attn_out = weights · V. This is the moment information actually moves between positions.

attn_out: context-mixed tokens (4 × 3)

Think it over (the feed-forward network)

Now each token is processed on its own by an ordinary little neural network: widen from 3 to 12, apply ReLU (negatives become 0), then narrow back to 3. mlp_out = ReLU(attn_out·W1 + b1) · W2 + b2. This is the classic "neurons and layers" network, it lives inside every transformer block.

hidden layer after ReLU, h (4 × 12), the 12 "neurons" per token

mlp_out: back down to 3 per token (4 × 3)

Score the whole dictionary

Finally we project each token's 3 numbers up to 6 logits, one raw score per dictionary word, using the unembedding matrix: logits = mlp_out · unembed. Softmax turns each row into a probability distribution over the next word.

unembed (3 → 6)

next-word probabilities, one row per input position (each row sums to 1)

The prediction

Reading the row for a chosen position, here is the model's probability for each possible next word. The tallest bar is its guess.

after the word:

Why is the guess usually nonsense? Because these weights are random. A real model's weights are slowly shaped by training on billions of words until predictions like this become accurate. Everything above is the forward pass, how a prediction is computed. Training, the part that makes the numbers good, is the bigger story this page deliberately leaves out.