This is a complete transformer, the same machinery behind ChatGPT, shrunk until every single number fits on screen. It reads four words and predicts what comes next.
Teal cells are the model's weights and word vectors: you can edit them. Grey cells are computed and update live. Hit Randomize to scramble the weights and watch the prediction lurch, because with untrained weights, the guess is meaningless. Turning those weights into good ones is exactly what training does, and that's the one thing this page leaves out.
Each word in the 6-word dictionary owns a row of 3 numbers, its embedding. We look up the four input words, then add a second vector that encodes position (1st word, 2nd word…), so the model knows the order. The sum, X = word + position, is what the rest of the network actually works on.
Attention is a soft lookup. We multiply X by three weight matrices to get three different views of each token: Q (what each token searches with), K (what it can be matched on), and V (what it hands over if chosen). The names are a story laid over the math. Underneath, it's just three linear projections: Q = X·Wq, K = X·Wk, V = X·Wv.
Each query is dot-product-compared with every key, then divided by √3 to keep the numbers tame: scores = (Q · Kᵀ) / √d. We then mask the upper triangle, a word may only look at itself and the words before it, never ahead. Masked cells are set to −∞ so they vanish in the next step.
Softmax turns each row of scores into attention weights that sum to 1, the fraction of attention each token pays to each earlier token. weights = softmax(row).
Each token builds its new, context-aware vector by taking a weighted average of the Value vectors, using the attention weights from step 3: attn_out = weights · V. This is the moment information actually moves between positions.
Now each token is processed on its own by an ordinary little neural network: widen from 3 to 12, apply ReLU (negatives become 0), then narrow back to 3. mlp_out = ReLU(attn_out·W1 + b1) · W2 + b2. This is the classic "neurons and layers" network, it lives inside every transformer block.
Finally we project each token's 3 numbers up to 6 logits, one raw score per dictionary word, using the unembedding matrix: logits = mlp_out · unembed. Softmax turns each row into a probability distribution over the next word.
Reading the row for a chosen position, here is the model's probability for each possible next word. The tallest bar is its guess.