el.dataset.currentDropdown = '') }">
Giles' blog
About
Contact
Archives
Categories
Blogroll

How do LLMs work?

Posted on 15 September 2025 in AI

This article is the last of three "state of play" posts that explain how Large Language Models work, aimed at readers with the level of understanding I had in mid-2022: techies with no deep AI knowledge. It grows out of part 19 in my series working through Sebastian Raschka's book "Build a Large Language Model (from Scratch)".

In my last two posts, I've described what LLMs do -- what goes in, what comes out, and how we use that to create things like chatbots -- and covered the maths you need to start understanding what goes on inside them. Now the tough bit: how do we use that maths to do that work? This post will give you at least a rough understanding of what's going on -- and I'll link to more detailed posts throughout if you want to read more.

As in my last posts, though, some caveats before we start: what I'm covering here is what you need to know to understand inference -- that is, what goes on inside an existing AI when you use it to generate text, rather than the training process used to create them. I'll write about training in the future. (This also means that I'm skipping the dropout portions of the code that I've covered previously. I'll bring that back in when I get on to training.)

I'll also be ignoring batching. I'll be talking about giving an LLM a single input sequence and getting the outputs for that sequence. In reality they're sent a whole bunch of input sequences at once, and work out outputs for all of them in parallel. It's actually not all that hard to add that on, but I felt it would muddy the waters a bit to include it here.

Finally, just to set expectations up-front; when I say "how do LLMs work", I'm talking about the structure that they have. They're essentially a series of mathematical operations, and these are meaningful and comprehensible. However, the specific numbers -- the parameters, aka weights -- that are used in these operations are learned through the training process -- essentially, showing an LLM a huge pile of text (think "all of the Internet") and adjusting its weights so that it gets really good at predicting the next token in a sequence given what it sees when you do that.

Why one set of parameters might be better at that job than another is not something we understand in depth, and is a highly active research area, AI interpretability. Back in 2024, Anthropic managed to find which parts of their LLM represented the concept of the Golden Gate Bridge, and put a demo version of their Claude chatbot online that had those parts "strengthened", which gave surprisingly funny results. But doing that was really hard. We can't just look at an LLM and say "ah, that's where it thinks about such-and-such" -- people need to do things like ablations, where they remove part of it and see what effect that has on the results (which has its own problems).

But while the meanings of the specific parameters that come out of the training process are hard to work out, there's still something important that we can understand -- the specific set of calculations that we use those parameters in.

People often say of LLMs that they are just large arrays of impenetrable numbers that no-one understands, and there's an element of truth in that. But it would be more accurate to say that each LLM is made up of a set of arrays of numbers -- yes, impenetrable ones -- which are doing things where, while the specific details might be unclear, the overall process is something we can understand.

Perhaps a metaphor is useful here: with a human brain, we don't know where the concept of "cat" is held, or what goes on when someone thinks about a cat. But we do know the general layout of the brain -- visual processing goes on in one place, audio in another, memories are controlled by this bit, and so on.

LLMs are a specific series of calculations, and which calculations are effective was determined by humans thinking about things, and we can understand what those calculations are doing. They're not just completely random neural networks that somehow magically do their work, having learned what to do by training.

So, with all that said, let's take a look at what those calculations are, and how they work.

The top level

The definition of an LLM that we came up with in the first post was this:

It receives a sequence of token IDs that represent the text that we want to process, and outputs a sequence of vectors of logits, one for each token in the input sequence. The logits for each token are the prediction for what comes next, based only on the tokens from the start up to and including that specific token.

So, how does it do that?

Vocab space to embeddings

Let's start by thinking of the input sequence -- it's just a series of numbers, each one being the ID of a token. They don't have any particular meaning -- in particular, they're not continuous in any meaningful way. For example, with the GPT-2 tokeniser, "cat" is token ID 9246. Token ID 9247 is " upset" (note the leading space). What would a token ID of 9246.5 mean? Nothing, really.

So the first step in an LLM is to convert all of those token IDs into something meaningful and continuous, where a thing that is "near" the representation for "cat" is likely to be something that actually is cat-adjacent. That is, of course, the embeddings that we met in the last post. The very first step in an LLM is to convert the sequence from being a list of token IDs into a list of token embeddings.

As part of its training, it learns an embedding -- remember, that's just a vector of numbers -- for each possible input token. You can just imagine that as a matrix, with as many rows as there are tokens in the vocabulary, and as many columns as there are dimensions in the embedding space. So to convert from token ID n to the appropriate embedding, it just reads out the nth row. Simple enough!

But there's an extra wrinkle that I found really useful in knowing what's going on at a slightly deeper level. Back in the discussion of matrices as projections between high-dimensional spaces, I said:

A 50257x768 matrix can be seen as a projection from a 50,257-dimensional space to a 768-dimensional one, and a 768x50257 one would project from a 768-dimensional space to a 50,257-dimensional space.

I chose those numbers because the 124M parameter GPT-2 model we're building in Sebastian Raschka's book has 50,257 tokens, and the embeddings it uses are 768-dimensional. So the matrix that represents all of the embeddings -- let's call it E -- for all of the tokens is 50257x768, just like the first projection matrix above.

Now, remember that when talking about high-dimensional spaces "meaning things", I said:

an obvious minimal case in the normalised vocab space [that is, the one after softmax -- "normalisation" is used differently in this post] is a vector where all of the numbers are zero apart from one of them, which is set to one -- that is, it's saying that the probability of one particular token is 100% and it's definitely not any of the others. This is an example of a one-hot vector (not super-inventive naming) and will become important in the next post.

It turns out that if you convert all of the input token IDs into one-hot vectors like that and "stack" them on top of each other (so that for a sequence of n tokens, you would have an nx50257 matrix) -- let's call that M -- and use the 50257x768 embedding matrix E to project M into 768-dimensional space, you get exactly the matrix of embeddings that you want -- the xth row in the result of the matrix multiplication will be the embedding for the token whose ID was represented by the one-hot vector you fed in, in row x of the matrix M!

In practice, you wouldn't do the conversion that way -- you would simply put together a matrix by running through the sequence of token IDs and appending to your result the appropriate row from the embedding matrix. That's because doing it with matrix multiplications would involve a lot of extra calculations -- for example, a bunch of pointless multiplications by zero (see the post linked above for details). However, I think it's an important intuition -- the operation that we do to convert our sequence of input token IDs has exactly the same results as using the matrix of all embeddings to project that same input represented in vocab space (with one-hot vectors) to embedding space.

So, by this stage, we've got a list of embeddings, one per token, which are inventively called token embeddings.

The next step is to add information about where tokens are positioned in the input stream. Anthropomorphising wildly, if we don't tell it where tokens are in the sequence, the transformer system that we'll use in a little bit doesn't really know much about what order things come in -- it is aware, when working on a token, of which tokens came before it, but not where they came. When considering the "cat" in "The fat cat", it knows that "The" and "fat" came earlier, but for all it knows it could be "fat The cat".

We need to give it a clue, and so there is a separate set of embeddings, position embeddings. These are also learned as part of training, and -- for the "absolute" position embeddings that the book and GPT-2 use -- there's one embedding meaning "the first token in the sequence", another meaning "the second token in the sequence", and so on and so forth. They are the same length -- the same dimensionality -- as the token embeddings we've already looked at -- eg. 768 for the 124M parameter GPT-2.

The token and position need to be combined together to produce a single series of embeddings to feed on to the next stage -- and all we do for that is add them together! The resulting vectors are called input embeddings in the book, and for "The fat cat sat on the mat", they would be:

....and so on.

With that complete, our input massaging is done, and it's time to feed those input embeddings into the LLM itself.

Sadly not robots in disguise

The core of the LLM is a sequence of transformer blocks. I'll give an overview of how those work in the next section, but their purpose is to annotate the input embeddings with extra information -- that is, they take in the sequence of input embeddings, and add on information. There are multiple layers of them -- 12 in the GPT-2 model -- and each layer works on the output of the previous one, adding on its own notes, kind of like the way that successive scholars added notes upon notes in the Talmud.

The goal of all of this is simple: after all of the transformer layers, we have a sequence of output embeddings, one per input token, that are made up of the input ones plus whatever all of the layers have added to them. We want the output embedding in position n to be an embedding for the most likely next token to come after the part of the input sequence that finishes with token n. Or, more strictly speaking, it should express in embedding space what kinds of tokens might be appropriate for the next one.

Once we have that, we normalise it (which helps stop our numbers from getting crazy-high or crazy-low -- there's a lot of normalisation happening in the transformer layers too, specifically called layer normalisation), and then we need to somehow map the embeddings we have back to logits.

Embedding space to vocab space

And that mapping is where all of that stuff above about embedding matrices and projections finally comes in useful. We have a sequence of n embeddings representing the predicted next tokens for every token in our n-long input sequence, each of which is 768-dimensional in our 124M GPT-2 example, and we want logits, which are lists of next-token likelihoods, which are in a 50,257-dimensional space. So all we need to do is multiply the matrix of all of our final layer normalised embeddings by an appropriate 768x50257 matrix to project them from embedding space back into vocab space!

What's kind of cool about this is that it's actually possible to use the original embedding matrix itself, E, just transposed to swap around rows and columns. That's called weight tying, because it "ties" the weight in the embedding layer to the output layer -- and it's what the original GPT-2 did.

In Raschka's book, however, he points out that it's generally better to just train a different matrix to do that (which makes sense, the "right" matrix to project in one direction is not necessarily the right one for the other direction). But it's neat to know that it can work. In this post I work through a simple example with an embedding space where we simply project from vocab space to embedding space and back again, and show that we do indeed get a decent reconstruction of the original tokens.

So, that's what the LLM does -- project the input token IDs from vocab space to embedding space, combine them with position information, do this transformer magic to get the embeddings of the predicted next token for each one, do layer normalisation, and then project back into vocab space to get logits for every token in our input sequence. And once we have logits, then we can get our predicted next token and use it.

Simple! Well, simple-ish ;-)

But what about those transformers?

Zooming in: transformer blocks

The transformer blocks are where the real magic happens. There are a number of layers of them; each layer has exactly the same structure, but they have a bunch of trainable parameters, which are different for each one. Each layer "annotates" the input embeddings, producing a new set of embeddings to feed into the next layer.

The way they do the "annotation" is conceptually kind of like the way the input embeddings are formed by adding the token embeddings to the position embeddings; in general, the idea is that if you add two embeddings together, you get something that meaningfully combines information from both of them.

So, as soon as we come into a transformer block, we stash away a copy of the input. Later on, we'll add it back into the result of some processing -- this is called a shortcut or residual connection. This shortcut helps preserve the original information, making sure the new processing doesn't "overwrite" everything.

Once we've done that, we do another one of those layer normalisations to keep the values within a reasonable range, and then we do the next magical thing, something I'll defer until the next section: we run multi-head attention (MHA) on the normalised input vectors.

For now, just think of that as a process that annotates each token with information from some of the tokens that came before it, to enrich its meaning. For example, at the first layer, where we're receiving the original input vectors -- per-token embeddings plus position vectors -- then for "the fat cat sat on the mat", the input vector for "cat" would just represent "the token cat at position 3". The purpose of MHA is to add on information to it -- you might imagine that it would do something so that it has some hint at being a specific cat (from the "the", as opposed to "a cat") and a hint of "fat"-ness.

Once we've done that, we add in the copies that we stashed away at the beginning -- that residual connection -- so now our vectors are essentially the original ones "annotated" with the results of the MHA. We've enriched the data so that it's a more meaningful representation of what the input sequence was about.

We stash away a fresh copy of the new vectors so that we can do another residual connection later, and go to the next part of the transformer block -- the feed-forward network. In my mental model, the MHA has enriched all of the input embeddings so that each token has some kind of representation of what it means in the context of the sequence from the start up to that token. That tells it what to think about for each token, and then the feed-forward network is what does the actual thinking. 1

But this "thinking" is actually done by a really simple network: we just run the embeddings again through another layer normalisation, then through a linear layer to project them into a higher-dimensional space (four times the number of dimensions in the GPT-2 example in the book), run the result through an activation function (GELU, see the feed-forward post for details) to add in some non-linearity, and then project it back down to the original dimensionality with another linear layer.

We then do our second residual connection by adding the second stashed copy that we took after MHA back in, so that our feed-forward layer is annotating the original data again, and that's our result!

So, to reiterate: if we disregard the layer normalisation, each transformer block is adding in information from MHA, then running that combined result through a linear layer to think about the result and adding that back in too.

And that leaves us with just one thing unexplained: multi-head attention -- the part that lets tokens "look back" at earlier ones and borrow context.

In more detail: the attention mechanisms

This is the tricky bit! Hold on, it will be a bit of a ride :-) But I think it's really worth digging in quite a bit here. I'll also say up-front that I'm going to use a number of examples in here of what attention might be doing in some hypothetical case. Please do treat these as explanatory examples. Attention works out the inter-relationships between tokens by what amounts to clever pattern-matching, and it's hard to understand how it does that without examples.

But the specific kind of patterns an LLM learns to identify and use will just be whatever patterns help it the most in its job of being part of a system that accurately predicts the next tokens for sequences 2. These are likely to be weird and alien from a human perspective, and that's why people sometimes say that we don't know what's going on inside LLMs -- that whole interpretability thing. But we can understand that they are doing pattern matching, and how they are doing it -- and that's important!

So let's start by talking about single-head attention.

Single-head attention

What we want to do is "decorate" a token's embedding by adding in information from other tokens, based on how relevant they are to it. So, in "the fat cat sat on the mat", we'll probably need to mix information from "the" and "fat" into "cat".

That's a pretty tall order, and a single attention head can't do anything that complicated -- realising that attention heads are (individually) dumb was the biggest "a-ha!" moment for me so far in learning about this stuff, so I think this deserves the deepest dive in this post.

What a single attention head does is something much simpler. As our human-readable, non-weird-and-alien example, let's imagine a single attention head that is doing something as simple as "for nouns, pick up any instances of 'the' or 'a' that relate to them, and mix in information about whether that means they're specific Xs ('the'), or general Xs ('a')". So, if fed a sequence that includes "the fat cat" it would blend that "the" into the "cat"'s embedding, while if it was fed "a thin cat" if would blend the "a" into "cat". Everything else -- "fat" and "thin" in those examples -- it would completely ignore.

What our imaginary head wants to do is match up nouns with their associated articles (that being the grammatical term for "the", "a", and similar words).

The way we think about this is normally expressed in terms that people have borrowed from database terminology. We would say in this case that "cat" is making a query for articles, and that the other tokens in the sequence aren't really making a query for anything (because in our example, only nouns want to get information from other tokens). And all of the tokens have an associated key, which is what they are -- in this case, articles or non-articles. With that information -- what each token wants in the context of this particular head, and what each token is in that same context -- we can do some pattern matching.

We do that using projections between embedding spaces. The input embeddings we have are quite rich; putting the position embeddings to one side for a moment, in the first layer, before anything has been done to it, "the fat cat" has embeddings for each token that mean something about the specific words -- the embedding for "cat" is very different to the embedding for "lamp". You could in theory reconstruct something like the original sequence from the embeddings, kind of like we do in that last projection to vocab space at the end of the LLM.

So what we do is use simpler embedding spaces. Let's start with the key vectors, the ones that say what something is. Imagine a projection that went from our original "rich" embedding space (our 768-dimensional one) into a much more impoverished (lower-dimensional, task-specific) one, which only represented "article" or "not article". Let's ignore how many dimensions it has for now, we'll just call it dqk (you can probably guess that the subscript means "query-key"). Project "the fat cat" into that embedding space, and we get something that maps to this sequence:


The matrix we use to do that projection, from the rich input vectors into the new impoverished space, is called the key weights, written Wk, and for our example with 768-dimensional embeddings coming in, it will be a 768xdqk matrix.

Let's take a look at that operation in a bit more detail. When we came into the head, we had a matrix containing the input embeddings in row-major order -- that is, for each token in the input sequence, we had a row, and that row was the input embedding for that token. After projecting it into our new impoverished embedding space with Wk, we have a new matrix, still with one row per token, but with the impoverished embeddings in each row. Let's use A to mean the embedding article above, and B to mean not-article, and lower-case letters to mean elements in those vectors -- for example, a1 to mean the first element in the vector A. Our new, projected matrix after the multiplication with Wk will look like this:

K=[a1a2adb1b2bdb1b2bd]

That is, the embedding for article in the first row, and then the one for not-article in the second and the third, to mirror the sequence above.

That's our first step -- we've got the key matrix K, which is the projection of our input embeddings into this impoverished embedding space. It is, of course, sized nxdqk, where n is our number of tokens and dqk is the number of dimensions in the space. So we have a matrix representing what things are, the keys I mentioned above.

Now imagine another projection that maps input embeddings to the same embedding space as the first one, but instead of mapping tokens to what they are, it maps them to what they want -- our queries. Our head wants to associate nouns with articles, so "cat" wants -- in this sense -- articles. You might think that the sequence "the fat cat" becomes this:


...but that would actually be saying that the first two tokens, "the" and "fat" want to know about non-article words, not that they don't want to know about articles. So it would actually be something more like this:


So we project our input vectors into the same impoverished space as before, but with a different matrix, which we call the query weights, Wq -- also a 768xdqk matrix, but a different one to Wk.

To show the result, let's use the vector A again to mean article -- after all, it's the same impoverished space and so the same vector for the same thing -- and add on a new one, which we'll call C, to represent nothing. We'll get something like this:

Q=[c1c2cdc1c2cda1a2ad]

...and we've got the query matrix Q, which is also sized nxdqk.

Let's call the shared embedding space that both K and Q are in the query/key space so that I don't have to keep typing "impoverished" (it's starting to look misspelled every time). It's a dqk-dimensional space.

So, to reiterate: the thing that a particular token wants is called the "query", and the thing that a particular token actually is is called the "key". We're representing them as embeddings, both in the same conceptual high-dimensional space -- a specific space that is used by this head only.

Now comes the clever bit. Remember that you can multiply two matrices if the number of columns in the first one matches the number of rows in the second. K and Q are both nxdqk, so we can't multiply them, but if we transpose one of them (swapping rows for columns) then we can. Let's transpose K and do that:

O=QKT

Now we have an nxdqk matrix times a dqkxn one, so that's valid. And the result will be nxn. But what will it contain?

By the definition of matrix multiplication, Oi,j -- that is, the element at row i, column j in the output matrix -- is the dot product of row i in the first matrix, taken as a vector, with column j in the second matrix, also considered as a vector. Let's write out the multiplication in full:

O=[c1c2cdc1c2cda1a2ad][a1b1b1a2b2b2adbdbd]

So, in our result matrix, the item in the first row, first column, is the dot product of the first row in the first matrix -- (c1,c2,cd), which is our vector C, and the first column in the second -- (a1,a2,ad), which is our vector A. That means it is C*A. Let's write out the full result matrix:

O=[C*AC*BC*BC*AC*BC*BA*AA*BA*B]

Next, the clever bit of the clever bit ;-) Remember from the maths post that the dot product of two vectors is high if they are similar, and low if they are not. So obviously the bottom left where we have A*A will be a high number. Slightly less obviously, if it's a well-formed embedding space, A, B and C will be very dissimilar vectors -- so all of the other points will be small numbers. Let's say that the result we get is actually this:

O=[0.030.060.060.030.060.061470.010.01]

What we've got is a matrix where each row relates to a token, and is based on its query embedding. Each value in the row is the dot product of that query embedding with the key embedding of one of the other tokens. The more similar the row's query is to the column's key, the higher the number. So what we actually have is a row for every token, each row having a column for every token. The numbers in the columns say how much attention this row's token should pay to the column's token, like this:

the fat cat
the 0.03 0.06 0.06
fat 0.03 0.06 0.06
cat 147 0.01 0.01

For this really dumb attention head, which only cares about articles, the only thing it has identified is that "cat" really cares about "the". All of the other numbers are really low.

That's really nifty! By projecting our input embeddings into a space where they can represent what they want in the query matrix Q and what they are in the key matrix K, we can do a single transpose and a matrix multiplication to get, for each token, a sequence of numbers that say how much it cares about each of the other tokens. These are called attention scores -- normally written as O, as I did above.

Once we have those numbers, there are a couple of extra steps to follow.

Firstly, we divide them all by the square root of the number of dimensions in our embedding space, dqk. That's because the actual number of dimensions can be really large, and so the dot products (which are sums of a bunch of multiplications -- specifically, as many multiplications as there are dimensions) can get huge too. We're just scaling things down a bit to stop things from getting overwhelmed in the next-but-one step.

Next, we clear out all of the numbers in the top right of the matrix -- the ones that say how much a token should care about tokens that come after them in the sequence. This is called a causal mask, and reflects how we understand text ourselves -- after all, we can't look ahead into the future acausally to understand what someone is saying now in the context of something they will say later on. 3 4

Intuitively you might think that clearing them would involve replacing them with zeros, but we actually use minus infinity:

the fat cat
the 0.03 - -
fat 0.03 0.06 -
cat 147 0.01 0.01

...because finally, we run all of the attention scores for each token -- that is, each row -- through the now-probably-depressingly-familiar softmax function, so that they all add up to 1, giving us what we call attention weights. We used - because that will always map to zero in the output of that function.

So, after that, our example above would be this (I ran it through PyTorch's softmax to be sure):

the fat cat
the 1 0 0
fat 0.49 0.51 0
cat 1 0 0

You might notice something a bit surprising here -- it's actually something that I only realised as I was writing this post. It looks good for "cat", but for "the" and "fat", softmax has messed up those nice "don't pay much attention to anything" rows.

For "the", it looks pretty much harmless, as a token paying attention to itself is not unreasonable, but for "fat" we've got a bit of a mess. The problem here is that softmax's outputs always have to add up to one, so when tokens don't want to pay attention to anything in particular, they wind up paying attention to everything kind of randomly. In practice this isn't the end of the world, and doing it turns out to be better than not, despite that drawback. 5

Our final step is to use those weights to create our final vector for this head, known as the context vector. What we want to do for each token, is produce a context vector that represents all of the other tokens, weighted by those attention weights -- so, "cat" would have something representing "the". Remember, that's going to be added on to the original input vector for "cat" as an "annotation" by our residual connection outside the attention head.

Naively, you might think that we could just take the input embeddings and multiply them by the weights -- that is, the context vector we're producing for "cat" would be the input embedding for "the" times one plus the embedding for "fat" times zero plus the one for "cat" times zero, just taking the weights from "cat"'s row at the bottom of that table of weights above. But what we actually want to add on as our "annotation" is not necessarily the same as the input vector -- "the" as an annotation for "cat" is not the same as "the" as a standalone thing (and let's not forget that the input embeddings have position embeddings in them too).

So what we do is project all of the input tokens into a different embedding space, called the value space -- again, a specific space for each head. For each input token, we essentially sum up the value-space representation of every other token in the sequence, weighted by its attention weight -- the softmaxed attention scores. So in our example, it would be the value-space embedding for "the" times its weight of one, plus the value-space embedding for "fat" times zero, plus the value-space embedding for "cat" times zero. Essentially just "the" in value space.

I won't go into the details of how we do this in this post, but it turns out to be just another matrix multiplication!

Obviously, all of that is just a toy example -- for example, the proximity of any articles is ignored, so when considering the "cat" in "I gave a treat to the fat cat", it would pay just as much attention to the article "a" as it did to the article "the". The attention head would need to use the fact that the input embeddings had position embeddings mixed in as well to say "nearby" articles, or something similar.

And anyway, as I said, the actual pattern-matching happening in the heads after training is likely to be weird and alien. But even given that, it will still be pretty basic at an individual head level.

So now we have a way to go from our input embeddings to a set of context vectors that express some kind of useful contextual information that we can add to each one of them.

Multi-head attention

But now, imagine running a bunch of those in parallel. Maybe one associates articles with nouns, another associates adjectives with nouns (so that "cat" is linked with "fat"). 6

That's what multi-head attention does. With some clever use of matrix multiplication, with a few operations we can run multiple parallel attention heads (12 in the GPT-2 example) on the same input, and get a matrix with all of the results.

We then feed that through a single linear layer to combine them all together, projecting them back to the same dimensionality as we started with, and that's our result! An annotation that the transformer block can add in to its copied input vector in its first residual connection.

Rolling it all together

And that's pretty much it. Now we have a high-level overview of what happens inside an LLM. We receive our sequence of token IDs, and:

  • Convert them into embeddings, which is conceptually the same as putting them into vocab space as one-hot vectors then projecting that into embedding space (even though in practice we do the conversion in a more efficient way).
  • Run these embeddings through multiple successive transformer blocks, each of which modifies them so that by the end, the embedding at position n represents the predicted next token for the sub-sequence that goes from the start of the sequence to position n.
  • Layer normalisation
  • Project them back from embedding space to vocab space.

Inside the transformer blocks, we:

  • Take a copy of the input sequence of embeddings
  • Layer normalisation
  • Run MHA
  • Add the copy back in so that the version that came out of MHA is something more like an "annotation" of the original
  • Take a second copy of that one
  • Layer normalisation again
  • Run it through a simple neural network
  • Add the results of that back in.

And inside MHA, we're running a number of pattern-matching things, attention heads, in parallel. Each one of them, for each token, looks at the tokens to its left (and itself) to see if there are any that it's "interested in" (from the perspective of that head), and if there are, it adds some information about the interesting tokens to the context vector for that "interested" token.

And we're done! Through all of this projecting into different spaces, multiplying matrices, annotating and pattern-matching, we've got something that -- with the right weights -- can take a sequence of token IDs, pretty much meaningless in themselves, and produce predictions about what could come next -- and with that, we can build our chatbot.

This has been quite a long post, and was fun but challenging to write -- and I suspect will have been challenging (but I hope also fun) to read. So any feedback really would be much appreciated. Were there any bits that were hard to follow? What could have made it clearer? And, importantly, if this is a topic you understand better than I do -- what did I get wrong?

Comments, as always, would be very welcome.

Coming up next: having summarised what I've learnt so far in the book, it's time to start working on training the LLM. I'll be reporting back soon...

Here's a link to the next post in this series.


  1. That's simplified quite a bit, of course -- there's some kind of "thinking" happening in both, and indeed the two work together. -

  2. To be pedantic: during training, using gradient descent, an LLM will learn whatever pattern-matching rules it needs to minimise the loss function. There's also reinforcement learning on top of that for modern LLMs, and that's a completely separate can of worms that I'll open in a future post. -

  3. As is tradition in this series of blog posts, I should add "except in German" ;-) -

  4. Working those numbers out and then throwing them away is, of course, wasteful -- real-world implementations would use specific code that just doesn't work them out in the first place rather than using pure matrix multiplications -- though the book does not, which makes sense as it's trying to teach the principles rather than optimise things. -

  5. That's not to say it's without problems. All of this random "extra" attention does pile up, in particular near the start of the sequence, so those tokens can wind up getting more than they should. This is in effect what people mean when talking about "attention sinks" as an issue in LLMs. Back in April of this year, a very early pre-print of a paper got a lot of interest because they tried using an alternative to softmax that didn't have the must-sum-to-one property to avoid that, and they had promising early results. Unfortunately later on they found that when they scaled models up to larger sizes, like 1.8B parameters, even though they got rid of the attention sinks, their results on training and benchmarks were worse -- that is, the attention sinks were gone but the models didn't work as well! That meant that the final version of the paper was much more muted. Still, in science negative results are important. -

  6. "Weird and alien" caveat relegated to a footnote as you must be getting bored of them by now. -

<< An addendum to 'the maths you need to start understanding LLMs' Writing an LLM from scratch, part 20 -- starting training, and cross entropy loss >>
Copyright (c) 2006-2026 by Giles Thomas. This work is licensed under a Creative Commons Attribution 4.0 International License.