I wanted to build on what I'd learned in chapter 6 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". That chapter takes the LLM that we've built, and then turns it into a spam/ham classifier. I wanted to see how easy it would be to take another LLM -- say, one from Hugging Face -- and do the same "decapitation" trick on it: removing the output head and replacing it with a small linear layer that outputs class logits
Turns out it was really easy! I used
Qwen/Qwen3-0.6B-Base, and you can see the code
here.
The only real difference between our normal PyTorch LLMs and one based on Hugging
Face is that the return value when you call your model is a ModelOutput object with more to
it than just the output from the model itself. But it has a logits field on
it to get the raw output, and with that update, the code works largely unchanged.
The only other change I needed to make was to change the padding token from the fixed
50256 that the code from the book uses to tokenizer.pad_token_id.
ChatGPT wrote a nice, detailed README for it, so hopefully it's a useful standalone artifact.
I recently posted about Andrej Karpathy's classic 2015 essay, "The Unreasonable Effectiveness of Recurrent Neural Networks". In that post, I went through what the essay said, and gave a few hints on how the RNNs he was working with at the time differ from the Transformers-based LLMs I've been learning about.
This post is a bit more hands-on. To understand how these RNNs really work, it's
best to write some actual code, so I've implemented a version of Karpathy's
original code using PyTorch's built-in
LSTM
class -- here's the repo. I've tried
to stay as close as possible to the original, but I believe
it's reasonably PyTorch-native in style too. (Which is maybe not all that surprising,
given that he wrote it using Torch, the Lua-based predecessor to PyTorch.)
In this
post, I'll walk through how it works, as of commit daab2e1. In follow-up posts, I'll dig in further,
actually implementing my own RNNs rather than relying on PyTorch's.
All set?
In chapter 5 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", we finally trained our LLM (having learned essential aspects like cross entropy loss and perplexity along the way). This is amazing -- we've gone from essentially zero to a full pretrained model. But pretrained models aren't all that useful in and of themselves -- we normally do further training to specialise them on a particular task, like being a chatbot.
Chapter 6 explains a -- to me -- slightly surprising thing that we can do with this kind of fine-tuning. We take our LLM and convert it into a classifier that assesses whether or not a given piece of text is spam. That's simple enough that I can cover everything in one post -- so here it is :-)
This post wraps up my notes on chapter 5 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Understanding cross entropy loss and perplexity were the hard bits for me in this chapter -- the remaining 28 pages were more a case of plugging bits together and running the code, to see what happens.
The shortness of this post almost feels like a damp squib. After writing so much in the last 22 posts, there's really not all that much to say -- but that hides the fact that this part of the book is probably the most exciting to work through. All these pieces developed with such care, and with so much to learn, over the preceding 140 pages, with not all that much to show -- and suddenly, we have a codebase that we can let rip on a training set -- and our model starts talking to us!
I trained my model on the sample dataset that we use in the book, the 20,000 characters of "The Verdict" by Edith Wharton, and then ran it to predict next tokens after "Every effort moves you". I got:
Every effort moves you in," was down surprise a was one of lo "I quote.
Not bad for a model trained on such a small amount of data (in just over ten seconds).
The next step was to download the weights for the original 124M-parameter version of GPT-2 from OpenAI, following the instructions in the book, and then to load them into my model. With those weights, against the same prompt, I got this:
Every effort moves you as far as the hand can go until the end of your turn unless something interrupts your control flow. As you may observe I
That's amazingly cool. Coherent enough that you could believe it's part of the instructions for a game.
Now, I won't go through the remainder of the chapter in detail -- as I said, it's essentially just plugging together the various bits that we've gone through so far, even though the results are brilliant. In this post I'm just going to make a few brief notes on the things that I found interesting.
Being on a sabbatical means having a bit more time on my hands than I'm used to, and I wanted to broaden my horizons a little. I've been learning how current LLMs work by going through Sebastian Raschka's book "Build a Large Language Model (from Scratch)", but how about the history -- where did this design come from? What did people do before Transformers?
Back when it was published in 2015, Andrej Karpathy's blog post "The Unreasonable Effectiveness of Recurrent Neural Networks" went viral.
It's easy to see why. While interesting stuff had been coming out of AI labs for some time, for those of us in the broader tech community, it still felt like we were in an AI winter. Karpathy's post showed that things were in fact moving pretty fast -- he showed that he could train recurrent neural networks (RNNs) on text, and get them to generate surprisingly readable results.
For example, he trained one on the complete works of Shakespeare, and got output like this:
KING LEAR:
O, if you were a feeble sight, the courtesy of your law,
Your sight and several breath, will wear the gods
With his heads, and my hands are wonder'd at the deeds,
So drop upon your lordship's head, and your opinion
Shall be against your honour.
As he says, you could almost (if not quite) mistake it for a real quote! And this is from a network that had to learn everything from scratch -- no tokenising, just bytes. It went from generating random junk like this:
bo.+\x94G5YFM,}Hx'E{*T]v>>,2pw\nRb/f{a(3n.\xe2K5OGc
...to learning that there was such a thing as words, to learning English words, to learning the rules of layout required for a play.
This was amazing enough that it even hit the mainstream. A meme template you still see everywhere is "I forced a bot to watch 10,000 episodes of $TV_SHOW and here's what it came up with" -- followed by some crazy parody of the TV show in question. (A personal favourite is this one by Keaton Patti for "Queer Eye".)
The source of that meme template was actually a real thing -- a developer called Andy Herd trained an RNN on scripts from "Friends", and generated an almost-coherent but delightfully quirky script fragment. Sadly I can't find it on the Internet any more (if anyone has a copy, please share!) -- Herd is no longer on X/Twitter, and there seems to be no trace of the fragment, just news stories about it. But that was in early 2016, just after Karpathy's blog post. People saw it, thought it was funny, and (slightly ironically) discovered that humans could do better.
So, this was a post that showed techies in general how impressive the results you could get from then-recent AI were, and that had a viral impact on Internet culture. It came out in 2015, two years before "Attention Is All You Need", which introduced the Transformers architecture that powers essentially all mainstream AI these days. (It's certainly worth mentioning that the underlying idea wasn't exactly unknown, though -- near the end of the post, Karpathy explicitly highlights that the "concept of attention is the most interesting recent architectural innovation in neural networks".)
I didn't have time to go through it and try to play with the code when it came out, but now that I'm on sabbatical, it's the perfect time to fix that! I've implemented my own version using PyTorch, and you can clone and run it. Some sample output after training on the Project Gutenberg Complete Works of Shakespeare:
SOLANIO.
Not anything
With her own calling bids me, I look down,
That we attend for letters--are a sovereign,
And so, that love have so as yours; you rogue.
We are hax on me but the way to stop.
[_Stabs John of London. But fearful, Mercutio as the Dromio sleeps
fallen._]
ANTONIO.
Yes, then, it stands, and is the love in thy life.
There's a README.md in the repo with full instructions about how to use it --
I wrote the code myself (with some AI guidance on how to use the APIs), but Claude
was invaluable for taking a look at the codebase and generating much better and
more useful instructions on how to use it than I would have done :-)
This code is actually "cheating" a bit, because Karpathy's original repo
has a full implementation of several kinds of RNNs (in Lua, which is what the
original Torch framework was based on), while I'm using PyTorch's
built-in LSTM class, which implements a Long Short-Term Memory network -- the specific
kind of RNN used to generate the samples in the post (though not in the code snippets,
which are from "vanilla" RNNs).
Over the next few posts in this series (which I'll interleave with "LLM from scratch" ones), I'll cover:
However, in this first post I want to talk about the original article and highlight how the techniques differ from what I've seen while learning about modern LLMs.
If you're interested (and haven't already zoomed off to start generating your own version of "War and Peace" using that repo), then read on!
I'm continuing through chapter 5 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", which covers training the LLM. Last time I wrote about cross entropy loss. Before moving on to the next section, I wanted to post about something that the book only covers briefly in a sidebar: perplexity.
Back in May, I thought I had understood it:
Just as I was finishing this off, I found myself thinking that logits were interesting because you could take some measure of how certain the LLM was about the next token from them. For example, if all of the logits were the same number, it would mean that the LLM has absolutely no idea what token might come back -- it's giving an equal chance to all of them. If all of them were zero apart from one, which was a positive number, then it would be 100% sure about what the next one was going to be. If you could represent that in a single number -- let's say, 0 means that it has only one candidate and 1 means that it hasn't even the slightest idea what is most likely -- then it would be an interesting measure of how certain the LLM was about its choice.
Turns out (unsurprisingly) that I'd re-invented something that's been around for a long time. That number is called perplexity, and I imagine that's why the largest AI-enabled web search engine borrowed that name.
I'd misunderstood. From the post on cross entropy, you can see that the measure that I was talking about in May was something more like the simple Shannon entropy of the LLM's output probabilities. That's a useful number, but perplexity is something different.
Its actual calculation is really simple -- you just raise the base of the logarithms
you were using in your cross entropy loss to the power of that loss. So if you were
using the natural logarithm to work out your loss , perplexity would
be , if you were using the base-2 logarithm then it would be , and so on.
PyTorch uses the natural logarithm, so you'd use the matching torch.exp function.
Raschka says that perplexity "measures how well the probability distribution predicted by the model matches the actual distribution of the words in the dataset", and that it "is often considered more interpretable than the raw [cross entropy] loss value because it signifies the effective vocabulary size about which the model is uncertain at each step."
This felt like something I would like to dig into a bit.
Chapter 5 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)" explains how to train the LLM. There are a number of things in there that required a bit of thought, so I'll post about each of them in turn.
The chapter starts off easily, with a few bits of code to generate some sample
text. Because we have a call to torch.manual_seed at the start to make the random
number generator deterministic, you can run the code and get exactly the same results
as appear in the book, which is an excellent sanity check.
Once that's covered, we get into the core of the first section: how do we write our loss function?
This article is the last of three "state of play" posts that explain how Large Language Models work, aimed at readers with the level of understanding I had in mid-2022: techies with no deep AI knowledge. It grows out of part 19 in my series working through Sebastian Raschka's book "Build a Large Language Model (from Scratch)".
In my last two posts, I've described what LLMs do -- what goes in, what comes out, and how we use that to create things like chatbots -- and covered the maths you need to start understanding what goes on inside them. Now the tough bit: how do we use that maths to do that work? This post will give you at least a rough understanding of what's going on -- and I'll link to more detailed posts throughout if you want to read more.
As in my last posts, though, some caveats before we start: what I'm covering here is what you need to know to understand inference -- that is, what goes on inside an existing AI when you use it to generate text, rather than the training process used to create them. I'll write about training in the future. (This also means that I'm skipping the dropout portions of the code that I've covered previously. I'll bring that back in when I get on to training.)
I'll also be ignoring batching. I'll be talking about giving an LLM a single input sequence and getting the outputs for that sequence. In reality they're sent a whole bunch of input sequences at once, and work out outputs for all of them in parallel. It's actually not all that hard to add that on, but I felt it would muddy the waters a bit to include it here.
Finally, just to set expectations up-front; when I say "how do LLMs work", I'm talking about the structure that they have. They're essentially a series of mathematical operations, and these are meaningful and comprehensible. However, the specific numbers -- the parameters, aka weights -- that are used in these operations are learned through the training process -- essentially, showing an LLM a huge pile of text (think "all of the Internet") and adjusting its weights so that it gets really good at predicting the next token in a sequence given what it sees when you do that.
Why one set of parameters might be better at that job than another is not something we understand in depth, and is a highly active research area, AI interpretability. Back in 2024, Anthropic managed to find which parts of their LLM represented the concept of the Golden Gate Bridge, and put a demo version of their Claude chatbot online that had those parts "strengthened", which gave surprisingly funny results. But doing that was really hard. We can't just look at an LLM and say "ah, that's where it thinks about such-and-such" -- people need to do things like ablations, where they remove part of it and see what effect that has on the results (which has its own problems).
But while the meanings of the specific parameters that come out of the training process are hard to work out, there's still something important that we can understand -- the specific set of calculations that we use those parameters in.
People often say of LLMs that they are just large arrays of impenetrable numbers that no-one understands, and there's an element of truth in that. But it would be more accurate to say that each LLM is made up of a set of arrays of numbers -- yes, impenetrable ones -- which are doing things where, while the specific details might be unclear, the overall process is something we can understand.
Perhaps a metaphor is useful here: with a human brain, we don't know where the concept of "cat" is held, or what goes on when someone thinks about a cat. But we do know the general layout of the brain -- visual processing goes on in one place, audio in another, memories are controlled by this bit, and so on.
LLMs are a specific series of calculations, and which calculations are effective was determined by humans thinking about things, and we can understand what those calculations are doing. They're not just completely random neural networks that somehow magically do their work, having learned what to do by training.
So, with all that said, let's take a look at what those calculations are, and how they work.
My last post, about the maths you need to start understanding LLMs, took off on Hacker News over the weekend.
It's always nice to see lots of people reading and -- I hope! -- enjoying something that you've written. But there's another benefit. If enough people read something, some of them will spot errors or confusing bits -- "given enough eyeballs, all bugs are shallow".
Commenter bad_ash made the excellent point that in the phrasing I originally had, a naive reader might think that activation functions are optional in neural networks in general, which of course isn't the case. What I was trying to say was that we can use one without an activation function for other purposes (and we do in LLMs). I've fixed the wording to (hopefully) make that a bit clearer.
ThankYouGodBless made a thoughtful comment about vector normalisation and cosine similarity, which was a great point in itself, but it also made something clear: although the post linked to an article I wrote back in February that covered the dot product of vectors, it really needed its own section on that. Without understanding what the dot product is, and how it relates to similarity, it's hard to get your head around how attention mechanisms work. I've added a section to the post, but for the convenience of anyone following along over RSS, here's what I said:
The dot product is an operation that works on two vectors of the same length. It simply means that you multiply the corresponding elements, then add up the results of those multiplications:
Or, more concretely:
This is useful for a number of things, but the most interesting is that the dot product of two vectors of roughly the same length is quite a good measure of how close they are to pointing in the same direction -- that is, it's a measure of similarity. If you want a perfect comparison, you can scale them both so that they have a length of one, and then the dot product is exactly equal to the cosine of the angle between them (which is logically enough called cosine similarity).
But even without that kind of precise normalisation (which requires calculating squares and roots, so it's kind of expensive), so long as the vectors are close in length, it gives us meaningful numbers -- so, for example, it can give us a quick-and-dirty way to see how similar two embeddings are.
Unfortunately the proof of why the dot product is a measure of similarity is a bit tricky, but this thread by Tivadar Danka is reasonably accessible if you want to get into the details.
As promised, up next: how do we put all of that together, along with the high-level stuff I described about LLMs in my last post, to understand how an LLM works?
This article is the second of three "state of play" posts that explain how Large Language Models work, aimed at readers with the level of understanding I had in mid-2022: techies with no deep AI knowledge. It grows out of part 19 in my series working through Sebastian Raschka's book "Build a Large Language Model (from Scratch)". You can read the first post in this mini-series here.
Actually coming up with ideas like GPT-based LLMs and doing serious AI research requires serious maths. But the good news is that if you just want to understand how they work, while it does require some maths, if you studied it at high-school at any time since the 1960s, you did all of the groundwork then: vectors, matrices, and so on.
One thing to note -- what I'm covering here is what you need to know to understand inference -- that is, using an existing AI, rather than the training process used to create them. That's also not much beyond high-school maths, but I'll be writing about it later on.
So, with that caveat, let's dig in!