I spent some time recently getting some models uploaded onto the Hugging Face Hub. I'd trained a bunch of GPT-2 small sized base models from scratch as part of my LLM from scratch series, and wanted to share them with anyone that was interested. I managed to get it done, but it was kind of tricky to get right.
The Hugging Face documentation is great if you're using the built-in models, but the coverage of custom architectures is... not quite as comprehensive. There are scattered examples, but they're all a bit vague and there's nothing really bringing them all together. But with what I could find, plus a lot of running things repeatedly, seeing how they failed, tweaking changes, banging my head against obscure stacktraces, and talking to various LLMs, I got there in the end.
This post is the tutorial I wish I'd found before I started, and I hope it's useful for people in a similar position. The one warning I'd give is that I did not dig into tokenisers in any depth. My own models use the standard GPT-2 one, and so I could just use the version that is built into Transformers. The setup you need to do with custom tokenisers doesn't look all that different to what you need do to for custom models, but as I haven't spent lots of time looking into it, I won't try to write a tutorial for something I've not done :-)
Firstly, why would you want to upload a model you've trained to Hugging Face? Well, let's say you've written and trained your own LLM -- you're learning how they work, or you've got a brilliant idea about how to tweak transformers to get that one step closer to AGI using the old gaming PC in your basement. You have some PyTorch code and a bunch of weights. How do you share it?
You could, of course, just dump the code on GitHub and share the weights somewhere. If people want to play with your model, they just need to download everything, install the dependencies, and then write code to load the weights and talk to your LLM -- run inference, fine-tune it, and so on.
That's quite a big "just", though. Not everyone who is going to want to look at your model will have the relatively deep knowledge required to do all of that. Speaking for myself, I spent quite some time fine-tuning and running inference on models long before I knew how the internals worked. I was able to do this because of the easy-to-use abstraction layer in Hugging Face's Transformers library, using models that had been uploaded to their hub.
What it would be nice to do is share the model within the Hugging Face ecosystem in a way that works smoothly. Let people run inference on it like this:
from transformers import pipeline
pipe = pipeline(task="text-generation", model="some-hf-user/some-model-name", trust_remote_code=True)
out = pipe(
"Every effort moves you",
max_new_tokens=20,
do_sample=True,
temperature=1.4,
top_k=25,
)
print(out[0]["generated_text"])
...rather than something daunting like this code
with its 24 lines just to sample a few tokens from the model.
Or to train it using code like what you see in this notebook
-- a bit of config then trainer.train --
rather than like this,
with its >100-line train function.
Here's what I had to do to get it working.
As part of my "extra credit" projects after finishing the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", I've trained seven base models completely from scratch based on the book's GPT-2 code -- three locally, and four in the cloud. I plan to train more as I work on ways to improve the quality of the trained models, in the hope that I can get to something closer to the original OpenAI weights' loss on my own hardware, or at least on something I can rent without breaking the bank.
It makes sense to share these models somewhere, both so that other people can take a look if they like, and also to build the knowledge of how to do it so that if I produce something more interesting in the future, I'll know how to share that too.
Raschka's code is all released under the Apache v2 open source license, so I can share my stuff under the same license without worrying about triggering any legal issues. So: I've put all of the models I've trained so far on Hugging Face under that license, and made them reasonably HF-native (I'll explain what I mean by that later).
From the post where I trained the models locally, we have:
gpjt/1xrtx3090m24-fineweb --
the first model in that post, trained on a roughly Chinchilla-optimal number of tokens
(20x the number of parameters) from FineWeb.gpjt/1xrtx3090m24-fineweb-edu --
the second model, trained on the same number of tokens from FineWeb-Edu.gpjt/1xrtx3090m24-fineweb-edu-2x --
the third one, which is the gpjt/1xrtx3090m24-fineweb-edu model trained further on another
roughly Chinchilla-optimal number of tokens from the same dataset.Then, from the post where I trained on a bunch of different kinds of machines on Lambda Labs, four models (with two checkpoints from one of them):
gpjt/8xa100m40
-- trained on a 8x A100, 40 GiB/GPU machine.gpjt/8xb200m160
-- trained on a 8x B200, 160 GiB/GPU machine.gpjt/8xh100m80-best
-- trained on a 8x H100, 80 GiB/GPU machine. The best validation loss for this train was
not in the last iteration, so this is the checkpoint with the best loss.gpjt/8xh100m80-latest
-- this one is the final checkpoint from the one above.gpjt/8xa100m80
-- trained on a 8x A100, 80 GiB/GPU machine.You can see how they compare on my evals at the bottom of this post.
I wanted to make them all usable within the Hugging Face ecosystem -- that is, I didn't want to just dump a bunch of weights and code into repos there, but rather to have something that someone coming to them without much context could make sense of. Let's dig into that.
I'm still working on my "extra credit" projects after finishing the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Last time around, I trained four base models, using the GPT-2 architecture from the book, on Lambda Labs machines. I was using two ways to compare them with each other, with three models that I'd trained locally, and with the original GPT-2 weights from OpenAI:
Here were the results I got, sorted by the loss:
| Test loss | IFT score | |
|---|---|---|
| OpenAI weights: medium | 3.231 | 38.53 |
| OpenAI weights: small | 3.500 | 22.98 |
| Cloud FineWeb, 8x A100 40 GiB | 3.674 | 17.09 |
| Cloud FineWeb, 8x H100 80 GiB | 3.725 | 11.98 |
| Cloud FineWeb, 8x A100 80 GiB | 3.730 | 11.71 |
| Cloud FineWeb, 8x B200 160 GiB | 3.771 | 13.89 |
| Local FineWeb train | 3.944 | 16.01 |
| Local FineWeb-Edu extended train | 4.135 | 14.55 |
| Local FineWeb-Edu train | 4.167 | 16.86 |
Now, you'd expect there to be at least a loose correlation; the lower the loss, the higher the IFT score. But, while we can see a difference between the OpenAI weights and our own, within our own there doesn't seem to be a logical pattern.
I think that the problem is that the results from the GPT-5.1 LLM-as-a-judge are not consistent between models. That's not a complaint about the code or its original design, of course -- it was originally written as part of the LLM book as a way of doing a quick test on an instruction fine-tuned model that we'd spent the previous 238 pages writing -- just something that was a bit more efficient than reading hundreds of input/output pairs ourselves. It was never meant to be a tool to compare models in the way I'm using it now.
In this post I'll dig into why it doesn't work for this kind of thing, and see if that's something we can change.
I'm carrying on with my "extra credit" projects after finishing the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Having proven that I could train a GPT-2 small scale base model from scratch on my RTX 3090 in 48 hours, I wanted to try training it on a multi-GPU machine on Lambda Labs. There are two benefits I see in doing that:
In addition, I wanted to see if anything unexpected dropped out of it; after all, there were four different sizes of machines that I wanted to try, so I'd be doing four from-scratch trains on the same dataset. Does the machine size affect the quality of the model in some way?
Here's what happened. As with the last post, this is a set of tidied-up lab notes, so you can see the full journey. There's a lot to it! I was considering splitting it into multiple posts -- "writing the code", "building the datasets", "running the trains" -- but they're interleaved. Each train taught me something about how to structure the code to make it easier to use, so the code kept changing.
So I think it's worth documenting the process as it really was. If at some point I want to write a how-to document on porting single-GPU code to multi-GPU, I'll be able to mine this for resources, and in the meantime, hopefully this will be of use to readers -- even if it's just at the level of "I got this error message, how do I fix it?"
Anyway, once again I don't want to bury the lede, so: after spending US$215.16 on various trains on various servers, I was able to find that a reasonably cheap instance on Lambda Labs, with 8x A100 GPUs, each of which has 40 GiB of VRAM, is the sweet spot for this particular 163M-parameter, ~Chinchilla-optimal single-epoch run. They can train the model in less than four hours, they happen to be the right size for batches that minimise loss (more on that later), and can do that train for about US$35, excluding validation.
If you'd like to read the gory details of what I did, then read on -- but if you prefer, you can jump straight to the results.
Having worked through the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", I wanted to try an experiment: is it possible to train a base model of my own, on my own hardware?
The book shows you how to train your LLM, does a basic training run on a small dataset, and then we switch to downloading the "pre-cooked" weights from OpenAI. That makes sense given that not every reader will have access to enough hardware to really train from scratch. And right back at the start of this series, I did some naive scaling of numbers I'd got when fine-tuning LLMs and came to the conclusion that it would be impossible in a reasonable time.
But the speed I got with my RTX 3090 on the book's small training run made me think that perhaps -- just perhaps! -- it might actually be possible to train a model of this size -- about 163M parameters -- on my own hardware. Not, perhaps, on a small laptop, but at least on a reasonably high-end "gaming" PC.
Additionally, Andrej Karpathy recently announced nanochat,
"the best ChatGPT that $100 can buy". He mentions on the main page that he's trained
a model called d32, with 32 Transformer layers, which has 1.9B parameters, for about $800.
His smaller 20-layer d20 model, with 561M parameters, he says should be trainable
in about four hours on an 8x H100 GPU node, which costs about $24/hour -- hence the
$100 total price.
What's even more interesting about nanochat is that it's built with PyTorch; initially
I'd got the impression that it was based on his pure C/CUDA llm.c,
which I would imagine would give a huge speedup. But no -- he's using the same stack
as I have been in this series!
Karpathy's models are both larger than 163M parameters, so it definitely sounded like this might be doable. Obviously, I'm nowhere near as experienced an AI developer, and he's using a larger machine (8 GPUs and each of them has > 3x more VRAM than mine), but he's also including the time to train a tokeniser and instruction fine-tune into that four hours -- and his smaller model is more than three times larger than mine. So that should all help.
This post is a little less structured than the others in my LLM from scratch series, as it's essentially a tidied version of the notes I kept as I worked through the project.
But so as not to bury the lede: using the Hugging Face FineWeb-series datasets, I was able to train a GPT-2 small sized base model to a level where it was almost as good as the original in just over 48 hours on my own hardware! Base models: not just for the big AI labs.
Here's the full story.
On 22 December 2024, I wrote:
Over the Christmas break (and probably beyond) I'm planning to work through Sebastian Raschka's book "Build a Large Language Model (from Scratch)". I'm expecting to get through a chapter or less a day, in order to give things time to percolate properly. Each day, or perhaps each chapter, I'll post here about anything I find particularly interesting.
More than ten months and 26 blog posts later, I've reached the end of the main body of the book -- there's just the appendices to go. Even allowing for the hedging, my optimism was adorable.
I don't want to put anyone else off the book by saying that, though! I expect most people will get through it much faster. I made a deliberate decision at the start to write up everything I learned as I worked through it, and that, I think, has helped me solidify things in my mind much better than I would have done if I'd only been reading it and doing the exercises. But on the other hand, writing things up does take a lot of time, much more than the actual learning does. It's worth it for me, but probably isn't for everyone.
So, what next? I've finished the main body of the book, and built up a decent backlog as I did so. What do I need to do before I can treat my "LLM from scratch" journey as done? And what other ideas have come up while I worked through it that might be good bases for future, similar series?
This post is on the second half of chapter 7 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". In the last post I covered the part of the chapter that covers instruction fine-tuning; this time round, we evaluate our model -- particularly interestingly, we try using another, smarter, model to judge how good its responses are.
Once again, Raschka's explanation in this section is very clear, and there's not that much that was conceptually new to me, so I don't have that many notes -- in fact, this post is probably the shortest one in my series so far!
This post is on the first part of chapter 7 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", which covers instruction fine-tuning.
In my last post, I went through a technique which I'd found could sometimes make it possible to turn non-fine-tuned models into reasonable chatbots; perhaps unsurprisingly, the GPT-2 model isn't powerful enough to work that way.
So, with that proven, it was time to do the work :-) This post covers the first half of the chapter, where we actually do the fine-tuning; I'll post later about the second part, where we start evaluating the model that we get.
Just as with the last chapter, what we're doing here is essentially plugging together the various things we've built so far, and Raschka's explanation is very clear, so I don't have that much in the way of notes -- but here are the bits that made me pause.
Chapter 7 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)" explains how we fine-tune our LLM to follow instructions -- essentially turning a model that can do next-token completion for text generation into something we can use for a chatbot.
Back when I first started looking into LLMs, I used a setup that didn't require that, and got surprisingly good results, at least with later OpenAI models.
The trick was to present the text as something that made sense in the context of next-token prediction. Instead of just throwing something like this at the LLM:
User: Provide a synonym for 'bright'
Bot:
...you would instead prepare it with an introductory paragraph, like this:
This is a transcript of a conversation between a helpful bot, 'Bot', and a human,
'User'. The bot is very intelligent and always answers the human's questions
with a useful reply.
User: Provide a synonym for 'bright'
Bot:
Earlier OpenAI models couldn't do this when I accessed them through the API, but later ones could.
How does our GPT-2 model stack up with this kind of thing -- and for comparison, how about a newer, more sophisticated base (as in, not instruction fine-tuned) model?
I recently posted about Andrej Karpathy's classic 2015 essay, "The Unreasonable Effectiveness of Recurrent Neural Networks". In that post, I went through what the essay said, and gave a few hints on how the RNNs he was working with at the time differ from the Transformers-based LLMs I've been learning about.
This post is a bit more hands-on. To understand how these RNNs really work, it's
best to write some actual code, so I've implemented a version of Karpathy's
original code using PyTorch's built-in
LSTM
class -- here's the repo. I've tried
to stay as close as possible to the original, but I believe
it's reasonably PyTorch-native in style too. (Which is maybe not all that surprising,
given that he wrote it using Torch, the Lua-based predecessor to PyTorch.)
In this
post, I'll walk through how it works, as of commit daab2e1. In follow-up posts, I'll dig in further,
actually implementing my own RNNs rather than relying on PyTorch's.
All set?