Fine-tuning LLMs :: Giles' blog

Fine-tuning LLMs

From April until December 2024, I explored how you go about fine-tuning a 7B base model to handle chat. I started by training a smaller model locally, then found out how to train things on cloud computing environments, including multi-GPU training and training on machines where even a server-grade H100 GPU wasn't big enough to be able to train the model.

Here are the posts in this series:

Messing around with fine-tuning LLMs (27 April 2024). In the first post in the series, I scope out the task, and fine-tune a 0.5B model on my own machine.
Messing around with fine-tuning LLMs, part 2 -- to the cloud! (28 April 2024). Next, I take a look at cloud GPU providers and pick Lambda Labs. As a sanity check, I replicate my fine-tune of the 0.5B model on a single-GPU instance there.
Messing around with fine-tuning LLMs, part 3 -- moar GPUs (15 May 2024). I then work out how to train the 0.5B model faster by using multiple GPUs in parallel.
Messing around with fine-tuning LLMs, part 4 -- training cross-GPU. (21 May 2024). The first successful fine-tune of a 7B model -- but I have to offload the optimizer to the CPU. I'll need to find out why.
Messing around with fine-tuning LLMs, part 5 -- exploring memory usage (5 July 2024). Some initial local experiments into memory usage for the 0.5B model to get some ideas as to why I had to offload the optimiser.
Messing around with fine-tuning LLMs, part 6 -- measuring memory usage more systematically (10 July 2024). Measuring memory usage more systematically for the 0.5B model, also locally, to find out how it behaves with different sequence lengths.
Messing around with fine-tuning LLMs, part 7 -- detailed memory usage across sequence lengths for an 8B model (16 August 2024). Making similar measurements of memory usage at different sequence lengths for the 8B model.
Messing around with fine-tuning LLMs, part 8 -- detailed memory usage across batch sizes (25 August 2024). Measuring the effect of batch sizes on memory usage, with a sidetrack looking into Liger Kernel, a new and easy-to use replacement of the default CUDA kernels used for training that promises (and delivers) better memory usage and performance.
Messing around with fine-tuning LLMs, part 9 -- gradient checkpointing (3 September 2024). Investigating how gradient checkpointing works, in the hope that it might allow me to trade off GPU processing for memory usage and get a larger batch size (meaning that each training iteration was slower, but the overall train took less time). Sadly, those hopes were dashed.
Messing around with fine-tuning LLMs, part 10 -- finally training the model! (22 December 2024). The last in the series -- a deep dive into fine-tuning the 8B parameter LLM on instruction data, exploring memory usage, training strategies, and model deployment to Hugging Face.