TRL - Transformer Reinforcement Learning
A comprehensive library to post-train foundation models
What's New
OpenEnv Integration: TRL now supports OpenEnv, the open-source framework from Meta for defining, deploying, and interacting with environments in reinforcement learning and agentic workflows.
Explore how to seamlessly integrate TRL with OpenEnv in our dedicated documentation.
Overview
TRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Built on top of the Transformers ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups.
Highlights
-
Trainers: Various fine-tuning methods are easily accessible via trainers like
SFTTrainer,GRPOTrainer,DPOTrainer,RewardTrainerand more. -
Efficient and scalable:
- Leverages Accelerate to scale from single GPU to multi-node clusters using methods like DDP and DeepSpeed.
- Full integration with PEFT enables training on large models with modest hardware via quantization and LoRA/QLoRA.
- Integrates Unsloth for accelerating training using optimized kernels.
-
Command Line Interface (CLI): A simple interface lets you fine-tune with models without needing to write code.
Installation
Python Package
Install the library using pip:
From source
If you want to use the latest features before an official release, you can install TRL from source:
Repository
If you want to use the examples you can clone the repository with the following command:
Quick Start
For more flexibility and control over training, TRL provides dedicated trainer classes to post-train language models or PEFT adapters on a custom dataset. Each trainer in TRL is a light wrapper around the Transformers trainer and natively supports distributed training methods like DDP, DeepSpeed ZeRO, and FSDP.
SFTTrainer
Here is a basic example of how to use the SFTTrainer:
from datasets import load_dataset
dataset = load_dataset("trl-lib/Capybara", split="train")
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset,
)
trainer.train()
GRPOTrainer
GRPOTrainer implements the Group Relative Policy Optimization (GRPO) algorithm that is more memory-efficient than PPO and was used to train Deepseek AI's R1.
from trl import GRPOTrainer
from trl.rewards import accuracy_reward
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
trainer.train()
Note
For reasoning models, use the reasoning_accuracy_reward() function for better results.
DPOTrainer
DPOTrainer implements the popular Direct Preference Optimization (DPO) algorithm that was used to post-train Llama 3 and many other models. Here is a basic example of how to use the DPOTrainer:
from trl import DPOTrainer
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
trainer = DPOTrainer(
model="Qwen3/Qwen-0.6B",
train_dataset=dataset,
)
trainer.train()
RewardTrainer
Here is a basic example of how to use the RewardTrainer:
from datasets import load_dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
trainer = RewardTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
train_dataset=dataset,
)
trainer.train()
Command Line Interface (CLI)
You can use the TRL Command Line Interface (CLI) to quickly get started with post-training methods like Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO):
SFT:
--dataset_name trl-lib/Capybara \
--output_dir Qwen2.5-0.5B-SFT
DPO:
--dataset_name argilla/Capybara-Preferences \
--output_dir Qwen2.5-0.5B-DPO
Read more about CLI in the relevant documentation section or use --help for more details.
Development
If you want to contribute to trl or customize it to your needs make sure to read the contribution guide and make sure you make a dev install:
cd trl/
pip install -e .[dev]
Experimental
A minimal incubation area is available under trl.experimental for unstable / fast-evolving features. Anything there may change or be removed in any release without notice.
Example:
Read more in the Experimental docs.
Citation
title = {{TRL: Transformers Reinforcement Learning}},
author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouedec, Quentin},
license = {Apache-2.0},
url = {https://github.com/huggingface/trl},
year = {2020}
}
License
This repository's source code is available under the Apache-2.0 License.