Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning
Official implementation of the paper "Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning".
Overview
This repository implements a curriculum learning framework for training large language models (LLMs) on reasoning tasks using GRPO (Group Relative Policy Optimization). The framework progressively trains models from easy to hard tasks, improving their reasoning capabilities across multiple domains.
Table of Contents
- Overview
- Table of Contents
- Installation
- Prerequisites
- Setup Environment
- Curriculum Schedules
- 1. Classical
- 2. Balanced
- 3. Cosine
- 4. Gaussian
- Configuration
- Project Structure
- Configuration Structure
- Training
- Default Args
- Custom Args
- VLLM server setup
- Training
- Evaluation
- Citation
- License
- Acknowledgments
Installation
Prerequisites
- Python 3.10+
- CUDA 12.x compatible GPU
- Conda or Mamba package manager
Setup Environment
- Clone the repository:
cd E2H-Reasoning
- Create the conda environment:
Curriculum Schedules
The framework supports four curriculum learning schedules:
1. Classic
Simple linear progression through tasks based on training progress.
2. Balanced
Equal probability for all task difficulty levels throughout training.
3. Cosine
Smooth transition from easy to hard tasks using cosine annealing.
4. Gaussian
Gaussian distribution with a moving center, transitioning from easy to hard tasks.
Configuration Example:
e2h_args:
curriculum_schedule: gaussian # Options: classic, balanced, cosine, gaussian
scheduler_params:
mu_exp: 0.5
sigma: 0.5
Configuration
The project uses Hydra for configuration management. Configuration files are located in config/.
Project Structure
curriculum-reasoning/
+-- config/ # Hydra configuration files
| +-- algorithm/ # Algorithm configs (GRPO)
| +-- model/ # Model configs (Qwen, Llama)
| +-- task/ # Task configs (GSM8K, MATH, etc.)
| +-- config.yaml # Base configuration
+-- env/
| +-- environment.yml # Conda environment specification
+-- src/
| +-- datasets.py # Dataset loading and preprocessing
| +-- rewards.py # Reward function implementations
| +-- trainer.py # CurriculumGRPOTrainer
+-- main.py # Main entry point for training/testing
+-- run.sh # SLURM submission script
+-- README.md # This file
Configuration Structure
config/
+-- algorithm/
| +-- grpo.yaml # GRPO training parameters
+-- model/
| +-- qwen1.5b.yaml # Qwen 1.5B model config
| +-- qwen3b.yaml # Qwen 3B model config
| +-- llama3b.yaml # Llama 3B model config
+-- task/
| +-- gsm8k.yaml # GSM8K task config
| +-- math.yaml # MATH task config
| +-- aqua.yaml # AQUA task config
| +-- blocksworld.yaml # Blocksworld task config
| +-- countdown.yaml # Countdown task config
+-- config.yaml # Base configuration
Training
Default Args
If want to just run our code without modifying any args.
--model=qwen1.5b,qwen3b,llama3b \
--task=<aqua,blocksworld,countdown,gsm8k,math> \
--curriculum_schedule=<classic,balanced,cosine,gaussian>
Custom Args
VLLM server setup
If using VLLM server, then execute the following command before training.
trl vllm-serve \
--model Qwen/Qwen2.5-1.5B-Instruct \
--dtype bfloat16 \
--max_model_len 4096 \
--trust_remote_code true\
--log_level warning \
&
or, VLLM can be run in colocate mode, by changing the configs in algorithm/grpo.yaml
Training
accelerate launch \
--mixed_precision bf16 \
--num_processes 2 \
--dynamo_backend no \
--use_deepspeed \
--zero_stage 3 \
--gradient_accumulation_steps 4 \
--gradient_clipping 1 \
--zero3_init_flag true \
--zero3_save_16bit_model true \
main.py \
mode=train \
model=qwen1.5b \
task=blocksworld \
<ARG Overides>
Evaluation
accelerate launch \
--mixed_precision bf16 \
--num_machines 1 \
--num_processes 1 \
--dynamo_backend no \
main.py \
mode=test \
model=qwen1.5b \
task=blocksworld \
<ARG Overides>
Citation
If you use this code in your research, please cite:
title={Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning},
author={Parashar, Shubham and Gui, Shurui and Li, Xiner and Ling, Hongyi and Vemuri, Sushil and Olson, Blake and Li, Eric and Zhang, Yu and Caverlee, James and Kalathil, Dileep and Ji, Shuiwang},
journal={arXiv preprint arXiv:2506.06632},
year={2025}
}
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Built with TRL (Transformer Reinforcement Learning)
- Uses vLLM for efficient inference
- Configuration management via Hydra
- Training optimization with DeepSpeed
For questions or issues, please open an issue on GitHub or contact the authors.