Name	Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets	assets
scripts	scripts
verl	verl
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
setup.py	setup.py

RePro: Rectifying LLM Thought From Lens of Optimization

Introduction

RePro (Rectifying Process-level Reward) is a novel post-training framework that aligns Chain-of-Thought (CoT) reasoning with gradient descent optimization principles.

While long-CoT prompting facilitates thorough exploration, it frequently results in suboptimal behaviors such as overthinking, hallucination, and inefficient reasoning paths. RePro mitigates these issues by:

Optimization Lens: Framing each reasoning step as a gradient update trajectory toward the optimal solution.
Dual Scoring Mechanism: Introducing a surrogate objective function to quantify both the intensity and stability of the reasoning process.
Process-Level Reward: Integrating these metrics into Reinforcement Learning with Verifiable Rewards (RLVR) pipelines to guide model alignment.

Empirical evaluations across mathematics, science, and coding benchmarks demonstrate that RePro consistently enhances reasoning accuracy while significantly reducing redundancy.

Dependencies

Python: 3.10+
CUDA: 11.8+
Key Libraries: PyTorch, vLLM, VeRL

Installation

Create a conda environment:

conda create -n repro python=3.10 conda activate repro
Install the package in editable mode:

pip install -e .

Quick Start

GRPO Training

We provide a demonstrated script for launching both single-node and multi-node training via GRPO.

Usage Syntax

Navigate to the project root and run the training script. Ensure your scripts/run_multinodes_repro_grpo.sh paths are correctly configured.

bash scripts/run_multinodes_repro_grpo.sh \ <MODEL_PATH> \ <NUM_NODES> \ <GPUS_PER_NODE> \ <TP_SIZE> \ <VLLM_GPU_UTIL> \ <RUN_NAME>

Argument	Description
`MODEL_PATH`	HuggingFace model ID or local path (e.g., `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`)
`NUM_NODES`	Total number of nodes (machines) for training
`GPUS_PER_NODE`	Number of GPUs available per node
`TP_SIZE`	Tensor Parallelism size
`VLLM_GPU_UTIL`	VLLM GPU memory utilization ratio (e.g., `0.7`)
`RUN_NAME`	Unique identifier for the experiment logging

1. Single-Node Example (DeepSeek-Distill-Qwen-1.5B)

To train on a single machine with 8 GPUs:

bash scripts/run_multinodes_repro_grpo.sh \ deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \ 1 \ 8 \ 1 \ 0.7 \ repro-deepscale-r-exp

2. Multi-Node Training Example

For multi-node setups, you must configure the distributed environment variables (NODE_RANK and MASTER_ADDR) on each node before execution.

Step 1: Export Variables

# On the master node (Rank 0): export NODE_RANK=0 export MASTER_ADDR=xxx.xxx.xxx.xxx # Replace with actual Master IP # On worker nodes (Rank 1, 2, ...): export NODE_RANK=1 # Change based on node index export MASTER_ADDR=xxx.xxx.xxx.xxx # Same Master IP as above

Step 2: Launch Script (Run on ALL nodes)

bash scripts/run_multinodes_repro_grpo.sh \ deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \ 2 \ 8 \ 1 \ 0.7 \ repro-multinode-exp

Citation

If you find this work or code useful in your research, please consider citing:

@article{author2025repro, title={Rectifying LLM Thought From Lens of Optimization}, author={Author One and Author Two and Author Three}, journal={arXiv preprint arXiv:2507.06920}, year={2025} }

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

open-compass/RePro

Folders and files

Latest commit

History

Repository files navigation

RePro: Rectifying LLM Thought From Lens of Optimization

Introduction

Dependencies

Installation

Quick Start

GRPO Training

Usage Syntax

1. Single-Node Example (DeepSeek-Distill-Qwen-1.5B)

2. Multi-Node Training Example

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

open-compass/RePro

Folders and files

Latest commit

History

Repository files navigation

RePro: Rectifying LLM Thought From Lens of Optimization

Introduction

Dependencies

Installation

Quick Start

GRPO Training

Usage Syntax

1. Single-Node Example (DeepSeek-Distill-Qwen-1.5B)

2. Multi-Node Training Example

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages