RePro: Rectifying LLM Thought From Lens of Optimization
Introduction
RePro (Rectifying Process-level Reward) is a novel post-training framework that aligns Chain-of-Thought (CoT) reasoning with gradient descent optimization principles.
While long-CoT prompting facilitates thorough exploration, it frequently results in suboptimal behaviors such as overthinking, hallucination, and inefficient reasoning paths. RePro mitigates these issues by:
- Optimization Lens: Framing each reasoning step as a gradient update trajectory toward the optimal solution.
- Dual Scoring Mechanism: Introducing a surrogate objective function to quantify both the intensity and stability of the reasoning process.
- Process-Level Reward: Integrating these metrics into Reinforcement Learning with Verifiable Rewards (RLVR) pipelines to guide model alignment.
Empirical evaluations across mathematics, science, and coding benchmarks demonstrate that RePro consistently enhances reasoning accuracy while significantly reducing redundancy.
Dependencies
- Python: 3.10+
- CUDA: 11.8+
- Key Libraries: PyTorch, vLLM, VeRL
Installation
-
Create a conda environment:
conda create -n repro python=3.10
conda activate repro -
Install the package in editable mode:
pip install -e .
Quick Start
GRPO Training
We provide a demonstrated script for launching both single-node and multi-node training via GRPO.
Usage Syntax
Navigate to the project root and run the training script. Ensure your scripts/run_multinodes_repro_grpo.sh paths are correctly configured.
<MODEL_PATH> \
<NUM_NODES> \
<GPUS_PER_NODE> \
<TP_SIZE> \
<VLLM_GPU_UTIL> \
<RUN_NAME>
| Argument | Description |
|---|---|
MODEL_PATH |
HuggingFace model ID or local path (e.g., deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) |
NUM_NODES |
Total number of nodes (machines) for training |
GPUS_PER_NODE |
Number of GPUs available per node |
TP_SIZE |
Tensor Parallelism size |
VLLM_GPU_UTIL |
VLLM GPU memory utilization ratio (e.g., 0.7) |
RUN_NAME |
Unique identifier for the experiment logging |
1. Single-Node Example (DeepSeek-Distill-Qwen-1.5B)
To train on a single machine with 8 GPUs:
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
1 \
8 \
1 \
0.7 \
repro-deepscale-r-exp
2. Multi-Node Training Example
For multi-node setups, you must configure the distributed environment variables (NODE_RANK and MASTER_ADDR) on each node before execution.
Step 1: Export Variables
export NODE_RANK=0
export MASTER_ADDR=xxx.xxx.xxx.xxx # Replace with actual Master IP
# On worker nodes (Rank 1, 2, ...):
export NODE_RANK=1 # Change based on node index
export MASTER_ADDR=xxx.xxx.xxx.xxx # Same Master IP as above
Step 2: Launch Script (Run on ALL nodes)
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
2 \
8 \
1 \
0.7 \
repro-multinode-exp
Citation
If you find this work or code useful in your research, please consider citing:
title={Rectifying LLM Thought From Lens of Optimization},
author={Author One and Author Two and Author Three},
journal={arXiv preprint arXiv:2507.06920},
year={2025}
}