DeepPrune: Parallel Scaling without Inter-trace Redundancy
Paper *
Efficient reasoning at scale by pruning redundant reasoning traces--without sacrificing accuracy.
Overview
Large language models (LLMs) often generate multiple reasoning traces in parallel to improve answer reliability. However, these traces frequently exhibit severe inter-trace redundancy, leading to wasted computation and inflated inference costs.
DeepPrune addresses this by learning to identify and prune semantically redundant traces before full execution--enabling cost-effective parallel reasoning while preserving performance.
More details can be found in our website
Results
Dependencies
cd DeepPrune
pip install -r requirements.txt
Llama-Factory for model fine-tuning and inference and we have provided the version we used in Llama-Factory folder. Here we modify it to support Focal Loss. Please refer to the GitHub issue if you want to clone LLaMA-Factory by yourself.
Qwen/Qwen3-4B-Instruct-2507 as the backbone LLM for DeepPrune. You can download it from https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507. You can also use other open-source LLMs.
Dataset
The dataset provided here is not complete because of the size! ! ! Please refer to https://huggingface.co/datasets/THU-KEG/DeepPrune for the full dataset.
To understand how to use the dataset, please refer to DeepPrune_data/README.md. Click here
Preliminaries
To understand the motivation behind DeepPrune, explore the preliminary analysis in:
Preliminaries/Preliminary experiment.ipynb
This notebook includes:
-
Distribution of answer agreement:
Most trace pairs yield the same answer, revealing significant redundancy in parallel reasoning. -
ROC curves for redundancy detection:
- Sentence-BERT (shallow similarity): AUROC = 0.58 - limited discriminative power.
- Qwen3-4B-Instruct (zero-shot LLM comparison): AUROC = 0.66 - moderate improvement, but still suboptimal.
To reproduce the zero-shot Qwen3-4B-Instruct results:
- Prepare the evaluation dataset using
DeepPrune/Offline/Ablation_Study.ipynb- Run
Preliminaries/zero_shot_exp.py
DeepPrune Pipeline
Prerequisites
- Install
Llama-Factory Patch required: Modify the codebase to support Focal Loss (see GitHub issue for guidance).
1 Prepare Finetuning Dataset
Generate the supervised training data for DeepPrune:
This constructs pairwise trace comparisons labeled by answer equivalence.
2 Offline Training
Train the DeepPrune model using supervised fine-tuning:
- Config:
DeepPrune/Offline/Qwen3_full_sft.yaml - Framework: Llama-Factory
After training:
- Generate test data:
DeepPrune/Offline/Ablation_Study.ipynb - Evaluate performance:
python DeepPrune/Offline/test_model_performance_parallel.py
- Visualize results:
DeepPrune/Offline/check_model_output.ipynb
Expect significant gains over shallow similarity baselines (AUROC > 0.83 in our experiments).
3 Online Pruning
Deploy DeepPrune for real-time trace pruning during inference:
-
Establish baselines:
RunDeepPrune/Online/check_pass_k.ipynbto compute:pass@1: Accuracy with single tracecons@512: Consensus accuracy with 512 traces
-
Apply DeepPrune:
python DeepPrune/Online/greedy_cluster_threshold.pyThis performs greedy clustering of traces using DeepPrune's similarity scores and prunes redundant ones.
-
Trade-off control:
Adjust the similarity threshold to balance:- Cost reduction (fewer traces executed)
- Performance retention (maintained consensus accuracy)
Acknowledgement
This code repository is developed based on Llama-Factory, vllm, DeepScaleR and DeepConf.
Thanks for their great work!
Citation
If you use DeepPrune in your research, please cite our work:
@article{tu2025deepprune,
title={DeepPrune: Parallel Scaling without Inter-trace Redundancy},
author={Shangqing Tu, Yaxuan Li, Yushi Bai, Lei Hou, Juanzi Li},
journal={arXiv preprint arXiv:2510.08483},
year={2025}
}