Name	Name	Last commit message	Last commit date
Latest commit History 71 Commits
assets	assets
data_process	data_process
evaluation	evaluation
examples	examples
LICENSE	LICENSE
README.md	README.md
demo.py	demo.py

(CVPR 2026)Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

Yuhong Liu * Beichen Zhang * Yuhang Zang⁺ * Yuhang Cao * Long Xing
Xiaoyi Dong * Haodong Duan * Dahua Lin * Jiaqi Wang⁺

⁺Corresponding authors.

News

[2026/02/25] We have released the Spatial-SSRL-3B Model, initialized from Qwen2.5-VL-3B-Instruct.
[2026/02/21] Our work has been accepted by CVPR 2026.
[2025/11/24] We have released the Spatial-SSRL-Qwen3VL-4B Model, initialized from Qwen3-VL-4B-Instruct.
[2025/11/03] Now you can try out Spatial-SSRL-7B on Spatial-SSRL Space.
[2025/11/03] We have released the Spatial-SSRL-7B Model, and Spatial-SSRL-81k Dataset.
[2025/11/02] We have released the Spatial-SSRL repository.

Overview

We are thrilled to introduce Spatial-SSRL, a novel self-supervised RL paradigm aimed at enhancing LVLM spatial understanding. By optimizing Qwen2.5-VL-7B with Spatial-SSRL, the model exhibits stronger spatial intelligence across seven spatial understanding benchmarks in both image and video settings.

Spatial-SSRL is a lightweight tool-free framework that is natually compatible with the RLVR training paradigm and easy to extend to a multitude of pretext tasks. Five tasks are currently formulated in the framework, requiring only ordinary RGB and RGB-D images. And we welcome you to join Spatial-SSRL with effective pretext tasks to further strengthen the capabilities of LVLMs!

Highlights

Highly Scalable: Spatial-SSRL uses ordinary raw RGB and RGB-D images instead of richly-annotated public datasets or manual labels for data curation, making it highly scalable.
Cost-effective: Avoiding the need for human labels or API calls for general LVLMs throughout the entire pipeline endows Spatial-SSRL with cost-effectiveness.
Lightweight: Prior approaches for spatial understanding heavily rely on annotation of external tools, incurring inherent errors in training data and additional cost. In constrast, Spatial-SSRL is completely tool-free and can easily be extended to more self-supervised tasks.
Naturally Verifiable: Intrinsic supervisory signals determined by pretext objectives are naturally verifiable, aligning Spatial-SSRL well with the RLVR paradigm.

Results

We train Qwen2.5-VL-3B and Qwen2.5-VL-7B with our Spatial-SSRL paradigm and the experimental results across seven spatial understanding benchmarks are shown below.

Quick Start

To directly experience Spatial-SSRL-7B, you can try it out on Spatial-SSRL Space!

Here we provide a code snippet for you to start a simple trial of Spatial-SSRL-7B on your own device. You can download the model from Spatial-SSRL-7B Model before your trial!

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info model_path = "internlm/Spatial-SSRL-7B" #You can change it to your own local path if deployed already img_path = "examples/eg1.jpg" question = "Consider the real-world 3D locations of the objects. Which object has a higher location? A. yellow bear kite B. building" #We recommend using the format prompt to make the inference consistent with training format_prompt = "\n You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within tags. The final answer MUST BE put in \\boxed{}." model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_path, torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained(model_path) messages = [ { "role": "user", "content": [ { "type": "image", "image": img_path, }, {"type": "text", "text": question + format_prompt}, ], } ] text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") generated_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=False) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print("Model Response:", output_text)

Here we provide a code snippet for you to start a simple trial of Spatial-SSRL-Qwen3VL-4B on your own device. You can download the model from Spatial-SSRL-Qwen3VL-4B Model before your trial!

from transformers import AutoProcessor, AutoModelForImageTextToText #transformers==4.57.1 from qwen_vl_utils import process_vision_info #0.0.14 import torch model_path = "internlm/Spatial-SSRL-Qwen3VL-4B" #You can change it to your own local path if deployed already #Change the path of the input image img_path = "examples/eg_qwen3vl.jpg" #Change your question here question = "Question: Consider the real-world 3D locations and orientations of the objects. If I stand at the man's position facing where it is facing, is the menu on the left or right of me?\nOptions:\nA. on the left\nB. on the right\n" question += "Please select the correct answer from the options above. \n" #We recommend using the format prompt to make the inference consistent with training format_prompt = "You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within tags. The final answer MUST BE put in \\boxed{}." model = AutoModelForImageTextToText.from_pretrained( model_path, torch_dtype=torch.float16, device_map='auto', attn_implementation='flash_attention_2' ) processor = AutoProcessor.from_pretrained(model_path) messages = [ { "role": "user", "content": [ { "type": "image", "image": img_path, }, {"type": "text", "text": question + format_prompt}, ], } ] text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") generated_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=False) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print("Model Response:", output_text[0])

Evaluation

Prepare your environment:

git clone https://github.com/InternLM/Spatial-SSRL.git conda create -n spatialssrl python==3.10 conda activate spatialssrl cd Spatial-SSRL/evaluation pip install -r requirements.txt # Recommended pip install flash-attn --no-build-isolation

Start your evaluation by referring to the tutorials in Eval.md

Todo

Release the training code.

Cases

Citation

If you find this project useful, please kindly cite:

@article{liu2025spatial, title={Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning}, author={Liu, Yuhong and Zhang, Beichen and Zang, Yuhang and Cao, Yuhang and Xing, Long and Dong, Xiaoyi and Duan, Haodong and Lin, Dahua and Wang, Jiaqi}, journal={arXiv preprint arXiv:2510.27606}, year={2025} }

License

Usage and License Notices: The data and code are intended and licensed for research use only.

Acknowledgement

We extend our sincere gratitude to VLMEvalkit, the powerful toolkit to evaluate a vast range of LMMs!

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InternLM/Spatial-SSRL

Folders and files

Latest commit

History

Repository files navigation

(CVPR 2026)Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

News

Overview

Highlights

Results

Quick Start

Evaluation

Todo

Cases

Citation

License

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

(CVPR 2026)Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

News

Overview

Highlights

Results

Quick Start

Evaluation

Todo

Cases

Citation

License

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages