SpatialVLA
Exploring Spatial Representations for
Visual-Language-Action Models

Delin Qu^*,1,2, Haoming Song^*,1,3, Qizhi Chen^*,1,4, Dong Wang^†,1, Yuanqi Yao¹, Xinyi Ye¹, Yan Ding¹, Zhigang Wang¹, JiaYuan Gu⁵, Bin Zhao^†,1,6, Xuelong Li^1,6

¹Shanghai AI Laboratory ²Fudan University ³Shanghai Jiao Tong University ⁴Zhejiang University ⁵ShanghaiTech ⁶Northwestern Polytechnical University
^*denotes equal contribution

Paper Code Model

Overview

We present SpatialVLA, a spatial-enhanced vision-language-action model that is trained on 1.1 Million real robot episodes. The model is equipped with 3D Egocentric Position Encoding and Adaptive Spatial Grids to explore spatial representations for generalist robot policy, achieving superior 3D scene spatial understanding, zero-shot in-distribution generalization, and efficient adaption to new robot setups. The model achieves state-of-the-art performance across a diverse range of evaluations and shows significantly faster inference speed with less tokens per action.

Pipeline

The pipeline of SpatialVLA. Given an image observation \(\mathbf{o}_t\) and a natural language task instruction \(\mathbf{L}\), the model processes the image observation with Ego3D Position Encoding and auto-regressively predict spatial action tokens that are de-tokenized to continous actions \(\mathbf{A}_t\) for robot control. The model consists of three key components: (1) the model first extract 2D semantic features with SigLIP vision encoder and inject 3D spatial context with Ego3D Position Encoding; (2) the spatial action movements are discretized by a unified adaptive action grids and represented with separate spatial action tokens for prediction; (3) the action grids and spatial action tokens are able to adjust and update for effective adaption to new robot setups.

Experiment

We evaluate SpatialVLA across 7 robot learning scenarios, 16 real-robot tasks, and 48 simulation setups, focusing on three key aspects: zero-shot control, spatial understanding, and adaptability to new setups. We also conduct a thorough ablation study on a mixed Fractal and Bridge dataset to verify our design decisions.

Spatial Understanding

Benefiting from the proposed Ego3D Position Encoding, SpatialVLA exhibits superior performance in understanding spatial prompts and complex spatial layout tasks.

"Place the plush toy closest to the white robot arm on the green car."

"Put the green cup on the pink cloth."

Zero-shot Evaluation

We evaluate SpatialVLA across 7 task suites to explore the language grounding, semantic understanding, and motion sensing capabilities, with varying backgrounds, poses, and motion distractors. SpatialVLA achieves the highest average success rate, outperforming all generalist manipulation policies.

average performance

close microwave

lift red chill pepper

put carrot in plate

put eggplant in basket

put cup on white plate

put cup on pink cloth

Efficient Finetuning

Multi-task finetuning, which involves training on a mixed dataset and testing on multiple tasks.

push handle aside

put banana in basket

put pot on cutting board

close drawer

Instruction following, which involves operating on different objects within the same scene.

put green cube on toy car

put blue cube on toy car

put orange on plate

put croissant on plate

Comparison of Spatial Understanding

Compared to existing policies, SpatialVLA shows superior spatial understanding, achieving 73% accuracy in Franka task #1, which involves spatial prompts, and significantly improving manipulation capabilities for complex positional changes in the out-of-distribution WidowX zero-shot tasks #2-4. Consequently, we suggest integrating 3D information, including depth or point cloud, into the VLA framework to improve the model's adaptability and robustness in spatial layout variations.

Fine-tune Tasks

"Place the plush toy closest to the white robot arm on the green car."

Diffusion Policy ❌

Octo ⚠️

OpenVLA ⚠️

SpatialVLA (ours) ✅

Zero-shot Tasks

"Put the green cup on the pink cloth."

RT-1-X ❌

RoboVLM ⚠️

Octo ⚠️

OpenVLA ⚠️

SpatialVLA (ours) ✅

"Put the green cup on the pink cloth."

RT-1-X ❌

RoboVLM ❌

Octo ❌

OpenVLA ⚠️

SpatialVLA (ours) ✅

"Put the carrot in the plate."

RT-1-X ❌

RoboVLM ⚠️

Octo ❌

OpenVLA ⚠️

SpatialVLA (ours) ✅

Citation

If you find our work helpful, please cite us:

@article{qu2025spatialvla,
  title={SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model},
  author={Qu, Delin and Song, Haoming and Chen, Qizhi and Yao, Yuanqi and Ye, Xinyi and Ding, Yan and Wang, Zhigang and Gu, JiaYuan and Zhao, Bin and Wang, Dong and others},
  journal={arXiv preprint arXiv:2501.15830},
  year={2025}
}

SpatialVLA Exploring Spatial Representations for Visual-Language-Action Models

Overview

Pipeline

Experiment

Spatial Understanding

Zero-shot Evaluation

Efficient Finetuning

Comparison of Spatial Understanding

Fine-tune Tasks

Zero-shot Tasks

Citation

SpatialVLA
Exploring Spatial Representations for
Visual-Language-Action Models