We present SpatialVLA, a spatial-enhanced vision-language-action model that is trained on 1.1 Million real robot episodes. The model is equipped with 3D Egocentric Position Encoding and Adaptive Spatial Grids to explore spatial representations for generalist robot policy, achieving superior 3D scene spatial understanding, zero-shot in-distribution generalization, and efficient adaption to new robot setups. The model achieves state-of-the-art performance across a diverse range of evaluations and shows significantly faster inference speed with less tokens per action.
The pipeline of SpatialVLA. Given an image observation \(\mathbf{o}_t\) and a natural language task instruction \(\mathbf{L}\), the model processes the image observation with Ego3D Position Encoding and auto-regressively predict spatial action tokens that are de-tokenized to continous actions \(\mathbf{A}_t\) for robot control. The model consists of three key components: (1) the model first extract 2D semantic features with SigLIP vision encoder and inject 3D spatial context with Ego3D Position Encoding; (2) the spatial action movements are discretized by a unified adaptive action grids and represented with separate spatial action tokens for prediction; (3) the action grids and spatial action tokens are able to adjust and update for effective adaption to new robot setups.
We evaluate SpatialVLA across 7 robot learning scenarios, 16 real-robot tasks, and 48 simulation setups, focusing on three key aspects: zero-shot control, spatial understanding, and adaptability to new setups. We also conduct a thorough ablation study on a mixed Fractal and Bridge dataset to verify our design decisions.
Benefiting from the proposed Ego3D Position Encoding, SpatialVLA exhibits superior performance in understanding spatial prompts and complex spatial layout tasks.
"Place the plush toy closest to the white robot arm on the green car."
"Put the green cup on the pink cloth."
We evaluate SpatialVLA across 7 task suites to explore the language grounding, semantic understanding, and motion sensing capabilities, with varying backgrounds, poses, and motion distractors. SpatialVLA achieves the highest average success rate, outperforming all generalist manipulation policies.
average performance
close microwave
lift red chill pepper
put carrot in plate
put eggplant in basket
put cup on white plate
put cup on pink cloth
put cup on pink cloth
Multi-task finetuning, which involves training on a mixed dataset and testing on multiple tasks.
push handle aside
put banana in basket
put pot on cutting board
close drawer
Instruction following, which involves operating on different objects within the same scene.
put green cube on toy car
put blue cube on toy car
put orange on plate
put croissant on plate
Compared to existing policies, SpatialVLA shows superior spatial understanding, achieving 73% accuracy in Franka task #1, which involves spatial prompts, and significantly improving manipulation capabilities for complex positional changes in the out-of-distribution WidowX zero-shot tasks #2-4. Consequently, we suggest integrating 3D information, including depth or point cloud, into the VLA framework to improve the model's adaptability and robustness in spatial layout variations.
"Place the plush toy closest to the white robot arm on the green car."
Diffusion Policy ❌
Octo ⚠️
OpenVLA ⚠️
SpatialVLA (ours) ✅
"Put the green cup on the pink cloth."
RT-1-X ❌
RoboVLM ⚠️
Octo ⚠️
OpenVLA ⚠️
SpatialVLA (ours) ✅
"Put the green cup on the pink cloth."
RT-1-X ❌
RoboVLM ❌
Octo ❌
OpenVLA ⚠️
SpatialVLA (ours) ✅
"Put the carrot in the plate."
RT-1-X ❌
RoboVLM ⚠️
Octo ❌
OpenVLA ⚠️
SpatialVLA (ours) ✅
@article{spatialvla2025,
title={SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Models},
author={Anonymous Authors},
year={2025}
}