SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation

Wenyu Zhang1, Wei En Ng2, Lixin Ma3*, Yuwen Wang2*, Junqi Zhao4*,
Allison Koenecke5, Boyang Li4, Lu Wang1
1Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR),
2National University of Singapore (NUS), 3Tongji University,
4Nanyang Technological University (NTU), 5Cornell University *Contributed equally to this work; authors are listed in alphabetical order.

Abstract

Current vision-language models may grasp basic spatial cues and simple directions (e.g. left, right, front, back), but struggle with the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications.

To address this gap, we develop SPHERE (Spatial Perception and Hierarchical Evaluation of REasoning), a hierarchical evaluation framework supported by a new human-annotated dataset. SPHERE systematically probes models across increasing levels of complexity, from fundamental skills to multi-skill integration and high-level reasoning that combines spatial, visual, and logical understanding.

Benchmark evaluation of state-of-the-art models reveals significant deficiencies, especially in reasoning about distance and proximity, understanding both egocentric and allocentric perspectives, and applying spatial logic in physical contexts. These findings expose critical blind spots in existing models and underscore the need for more advanced spatial reasoning techniques, driving the development of vision-language models that align more closely with human spatial cognition.

The SPHERE benchmark is available at this repository.

Dataset Viewer

Leaderboard

The leaderboard is sorted by overall performance by default. Click on the corresponding cell to sort by another column.

Single-Skill tasks include Position, Counting, Distance, and Size.

Multi-Skill tasks include Position + Counting, Distance + Counting, and Distance + Size.

Reasoning tasks include Object occlusion and Object manipulation.

# Model Params Single-Skill Multi-Skill Reasoning Overall
Pos. Count. Dist. Size P+C D+C D+S Occl. Manip. Avg
SpatialBot-RGB 3B 55.9 67.7 52.8 73.0 38.5 39.2 38.7 57.6 60.3 53.6
SpatialBot 3B 55.5 72.1 51.9 75.4 39.1 36.1 38.9 56.5 54.6 53.1
Phi-3.5-Vision 4B 61.8 58.7 56.7 76.0 44.4 36.1 47.2 54.4 55.5 54.3
LLaVA-NeXT 7B 53.6 68.7 54.5 70.9 42.0 35.4 36.1 48.7 49.1 50.6
LLaVA-OneVision 7B 60.2 76.1 64.2 84.0 53.3 53.8 52.9 53.0 59.6 61.5
Qwen2-VL 7B 63.2 74.1 53.0 78.6 36.1 33.5 50.8 54.0 55.2 55.0
Qwen2.5-VL 7B 59.3 82.1 58.9 73.6 44.7 31.6 37.9 48.7 54.1 54.1
Janus-Pro 7B 57.1 70.6 59.3 76.5 37.3 33.5 46.8 53.5 54.3 54.0
SpaceMantis 8B 52.7 52.7 57.0 63.1 28.4 43.7 45.6 56.6 54.2 50.4
SpatialRGPT-RGB 8B 59.3 70.1 59.2 74.6 42.6 46.2 40.9 54.5 53.2 55.3
InstructBLIP 8B 45.6 64.7 50.4 56.2 32.0 28.5 38.1 54.8 52.7 47.0
Idefics2 8B 50.1 45.3 49.1 58.5 22.5 27.8 40.1 51.0 53.3 44.2
InternVL2.5 8B 62.2 72.6 64.6 78.3 50.2 43.8 41.8 52.7 53.6 57.3
Qwen-VL 10B 55.9 72.9 59.2 72.0 37.2 29.9 34.7 54.3 54.3 52.0
Llama-3.2-Vision 11B 58.4 53.5 52.9 67.2 29.8 27.7 40.8 60.7 56.1 49.7
Qwen2-VL 72B 62.8 80.6 60.8 85.5 37.3 38.0 69.3 53.9 55.0 59.8
Qwen2.5-VL 72B 63.7 81.1 69.9 89.3 51.5 55.7 66.4 50.3 56.4 64.3
Gemini 2.0 Flash - 68.6 82.1 75.2 86.9 50.3 52.5 49.7 48.0 49.2 61.7
Gemini 2.5 Flash - 74.2 85.6 83.2 86.9 68.6 74.1 68.8 58.5 62.8 73.0
GPT-4o - 71.7 76.6 71.3 89.4 54.4 62.7 58.8 66.0 63.3 67.9
o4-mini - 70.0 78.6 83.2 89.9 65.7 70.9 79.4 62.0 58.3 72.5

BibTeX

@article{zhang2025sphere,
  title={SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation},
  author={Zhang, Wenyu and Ng, Wei En and Ma, Lixin and Wang, Yuwen and Zhao, Jungqi and Koenecke, Allison and Li, Boyang and Wang, Lu},
  journal={arXiv},
  year={2025}
}