SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation

Wenyu Zhang¹, Wei En Ng², Lixin Ma^3*, Yuwen Wang^2*, Junqi Zhao^4*,
Allison Koenecke⁵, Boyang Li⁴, Lu Wang¹

¹Institute for Infocomm Research (I²R), Agency for Science, Technology and Research (A*STAR),
²National University of Singapore (NUS), ³Tongji University,
⁴Nanyang Technological University (NTU), ⁵Cornell University ^*Contributed equally to this work; authors are listed in alphabetical order.

Abstract

Current vision-language models may grasp basic spatial cues and simple directions (e.g. left, right, front, back), but struggle with the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications.

To address this gap, we develop SPHERE (Spatial Perception and Hierarchical Evaluation of REasoning), a hierarchical evaluation framework supported by a new human-annotated dataset. SPHERE systematically probes models across increasing levels of complexity, from fundamental skills to multi-skill integration and high-level reasoning that combines spatial, visual, and logical understanding.

Benchmark evaluation of state-of-the-art models reveals significant deficiencies, especially in reasoning about distance and proximity, understanding both egocentric and allocentric perspectives, and applying spatial logic in physical contexts. These findings expose critical blind spots in existing models and underscore the need for more advanced spatial reasoning techniques, driving the development of vision-language models that align more closely with human spatial cognition.

The SPHERE benchmark is published at ACL 2025 and available at this repository.

Leaderboard

The leaderboard is sorted by overall performance by default. Click on the corresponding cell to sort by another column.

Single-Skill tasks include Position, Counting, Distance, and Size.

Multi-Skill tasks include Position + Counting, Distance + Counting, and Distance + Size.

Reasoning tasks include Object occlusion and Object manipulation.

#	Model	Params	Single-Skill				Multi-Skill			Reasoning		Overall
#	Model	Params	Pos.	Count.	Dist.	Size	P+C	D+C	D+S	Occl.	Manip.	Avg
	SpatialBot-RGB	3B	55.9	67.7	52.8	73.0	38.5	39.2	38.7	57.6	60.3	53.6
	SpatialBot	3B	55.5	72.1	51.9	75.4	39.1	36.1	38.9	56.5	54.6	53.1
	Phi-3.5-Vision	4B	61.8	58.7	56.7	76.0	44.4	36.1	47.2	54.4	55.5	54.3
	LLaVA-NeXT	7B	53.6	68.7	54.5	70.9	42.0	35.4	36.1	48.7	49.1	50.6
	LLaVA-OneVision	7B	60.2	76.1	64.2	84.0	53.3	53.8	52.9	53.0	59.6	61.5
	Qwen2-VL	7B	63.2	74.1	53.0	78.6	36.1	33.5	50.8	54.0	55.2	55.0
	Qwen2.5-VL	7B	59.3	82.1	58.9	73.6	44.7	31.6	37.9	48.7	54.1	54.1
	Janus-Pro	7B	57.1	70.6	59.3	76.5	37.3	33.5	46.8	53.5	54.3	54.0
	SpaceMantis	8B	52.7	52.7	57.0	63.1	28.4	43.7	45.6	56.6	54.2	50.4
	SpatialRGPT-RGB	8B	59.3	70.1	59.2	74.6	42.6	46.2	40.9	54.5	53.2	55.3
	InstructBLIP	8B	45.6	64.7	50.4	56.2	32.0	28.5	38.1	54.8	52.7	47.0
	Idefics2	8B	50.1	45.3	49.1	58.5	22.5	27.8	40.1	51.0	53.3	44.2
	InternVL2.5	8B	62.2	72.6	64.6	78.3	50.2	43.8	41.8	52.7	53.6	57.3
	Qwen-VL	10B	55.9	72.9	59.2	72.0	37.2	29.9	34.7	54.3	54.3	52.0
	Llama-3.2-Vision	11B	58.4	53.5	52.9	67.2	29.8	27.7	40.8	60.7	56.1	49.7
	Qwen2-VL	72B	62.8	80.6	60.8	85.5	37.3	38.0	69.3	53.9	55.0	59.8
	Qwen2.5-VL	72B	63.7	81.1	69.9	89.3	51.5	55.7	66.4	50.3	56.4	64.3
	Gemini 2.0 Flash	-	68.6	82.1	75.2	86.9	50.3	52.5	49.7	48.0	49.2	61.7
	Gemini 2.5 Flash	-	74.2	85.6	83.2	86.9	68.6	74.1	68.8	58.5	62.8	73.0
	GPT-4o	-	71.7	76.6	71.3	89.4	54.4	62.7	58.8	66.0	63.3	67.9
	o4-mini	-	70.0	78.6	83.2	89.9	65.7	70.9	79.4	62.0	58.3	72.5

BibTeX

@article{zhang2025sphere, title={SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation}, author={Zhang, Wenyu and Ng, Wei En and Ma, Lixin and Wang, Yuwen and Zhao, Jungqi and Koenecke, Allison and Li, Boyang and Wang, Lu}, journal={ACL}, year={2025} }

SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation

Abstract

Dataset Viewer

Leaderboard

BibTeX