● Preprint

Mind the Gap: Diagnosing Spatial Reasoning Failures in Multimodal Large Language Models

A psychometrically grounded diagnostic suite that separates reading a scene from mentally simulating it — and reveals where today's MLLMs break.

Ilias M. Stogiannidis1, Steven McDonagh1, Sotirios A. Tsaftaris1
1The University of Edinburgh, Edinburgh, UK
{i.stogiannidis, s.mcdonagh, s.tsaftaris}@ed.ac.uk
32
MLLMs evaluated
7
Task splits
2,855
Diagnostic items
10
Model families

The headline

Strong at seeing, near-chance at simulating

Across seven tasks, the same 32 models that describe scenes well collapse toward chance once a task demands mental transformation, with 3D mental rotation bottoming out at chance. The gap is not gradual; it is a cliff.

Performance of 32 MLLMs across seven spatial reasoning tasks. Each dot is one model, coloured by source type. Tasks ordered by mean difficulty.
Performance of 32 MLLMs across seven spatial reasoning tasks. Each marker is one model, grouped by source type. Tasks are ordered by mean difficulty, easiest (left) to hardest (right). Navigation exhibits the widest spread (77.5 pp), serving as a visual-grounding probe, while MRT Hard clusters all 32 models near chance (~25%).

Abstract

Mind the Gap

Multimodal Large Language Models (MLLMs) perform well on many vision–language tasks, but this can hide fundamental spatial reasoning weaknesses. Existing benchmarks blur the line between static scene description and the dynamic mental simulation needed for rotation, folding, and perspective-taking. To target core spatial skills, we introduce a psychometrically grounded diagnostic suite of seven task splits adapted from cognitive assessments, spanning 2,855 controlled samples over synthetic and real images.

Evaluating 32 state-of-the-art MLLMs reveals a sharp dissociation: models succeed on information extraction yet fall to near-chance on mental simulation. Longer reasoning chains reduce accuracy, models consistently confuse mirror reflections with rotations, and left–right perspective inversions remain unsolved across all 32 models. Crucially, scaling model parameters does not improve mental simulation: 3D mental rotation stays near chance for all architectures and sizes. Our analysis offers interpretable failure diagnostics and concrete directions for architectural and training improvements.

The benchmark

Seven task splits, from cognitive science

Each split is adapted from a gold-standard psychometric assessment, probing object-centric, egocentric, and allocentric reference frames across synthetic and real images.

Mental Rotation Test (Hard) example

MRT — Hard 500

White polycubes on a blank background, four choices. Identify the rotated original among mirrored distractors.

Adapts Mental Rotation Test (Shepard & Metzler)
Coloured polycubes on a 3D Cartesian grid

MRT — Easy 500

Coloured shapes on a 3D Cartesian grid with three choices (one of the two mirror distractors removed).

Adapts Mental Rotation Test
Paper folding example

Paper Folding 400

A sheet folded once or twice and hole-punched; pick the unfolded hole pattern from three options.

Adapts Paper Folding Test (Ekstrom et al.)
Maze of coloured blocks · route from S → E

Navigation 400

Trace paths through visual mazes, count direction changes, and relate key locations.

Maze-Nav from SpatialEval
Egocentric 8-class orientation (Left, Front-Left, …)

Orientation 400

Binary questions on object orientation in a camera-centric, eight-class egocentric taxonomy.

From EgoOrientBench
Spatial relations example

Spatial Relations 370

36 relations in natural images — from right of/above to attached to, touch, overlapping.

Subset of Spatial-Obj
Perspective taking example

Perspective Taking 285

Adopt another agent's viewpoint — the allocentric reference-frame transformation humans find easy.

From What's UP

2,855 items

Controlled, balanced, multiple-choice — designed to isolate mental simulation from scene perception.

Results

Leaderboard across all seven splits

Overall accuracy and per-split breakdown for all 32 models. Click any column header to sort. Best value per column is highlighted.

Model Overall Fold. MRT-E MRT-H Nav. Ori. Pers. Rel.
Closed-source / API Open-source best column leader

MRT-H chance level ≈ 25% (4 options). Every one of the 32 models lands between 20.0% and 29.4% on MRT-H — a capability ceiling, not a difficulty gradient.

What we found

Seven diagnostics of spatial failure

We go beyond aggregate accuracy: distractor selection, cross-model agreement, reasoning-length, and controlled thinking-vs-instruct comparisons expose how models fail.

1 Perception–Simulation Dissociation

Perceptual tasks are moderate — Relations μ=70.0%, Orientation μ=68.6% — but accuracy collapses on simulation: MRT Hard sits at μ=25.0%, exactly chance. The gap is a cliff, not a slope.

2 Navigation probes visual grounding

A single property — whether the model grounds route planning in the maze image — drives the split. Best 95.8%, worst 18.2%: a 77.5 pp spread and the strongest correlation with overall rank (r=0.91).

3 The Reasoning-Length Paradox

Wrong answers carry far longer chains of thought — 849 vs 188 words on average, ~3× on Navigation and Folding. When uncertain, models thrash rather than converge.

4 Chirality confusion in rotation

Mirror-reflected distractors capture 50–63% of MRT-Easy errors and 66–72% of MRT-Hard errors. Models perceive 3D structure but project it into a reflection-symmetric representation.

5 Systematic left–right inversion

Asked for another viewpoint, models answer "right" when the truth is "left" 48.1% of the time (and vice-versa 47.6%). They default to the camera frame and never flip the lateral axis.

6 Scaling does not help

Mental rotation is essentially flat with size (MRT-Easy r=0.12, MRT-Hard r=0.38): a 7B model matches a 241B model. Closed-source's edge (51.2% vs 43.1%) is almost entirely Navigation — on mental rotation there is no significant difference.

7 Folding is a shallow heuristic

Paper-folding accuracy is flat across fold count × hole count (42.3–44.9%) and fold direction. Models apply one shallow rule regardless of complexity instead of simulating each fold in sequence.

Diagnostics in depth

The evidence behind the gap

Reasoning length paradox: incorrect answers use longer chains of thought across all splits.
The reasoning-length paradox. Across all seven splits, incorrect answers come with substantially longer chains of thought than correct ones — ~3× longer on Folding and Navigation. Models thrash through conflicting hypotheses when uncertain instead of reasoning toward the answer.
Per-split accuracy delta between Thinking and Instruct variants of Qwen3-VL.
Extended reasoning training is not a reliable fix. Per-split accuracy delta (Thinking − Instruct) for the Qwen3-VL family. Gains appear at 30B, vanish at 8B and 235B, and reverse at 32B — neither consistent nor monotonic with scale.
Granular error analysis for the three hardest splits: MRT distractors, folding complexity, and perspective confusion matrix.
Granular failure analysis. (a–b) MRT distractor rates: mirror distractors dominate, indicating chirality confusion. (c–d) Folding accuracy is flat across fold × hole count — a shallow heuristic, not sequential simulation. (e) Perspective confusion matrix: left–right confusions account for nearly half of all errors.
Scaling behaviour of 27 open-source models per split.
Performance does not scale with size. Per-split accuracy vs. log active parameters for 27 open-source models (dense = circles, MoE = triangles). Relations scales moderately (r=0.53); MRT-Hard is essentially flat (r=0.38). 3D mental rotation does not improve with scale.
Systematic wrong-answer agreement across splits.
Failures are shared, not random. When models err, they converge on the same wrong answer far above chance: Perspective 66.6% (vs 33% baseline), Folding 65.5% (vs 50%), Orientation 98.1%. Spatial errors are systematic, stemming from how current architectures represent space.

Seven recurring error types

Perspective-transformation failure ~36–50% of perspective errors 3D mental-rotation breakdown ~95% of MRT errors (weak models) Reference-frame confusion ~40% of relation errors Visual-grounding failure weakest navigation models Incomplete reasoning ~10–15% across splits Structure misidentification ~5–10%, mainly MRT Ambiguity handling ~15–20% of orientation errors

Qualitative evidence

Inside the failures

The same error modes recur across models: correct perception, correct reasoning about the rule, yet the wrong reference frame or the wrong rotation.

A man and a boy walking, viewed from the front

Perspective left/right reversal — GPT-5.2

Task: PerspectivePred: B (right)Gold: D (left)

Q: Where is the boy from the man's perspective?

"The man is on the left side of the image, and the boy is walking beside him on the man's right-hand side (viewer's right)."

Analysis. The model correctly perceives the boy on the viewer's right. But the man walks toward the camera, so his left and right are mirrored relative to the viewer — the boy is on the man's left. The canonical mirror-inversion error: viewer-frame coordinates mapped onto the subject's frame without the required 180° reference-frame rotation.

Mental Rotation Test hard item with polycube shapes

Exhaustive analysis, wrong rotation — GPT-5.2

Task: MRT HardPred: BGold: C1,851 characters of reasoning

Q: Which option is the original shape in a rotated orientation?

"Key fingerprint features preserved under rotation: exactly one cube has 3 neighbours; the branching point is one in from an end … Option B preserves the same connectivity. Therefore, only Option B is a rotated version."

Analysis. The model identifies the right rotation-invariant properties and checks each option — exactly the correct procedure — yet answers wrong. A reasoning–perception decoupling: it articulates what should be invariant under rotation but cannot reliably perceive whether those invariants hold in the 3D rendering.

A man patting a dog while holding a phone

Hand-laterality inversion — GPT-5.2

Task: RelationsPred: B (right)Gold: A (left)

Q: Which hand is the man using to pat the dog?

"The hand patting the dog is the one not holding the phone, which is his right hand."

Analysis. The exclusion logic is valid, but the laterality label is wrong. Because the man faces the camera, his left hand appears on the viewer's right. The model conflates image-space position with body-space identity — the mirror-inversion error, at the level of body parts.

Cite

BibTeX

@misc{stogiannidis2025mindgapbenchmarkingspatial,
  title         = {Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models},
  author        = {Ilias Stogiannidis and Steven McDonagh and Sotirios A. Tsaftaris},
  year          = {2025},
  eprint        = {2503.19707},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2503.19707},
}

An earlier version appeared as arXiv:2503.19707 — "Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models".