📝 TL;DR: With our framework, VLMs can perform spatial reasoning in arbitrary perspectives.


Abstract

We present a framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation. Perspective-taking—the ability to perceive an environment or situation from an alternative viewpoint—is a key benchmark for human-level visual understanding, essential for environmental interaction and collaboration with autonomous agents. Despite advancements in spatial reasoning within VLMs, recent research has shown that modern VLMs significantly lack perspective-aware reasoning capabilities and exhibit a strong bias toward egocentric interpretations. To bridge the gap between VLMs and human perception, we focus on the role of mental imagery, where humans perceive the world through abstracted representations that facilitate perspective shifts. Motivated by this, we propose a framework for perspective-aware reasoning, named Abstract Perspective Change (APC), that effectively leverages vision foundation models, such as object detection, segmentation, and orientation estimation, to construct scene abstractions and enable perspective transformations. Our experiments on synthetic and real-image benchmarks, compared with various VLMs, demonstrate significant improvements in perspective-aware reasoning with our framework, further outperforming fine-tuned spatial reasoning models and novel-view-synthesis-based approaches.


👀 Problem Definition

Perspectives are important, but remain a missing block in VLM spatial reasoning.

Egocentric vs. Allocentric. While VLMs perform well when questions are asked from the egocentric (i.e. camera's) perspective, they struggle when the same questions are posed from an allocentric perspective, showing a strong bias toward egocentric reasoning [1]. Allocentric reasoning is crucial for high-level spatial tasks, and serves as a key benchmark for human-level spatial understanding, as also recognized in previous studies on VLM spatial reasoning [1, 2, 3, 4, 5]. In this work, we aim to extend the spatial reasoning capabilities of VLMs to 🌟arbitrary perspectives🌟, thereby bridging the gap between VLMs and human perception and opening up new possibilities for VLM-based applications.


🤔 How Do Humans Shift Perspectives?

We can imagine a scene from another viewpoint by forming an abstract version of it and rotating it in our mind.

Mental Imagery Simulation. Inspired by how humans employ mental imagery [6, 7, 8, 9] to reason from across different perspectives (left), we propose a similar process for VLMs, by constructing an explicit abstraction of the input scene and using it as a foundation for perspective changes (right).
Based on this idea, we build our framework by answering the following questions:
  1. Given an image of a scene and a question based on it, how can we turn it into an abstract 3D scene?
  2. After transforming the abstract scene to align with a given perspective, how can we feed it back to the VLM?


💡 Abstract Perspective Change

We propose APC, a framework that enables VLMs to perform perspective-aware reasoning,
by explicitly simulating the mental imagery process of humans.

Our proposed framework consists of three stages:
  1. Scene Abstraction (Sec. 3.1): APC first detects the objects of interest and build a coarse 3D abstraction of the scene using off-the-shelf vision foundation models. We employ models for object detection [12], segmentation [13], depth estimation [14], and orientation estimation [15].
  2. Perspective Change (Sec. 3.2): Then, a reference perspective is set and the abstraction is transformed into the reference viewer’s egocentric coordinate frame.
  3. Perspective Prompting (Sec. 3.3): Finally, APC passes the transformed scene to the VLM by producing (1) a numerical (textual) prompt or (2) an abstract visual prompt, and poses the question of interest from the reference perspective.

🔍 Perspective Prompts

In APC, the transformed abstract scene is then passed to the VLM in the form of a prompt.

Perspective Prompt Samples. We explore two variations of perspective prompting, numerical (left) and visual (right). Numerical (textual) prompt is generated by directly utilizing the 3D coordinate and orientation information. To generate the Visual prompt, we first place a colored cube at each object’s identified 3D position then render the scene at the reference viewpoint, which results in an egocentric depiction of the scene. In addition, we construct an abstract question along with object-color mapping to ground the abstracted view.

🖼️ Results

Comparison with Baseline VLMs.

Spatial Reasoning with Perspective Change. Recent VLMs such as Qwen2.5-VL [10] and Cambrian-1 [11] often struggle with spatial reasoning tasks that require a shift to a specific reference viewpoint. In constrast, our APC effectively handles such perspective changes by constructing a scene abstraction and delivering the transformed view through a simple prompting technique.


Probing the Perspective Awareness of VLMs.

Perspective Awareness. Each plot shows accuracy versus the angular offset θ between the camera and the reference viewpoint. While baselines show clear degradation at certain ranges of θ, APC retains robust accuracy across all angles, demonstrating strong perspective-aware reasoning.

References

[1] Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities, Zhang et al., ICLR 2025
[2] 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark, Ma et al., arXiv 2024
[3] Seeing Through Their Eyes: Evaluating Visual Perspective Taking in Vision Language Models, Góral et al., arXiv 2024
[4] The 3D-PC: A Benchmark for Visual Perspective Taking in Humans and Machines, Linsley et al., ICLR 2025
[5] Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces, Yang et al., CVPR 2025
[6] Principles of mental imagery., Finke, R. A, The MIT Press, 1989
[7] Mental imagery., Bence Nanay, The Stanford Encyclopedia of Philosophy, 1997
[8] Mental rotation of three-dimensional objects., Roger N Shepard and Jacqueline Metzler, Science, 1971
[9] Visual images preserve metric spatial information: Evidence from studies of image scanning., Kosslyn et al., Journal of Experimental Psychology, 1978
[10] Qwen2.5-VL Technical Report, Qwen Team Alibaba Group, arXiv 2025
[11] Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs, NeurIPS 2024
[12] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection, Liu et al., ECCV 2024
[13] Segment Anything, Kirillov et al., ICCV 2023
[14] Depth Pro: Sharp Monocular Metric Depth in Less Than a Second, Bochkovskii et al., ICLR 2025
[15] Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models, Wang et al., arXiv 2024

Citation

If you find our work helpful, please cite the following paper.

@article{kim2025inference, title = {Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing}, author = {Kim, Jaihoon and Yoon, Taehoon and Hwang, Jisung and Sung, Minhyuk}, journal= {arXiv preprint arXiv:2503.19385}, year = {2025} }