Logo Look Where It Matters:
Training-Free Ultra-HR Remote Sensing VQA via Adaptive Zoom Search
1Central University of Finance and Economics   2Tsinghua University   3East China Normal University *Equal Contribution  Corresponding Author
arXiv Code

Abstract

With advances in satellite constellations, sensor technologies, and imaging pipelines, ultra-high-resolution (Ultra-HR) remote sensing imagery is becoming increasingly widespread. However, current remote sensing foundation models are ill-suited to such inputs: full-image encoding exhausts token and memory budgets, while resize-based preprocessing loses fine-grained and answer-critical details. In this context, guiding the model look where it matters before prediction becomes crucial. Therefore, we present ZoomSearch, a training-free, plug-and-play pipeline that decouples 'where to look' from 'how to answer' for Ultra-HR Remote Sensing Visual Question Answering (RS-VQA). ZoomSearch combines Adaptive Multi-Branch Zoom Search, which performs a hierarchical search over image patches to localize query-relevant regions, with Layout-Aware Patch Reassembly, which reorganizes the selected patches into a compact, layout-faithful canvas. We conduct comprehensive experiments on Ultra-HR RS-VQA benchmarks MME-RealWorld-RS and LRS-VQA, comparing against (i) strong general foundation models, (ii) remote sensing foundation models, (iii) Ultra-HR RS-VQA methods, and (iv) plug-and-play search-based VQA methods. When integrated with LLaVA-ov, ZoomSearch attains state-of-the-art accuracy across diverse tasks, improving the LLaVA-ov baseline by 26.3% on LRS-VQA and 114.8% on MME-RealWorld-RS. Meanwhile, it achieves much higher inference efficiency, outperforming prior search-based methods by 20–44% in speed.

Overview of ZoomSearch

ZoomSearch introduction figure

The top-left part illustrates Adaptive Multi-Branch Zoom Search, which progressively explores the image and focuses on regions that are closely related to the text query. The bottom part shows the scoring mechanism, where each candidate patch is evaluated by a patch–text relevance score from an external scoring model and a model-evidence signal from the foundation model. The top-right part depicts Layout-Aware Patch Reassembly, which reorganizes the selected informative patches into a spatially consistent canvas that preserves their relative and global positions.

Experimental Results

Performance on LRS-VQA Dataset

Method Pub. Max Res. Rural/Urban Count Reasoning Status Category Shape Color Background Avg.
Gemini-2.5-Flash - - 56.49 18.49 20.40 18.69 14.81 34.17 36.08 16.11 26.91
GPT-4o - - 55.19 16.50 21.39 19.22 16.26 39.32 47.45 18.78 29.26
LLaVA-v1.5-7b NeurIPS'23 336 53.04 11.84 20.50 11.80 15.40 31.98 39.87 19.18 25.45
LLaVA-v1.6-7b - 672 52.00 13.68 19.80 17.40 15.91 29.77 41.31 17.96 25.98
LLaVA-ov-7b TMLR'25 384 50.08 11.68 21.80 11.20 19.15 31.98 45.49 17.14 26.07
Qwen2.5-VL-7b - 1000 47.12 16.18 19.50 9.20 16.31 21.81 51.76 17.96 24.98
LLaVA-HR ICLR'25 1536 57.11 9.67 17.60 9.10 15.20 21.02 37.91 17.96 23.30
GeoChat CVPR'24 504 61.42 11.76 16.70 6.50 8.00 21.47 17.39 11.84 19.38
VHM AAAI'25 336 56.39 12.26 18.80 13.40 17.12 31.86 46.27 12.24 26.69
GeoLLaVA-8K NeurIPS'25 8K 54.68 12.83 21.32 4.41 14.81 22.41 49.52 16.18 24.52
ImageRAG GRSM'25 Dynamic 58.55 13.70 21.20 10.00 21.07 33.90 45.75 19.18 27.92
ZoomEye EMNLP'25 Dynamic 48.24 14.76 22.10 10.40 23.61 28.58 45.36 21.63 26.84
RAP ICML'25 Dynamic 45.45 16.78 26.10 11.00 24.32 30.96 51.90 24.08 28.82
ZoomSearch (Ours) - Dynamic 62.53 17.32 28.50 15.90 24.75 37.80 50.12 26.43 32.92
Improvements - - +25% +48% +31% +42% +29% +18% +10% +54% +26%

Performance on MME-RealWorld-RS Dataset

Method Max Res. Position Color Count Avg.
Gemini-2.5-Flash - 55.43 50.92 29.64 45.33
GPT-4o - 33.52 29.83 18.90 27.42
LLaVA-v1.5-7b 336 21.48 22.95 16.31 20.28
LLaVA-v1.6-7b 672 26.49 24.06 20.47 23.70
LLaVA-ov-7b 384 26.81 26.14 27.57 26.83
Qwen2.5-VL-7b 1000 22.12 15.54 14.93 17.55
LLaVA-HR 1536 35.56 44.30 7.91 29.26
GeoChat 504 25.06 23.11 15.66 21.32
VHM 336 35.24 20.32 16.80 24.18
GeoLLaVA-8K 8K 34.90 27.92 22.27 28.41
ImageRAG Dynamic 63.33 60.48 32.46 52.09
ZoomEye Dynamic 43.52 60.88 30.10 44.94
RAP Dynamic 57.62 64.53 40.25 54.20
ZoomSearch (Ours) Dynamic 67.62 66.14 39.15 57.64
Improvements - +152% +153% +42% +115%

Case Study

BibTeX

@article{ZoomSearch,
  title={Look Where It Matters: Training-Free Ultra-HR Remote Sensing VQA via Adaptive Zoom Search},
  author={Zhou, Yunqi and Jiang, Chengjie and Yuan, Chun and Li, Jing},
  journal={arXiv preprint arXiv:2511.20460},
  year={2025}
}