Abstract
With advances in satellite constellations, sensor technologies, and imaging pipelines, ultra-high-resolution (Ultra-HR) remote sensing imagery is becoming increasingly widespread. However, current remote sensing foundation models are ill-suited to such inputs: full-image encoding exhausts token and memory budgets, while resize-based preprocessing loses fine-grained and answer-critical details. In this context, guiding the model look where it matters before prediction becomes crucial. Therefore, we present ZoomSearch, a training-free, plug-and-play pipeline that decouples 'where to look' from 'how to answer' for Ultra-HR Remote Sensing Visual Question Answering (RS-VQA). ZoomSearch combines Adaptive Multi-Branch Zoom Search, which performs a hierarchical search over image patches to localize query-relevant regions, with Layout-Aware Patch Reassembly, which reorganizes the selected patches into a compact, layout-faithful canvas. We conduct comprehensive experiments on Ultra-HR RS-VQA benchmarks MME-RealWorld-RS and LRS-VQA, comparing against (i) strong general foundation models, (ii) remote sensing foundation models, (iii) Ultra-HR RS-VQA methods, and (iv) plug-and-play search-based VQA methods. When integrated with LLaVA-ov, ZoomSearch attains state-of-the-art accuracy across diverse tasks, improving the LLaVA-ov baseline by 26.3% on LRS-VQA and 114.8% on MME-RealWorld-RS. Meanwhile, it achieves much higher inference efficiency, outperforming prior search-based methods by 20–44% in speed.
Overview of ZoomSearch
The top-left part illustrates Adaptive Multi-Branch Zoom Search, which progressively explores the image and focuses on regions that are closely related to the text query. The bottom part shows the scoring mechanism, where each candidate patch is evaluated by a patch–text relevance score from an external scoring model and a model-evidence signal from the foundation model. The top-right part depicts Layout-Aware Patch Reassembly, which reorganizes the selected informative patches into a spatially consistent canvas that preserves their relative and global positions.
Experimental Results
Performance on LRS-VQA Dataset
| Method | Pub. | Max Res. | Rural/Urban | Count | Reasoning | Status | Category | Shape | Color | Background | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gemini-2.5-Flash | - | - | 56.49 | 18.49 | 20.40 | 18.69 | 14.81 | 34.17 | 36.08 | 16.11 | 26.91 |
| GPT-4o | - | - | 55.19 | 16.50 | 21.39 | 19.22 | 16.26 | 39.32 | 47.45 | 18.78 | 29.26 |
| LLaVA-v1.5-7b | NeurIPS'23 | 336 | 53.04 | 11.84 | 20.50 | 11.80 | 15.40 | 31.98 | 39.87 | 19.18 | 25.45 |
| LLaVA-v1.6-7b | - | 672 | 52.00 | 13.68 | 19.80 | 17.40 | 15.91 | 29.77 | 41.31 | 17.96 | 25.98 |
| LLaVA-ov-7b | TMLR'25 | 384 | 50.08 | 11.68 | 21.80 | 11.20 | 19.15 | 31.98 | 45.49 | 17.14 | 26.07 |
| Qwen2.5-VL-7b | - | 1000 | 47.12 | 16.18 | 19.50 | 9.20 | 16.31 | 21.81 | 51.76 | 17.96 | 24.98 |
| LLaVA-HR | ICLR'25 | 1536 | 57.11 | 9.67 | 17.60 | 9.10 | 15.20 | 21.02 | 37.91 | 17.96 | 23.30 |
| GeoChat | CVPR'24 | 504 | 61.42 | 11.76 | 16.70 | 6.50 | 8.00 | 21.47 | 17.39 | 11.84 | 19.38 |
| VHM | AAAI'25 | 336 | 56.39 | 12.26 | 18.80 | 13.40 | 17.12 | 31.86 | 46.27 | 12.24 | 26.69 |
| GeoLLaVA-8K | NeurIPS'25 | 8K | 54.68 | 12.83 | 21.32 | 4.41 | 14.81 | 22.41 | 49.52 | 16.18 | 24.52 |
| ImageRAG | GRSM'25 | Dynamic | 58.55 | 13.70 | 21.20 | 10.00 | 21.07 | 33.90 | 45.75 | 19.18 | 27.92 |
| ZoomEye | EMNLP'25 | Dynamic | 48.24 | 14.76 | 22.10 | 10.40 | 23.61 | 28.58 | 45.36 | 21.63 | 26.84 |
| RAP | ICML'25 | Dynamic | 45.45 | 16.78 | 26.10 | 11.00 | 24.32 | 30.96 | 51.90 | 24.08 | 28.82 |
| ZoomSearch (Ours) | - | Dynamic | 62.53 | 17.32 | 28.50 | 15.90 | 24.75 | 37.80 | 50.12 | 26.43 | 32.92 |
| Improvements | - | - | +25% | +48% | +31% | +42% | +29% | +18% | +10% | +54% | +26% |
Performance on MME-RealWorld-RS Dataset
| Method | Max Res. | Position | Color | Count | Avg. |
|---|---|---|---|---|---|
| Gemini-2.5-Flash | - | 55.43 | 50.92 | 29.64 | 45.33 |
| GPT-4o | - | 33.52 | 29.83 | 18.90 | 27.42 |
| LLaVA-v1.5-7b | 336 | 21.48 | 22.95 | 16.31 | 20.28 |
| LLaVA-v1.6-7b | 672 | 26.49 | 24.06 | 20.47 | 23.70 |
| LLaVA-ov-7b | 384 | 26.81 | 26.14 | 27.57 | 26.83 |
| Qwen2.5-VL-7b | 1000 | 22.12 | 15.54 | 14.93 | 17.55 |
| LLaVA-HR | 1536 | 35.56 | 44.30 | 7.91 | 29.26 |
| GeoChat | 504 | 25.06 | 23.11 | 15.66 | 21.32 |
| VHM | 336 | 35.24 | 20.32 | 16.80 | 24.18 |
| GeoLLaVA-8K | 8K | 34.90 | 27.92 | 22.27 | 28.41 |
| ImageRAG | Dynamic | 63.33 | 60.48 | 32.46 | 52.09 |
| ZoomEye | Dynamic | 43.52 | 60.88 | 30.10 | 44.94 |
| RAP | Dynamic | 57.62 | 64.53 | 40.25 | 54.20 |
| ZoomSearch (Ours) | Dynamic | 67.62 | 66.14 | 39.15 | 57.64 |
| Improvements | - | +152% | +153% | +42% | +115% |
Case Study
Qualitative comparison between our method and other search-based methods on an object counting task.
Qualitative comparison between our method and other search-based methods on an object color recognition task.
More qualitative results of ZoomSearch.
BibTeX
@article{ZoomSearch,
title={Look Where It Matters: Training-Free Ultra-HR Remote Sensing VQA via Adaptive Zoom Search},
author={Zhou, Yunqi and Jiang, Chengjie and Yuan, Chun and Li, Jing},
journal={arXiv preprint arXiv:2511.20460},
year={2025}
}