This paper addresses the limitations of existing cross-view object geo-localization schemes, which rely on rectangular proposals to localize irregular objects in satellite imagery. These ``rectangular shackles" inherently struggle to precisely define objects with complex geometries, leading to incomplete coverage or erroneous localization. We propose a novel scheme, cross-view object segmentation (CVOS), which achieves fine-grained geo-localization by predicting pixel-level segmentation masks of query objects. CVOS enables accurate extraction of object shapes, sizes, and areas—critical for applications like urban planning and agricultural monitoring. We also created the CVOGL-Seg dataset specifically to support and evaluate CVOS. To tackle CVOS challenges, we introduce Transformer Object Geo-localization (TROGeo), a two-stage framework. First, the Heterogeneous Task Training Stage (HTTS) employs a single transformer encoder with a Cross-View Object Perception Module (CVOPM) and is trained by learning a heterogeneous task. Second, the SAM Prompt Stage (SPS) utilizes SAM’s zero-shot segmentation capability, guided by HTTS outputs, to generate precise masks. We extensively evaluate our method on CVOGL and CVOGL-Seg datasets and demonstrate state-of-the-art performance compared to existing models. Our work demonstrates that CVOS breaks the rectangular shackles and unlocks new potential for fine-grained object geo-localization.
An overview of the proposed TROGeo framework, consisting of (a) Heterogeneous Task Training Stage (HTTS) and (b) SAM Prompt Stage (SPS).
Human perception provides a key insight: when searching for cross-view object localization cues, observers consciously integrate the object and its semantic context (e.g., road layout, neighboring buildings). Inspired by LDM.
Comparison with previous works on the CVOGL dataset. “w/o OST” and “w OST” refer to whether Object Segmentation Task (OST) is learned in TROGeo’s HTTS. Bold indicates the best result.
Comparison with previous works on the CVOGL-Seg dataset. “+ SPS” means that our SAM Prompt Stage (SPS) is added to obtain the segmentation mask using the original rectangular region output as a prompt.
@InProceedings{Zhang_2025_ICCV,
author = {Zhang, Qingwang and Zhu, Yingying},
title = {Breaking Rectangular Shackles: Cross-View Object Segmentation for Fine-Grained Object Geo-Localization},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {8197-8206}
}