Breaking Rectangular Shackles: Cross-View Object Segmentation for Fine-Grained Object Geo-Localization

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China

Illustration of the cross-view object geo-localization task and comparison of different solutions.

Abstract

This paper addresses the limitations of existing cross-view object geo-localization schemes, which rely on rectangular proposals to localize irregular objects in satellite imagery. These ``rectangular shackles" inherently struggle to precisely define objects with complex geometries, leading to incomplete coverage or erroneous localization. We propose a novel scheme, cross-view object segmentation (CVOS), which achieves fine-grained geo-localization by predicting pixel-level segmentation masks of query objects. CVOS enables accurate extraction of object shapes, sizes, and areas—critical for applications like urban planning and agricultural monitoring. We also created the CVOGL-Seg dataset specifically to support and evaluate CVOS. To tackle CVOS challenges, we introduce Transformer Object Geo-localization (TROGeo), a two-stage framework. First, the Heterogeneous Task Training Stage (HTTS) employs a single transformer encoder with a Cross-View Object Perception Module (CVOPM) and is trained by learning a heterogeneous task. Second, the SAM Prompt Stage (SPS) utilizes SAM’s zero-shot segmentation capability, guided by HTTS outputs, to generate precise masks. We extensively evaluate our method on CVOGL and CVOGL-Seg datasets and demonstrate state-of-the-art performance compared to existing models. Our work demonstrates that CVOS breaks the rectangular shackles and unlocks new potential for fine-grained object geo-localization.

Transformer Object Geo-localization (TROGeo)

Overview

An overview of the proposed TROGeo framework, consisting of (a) Heterogeneous Task Training Stage (HTTS) and (b) SAM Prompt Stage (SPS).

Cross-View Object Perception Module (CVOPM)

Human perception provides a key insight: when searching for cross-view object localization cues, observers consciously integrate the object and its semantic context (e.g., road layout, neighboring buildings). Inspired by LDM.

Results

Rectangular Shackles: Defining Geographic Locations of Objects Using Rectangular Regions.

Comparison with previous works on the CVOGL dataset. “w/o OST” and “w OST” refer to whether Object Segmentation Task (OST) is learned in TROGeo’s HTTS. Bold indicates the best result.

Breaking Rectangle Shackles: Defining Geographic Locations of Objects Using Segmentation Masks.

Comparison with previous works on the CVOGL-Seg dataset. “+ SPS” means that our SAM Prompt Stage (SPS) is added to obtain the segmentation mask using the original rectangular region output as a prompt.

Qualitative Analysis

Ablation Studies

BibTeX

@InProceedings{Zhang_2025_ICCV,
    author    = {Zhang, Qingwang and Zhu, Yingying},
    title     = {Breaking Rectangular Shackles: Cross-View Object Segmentation for Fine-Grained Object Geo-Localization},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {8197-8206}
}