In this paper, (a) we introduce LoD-Loc v3 to address two critical challenges in aerial localization over LoD city models: cross-scene generalization and the ambiguity problem in dense urban scenes. Our solutions are twofold: (b) we construct InsLoD-Loc, a large-scale synthetic dataset covering 40 distinct areas for model zero-shot training, and (c) we reformulate the localization paradigm by shifting from semantic to instance-level silhouette alignment, which provides superior convergence.
Abstract
We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban environments. While prior work LoD-Loc v2 achieves localization through semantic building silhouette alignment with low-detail city models, it suffers from two key limitations: poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k images with precise instance-level building annotations. This enables trained models to exhibit remarkable zero-shot generalization capability. Second, we reformulate the localization paradigm by shifting from semantic to instance-level silhouette alignment, which significantly reduces pose estimation ambiguity in dense scenes. Extensive experiments demonstrate that LoD-Loc v3 outperforms existing state-of-the-art (SOTA) baselines, achieving superior performance in both cross-scene and dense urban scenarios with a large margin.
Demo Video
Please watch the video for a detailed explanation of our pipeline and qualitative results.
Overview of InsLoD-Loc
The left panel illustrates the geographic distribution of the 40 flight areas across Europe and Asia. The right panel showcases representative samples from the dataset, each displaying (from left to right): a photorealistic RGB query image and its corresponding pixel-accurate instance label, where each color represents a unique building.
Visualization Results
The visualized results demonstrate that our method, based on instance silhouette alignment, can effectively solve the ambiguity problem in dense urban scenes. The columns from left to right show: query image, prior pose, LoD-Loc v2 result, LoD-Loc v3 result, and the ground truth. All query images are from the Tokyo-LoDv3 dataset.
BibTeX
If you find this work useful for your research, please cite our paper:
@article{peng2026lodlocv3,
title={LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment},
author={Peng, Shuaibang and Zhu, Juelin and Li, Xia and Yan, Shen and Yang, Kun and Zhang, Maojun and Liu, Yu},
journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}