Cross‐Modal Urban Sensing: Evaluating Sound–Vision Alignment Across Street‐Level and Aerial Imagery
Published online on March 26, 2026
Abstract
["Transactions in GIS, Volume 30, Issue 2, April 2026. ", "\nABSTRACT\nEnvironmental soundscapes carry rich ecological and social information, yet remain underutilized in geographic analysis. This study introduces a unified cross‐modal evaluation framework to examine how urban sounds align with visual representations from street‐level and aerial perspectives, and how different visual representation strategies influence this alignment. We integrate geo‐referenced sound recordings from London, New York, and Tokyo with corresponding street‐view and remote‐sensing imagery, applying embedding‐based methods (AST for audio; CLIP and RemoteCLIP for imagery) and segmentation‐based methods (CLIPSeg and Seg‐Earth OV). Results show that embedding‐based models capture stronger semantic alignment between sound and imagery, particularly for street‐level views, while segmentation‐based models, especially from aerial imagery, more effectively reveal interpretable ecological patterns when mapped to Biophony, Geophony, and Anthrophony (BGA) categories. These findings highlight the complementary strengths of embeddings for fine‐grained semantic matching and segmentation for ecological interpretation, offering a reproducible evaluation framework for incorporating sound into multimodal urban sensing.\n"]