GAEA: A Geolocation Aware Conversational Assistant [WACV 2026🔥]

Ron Campos^*, Ashmal Vayani^*, Parth Parag Kulkarni^*, Rohit Gupta, Aizan Zafar, Aritra Dutta, Mubarak Shah

* Equally contributing first authors
Center for Research in Computer Vision, University of Central Florida

Inspired by the challenges in training conversational Large Multimodal Models (LMMs) for geolocalization and the lack of comprehensive datasets in this domain, we introduce GAEA. This open-source conversational model uniquely combines global-scale geolocalization with rich, interactive discussions about locations, landmarks, and services. To support its training, we curate GAEA-1.4M, a diverse dataset of 1.4M samples, integrating images, metadata, and knowledge-driven captions. We also propose GAEA-Bench a benchmark designed to assess the conversational and geolocalization capabilities of LMMs.

We compare the performance of various LMMs on the geographically-grounded visual-question-answering task, included in our new GAEA-Bench benchmark. Most LMMs can describe the Wat Pho statue, but only GAEA, our Geolocation Aware Assistant, retrieves the correct nearby cafe, Cafe Amazon (Left). Qualitative SVQA comparison showing GAEA’s ability to provide accurate, location-specific answers where other LMMs fail (Right).

Abstract

Image geolocalization, in which an AI model traditionally predicts the precise GPS coordinates of an image, is a challenging task with many downstream applications. However, the user cannot utilize the model to further their knowledge beyond the GPS coordinates; the model lacks an understanding of the location and the conversational ability to communicate with the user. In recent days, with the tremendous progress of large multimodal models (LMMs)---proprietary and open-source---researchers have attempted to geolocalize images via LMMs. However, the issues remain unaddressed; beyond general tasks, for more specialized downstream tasks, such as geolocalization, LMMs struggle. In this work, we propose solving this problem by introducing a conversational model, GAEA, that provides information regarding the location of an image as the user requires. No large-scale dataset enabling the training of such a model exists. Thus, we propose GAEA-1.4M, a comprehensive dataset comprising over 800k images and approximately 1.4M question-answer pairs, constructed by leveraging OpenStreetMap (OSM) attributes and geographical context clues. For quantitative evaluation, we propose a diverse benchmark, GAEA-Bench, comprising 3.5k image-text pairs to evaluate conversational capabilities equipped with diverse question types. We consider 11 state-of-the-art open-source and proprietary LMMs and demonstrate that GAEA significantly outperforms the best open-source model, LLaVA-OneVision, by 18.2% and the best proprietary model, GPT-4o, by 7.2%.

GAEA is the first open-source conversational model for conversational capabilities equipped with global-scale geolocalization.

Main contributions:

GAEA-1.4M: A Diverse Training Dataset. We propose GAEA-1.4M, a new dataset designed for training conversational image geolocalization models, incorporating diverse visual and contextual data.
GAEA-Bench: Evaluating Conversational Geolocalization. To assess conversational capabilities in geolocalization, we introduce GAEA-Bench, a benchmark featuring various question-answer formats.
GAEA: An Interactive Geolocalization Chatbot. We present GAEA, a conversational chatbot that extends beyond geolocalization to provide rich contextual insights about locations from images.
Benchmarking Against State-of-the-Art LMMs. We quantitatively compare our model’s performance against 8 open-source and 3 proprietary LMMs, including GPT-4o and Gemini-2.0-Flash.

GAEA-1.4M Dataset Overview

Figure: Data Collection and Annotation Pipeline. (Left) GAEA-1.4M includes geographically diverse visual samples from various data sources, such as MP-16, GLD-v2, and CityGuesser68k. (Middle) We also incorporate OpenStreetMap (OSM) metadata and auxiliary context for each image, ranging from climate zones to geographical clues about the country. (Right) Using open-source LLMs and GPT-4o, we generate four diverse question-answer pairs across geolocation, reasoning, and conversational subsets.

GAEA-Bench Curation Pipeline

Figure: Overview of GAEA-Bench. GAEA-Bench is designed to evaluate the conversational abilities of various LMMs across different question types, including MCQs, T/F, and both short and long VQAs. We have carefully selected a subset of 3.5k samples from MP-16 and generated corresponding OSM metadata to generate QA pairs using GPT-4o. GAEA-Bench aims to fill the gap in conversational benchmarks by incorporating geolocalization capabilities.

Figure: Evaluation pipeline for conversational benchmarking on GAEA-Bench, highlighting various question types we introduce in our GAEA-Bench. Each question type is evaluated with various defined criteria using GPT-4o as a judge. For instance, SVQA is evaluated against Accuracy and Correctness, and LVQA is evaluated on Consistency, Fluency, and Relevancy criteria.

Figure: Classification and distance threshold accuracy computation pipeline simultaneously evaluates geolocalization performance at city and country level by comparing model predictions with ground truth annotations derived from reverse-geocoding GPS coordinates and accuracy at different distance thresholds by geocoding predictions of the model.

Data Statistics

Statistic	Value
Total images	822,951
Total cities / countries	41,481 / 234
Total questions	1,580,531
Total geo-localization questions	822,951
Total explanatory captions	384,947
Total open-ended questions	267,668
Total multiple-choice questions	48,673
Total true/false questions	56,292

Qualitative Example of GAEA-1.4M

Figure: Examples of the four question types in our dataset: SVQA, MCQ, TF, and LVQA. Each type targets a distinct reasoning skill grounded in geographical, visual, or contextual understanding. Our dataset has three categories, including Geolocalization, Reasoning (LVQA), and Conversational (SVQA, MCQ, TF) QAs, as shown in the figure.

Benchmarking and Evaluations

GAEA is the first model explicitly trained on 1.6 million instructions, incorporating reasoning-based question-answer pairs to provide transparent geolocation predictions, unlike traditional black-box models. We benchmark GAEA against state-of-the-art LMMs and geo-localization models, evaluating performance on diverse question types and standard benchmarks while introducing new datasets for city and country classification.

Evaluation on GAEA-Bench

(1) GAEA achieves the highest average accuracy (66.06%) across decision-making and short-form VQA questions, surpassing GPT-4o by 8.28% and outperforming the best open-source model by 25.69%. However, both open-source and proprietary models struggle with short-form questions, with GPT-4o's accuracy dropping significantly from long to short VQAs.

Figure: We benchmark 11 open-source and proprietary LMMs on GAEA-Bench. Notably, GAEA outperforms all open-source models and fares higher than the proprietary models on decision-making questions (MCQs and TFs). We provide the relative performance change for each model compared to `GAEA`. We use GPT-4o as a judge for evaluation, and it has been documented that LLMs as judges prefer their long-form output; hence, the scores for these models are likely overestimated.

Standard Geolocalization Evaluation

(2) GAEA performs competitively against specialized geo-localization models, achieving the second-best performance on IM2GPS3k, surpassing GaGA by 2.5% at 25 km and 3.66% at the country level, while also outperforming GeoCLIP across all thresholds. On IM2GPS, it surpasses GaGA at 25 km and 2,500 km, and on GSW-15K, it outperforms GeoCLIP and GeoDecoder in city-level geolocation.

Figure: We benchmark the performance of various specialized models on standard geolocation datasets. GAEA demonstrates competitive results, outperforming GaGA on multiple distance thresholds in both IM2GPS and IM2GPS3k.

Classification Accuracy Results

(3) GAEA outperforms recent LMMs, including LLaVA-OneVision, InternVL, and GLM-4V-9B, in city- and country-level classification across three new datasets, demonstrating its extensive geographical coverage and superior geolocation capabilities.

Figure: Classification accuracy for both city and country labels, where GAEA establishes itself as a strong baseline, surpassing several recent LMMs in performance.

Qualitative examples of various LMMs on GAEA-Bench

We further present various question types in our GAEA-Bench and demonstrate how various LMMs respond to conversational questions equipped with the geo-localization capabilities. Notably, GAEA comprehends the geographic location of the image and responds with the correct output.

Conclusion

We introduced GAEA, the first interactive conversational model with specialized geolocation capabilities, explicitly trained on a large-scale conversational dataset, GAEA-1.4M. We meticulously designed the dataset to enhance GAEA’s reasoning, conversational abilities, and geolocation accuracy. We curated geolocalizable images from MP-16, GLDv2, and CityGuessr68k, enriching them with auxiliary context and metadata, such as geographic clues, and climate zones. In addition to a high-quality instruction set, we present GAEA-Bench, a comprehensive benchmark that evaluates LMMs across multiple question types, including MCQs, True/False, short- and long-VQAs. Our results show that GAEA outperforms recent LMMs on GAEA-Bench, demonstrating strong geolocation and conversational capabilities by leveraging OpenStreetMap (OSM) data. These findings establish GAEA as a strong baseline for future research in geolocalization.

For additional details about GAEA-Bench evaluation and experimental results of GAEA, please refer to our main paper. Thank you!

BibTeX

@misc{campos2025gaeageolocationawareconversational,
      title={GAEA: A Geolocation Aware Conversational Model}, 
      author={Ron Campos and Ashmal Vayani and Parth Parag Kulkarni and Rohit Gupta and Aritra Dutta and Mubarak Shah},
      year={2025},
      eprint={2503.16423},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.16423}, 
}