Are Large Language Models Geospatially Knowledgeable?

Despite the impressive performance of Large Language Models (LLM) for various natural language processing tasks, little is known about their comprehension of geographic data and related ability to facilitate informed geospatial decision-making. This paper investigates the extent of geospatial knowledge, awareness, and reasoning abilities encoded within such pretrained LLMs. With a focus on autoregressive language models, we devise experimental approaches related to (i) probing LLMs for geo-coordinates to assess geospatial knowledge, (ii) using geospatial and non-geospatial prepositions to gauge their geospatial awareness, and (iii) utilizing a multidimensional scaling (MDS) experiment to assess the models' geospatial reasoning capabilities and to determine locations of cities based on prompting. Our results confirm that it does not only take larger, but also more sophisticated LLMs to synthesize geospatial knowledge from textual information. As such, this research contributes to understanding the potential and limitations of LLMs in dealing with geospatial information.


INTRODUCTION
The recent proliferation of pretrained large language models (LLMs) like GPT-3 [2] and their impressive performance on several downstream tasks has led the natural language processing (NLP) community to consider the implicit knowledge these models may contain in their parameters.Authors have shown that LLMs can function, to an extent, as knowledge bases [17], since they store various types of knowledge, such as common sense, relational, and linguistic aspects in their parameters [3,16,21].This paper explores whether and to what extent geospatial knowledge is encoded in LLMs and whether such models have geospatial awareness.Finally, we examine the models' geospatial reasoning potential.Geospatial knowledge includes the factual understanding of geographic data such as location, distance, and area.Geospatial awareness is concerned with the ability to perceive and comprehend geographical information.Finally, geospatial reasoning is the use of geospatial knowledge and awareness for informed decision making.
Most recent and successful LLMs are built using transformer architectures [26] specifically designed for sequence tasks such as language modeling.Although transformers consist of an encoderdecoder structure, LLMs usually only use the decoder part since they focus on autoregressive generation.This autoregressive nature means that LLMs generate coherent and contextually relevant text.Specifically, they predict the next word (token) in a sequence based on the previous words, thereby capturing the dependencies and contextual nuances of the language.The autoregressive strategy calculates a probability distribution over the entire vocabulary and, given a token, it samples from this distribution to generate the next token.We specifically study the LLaMA [24], Open Pre-Trained Transformer (OPT) [31], and Alpaca [22] models.
To evaluate LLMs with respect to their geospatial knowledge, awareness, and reasoning capabilities, we conducted the following experiments.First, we probe the LLMs for actual geo-coordinates of cities.This should provide us with an idea about their concrete geospatial knowledge.To assess their geospatial awareness, we evaluate whether geospatial prepositions such as "near" translate into smaller distances when used in sentences to generate nearby cities as opposed to a control scenario which simply uses the conjunction "and".Last, to gauge the geospatial reasoning potential of LLMs, we perform a multidimensional scaling(MDS) [1] experiment, in which we compare the predicted layout of cities using real distances to a distance measure derived from LLMs.
Our findings reveal that LLMs are becoming more adept at handling and comprehending geospatial data, as evidenced by their encoded geospatial knowledge and subsequent geospatial awareness while generating texts.Our results also show the possibility of using LLMs in geospatial reasoning tasks.

RELATED WORK
There has been comparatively little research when it comes to using natural language processing (NLP) techniques for geospatial data.While NLP has made great progress in a number of areas, its use and usefulness in "processing" geographical data have been underutilized.
Previous research in this area has focused on geographical information encoded in word embeddings [4,9].These studies primarily investigated the degree of isomorphism between word embeddings and the geographical concepts existing in the real world.Our study differs from these studies in the sense that we do not rely on the word embeddings, but rather on a whole language model which facilitates a thorough investigation of LLMs for geospatial data.
Derungs and Purves [5] explored the spatial relationship for "near" using the Microsoft N-grams [27] to investigate the geospatial awareness of a statistical -gram language model.They used expressions of "A near **" where A stands for different locations and ** refers to the autocomplete suggestions generated by the Microsoft -gram language model.They have shown what is encoded as near in this -gram as near for different scales of locations and generated nearness maps for various locations.In contrast, our work focuses on the currently popular and much better-performing neural language models.
In closely related work, Liétard et al. [11] assessed the geospatial knowledge of LLMs with tasks like geocoordinate prediction and neighboring country prediction.Our research expands on this work by assessing the model's geospatial awareness and the utilization of LLMs for geospatial reasoning tasks.While their work used smaller models and required some training, our work uses state-ofthe-art LLMs models and requires no training, as it leverages the LLMs' zero-shot inference capabilities.By incorporating all these aspects, we seek to provide a more comprehensive assessment of the geospatial capabilities of LLMs.

METHODOLOGY
Our methodology involves three different tasks to assess different aspects of the geospatial capabilities of LLMs.
For the first task, evaluating the geospatial knowledge encoded within LLMs, the objective is to correctly predict the locations and coordinates of cities.The second task, assessing geospatial awareness, analyzes the expressions generated by LLMs when leveraging geospatial prepositions vs. generic expressions, e.g., "near vs. "and" by comparing their resulting respective distances, i.e., are cities that use "near" actually closer than when using "and"?Lastly, to assess the LLMs' usefulness for geospatial reasoning, we devise a problem where the goal is to predict the locations of cities based on the relative distances between cities.We generate two "constellations", one which uses the actual distances, compared to another one that uses LLM-derived distances.
For all the above tasks, we make use of auto-regressive language models.We use the foundational language models OPT [31] and LLaMA [24] along with the instruction-tuned model Alpaca [22] in our experimentation.Instruction-tuned models are language models that have been fine-tuned to produce outputs that are better preferred by humans.The Alpaca model is based on LLaMA and has been fine-tuned using self-instruct [28].We limit our investigation to the above open-sourced models rather than closed models like GPT-3 and chatGPT to ensure transparency and reproducability.Closed models are proprietary and, as such, limit our ability to examine the training dataset and understand the training process in detail, making it difficult to draw scientifically sound conclusions or conduct comprehensive analyses.By using open models, we prioritize openness and enable a more thorough investigation of the models' behavior and underlying training data, aligning with our research objectives.
We use HuggingFace's Transformers [30] to implement the LLMs, which we run on an NVIDIA A100 GPU (80GB).The total estimated GPU run time is around 200 hours.We use Stanza [19] to postprocess the generated sentences in our contextualizing geospatial prepositions task and use Scikit-learn [15] to implement multidimensional scaling (MDS), which is used in the spatial reasoning task in Section 6 to determine the locations of cities. Map-based visualizations were done in Python using the Folium [18] library with the Stamen Design map style1 under the CC BY 3.0 license.The map data is from OpenStreetMap2 under the ODbL License.

MEASURING GEOSPATIAL KNOWLEDGE
This first experiment simply probes LLMs to determine the coordinates (latitude and longitude) of cities.This task serves as an indicator of the extent to which LLMs encode geospatial knowledge.

Experimental Setup
Prompting, introduced by Brown et al. [2], refers to appending a few sample input-output along with a textual prompt to a pretrained LLM, which is then expected to provide a relevant completion of this input based on the sample inputs and outputs.This approach is sometimes referred to as in-context learning.LLMs can handle a wide range of NLP tasks through prompting, eliminating the need for costly model parameter updates through model fine-tuning.
In our experiments, we use prompts such as the following: The geo-coordinates of Peoria are 40.69 and -89.58.
The geo-coordinates of Oldham are 53.55 and -2.11.
The geo-coordinates of Plzen are 49.75 and 13.36.
The geo-coordinates of Kathmandu are ...
The first three sentences represent a prompt template that illustrates the type of generation that we expect from the model.Our goal is for the LLM to learn from these first three sentences and to complete the last sentence in the best possible way.
The above example is referred to as 3-shot inference since we provided three examples prior to the actual sentence completion task.Providing no example is also referred to as zero-shot inference.We use both 3-shot and zero-shot inference for the location prediction of cities.The cities that we use as examples while prompting are selected randomly from a dataset of 3,527 cities (see below).We experiment with different prompt templates, as listed below, to assess their effect on the generated results: The geo-coordinates of <city> are ...

Template 2:
The latitude and longitude of <city> are ...

For the case of the instruction-following Alpaca model, we provide the following input.
Below is an instruction that describes a task, paired with an input that provides further context.Write a response that appropriately completes the request.

### Instruction:
Provide the geo-coordinates of the city given below.
In the above instruction template, we can simply request the latitude and longitude instead of the geo-coordinates, to match the second template that we used for proppting the other models (simply changing the instruction geo-coordinates with latitude and longitude) Dataset: We use the MaxMind database 3 for a global list of cities that have a population greater than 100k along with their geographic coordinates.This results in a list of 3,527 cities whose coordinates we predict using LLMs.Experimental Details: We use the 6.7B(illion) and 13B-parameter variant of the OPT model and the 7B and 13B-parameter variant of the LLaMA model.For the Alpaca model, we instruction-tune the 7B LLaMA model.We use beam search with five beams as our decoding strategy.The beam search algorithm is commonly used in language generation tasks to help determine the output sequence that is most likely to occur by keeping a list of several candidate sequences at each step.It investigates many options while taking contextual dependencies into consideration, enabling the creation of cohesive and contextually relevant text.

Results and Discussion
The results of our coordinate prediction task are shown in Table 1.The first three rows correspond to the results reported in Liétard et al. [11], while all subsequent rows reflect our work.The "Prompt template" column mentions which of the two different prompt templates was used. .The "Prediction rate" column refers to the percentage of successful predictions made by the LLM, representing the instances where the model correctly generated another city as the result.We note that, for some models, the generated text does not match the expected output because of the open-ended nature of text generation by auto-regressive LLMs.A failed generation result example is as follows.
The latitude and longitude of Lobito are: Lobito is located in Angola.
The results reported in previous work by Liétard et al. [11] imply that LLMs have limited encoded geospatial knowledge.In their work, they compare LLMs with Word2Vec [14], a neural-based word representation in vector space, and report that Word2Vec performed better than LLMs.Another interesting finding by Liétard et al. [11] was that BERT [6], a bi-directional LLM, performed better than the (larger and later-released) autoregressive GPT-2 [20].

MEASURING GEOSPATIAL AWARENESS
Geospatial awareness refers to the perception of space and the use of spatial information during everyday activities.This idea also applies to generative language models, i.e., the degree to which LLMs capture geospatial information and how this is evident when generating text.To assess the geospatial awareness of LLMs, we utilize geospatial prepositions, i.e., prepositions that describe spatial relationships between objects or places in a geographical setting.

Experimental Setup
We want the LLM to generate sentences such as "<City-A> is near <City-B>", where "<City-A> is near" is passed as context.
Assuming that the model has geospatial awareness, <City-B> should be geographically close to <City-A>.
In our experiments, we contextualize the LLM input with a geospatial preposition and evaluate the output the LLM generates and prompt the model as follows: Albany is near ... In the above prompt, Albany is a city and near is a geospatial preposition.
We analyze whether the generation of <City-B> given the context of "<City A> is near" is affected by the presence of the preposition "near" or not.In addition to "near", we also use the prepositional phrases "close to" and "far from".
We contrast the results of the above experiments with a control experiment where the geospatial preposition is replaced with the conjunction "and" using the following prompt: Albany and ... We prompt the model in both zero-shot and three-shot settings.Additionally, we also append the state of the city in all our inputs as below: Albany, New York is near ...

Dataset:
We curate a list of 93 cities in the contiguous United States to perform an in-depth analysis of the geospatial awareness of LLMs.The list balances city size with a coverage of most of the contiguous states.Experimental Setup: Based on the results of the coordinate prediction task, we only use the 13B variant of the LLaMA model for this task.We use ten different prompts per city, where each prompt is created by randomly selecting a city and its closest city in our list.We generate fifty samples for each prompt.We use top- sampling-based decoding with  set to 100 and a temperature of 0.9.Top- sampling limits the token pool while decoding to the  most likely options at each step, while temperature controls the randomness during token selection.A higher temperature value increases randomness, while a lower temperature value reduces randomness.

Results and Discussion
Figure 1 displays a box plot of the statistics of the actual distances between the generated places and the original city in our experiment.The categories shown are based on the presence or absence of state name in the name of the original city.The visualization makes it evident that the use of geospatial prepositions in the sentences has an impact on the generated cities.The sentences contextualized with geospatial prepositions that indicate close proximity, such as "near" and "close to", yielded cities that are physically closer to the original city.Conversely, when the context was "far", a geospatial preposition indicating distant location, the generated cities tend to be farther away from the original city.For our control experiment (with the non-geospatial word "and"), the observed differences in the distances of the predicted cities provide compelling evidence of Figure 2 provides a specific example showing that the inclusion or exclusion of the state name in the city names influences the generated cities.The generated cities are occasionally further away from the source cities when state names are not included in the prompt.We believe that this discrepancy is due to the limitation of LLMs in resolving the exact location of a city when the state information is missing: the lack of state information may lead LLMs to confuse cities with the same name (disambiguation).This situation is demonstrated in Figure 2, which shows the heatmap of places generated for Albany, New York.When the state (NY) is included, the generated cities are almost always located closer to the (correct) Albany in New York State.In contrast, when the state name is absent from the prompt, the ambiguity between Albany in New York and Albany in California results in two respective hotspots.
Figure 3 provides additional examples for Albany, New York (3i), Fort Worth, Texas (3ii), Havre, Montana (3iii), and Fresno, California (3iv).In almost all cases, the hotspot is centered on the original city for geospatial prepositions indicating close proximity ("near", "close to"), while the hotspots are located further away for prepositions indicating distant places ("far from").In contrast, there are no clear hotspots for the control setting ("and") with predictions spread out across the map, providing further confirmation of the geospatial awareness displayed by LLMs.

The Effect of the Training Data
Elazar et al. [7] explored the effect of training data on LLMs' prediction.They argue that the factual knowledge extracted might be due to the co-occurrence pattern in the dataset that the model was trained on, rather than on a model's hypothesized actual understanding.They suggested that understanding the model's predictions must be accompanied by a study of the training dataset and its effects.
To assess whether the geospatial awareness LLMs exhibit in our experiments is due co-occurrence patterns in the training dataset, we use the CC100 dataset, which was used for model training, to obtain the occurrence counts 5 of each city and the co-occurrence counts of all city pairs in our experimentation.
CC100 is a corpus of monolingual data for 100+ languages including English constructed using the methods proposed by Wenzek et al. [29] by processing Commoncrawl6 snapshots.Given that the LLaMA model was trained directly on Commoncrawl snapshots, we believe that CC100 is a good approximation of the training data.
We conduct an analysis similar to Elazar et al. [7], by quantifying the number of times (generation counts) each city in our list is generated under the different preposition prompts.We then calculate the Spearman rank correlation coefficient to examine the association between these generation counts with the occurrence of these cities and the co-occurrence frequency of these city pairs in the CC100 dataset.Additionally, we also consider the correlation of these generations with actual distances, as well as the correlations between distance and co-occurrence to further analyze their relationship.All computed rank correlation coefficients are shown in Table 2. Table 2 reveals several interesting findings.First, there is -as expected-an inverse correlation between distance and generation (row i) for the geospatial prepositions "near" and "close to", which is significantly stronger than the correlation obtained for our control word "and".This further solidifies the argument that LLMs possess geospatial awareness, as they generate physically closer places when prompted with geospatial prepositions denoting close proximity.We did not find any notable correlations for the "far from" setting, which can be attributed to the ambiguity associated with this preposition.There is no defined threshold for places to be considered far from each other, in contrast to the close proximity setting.

Table 2: Spearman rank correlation coefficient between different variables for generations of different prepositions
Further, as expected we see that the correlation between cooccurrence and generations (row ii) for "far from" is low.However, we observe a positive correlation for the other three prepositions.We should note that the correlation is slightly stronger for our control word "and", compared to geospatial prepositions denoting close proximity.This shows that, in comparison to prepositions denoting close proximity, co-occurrence between cities has a greater influence on the generations associated with "and".The positive correlation between co-occurrence and generations for "near" and   "close to" can be explained by Tobler's first law of geography [23]: "Everything is related to everything else, but near things are more related than distant things".This aligns with the positive correlation observed between distance and co-occurrence.Therefore, we can confirm that the co-occurrence between cities does not significantly impact the generations for geospatial prepositions any more than what is expected based on Tobler's first law of geography.Since the effect is more prominent for our control word "and", though we can conclude that the LLMs do possess genuine geospatial awareness.
To further solidify this argument, we also study the correlation between the generation counts and the counts of both, the prompt and generated cities (rows iii and iv).The little to no correlation between the count of the prompt city and the generations implies that there is a minimal impact of the count of the prompt city on generations.We do observe a positive correlation between generation and the count of the generated cities, which is expected when sampling from any well-trained LLM.However, the correlation is considerably more prominent for the control word "and" compared to the other geospatial prepositions, which suggests that the frequency of the generated cities in the pre-training dataset plays a more prominent role for the control word "and" compared to the geospatial prepositions.This observation further confirms the claim that LLMs are genuinely geospatially aware.
In conclusion, our results provide compelling evidence that LLMs are indeed geospatially aware.Our analysis of the pre-training dataset reinforces the claim by demonstrating that the observed geospatial awareness in LLMs is not merely a result of the patterns seen in the pre-training dataset.

LLMS AND GEOSPATIAL REASONING
Geospatial reasoning refers to the process of understanding and analyzing geospatial information to draw conclusions and make decisions.In order to assess the usefulness of LLMs for this task, we devise an experiment to predict the locations of cities using dissimilarity measures, such as for example distances between the cities.

Experimental Setup
We use dissimilarity measures to establish a 2-dimensional geometric representation of cities.We accomplish this through the application of multi-dimensional scaling (MDS) [1].Specifically, we begin with a list of cities with known locations and with a test city whose location and coordinates we want to predict.Knowing the distance between all cities (including the test city), we then use a least-squares estimation of transformation parameters between two point patterns [25] to get the transformation matrix that maps the 2-dimensional geometric space coordinates generated by MDS to actual geo-coordinates using the cities for which the geo-coordinates are known.Finally, we use this transformation matrix to determine the geo-coordinates for the test city.We provide the detailed pseudocode for this geo-coordinate prediction from a dissimilarity measure task in Algorithm 1.
We use actual distances as a benchmark for dissimilarity measures and the co-occurrence counts between each city pair as our baseline measure to establish a comparative reference point.
Co-occurrence is a measure of similarity, so to convert it into a dissimilarity measure, we consider the reciprocal of co-occurrence values.We adopt a similar approach for the other similarity measure we study, namely, the generation frequency, to convert it into a dissimilarity measure.
By utilizing a dissimilarity measure between cities to predict their geo-location, our designed task illustrates a practical application of geospatial reasoning.We extract diverse measures of dissimilarity from the LLM and conduct a comparative analysis against our predefined benchmark and baseline.The dissimilarity measures include the following: • Predicted Distance: We predict the distances between each city pair in a zero-shot setting from LLM, prompted in a manner similar to the geo-coordinate prediction task ( §4).To predict the distances between cities, we prompt the LLM as follows.
The distance in kilometers between Albany, New York and Dallas, Texas is ...
Dataset: We use the curated list of 93 cities in the contiguous United States presented in Section 5.Each city in our dataset is considered a test city for which we want to predict its coordinates and we use the remaining cities to sample cities with known locations.Based on the results of Section 5, we include the state names in the prompts.Experimental Details: Similar to our contextualizing geospatial preposition task, we use the 13B variant of the LLaMA model.We use beam search with 5 beams as our decoding strategy.

Results and Discussion
We present the result of our location prediction task in Table 3.We report two mean error distances.The first one is based on predicting the geo-coordinate of a city using all the remaining cities in our list from the contiguous US.The second one is calculated by dividing the cities into nine different US census bureau-designated divisions and limiting our experiment to each division both for the known and the unknown cities.
To establish a benchmark, we compute the minimum attainable error, which is obtained by using an "oracle", i.e., we use the actual distances for all cities.An additional baseline is the "random" one, where we report the average error from ten different random predictions for the test city locations.The latter should function as the maximum "reasonable" error.Our results indicate that limiting the task to a smaller geographical region by using the divisions instead of considering the whole contiguous US leads to better prediction accuracies in general.This can be attributed to the inherent ease of prediction when using cities that are in closer proximity.Furthermore, the errors associated with co-occurrence counts (row i in Table 3) and "and" generation counts (row ii) resemble the errors obtained from random distances, which is expected since both of these counts do not convey a similarity or dissimilarity based on proximity between two cities.The same observation holds -and is even more prominent-for the "far from" generation counts (row v).On the other hand, the "near" and "close to" generation counts (rows iii and iv) have much lower errors compared to the co-occurrence count and "and" based generations.This further strengthens the argument that LLMs possess geospatial awareness and reasoning capability.However, due to the sparsity of the generation counts, we do not obtain values that closely align with the predicted distances.Last, directly asking the LLM to predict distances (row vi) yields results that are much closer to the actual distances and far better than the random distance.
It is important to note that the goal of this task was not to assess the ability of LLMs to predict the geo-coordinates of the cities. 7Instead, our objective is to evaluate LLMs' geospatial reasoning capabilities.One potential use case of our task can be to predict the relative orientation of a city instead of its exact location.We do believe that LLMs have potential for such a use case.
Figure 4 shows the actual locations along with the predicted locations using both actual distances and predicted distances for four cities.On the maps, the green, blue, and red markers represent the actual locations, the predicted locations based on actual distances, and the predicted locations from predicted distances, respectively.In the case of "Albany, New York" and "Havre, Montana" the predicted locations, whether based on actual or predicted distances, deviate only slightly from the actual values.In the case of "Dallas, Texas", the actual locations align closely with predicted locations based on actual distances, but differ from the ones predicted using predicted distances.However, for "Indianapolis, Indiana" the predicted locations based on predicted distances align well with the actual values, while those based on the actual distances deviate.These inconsistencies would only be marginally noticeable in the context of city scales if our main object was the orientation of the city rather than its precise location.While we acknowledge that the predicted locations based on predicted distances differ from those based on actual distances, they still are reasonably close.A high Pearson correlation coefficient of 0.92 between actual distances and predicted distances further showcases the potential of LLMs for such tasks.
In conclusion, our results demonstrate the potential use of LLMs for geospatial reasoning tasks.While it is important to note that the values produced by LLMs may not precisely match the actual values, they still show a remarkable level of similarity and exhibit a high correlation.Thus, LLMs have great potential for supporting humans in geospatial reasoning and analysis tasks with targeted fine-tuning tailored to a certain use case.

CONCLUSIONS AND FUTURE WORK
This work demonstrates notable improvements in LLMs' ability to handle geospatial data, due not only to the increasing size of models, but also facilitated by novel techniques such as instruction tuning.We show that LLMs encode geospatial knowledge, which can be leveraged for tasks that are simple, such as obtaining coordinates and locations for cities by probing those models, or more complex, such as a quantitative understanding of spatial prepositions.All this information can be extracted and utilized using the proper "querying" techniques such as prompting.We demonstrate that LLMs show potential for geospatial reasoning tasks, but further enhancements are needed to meet the desired accuracy and performance levels.Overall, LLMs have come a long way and now exhibit geospatial awareness when generating text.
Future research will focus on examining the practical applicability of LLMs in real-world applications involving geospatial data, as well as utilizing even larger models, and doing so for languages other than English.

Figure 1 :
Figure 1: Distances of cities predicted when contextualized with different prepositions.

Figure 2 :
Figure 2: Heatmaps of the places generated for (a) Albany, New York, and (b) Albany, when contextualized with "near".When state information is provided, the model can effectively disambiguate between the different similar-named cities.

Figure 4 :
Figure 4: Original and predicted locations based on actual distances and predicted distances for (a) Albany, New York, (b) Havre, Montana, (c) Dallas, Texas, and (d) Indianapolis, Indiana.Green: Actual, Blue: Predicted based on actual distance, and Red: Predicted based on predicted distance.

Table 1 :
Mean error distances in km for coordinate prediction of cities

Table 3 :
Mean error distances in kilometers for geocoordinate predictions of cities from dissimilarity measure using MDS