Location reference recognition from texts: A survey and comparison

A vast amount of location information exists in unstructured texts, such as social media posts, news stories, scientific articles, web pages, travel blogs, and historical archives. Geoparsing refers to the process of recognizing location references from texts and identifying their geospatial representations. While geoparsing can benefit many domains, a summary of the specific applications is still missing. Further, there lacks a comprehensive review and comparison of existing approaches for location reference recognition, which is the first and a core step of geoparsing. To fill these research gaps, this review first summarizes seven typical application domains of geoparsing: geographic information retrieval, disaster management, disease surveillance, traffic management, spatial humanities, tourism management, and crime management. We then review existing approaches for location reference recognition by categorizing these approaches into four groups based on their underlying functional principle: rule-based, gazetteer matching-based, statistical learning-based, and hybrid approaches. Next, we thoroughly evaluate the correctness and computational efficiency of the 27 most widely used approaches for location reference recognition based on 26 public datasets with different types of texts (e.g., social media posts and news stories) containing 39,736 location references across the world. Results from this thorough evaluation can help inform future methodological developments for location reference recognition, and can help guide the selection of proper approaches based on application needs.


INTRODUCTION
"Location matters, and not just for real estate" [177].With the rapid development of the Global Navigation Satellite System (GNSS), sensor-rich (e.g., inertial sensors, Wi-Fi module, and cameras) smart devices, and ubiquitous communication infrastructure (e.g., cellular and 4G networks and Wi-Fi access points), our capability of obtaining location information of moving objects and events in both indoor and outdoor spaces has been dramatically improved [159].This increased our ability to better understand geo-spatial processes and to support decision making in all contexts from business, entertainment, to crisis management [177].Apart from sensor equipment, natural language texts (e.g., social media posts, web pages, and news stories) are another important source that contains much geospatial information in the form of location references.These location references embedded in texts can be in the form of simple place names (or toponyms), and can also be in the form of location descriptions that contain both place names and additional spatial modifiers (e.g., direction, distance, and spatial relationship) [175].Geoparsing refers to the process of recognizing location references from texts and identifying their geospatial representations.Geoparsing is an ongoing research problem that has been studied over the past two decades [9,15,80,90,162].It consists of two steps: (1) toponym recognition, which is also called location reference recognition (2) toponym resolution, which is also called geocoding that disambiguates toponyms and identifies their geographic coordinates.Figure 1 illustrates the workflow of geoparsing.

Geographic locations Texts
Fig. 1.The general workflow of geoparsing and its two steps.
Geoparsing has traditionally been used in formal texts for location extraction, such as web pages, news, scientific articles, travel blogs, and historical archives [15,177].However, the drastically increased importance of social media data (SMD) in various domains such as social science, policy making, and humanitarian relief [18,36,75,170] has facilitated efforts to extend geoparsing to informal texts [177].According to Statista1 , the number of worldwide social network users will reach 4.4 billion by 2025.On average, 500 million tweets2 and 4.75 billion Facebook posts 3 are sent each day.Formal texts normally do not have location-related metadata, while informal texts, such as tweets, can be geotagged, i.e., a Twitter user can select a location and attach that location to the posted message.However, geotagged tweets are rare, and according to Cheng et al. [31], Morstatter et al. [134], and Kumar et al. [100], only 0.42%, 3.17%, and 7.90% of the total number of tweets contain geotags, respectively.In addition, Twitter removed their precise geotagging feature in June 2019, showing only a rough location, e.g., the bounding box of a tagged place rather than a pair of latitude and longitude coordinates.This change could lead to a further decrease in the number of geotagged tweets [85].
The geotagged locations of tweets are not always the same as the locations described in their tweet content either [110].
In a nutshell, it is often necessary to extract location references from unstructured texts.Notably, informal texts, such as tweets, are short, have few or no formatting or grammatical requirements, and can have uncommon abbreviations, slang, and misspellings, which pose additional challenges for geoparsing [178].
While there exist quite some studies on geoparsing [67,145], we identify two gaps in the literature that motivate this current review paper.First, the many possible applications of geoparsing are scattered in individual papers [1,15,57,62] or are only partially reviewed [66,82], and there lacks a systematic and more comprehensive summary of these applications.Consequently, it is difficult for researchers who are new to geoparsing to have a quick view of these many possible applications.Second, existing review papers on geoparsing, such as [67,124,133,179], focused on the entire workflow of geoparsing (i.e., both of the two steps) rather than location reference recognition alone (i.e., the first step only).While providing more comprehensive coverage on the topic of geoparsing, existing efforts reviewed only some approaches for the step of location reference recognition.In recent years, many new approaches for location reference recognition have been developed, such as Flair NER [4], NeuroTPR [180], nLORE [51], and GazPNE2 [81].Given the high importance of location reference recognition in geoparsing (i.e., only those references that are correctly recognized can be geo-located), it is necessary to have a review that specifically focuses on the possible and recent approaches for location reference recognition.
This work aims at filling the two research gaps discussed above.First, we summarize seven typical application domains of geoparsing, which are geographical information retrieval (GIR) [56,145], disaster management [110,161], disease surveillance [63,158,171], traffic management [76,114,128], spatial humanities [62,153,171], tourism management [25,34,35], and crime management [17,40,176].Second, we review existing approaches for location reference recognition by categorizing the approaches into four groups: rule-based, gazetteer matching-based, statistical learningbased, and hybrid approaches.Noticing that many existing approaches were not cross compared on the same datasets, we also conduct experiments to compare and evaluate the reviewed 27 existing approaches on 26 public datasets.We examine multiple characteristics of the existing approaches, including their performance on formal and informal texts, their performance on different types of locations (e.g., admin units and traffic ways), and their computational efficiency.
The remainder of this paper is structured as follows: In Section 2, we summarize seven typical application domains of geoparsing.In Section 3, we review existing approaches for location reference recognition.We evaluate existing approaches on the same public datasets in Section 4. Finally, we conclude the paper in Section 5 and discuss some potential future directions.

SEVEN APPLICATION DOMAINS OF GEOPARSING
Geoparsing has many possible applications.In this section, we summarize seven typical application domains of geoparsing, which are most discussed in literatures.Figure 2 provides an illustration of these domains.GIR: One of the primary applications of geoparsing is geographic information retrieval.Historically, documents have been indexed by subject, author, title, and document type.However, a diverse and large group of information system users (e.g., readers, nature resources managers, scientists, historians, journalists, and tourists) desire geographicallyoriented access to document collections, such as by retrieving interesting contents about specific geographic locations [24,56,107,129,144,172,189].For instance, resources in digital libraries can be indexed by locations contained in Manuscript submitted to ACM descriptive metadata records associated with the resources, thereby improving users' experience in searching for their needed resources [56].People are looking for web pages containing useful information about everyday tasks, such as local merchants, services, and news [24].The public can consume up-to-date information related to COVID-19 (e.g., disease prevention, disease transmission, and death reports) on Twitter by locations [129].
Disaster management: News stories and SMD contain a large volume of historical and real-time disaster information.
Location-enabled SMD can be very helpful to timely map the situational information, such as rescue requests [163,196], resource (e.g., food, clothing, water, medical treatment, and shelter) needs and availability [21,48], and facility status (e.g., building collapse, road closure, pip broken, and power outage) [22,50,120,156] in the aftermath of disasters.With a crisis map, first responders can track the unfolding situation and identify stricken locations that require prioritized intervention [19] and realize optimized real-time resource allocation [163], government agencies can conduct the damage assessment of the disasters in a faster manner [190], and the public can search for the locations where they can obtain needed resources.By extracting spatiotemporal, environmental, and other information about disaster events from news stories, flood-prone areas can be identified [192], the responsibility of atmospheric phenomena for floods can be understood [20], the spatial and temporal distributions of natural disasters during a long period can be analyzed [113], and the evolution of disasters (e.g., the phases of preparedness, impact, response, and recovery) can be tracked [86,181,182].
Disease surveillance: Scientific articles, historical archives, news reports, and social media contain detailed information of disease events, such as where the disease was first reported and how it spread spatiotemporally.Mining geographic locations and other related information of disease events can help track diseases [32,63,135,139,158,171], Manuscript submitted to ACM perform early warning and quick response [95], and understand the mechanisms underlying the emergence of diseases [12,91].For example, geoparsing historical archives (e.g., The annual US Patent Office Reports 1840-1850 and Registrar General's Reports) can help track the spread of potato disease 'late blight' in the 19th-century in the United States [171] and understand the relationship between cholera related disease and place names during Victorian times [135].Scientific articles were geoparsed to analyze the demographic, environmental, and biological correlation of the occurrence of emerging infectious diseases at a global scale [12,91].Social media can also reflect the movement of the public and their feelings during pandemics through geotags or mentioned locations in texts.Location-enabled tweets were applied to analyze the mental health status of the public after the occurrence of COVID-19 [79,195], to track and visualize the spread and diffusion of COVID-19 [16], and to reveal human mobility patterns [87,89].
Traffic management: Twitter users report near-real-time information about traffic events (e.g., crashes and congestion).Detecting traffic events, their precise locations, and other related information from tweets [3,13,60,70,160,167] is important for an effective transportation management system.The detected traffic events can also support urban policy making [38], such as to help drivers to avoid risk zones and choose the fastest and safest routes [10], to help the transportation management sector reduce fatalities and restore traffic flow as quickly as possible [10], to predict future traffic jams [11], and to improve road safety by recognizing high-risk areas [128].By doing so, Twitter users acting as social sensors can complement existing physical transport infrastructure (e.g., video cameras and loop detectors) in a cost-effective manner, which is especially important to developing countries where resources are limited.
Spatial humanities: 'Spatial turn' was used to describe a general movement, observed since the end of the 1990s, emphasizing the reinsertion of place and space in humanities [183].Digitizing and geoparsing large historical textual collections, such as books, reports, and novels create new ways for research in humanities (e.g., Archaeology, History, and Literature) [47,61,62,68,77,130,135,171], such as to understand the historical geographies of nineteenth-century Britain and its relationships with the wider world [61], to identify the significance of specific commodities in relation to particular places and time [77], to analyze a correspondence between eighteenth-century aesthetic theory and the use of the terms beautiful, picturesque, sublime and majestic in contemporaneous and later accounts of the Lakes region [47], and to reveal the spatial structure of a narrative in fictional novels [130].
Tourism management: According to the prediction of Statista in 2016, there were going to be 32 million active bloggers in the US alone by 2020.Among all the active blogs, travel is rated as the top 5 topics shared by bloggers 4 .
Travel blogs contain a wealth of information about visited places organized as bloggers' experiences and insights as well as their perceptions of these places [74].These narratives reflect the blogger's behavior and interaction with places and also the relationships among the places.Geoparsing travel blogs is helpful for understanding places [73], such as to find their features and related activities, and can help describe a place with tourism attributes to support tour planning [73,74,97,194].Applications include helping travelers choose preferred places and visit them in an appropriate order at a proper time, and supporting wayfinding given the spatial relation of places [74].
Crime management: Many countries do not make crime data available to their citizens [17] or just provided coarsegrained details 5 , such as the total number of thefts in a district or a province.According to the Crime Information Need Survey [17], around 78.3% of respondents in Indonesia agreed that crime information should be available to the public.
The needed information includes crime type, perpetrator, victim, time, and very importantly, location.Meanwhile, information related to crime is often scattered across news and social media.Mining and gathering crime-related information from these text-based sources can be useful for informing the public and may even help predict and prevent some crimes [14,39,40,149,155,165].In particular, geoparsing can help extract location information of crimes, which can help residents to choose places to live and help travelers to avoid certain unsafe places [17].
Different applications have distinct requirements for the approaches for location reference recognition.For instance, informal texts (e.g., tweets) are the main source for emergency response, from which the real-time location of sub events caused by disasters can be extracted, while scientific articles are the main source for analysing the mechanisms underlying the emergence of diseases, from which the spatiotemporal evolution of diseases across the globe can be extracted.GIR just needs coarse-grained geospatial information, such as a city, while traffic management requires the fine-grained location (e.g., a street) of traffic events; Geoparsing historical documents that contain billions of words requires a fast processing workflow.Therefore, to guide the selection of proper approaches for location reference recognition based on application needs, examining the characteristics of existing approaches is necessary, which will be introduced in Section 4.

A SURVEY OF EXISTING APPROACHES
In this section, we review existing approaches for location reference recognition.In subsection 3.1, we review individual approaches by categorizing them into four groups, and in subsection 3.2, we review existing comparative studies and differentiate our current review from the existing studies.

Approaches for location reference recognition
In the existing literature, Leidner and Lieberman [104], Monteiro et al. [133], Purves et al. [145] identified three types of approaches for location reference recognition, which are rule-based, gazetteer matching-based, and statistical learningbased.However, many studies, such as [57,80,105,123], used a combination of different approaches to compensate the shortcomings of each other.Therefore, in this review, we add a fourth type, hybrid approaches, which combine two or all three types of approaches, and we use these four types to organize our review on location reference recognition.We show this classification schema in Figure 3.There are several studies that used only rules to extract location references.For instance, Giridhar et al. [60] used road-traffic-related tweets to detect and locate point events, such as car accidents.Specifically, a set of REs were defined according to the composition of nouns, determiners, adjectives, cardinal numbers, conjunctions, and possessive endings.

Rule
Furthermore, to decrease false positives, grammar-based rules were implemented based on spatial prepositions, such as in, at, between, and near.Zou et al. [196] analyzed the rescue request on Twitter during Hurricane Harvey.The authors first manually annotated the tweets that were asking for help.Then, they used a rule-based method to recognize location references from tweets.More specifically, they assumed that the formal description of an address in the United Although many studies classify rule-based approaches as one category [6,104,133], pure rules-based approaches are rare.All the rule-based approaches discussed in [133] are, in fact, hybrid approaches.This is likely because the approaches that rely on linguistic patterns only are ineffective [162].It remains a challenge to define rules in a complete and robust manner that can explain all possible occurrences of location references in texts, especially in microblogs with dramatic variation of writing styles and weak grammar [151].However, a set of simple rules can be used to enhance gazetteer matching and statistical learning-based approaches, which will be introduced in the following subsections.

Gazetteer matching-based approaches.
A gazetteer is a dictionary of place names associated with geospatial information (e.g., place types and geographic coordinates) and some additional information such as population size, administrative level, and alternative names.Gazetteers play important roles in location reference recognition in many Manuscript submitted to ACM Hu, et al.
studies.GeoNames6 is a most widely used gazetteer, and OpenStreetMap (OSM) 7 , in a broad sense, can be considered as a gazetteer as well.There are 12,255,0288 and 23,876,956 9 places in GeoNames and OSM, respectively.Figure 4 illustrates the point density map of the places in OSM and GeoNames.In gazetteer matching-based approaches, the n-grams of a text are first matched against a gazetteer, which are then filtered or disambiguated with a couple of heuristics.Gazetteer matching-based approaches are still faced with two main challenges.The first is that many location references appearing in texts are missing from gazetteers due to various reasons, such as name variation (e.g., 'South rd' for 'South road' and 'Frankfurt airport' for 'Frankfurt international airport') and data incompleteness (e.g., the missing of 'Hidden Valley Church of Christ' from a gazetteer) [57].Second, gazetteer matching-based approaches often run into ambiguity issues.For instance, the names 'Washington', 'MO', 'South Wind', and '1 ft' all exist in gazetteers, but can also refer to other types of entities.This is called geo/non-geo ambiguities, while geo/geo ambiguities refer to the situation that different spatial locations use the same name, such as Manchester, NH, USA versus Manchester, UK.For simplicity, we use ambiguities and ambiguous to refer to geo/non-geo ambiguities by default.We will use the full name geo/geo ambiguities to refer to the second situation.The main focus of gazetteer-based approaches is often to overcome the two mentioned challenges by using heuristics to perform disambiguation (to increase precision) and by including place name variants to expand the used gazetteer (to increase recall).Many studies used gazetteer matching-based approaches to recognize location references from texts [3,6,12,15,33,41,52,126,128,140,140,142,162,168,169,189].One of the earliest geoparsing approaches was proposed by Woodruff and Plaunt [189] to support georeferenced document indexing and retrieval.A gazetteer containing around 120,000 places in California is first built on the US Geological Survey's Geographic Names Information System (USGS 1985) and the land use data from the US Geological Survey's Geographic Information Retrieval and Analysis System (GIRAS).
The place names in a document were identified by matching texts' n-grams containing non-stop words against the gazetteer.If a token had no matches in the gazetteer, it was depluralized (e.g., 'valleys' to 'valley') and rematched with the gazetteer.Identified place names were then geocoded to determine the geographic scope of the document.
Amitay et al. [15] developed a system, named Web-a-Where, for recognizing and geocoding continents, countries, states, and cites as well as their abbreviations in web pages.A gazetteer was created by collecting about 75,000 place names across the world from different data sources: USGS, World-gazetteer.com10, UNSD 11 , and ISO 3166-1 12 .The system first extracted candidate place names in a given page by matching against the gazetteer.Then, four heuristics Hu, et al.
were sequentially used to disambiguate and geocode the candidate place name, such as the vicinity of two candidate places (e.g., Chicago, IL) and the population of places.The approach was evaluated on 600 web pages, containing over 7000 place names.Clough [33] proposed geoparsing web pages.In the recognizing step, candidate place names are first identified by matching against gazetteers, which are then filtered by using stop words and by using context cues, such as to filter person names with simple heuristic <  ><  > (e.g., Mr. Sheffield), where <  > is a candidate place name and also in the dictionary of person names.Used gazetteers include the Ordnance Survey 1:50,000 Scale Gazetteer for the UK (OS 13 ), Seamless Administrative Boundaries of Europe dataset (SABE 14 ), and Getty Thesaurus of Geographic Names (TGN 15 ).Pouliquen et al. [142] proposed geoparsing multilingual texts.Candidate place names are first identified by matching with a multilingual gazetteer, which are then disambiguated through a dictionary of person names (e.g., 'George Bush' and 'Tony Blair') and stop words (e.g., 'And', 'Du', 'Auch') in a multilingual context.
The multilingual gazetteer is created from three sources: Global Discovery database of place names (Global Discovery 2006), the multilingual KNAB database (KNAB 2006), and a European commission internal document.The approach was tested on 162 newspaper stories in five languages (i.e., German, English, Spanish, French, and Italian) from Europe Media Monitor.
Gazetteer matching-based approaches were also used to extract locations from tweets.For instance, Paradesi [140] proposed TwitterTagger for geoparsing tweets.It matched the noun phrases of a tweet text with the entries in gazetteers (i.e., USGS database), which was followed by disambiguating the matched entry with two heuristics.The first was to check if spatial indicators (e.g., 'in' and 'near') were used before a noun phrase.The second was to check whether other users used a spatial indicator before the same noun phrase in their tweets.Geo/geo ambiguities were removed by calculating the distance between the location of a place name in a tweet and the location of the user who posted the tweet as well as the location of other tweet users who mentioned the same place name.The approach was evaluated on 2000 annotated tweets.Middleton et al. [126] proposed a multilingual geoparser for tweets, named geoparsepy.To overcome the place name variation issues, a set of heuristics were applied to expand OSM place names.To deal with abbreviations, a multilingual corpus of the street and building types from OSM was used to compute obvious variants for common location types (e.g., 'Southampton Uni' for 'Southampton University').To overcome the ambiguity issue, uni-gram location names that are non-nouns are filtered using a multilingual WordNet corpus lookup, such as 'ok' and 'us', which can refer to locations or other types depending on their POS tag.Location phrases are then filtered using a multilingual stop word corpus.To remove the geo/geo ambiguities, a confidence score is calculated for each matched entry in the gazetteer based on several evidential features, such as the other location references in the tweet, the admin level, and the geotag of the tweet if available.de Bruijn et al. [41] proposed a geoparsing algorithm (named TAGGS) by employing both metadata and the contextual spatial information of groups of tweets referencing the same location regarding a specific disaster type.It matches the uni-and bi-grams of a tweet text with the GeoNames gazetteer.The found candidates are then filtered with several heuristics, such as discarding the candidates with the 1000 most common words.Some studies [19,138] leveraged gazetteer matching-based semantic annotators to realize their own geoparsing tool.For instance, Nizzoli et al. [138] employed TagMe [52] to identify entities of an input text and to link them to the corresponding entity in a knowledge graph (i.e., DBpedia).Then, they traversed the knowledge graph to expand the available information for the geoparsing task.Finally, they exploited all available information for learning a regression model that selects the best entity in the knowledge graph for annotated places in the text.
Studies, such as [3,6,22,128,167,191,192], focused only on local events whose geographical scope is known, such as floods or traffic accidents happened in a certain city.Therefore, they would normally use a local gazetteer that contains only the places in a certain region, which can dramatically mitigate the issues of geo/non-geo ambiguities and geo/geo ambiguities.Although the proposed geoparsing approaches are not globally applicable, they are effective in dealing with local events.For instance, Al-Olimat et al. [6] proposed a Location Name Extraction tool (LNEx), which used n-gram statistics and location-related dictionaries to handle the abbreviations and automatically filter and augment the place names in the OpenStreetMap gazetteer (handling name contractions and auxiliary contents).Ahmed et al.
[3] used tweets to monitor the traffic congestion in real-time.Specifically, tweets related to traffic congestion are first detected by using supervised and unsupervised machine learning techniques, and the road names in the tweets are then extracted by matching with a list of road names in the city of Chennai.Jaro-Winkler metric is used to calculate the similarity between the n-grams in tweets and the road names in the list to overcome the challenge of place name variants.1551 congestion-related tweets were used to detect the congestion situations in Chennai in two short periods, lasting 7 months in total.Milusheva et al. [128] used traffic-related tweets to derive the locations of road traffic crash in Nairobi, Kenya, for the purpose of road safety improvements.Specifically, they applied a machine learning model to capture the occurrence of a crash and developed a gazetteer matching-based geoparsing algorithm to identify its location.A gazetteer of landmarks (e.g., roads, schools, and bus stops) for five counties that constitute the Nairobi metro area was created from OpenStreetMap, GeoNames, and Google Places.The location of crashes is then determined by matching the n-grams of the tweets with the entries in the gazetteer.
It is simple to implement gazetteer matching-based approaches and they can easily adapt to a multilingual context.
Moreover, they are effective in certain applications, such as the ones whose geographic scope is limited to a small region (a city) or the ones require only coarse-grained locations, such as country names.However, it remains a challenge to propose a general and globally applicable approach for location reference recognition by using gazetteer matching and simple heuristics since the name variants and geo/non-geo ambiguity issues are ubiquitous in natural language texts.To overcome this challenge, many studies combined gazetteer matching with rules and/or statistical learning to compensating the shortcomings of each other, which will be introduced in the following subsections.In the following, we discuss these two groups of approaches respectively.
Learning-based NER: Location reference recognition can be considered as a subtask of NER, which has been extensively studied.Therefore, many studies [58,70,72,93,110,120,173] used existing statistical-based NER models or retrain them to extract location references from texts.For instance, Lingad et al. [110] used OpenNLP 16 , TwitterNLP and OpenNLP were also retrained and evaluated by using 10-fold cross-validation in their study.was used to extract cities and countries from tweet content and user profiles.Mao et al. [120] proposed mapping near-real-time power outages from tweets using a retrained NeuroNER model [44].Suat-Rojas et al. [167] utilized a retrained spaCy NER to detect and analyze traffic accidents from Spanish tweets in a city of Colombia.
Recently, many deep learning-based NERs have also been proposed.For example, Limsopatham and Collier [109] proposed recognizing name entities from tweets by enabling BiLSTM to automatically learn orthographic features using both the character embedding and word embedding.Akbik et al. [5] proposed Flair, an NLP tool that used contextual string embeddings for sequence labeling tasks, such as part-of-speech (POS) tagging and NER.Qi et al. [147] proposed Learning-based PNE: Apart from using or retraining existing NER models, many studies also trained their own models for location reference recognition by using machine learning [137,154,164] and deep learning models [9,27,30,99,116,122,174,191].For instance, Nissim et al. [137] trained the Curran and Clark (C&C) maximum entropy tagger [37] for recognizing location references from Scottish historical documents, using the built-in C&C features, including morphological and orthographical features, information about the word itself, POS tags, named entity tag history (with a window size of 2), and contextual features.The model was evaluated on 648 Scottish historical documents, containing 10,868 sentences and 5682 places.Kumar and Singh [99] implemented a multi-channel convolutional neural network (CNN) architecture to extract location references from tweets.The model was evaluated on 5107 earthquake-related tweets with 6690 place names by using 10-fold cross-validation.Xu et al. [191] proposed DLocRL, a deep learning pipeline for fine-grained location recognition and linking in tweets.Specifically, they first used BiLSTM-CRF to train a Point of Interest (POI) recognizer.Then, given an input pair 〈POI, Profile〉, a linking module was trained to judge whether the location profile corresponds to the POI.The profile is an entry in a POI dictionary.The approach was evaluated on the Singaporean national Twitter dataset that was first used in [105], containing 3611 tweets and 1542 POIs.
Cadorel et al. [27] proposed to extract a property's location and neighborhood from French housing advertisements by recognizing place names and retrieving relationships between them.Specifically, a BiLSTM-CRF network with a concatenation of several text representations (CamenBERT [121], Flair, and Word2Vec [127]) was used to extract place names.
To mitigate the effort for manually annotating a large training dataset, semi-supervised approaches have been developed.For instance, Wang et al. train a toponym recognizer.The approach was evaluated on three Chinese NLP datasets (i.e., WeiboNER, Boson, and MSRA) 21 .Khanal and Caragea [96] used a multi-task learning setting to augment the learning of fine-grained location identification.The three tasks related to crisis events are key-phrase identification, eyewitness-account classification, and humanitarian category classification.The learning is conducted on one of the three popular transformer-based models: BERT [45], Albert [103], and RoBERTa [115].Several public datasets for the training of the three tasks were utilized in the multi-task learning.The proposed approach was evaluated on two disaster-related twitter datasets that were used in Middleton et al. [126], which contain 1907 and 1762 tweets, respectively.
Given abundant annotated data, statistical learning-based approaches can automatically recognize location references according to the contextual cues and the intrinsic features of location references without requiring additional expert knowledge and gazetteers.However, a large number of labeled training sentences are often not available, making it difficult to use these approaches in many situations [69].Furthermore, deep learning based models normally take much more time to recognize place names from texts than rule and gazetteer matching-based approaches.

Hybrid approaches.
Every single technique has its own drawbacks.Thus, researchers have proposed fusing different techniques to achieve the best of all [23,49,78,105,118,186,193].Hybrid approaches can be further divided into four types based on the way they combine the previous three approaches: fusing rule and gazetteer, fusing rule and statistical learning, fusing gazetteer and statistical learning, and fusing rule, gazetteer, and statistical learning.
Fusing rule and gazetteer: Many studies [118,122,131,132,143,182,186] fused rules and gazetteers to overcome the shortcomings of each other.Manually defined rules are fragile and the detected location references can be thus further verified by gazetteers.Inversely, rules can be used to mitigate the two challenges faced by gazetteer matching, and can help remove the ambiguities of the location references detected by gazetteer matching and by recognizing those references that are not included in gazetteers.For instance, Pouliquen et al. [143] proposed identifying cities and countries from newspapers in multiple languages.Location references are recognized by matching texts' n-grams written in upper case with a multi-language gazetteer, named Global Discovery gazetteer.The matches are then filtered by stop words and person names.To recognize the morphological variants of places, regular expressions are used to list all possible suffixes and suffix combinations of location references.By doing so, some unseen places in gazetteers can be recognized, such as 'Lontoolaisen', because it consists of 'Lontoo' that is in the gazetteer and the suffix 'laisen'.To remove geo/geo ambiguities, several heuristics are utilized, such as discarding small places in the gazetteer, leveraging the importance of places, and determining the country of an article.The approach was evaluated on 28 texts with 1650 places in 8 languages (e.g., English, Spanish, and Russian).To help understand the origins, mutations, and the geospatial transmission patterns of viruses, such as influenza, rabies, and Ebola, Weissenbacher et al. [186] presented a geoparsing system for research articles related to phylogeography.GeoNames was first searched to detect location references in articles, and then a black-list (e.g., 'How', 'Although', 'Gene', and 'Body') and a set of rules were created to remove noisy entities found in GeoNames.Malmasi and Dras [118] proposed detecting location references in tweets.A POS rule-based tree splitting method is first used to extract noun phrases, and the n-grams of the noun phrases are then matched with the entries of GeoNames.Dutt et al. [49] presented SAVITR, a system that geoparses and visualizes tweets during emergencies.They used a POS tagger to find proper nouns, and then used REs to mitigate the ambiguity of proper nouns with the prefix and suffix words (e.g., 'road', 'south', and 'city') of place names.Last, the phrases extracted by the above methods are verified and geocoded by using a gazetteer (i.e., GeoNames or OSM) in India.Martínez and Pascual [122] presented LORE, a knowledge-based model that captures location references from English and Spanish tweets.First, bi-grams and uni-grams in the tweets are matched with entries in the GeoNames gazetteer and then filtered by heuristics.Second, linguistic patterns involving location-indicative words (e.g., 'city' and 'street'), location markers (e.g., 'north' and '10km'), and POS tags are derived to recognize location expressions, such as '25 miles NW of London City'.They derived the linguistic patterns from 500 English tweets and 100 Spanish tweets, and then used 900 English tweets and 500 Spanish tweets to test LORE.
Fusing rule and statistical learning: Statistical learning models might not generalize well due to limited training samples, and manually defined rules can be added to boost the performance of the trained models, e.g., by correcting evident errors.For instance, Acheson and Purves [2] proposed geoparsing scientific articles in the form of PDF by first recognizing candidate location references using Stanford NER and then filter the candidates using rules, such as to include candidates with 'University' or 'Institute' and reject candidates with 'Inc' and 'GmbH'.These rules are derived based on observations from the training sets.Google Geocoding API was then used to obtain the spatial representation of the detected location references.The approach was evaluated on two article corpora in the domain of Orchards and Cancer, containing 150 and 200 articles, respectively.Das and Purves [38] proposed detecting traffic events (e.g., traffic accidents and congestion) in India by using tweets.First, tweets are classified as traffic relevant or not relevant using a Manuscript submitted to ACM supervised model.A hybrid method is then used to recognize location references in traffic-relevant tweets.Finally, the location references are geocoded by using OSM Nomination API 22 .Specifically, the approach combines the detected location references by Stanford NER, retrained OpenNLP, and a rule-based system involving spatial indicators (e.g., 'in', 'at', and 'near'), POS tags, and 85 words of place categories (e.g., 'hospital', 'road', and 'clinic').
Fusing gazetteer and statistical learning: In this type of hybrid approaches, gazetteers are used in two main ways: (1) to combine the detection result of statistical learning models with gazetteer matching; (2) to use gazetteer matching result (e.g., if an n-gram is in gazetteers or not) as input features for statistical learning models.Examples of the first way are [56,71,78,105].For instance, to improve users' experience in searching their needed resources from digital libraries, Freire et al. [56] proposed geoparsing descriptive metadata records associated with digital resources.
Initial location references are recognized by matching tokens of records with candidate entries in GeoNames.A Random Forest classifier was then trained to disambiguate and link the initial location references to the final entry.Li and Sun [105] proposed recognizing POIs in tweets.Candidate POIs in tweets were first extracted by matching with a POI inventory, which was constructed from check-in data in Foursquare.A trained time-aware POI tagger based on CRF was then utilized to remove the ambiguity of the candidates based on the context cues in the text.Hoang and Mothe [78] combined the detection results of multiple publicly available tools, such as Ritter's tool [151], Gate NLP framework [23], and Stanford NER, and then filtered the results using DBPedia.Different configurations of the NER tools and DBPedia were tested on the Ritter's dataset [151] and MSM2013 dataset [28].Examples of the second way include [51,88,117,141,185].For instance, Inkpen et al. [88] trained three CRF models for recognizing city, province/state, and country mentions based on manually defined features, including gazetteer features (if a phrase is in GeoNames or not).The models are intended to not only detect location references in tweets but also to categorize them into three types.The models were evaluated by using 10-fold cross-validation on 6000 tweets, containing 1270 countries, 772 states or provinces, and 2327 cities.To support viral phylogeographic studies, Weissenbacher et al. [185] proposed recognizing location references in research articles pertaining to virus-related GenBank records by using a CRF model.
Lexical (i.e., POS tags), semantic, and gazetteer features.The proposed approach was evaluated on the same dataset as Weissenbacher et al. [186].Fernández-Martínez and Periñán Pascual [51] proposed nLORE, a BiLSTM-CRF architecture for location reference recognition, exploiting both linguistic and gazetteer features from LORE [123].The model was trained on 7000 tweets and then tested on 1063 tweets.
Fusing rule, gazetteer, and statistical learning: Some studies combined all the three techniques for location reference recognition [48,59,80,81,106,117].For instance, Gelernter and Zhang [59]  proposed understanding five important aspects of need-tweets and availability-tweets during disasters, including what resource (e.g., water, food, shelter, and medicines) and what quantity is needed/available, the geographical location of the need/availability, and who needs/is providing the resource.With regard to geoparsing, the authors improved their previously proposed system, Savitr [49], by combining the location references detected by spaCy and a rule system and then filtering the location references through a gazetteer.More recently, we proposed two place name extractors for tweets.The first extractor is named GazPNE [80], which is a neural classifier first trained based on place names in OpenStreetMap in the region of the US and India and non-place names that are synthesized by rules.Because GazPNE still suffers from ambiguity issues due to its limited use of context information, we developed a second and more robust approach, GazPNE2 [81].It utilizes two pretrained transformer models, BERT and BERTweet [136] to disambiguate the detected location references and achieved an improved F1 score of 0.8 on 19 public twitter datasets.

Comparative studies
In addition to individual studies that focused on developing new methods, researchers also conducted experiments to compare existing methods based on the same datasets.Liu et al. [112] created a medium-scale corpus of locative expressions from multiple social media sources which include the TellUsWhere corpus [187], two sets of micro-blog posts from Twitter, comments from YouTube, forums, blog posts from tier one of the ICWSM-2011 Spinn3r dataset 23 , Wikipedia, and documents from the British National Corpus [26].They then compared the performance of a couple of location reference recognition models over these seven corpora, which include Locative Expression Recogniser (LER) [111], retrained Stanford NER, pretrained Stanford NER, GeoLocator [57], UnLockText, and Twitter NLP.The results show the pretrained Stanford NER achieves the best overall performance.Gritta et al. [67] evaluated the performance of five geoparsers (GeoTxt, Edinburgh Geoparser [68], Yahoo!PlaceSpotter, CLAVIN, and Topocluster [42]) on two datasets, Local-Global Lexicon (LGL) [108], and WikToR that was programmatically created by the author.For location reference recognition, GeoTxt uses Stanford NER, Edinburgh Geoparser uses LT-TTT2, TopoCluster uses Stanford NER, and CLAVIN uses Apache OpenNLP.The evaluation results showed that Stanford NER performs the best in location reference recognition, and Edinburgh Geoparser and CLAVIN perform the best in geocoding.Wang and Hu [179] developed an extensible and unified platform for evaluating geoparsers, named EUPEG, which enables direct comparison of nine geoparsers on eight public corpora, which are LGL, GeoVirus [64], TR-News [93], GeoWebNews [65], WikToR [67], GeoCorpora [177], Hu2014 [83], and Ju2016 [92].The compared geoparsers include GeoTxt, Edinburgh Geoparser, TopoCluster, CLAVIN , Yahoo! PlaceSpotter, CamCoder [64] that uses spaCy NER for location reference recognition, DBpedia Spotlight [125], and two systems that use Stanford NER and spaCy NER for location reference recognition, respectively, and population-based heuristics for disambiguation in geocoding step.Won et al. [188] evaluated the performance of five NERs and voting systems that combine the NERs in extracting place names from two collections of historical correspondence, named Mary Hamilton Papers and the Samuel Hartlib collection.The NERs include NER-Tagger [101], Stanford NER, spaCy, Edinburgh Geoparser, and Polyglot-NER [7].The results showed that although the individual performance of each NER system was corpus dependent, the ensemble combination can achieve consistent measures of precision and recall, outperforming the individual NER systems.At the International Workshop on Semantic Evaluation 2019 24 , a task for toponym resolution in scientific articles was launched.The evaluation results were presented in [184].Several systems were evaluated on a corpus of 150 full PubMed articles as 105 articles for training and 45 articles for testing, containing in total 8360 toponyms.In the subtask of toponym recognition, all systems except one adopted Deep Recurrent Neural Networks.The highest F1 score was achieved by the system proposed by a team from Alibaba Group, which adopted an architecture of BiLSTM-CRF that was trained on OntoNote5.0,CoNLL13, and weakly labeled training corpora.All systems relied on handcrafted features for toponym resolution, including the lexical context of the toponyms and their importance (e.g., population), and a gradient boosting algorithm performs the best.
There are two major differences between this study and the aforementioned comparative studies.First, these existing comparative studies focused on the entire workflow of geoparsing, while we focus on a narrower topic, i.e., location reference recognition, and provide a deeper review and comparison of methods on this topic.Second, our comparative experiments (presented in the following section) are more comprehensive than existing studies.We used more datasets (26 datasets, containing 39,736 places across the world) and compared 27 different approaches.In the following, we present the results from the comparative experiments.

COMPARISON OF EXISTING APPROACHES 4.1 Methods
To inform future methodological developments for location reference recognition and help guide the selection of proper approaches based on application needs, we examine numerous characteristics of existing approaches for location reference recognition.We use or implement the 27 most widely used approaches including both general NERs and location-specific approaches.Note that, we do not include several approaches, such as LNEx [6], GazPNE [80], and SAVITR [49], since they can only be applied to a local region rather than the entire globe, while the place names Manuscript submitted to ACM  26 and includes an NER tool, which was built on BiLSTM and CRF.
We kept the entities of LOC, FAC, and GPE as locations.
• OpenNLP (1.9.4) [125]: The Apache OpenNLP library is an open sourced and machine learning based toolkit for the processing of natural language text.We kept the entities of Location detected by OpenNLP as locations.
• DBpedia Spotlight [125]: It is for recognizing and linking entities based on a knowledge base-DBpedia.We treated the place mentions detected by this tool 27 as locations in the evaluation.
• NER-Tagger [101]: It is a NER tool targeted for tweets.It was built on BiLSTM and CRF.We used the pretrained model and implementation 28 to detect locations that were tagged with B-LOC and I-LOC.
• Polyglot (16.07.04) [7]: It is a natural language pipeline 29 and includes a multi-language NER tool.The entities tagged with I-LOC by this tool were regarded as locations.
• NeuroNER [44]: It is a BiLSTM-CRF based NER system developed by MIT.We used the pretrained model and implementation 30 to tag entities, and the locations were those with the tags of LOC, FAC, and GPE.
• CogComp (4.0) [150]: It is a NER tagger 31 , which was developed by the University of Illinois.The entities tagged with LOC were taken as the locations identified by this tool.
• OSU Twitter NLP [151]: It is a twitter-specific NER tool 32 that particularly targets twitter texts.The entities tagged with GEO-LOCATION and FACILITY by this tool were treated as locations.
• TwitIE-Gate (9.0.1) [23]: It is a twitter-specific NER tool 33 , providing an executable pipeline on an open-source software toolkit GATE 34 (General Architecture for Text Engineering).The entities tagged with Location by this tool were treated as locations.
• TNER [174]: It is an All-Round Python Library 35 for Transformer-based Named Entity Recognition.Its recognized locations included those entities tagged with LOC, GPE, and FAC.
• Flair NER [4]: Flair is an NLP framework designed to facilitate training and distribution of sequence labeling and text classification.Flair-NER is the standard 4-class NER model trained on CoNLL-03.We used their trained model 36 directly and identified locations through the LOC tag.
It is named Flair NER (Ont) for short in this review.We used their trained model 37 directly and included entities tagged with LOC, GPE, and FAC as locations.
• BERT-base-NER: It is a fine-tuned BERT model that is ready to use for Named Entity Recognition.We used their trained NER model 38 directly.The locations were derived from the entities tagged with B-LOC and I-LOC.
• GazPNE2 [81]: It fuses global gazetteers and two pretrained transformer models.The latest version 39 utilized Stanza to accelerate GazPNE2 and detect hard examples for GazPNE2.
• CLIFF (2.6.1)[46]: It integrates the results of Stanford NER and a modified CLAVIN (Cartographic Location and Vicinity Indexer) geoparser.In the evaluation, we used its implementation 40 and kept the detected place mentions as locations.
• LORE [123]: It is a rule-based location extractor for tweets.We used its implementation to extract locations.
• nLORE [122]: It is a deep learning model, an advanced version of LORE.We used the trained model provided by the author to extract locations.
• Edinburgh Geoparser (1.2) [68]: It is a geoparsing tool 41 developed by Edinburgh University, which combines rules and gazetteers to extract place names directly from text.
• BaseSemEval12 [117]: It is a baseline system for SemEval-2019 Task 12 (i.e.Toponym Resolution in Scientific Papers) that uses a 2-layer feedforward neural network 42 .Its output place mentions were taken as the detected locations.
• NeuroTPR [180]: It is a neuro-net toponym recognition tool trained on recurrent neural networks.We used their trained model and implementation 43 to detect location mentions in texts.
• Geoparserpy (2.1.4)[126]: It is a representative gazetteer-based geoparser.We used the implementation of Geoparserpy and deployed the required OpenStreetMap gazetteer to extract place names.
• SPENS [188]: This approach combines the result of five different systems in a voting mechanism, including Stanford NER, Polyglot NER, Edinburgh Geoparser, NER-Tagger, and spaCy.It is thus named SPENS for short.
We reimplement the approach by using the code and API of the five modules.
• Ritter+Stanford NER+DBpedia [78]: It uses DBpedia to filter the merged detection by Ritter's tool (also named OSU Twitter NLP) and Stanford NER.We name this approach RSD for short and reimplement the approach by using the code and API of the three modules.
• Ritter+GATE+DBpedia [78]: It uses DBpedia to filter the merged detection by Ritter's tool and GATE.We name this approach RGD for short and reimplement the approach by using the code and API of the three modules.
• Ritter+Stanford NER [78]: It merges detection by Ritter's tool and Stanford NER.We name this approach RS for short and reimplement the approach by using the code and API of the two modules.
All methods were configured taking into account the corresponding research results to ensure to choose the optimal parameter settings.For example, we consider not only Location and GPE but also Facility detected by Stanza as a location since this can achieve the best F1 score on the total datasets.

Test data
We collect in total 26 commonly used datasets and use them as test data.The datasets include 3 formal datasets (i.e., news) and 23 informal datasets (i.e., tweets), containing 39,736 place names in total, as shown in Table 4.They can be categorized into two groups based on the purpose of the datasets: Location Extraction (LE) and NER.The former only annotates Location while the latter annotates not only Location, but also the other types, such as Person, Organization, and Facility.Note that, we do not use some available geoparsing datasets that were used to evaluate the geoparsing approaches in [179], such as WikToR [67] since we found they miss quite a few toponyms.For example, in WikToR, a text or article corresponds to a WikiPedia page entitled with a toponym whose coordinates are specified.The text explains the toponym.Only this toponym is automatically annotated, ignoring the other toponyms in the text.The dataset can be used to evaluate toponym resolution approaches but not toponym recognition approaches.The description of the used datasets is as follows: • LaFlood2016, HouFlood2015, CheFlood2015 44 : They are three flood related datasets, which were created by Al-Olimat et al. [6].The locations in the three datasets were annotated as one of the three types: inLOC, outLOC, and ambLOC, denoting the locations inside the area (e.g., 'Houston') of interest, outside the area, and ambiguous locations (e.g., 'my house'), respectively.We only evaluate the tools on the inLOC and outLOC locations, ignoring the ambLOC locations.Louisiana, Houston, Texas, and Chennai, as well as their abbreviations, such as 'La', 'Hou', and 'Tx' appear frequently in the datasets.Moreover, many location mentions are in hashtags, such as '#laflood', '#txwx', and '#ChennaiRain'.• Harvey2017 45 : The dataset is related to 2017 Hurricane Harvey and was created by Wang et al. [180].The dataset contains many fine-grained locations, such as '398 Garden Oaks Blvd' and '26206 longenbaugh rd'.No places appear in hashtags since they have been removed from the dataset.• NzEq2013, NyHurcn2012 46 : The two twitter datasets correspond to the New Zealand earthquake in 2013 and New York Hurricane in 2012, respectively.They were created by Middleton et al. [126].We found several missing place names (e.g., 'Christchurch') which, however, appear frequently in the two datasets.To mitigate this issue, we manually create two missing place name lists (i.e.[('new', 'zealand'), ('nz'), ('uk'), ('christchurch'), ('chch'), ('lyttleton'), ('southland'), ('wellington'), ('south', 'island')] and [('new', 'york'), ('nyc'), ('new', 'york', 'city'), 44 The datasets can be obtained by filling out the Dataset Registration form https://docs.google.com/forms/d/e/1FAIpQLScf6-DNwkgJXPS5e28Mj18hIW3Ap_Ym7Kna-SO7oSmiC72qGw/viewform 45https://github.com/geoai-lab/NeuroTPR/tree/master/Data/TestData/HarveyTweet2017 46https://revealproject.eu/geoparse-benchmark-open-dataset/Manuscript submitted to ACM ('ny')] ) for the two datasets, respectively.We define that the detection of an entity which is not annotated in the dataset but in the corresponding missing list is a true positive.Moreover, sub place names exist in dataset NzEq2013.For example, in the text 'Christchurch hospital is now back in operation', both 'Christchurch hospital' and 'Christchurch' were annotated as Location.To tackle this issue, we removed sub place names from the dataset.
• Martinez_I, Martinez_II, Martinez_III: The three twitter datasets correspond to multiple crises and emergency events (e.g., earthquake, flood, car accident, bombing, shooting, terrorist, and incident) that happened across the world.They were initially utilized in [51,123].One of the features of the datasets is that many fine-grained locations, such as '13219 S penrose Ave' and 'Exit 34' as well as complex location expressions, such as '50 miles SW of Liverpool' and '25mins away from Northumbria Street' were annotated.
• GeoCorpora 47 : It was created by Wallgrün et al. [177].In the dataset, location references in tweets were not only annotated but also linked to the toponyms of GeoNames.Therefore, it can be also used to evaluate the geocoding approaches.The dataset corresponds to multiple noteworthy events (e.g., earthquake, ebola, fire, flood, protest, and rebel) that happened across the world in 2014 and 2015.The majority of the annotated places are admin units, such as continent, country, state, and city.
• BTC-A, BTC-B, BTC-E, BTC-F, BTC-G, BTC-H 49 : The Board Twitter Corpus (BTC) was created by Derczynski et al. [43].The datasets were sampled across different regions, temporal periods, and types of Twitter users.Apart from Location, Organization and Person were also annotated.Several annotated place names are in mentions (e.g., '@HoustonFlood').However, they are normally ignored by existing location extractors.Thus, we remove the place name in the mentions from the six datasets.• NEEL2016 50 : It is the gold dataset of 2016 Named Entity rEcognition and Linking (NEEL) Challenge.The dataset includes event-annotated tweets covering multiple noteworthy events from 2011 to 2013, such as the death of Amy Winehouse, the London Riots, the Oslo bombing, and the Westgate Shopping Mall shootout.Entities of different types, such as Location, Person, Organization, Event, and Product were not only annotated but also linked to entries in DBPedia.We used its training set, which contains 2135 tweets and 602 places.• LGL 54 : Local-Global Lexicon (LGL) corpus was created by Lieberman et al. [108].Toponyms were manually annotated and geocoded from 588 human-annotated news articles published by 78 local newspapers.• GeoVirus 55 : GeoVirus was created by Gritta et al. [64] for the evaluation of geoparsing approaches in news related to disease outbreaks and epidemics, such as Ebola, Bird Flu, and Swine Flu.Toponyms were manually annotated and geocoded.Only admit units were annotated in the dataset.Buildings, POIs, streets, and rivers were ignored.
• TR-News 56 : TR-News was created by Kamalloo and Rafiei [93].Toponyms were manually annotated and geocoded from 118 news articles from various news sources.We adopted the standard comparison metrics: precision, recall, and F1-score.In the case of overlapping or partial matches, we penalize a tool by adding 1/2 FP (False Positive) and 1/2 FN (False Negative) (e.g., if the tool marks 'The Houston' instead of 'Houston'), following Al-Olimat et al. [6].

Results of location reference recognition
We run the 27 tools on all the test datasets, and their precision, recall, and F1 score are reported in Figure 6.The three metrics are obtained by calculating the sum of FP, FN, and TP of all the datasets rather than by averaging the three metrics of the datasets due to the imbalanced place name count (from 119 to 5122) of different datasets.The raw result of the tools on each dataset can be seen from 57

Error analysis
To gain an insight into what mistakes these tools made, we carried out an error analysis.Specifically, we investigated the performance of the tools on formal and informal texts.We also examined their performance on location references in different categories and forms.4.4.2Place category.We divide the location references in the datasets into four categories: admin units (e.g., country, state, town, and suburb), traffic ways (e.g., street, road, highway, and bridge), natural features (e.g., river, creek, beach, and hill), and POIs (e.g., park, church, school, and library).We choose four datasets (i.e., Harvey2017, GeoCorpora, LGL, and TR-News) to conduct this experiment because the category of the places in the four datasets can be derived.The places in Harvey2017 were categorized into ten types [84].We treat the types of house number addresses, street names, highway, exits of highways, and intersections of roads as traffic ways, the type of natural features as natural features, the types of other human-made features and local organizations as POIs, and the types of admin units and multiple areas as admin units.In the other three datasets, place names were linked to the entries of GeoNames.We treat the places whose feature codes 58 are A (e.g., country, state, and region) and P (e.g., city and village) in GeoNames as admin units, the places whose feature codes are R (e.g., road and railroad) as traffic ways, the places whose feature codes are H (e.g., stream and lake), T (e.g., mountain, hill, and rock), U (e.g., undersea and valley), V (e.g., forest and grove) as natural features, and the places whose feature codes are L (e.g., park and port) and S (e.g., sport, building, and farm) as POIs.There are in total 9790 admin units, 773 traffic ways, 263 natural features, and 754 POIs in the four datasets.
We define the detection rate as the proportion of correctly detected places among the total places of a certain category.
Only an exact match is regarded as a correct detection.Figure 9 shows the detection rate of the tools on the four categories.We can observe that many tools show superior performance in recognizing coarse-grained places, with 13 of 27 recognizing over 60% of the admin units.However, most of them are incapable of recognizing fine-grained places.'#HoustonFlood' and '#Chennai' ).Place names with numbers mostly refer to fine-grained locations, such as highways, roads, and home addresses.We define that the abbreviation of place names consists of only one single word and its char length does not surpass 3. 1621, 3697, and 6560 place names are in the three forms (number, abbreviation, and hashtag), respectively.Figure 10 shows the detection rate of the tools on the place names of the three forms.

Geoparserpy
We can observe that it is a challenge to recognize place names with numbers, with only 4 of 27 achieving a detection rate of over 0.  forms.It is also a challenge to recognize place names in hashtags, with only 5 of 27 achieving a detection rate of over 0.3.However, it is encouraging that GazPNE2 and LORE can recognize over 70% of the places in hashtags.

Computational efficiency
In this section, we further investigate the computational efficiency (i.e., speed) of different approaches.In many applications, the texts that need to be geoparsed are of huge volumes, such as some major historical books and reports (e.g., the Old Bailey Online) that each comprise many millions or even billions of words [62] and over millions of tweets related to a crisis event [146].This requires a rapid geoparsing procedure and the speed is thus a critical indicator.
We run each approach on the total datasets and record the consumed time of each approach.We do not count the consumed time during the training phase if an approach needs to be trained since it can be conducted offline and is a one-time process.Note that, we do not include Edinburgh Geoparser and DBpedia Spotlight in the comparison since they are online service and it is impossible to count their amount of time of processing done on the server.While most of approaches run on a MacBook Pro laptop with an Intel Core i7 (2.2 GHz 6-Core) and a RAM of 16 GB, three approaches (i.e., OSU TwitterNLP, LORE, and nLORE) run on a Lenovo laptop with an Intel Core i5 (2.5 GHZ 4-Core) and a RAM of 3.8 GB since they require a Linux or Windows environment.Figure 11 illustrates the speed of the approaches.
We can observe that the speed of different approaches varies drastically.It takes 6 minutes to 33 hours for these approaches to process the total datasets that contain 1,092,093 words.It is unexpected that OSU Twitter NLP takes nearly 9.6 hours.Therefore, RSD, RGD, and RS that use the OSU Twitter NLP take also nearly 10 hours.The other approaches that take over 5 hours are all deep learning-based.spaCy, Cliff, LORE, Ployglot, and OpenNLP are 20 times quicker than these approaches.However, the deep learning-based approaches achieve a much higher F1 score than the other approaches.Therefore, there exists a trade-off between correctness and computational efficiency.

CONCLUSIONS AND OUTLOOK
In this paper, we first summarized seven typical applications of geoparsing, and then presented a survey of existing approaches for location reference recognition.We grouped these existing approaches into four categories: rule-based, gazetteer matching-based, statistical learning-based, and hybrid approaches.In addition, we carried out a comparative study to systematically compare 27 existing approaches on 26 datasets.We evaluated these approaches from multiple perspectives including their overall location recognition accuracy, their performance on formal (e.g., news) and informal  (e.g., tweets) texts, and their capabilities to detect location references in different categories and forms.Finally, we also compared the computational efficiency of these existing approaches.
From the results, we can conclude that: (1) deep learning is so far the most promising technique in location reference recognition; (2) fusing existing approaches or tools in a voting mechanism can overcome each others' shortcomings and is more robust than each single approach; (3) the performance of different approaches varies on the type of texts and the attributes of location references and their computational efficiency also vary drastically.Users should choose the proper one according to their specific application demands.For example, for traffic management and disaster management that rely on informal texts and require fine-grained locations, GazPNE2 is a good option; for crime management and tourism management that rely on formal texts and require both coarse-grained and fine-grained locations, Flair NER (Ont) is a good option; for spatial humanities, disease surveillance, and GIR that need process a large number of formal texts and require just coarse-grained locations, Standford NER, CogComp, and spaCy are good options.
Several research directions can be further explored in the future.
• Location reference recognition: There is still space to improve the performance of the approaches for informal texts (e.g., tweets) since most of the existing approaches did not perform well on informal texts.Inspired by the success of two voting systems RS and SPENS, one of the promising ways might be further selecting several approaches from the 27 approaches and combining them in a voting system to achieve more satisfying results.
We can also configure the combined approaches of the voting system to satisfy the requirements of different applications, such as a high recall on POIs of tweets.• Location reference geocoding: While there exist many approaches for geocoding location references, they focused mainly on formal texts.Some studies proposed geoparsing tweets, but usually limited the application area to a known geographic region (e.g., a city where a flood occurs) [3,6].In those cases, simply searching a local gazetteer is often sufficient for geocoding.Only a few studies [94,138] geoparsed tweets at a global scale.This task has two main challenges: geo/geo ambiguities caused by limited contexts in short texts of tweets and unseen place names caused by place name variants and the informal features of tweets that might contain abbreviations, slang, and misspellings.Three main ways might be explored to overcome the challenges: (1) to leverage deep learning [29,55,98] and clustering techniques [41] that can group tweets of the same topic to • Datasets for geoparsing research: Existing datasets for geoparsing research are often in the form of formal text such as news, and only a few twitter datasets (e.g., GeoCorpora and NEEL2016) are available for geoparsing research.Although there exist many other twitter datasets for general NER research, these datasets typically do not contain geographic coordinates for the labeled entities and therefore cannot support research involving the entire workflow of geoparsing.More datasets in informal texts with labeled location references and their geographic coordinates are needed.Furthermore, most of the location references in the existing datasets are admin units, such as countries and cities.Finer-grained location references, such as traffic ways and POIs, are much rare.However, they are important in many applications, such as determining the precise locations where rescue is needed during disasters.A large twitter dataset that contains many fine-grained locations across the world would be very helpful for advancing methods in recognizing and geocoding fine-grained location references in texts.

Fig. 3 .
Fig. 3. Classification of existing approaches for location reference recognition.

Fig. 4 .
Fig. 4. Point density maps of the places contained in OpenStreetMap and GeoNames.

3. 1 . 3
Statistical learning-based approaches.Statistical learning-based approaches are built on annotated training corpora containing texts associated with the expected location references.The annotated corpora are used to train a model via manually defined features, such as infrequent strings, length, capitalization, and contextual features, and/or features automatically learned by deep learning algorithms.The trained model is then applied to unlabeled texts, and the same features are computed to decide on the association of texts and location references.The basic architecture of statistical learning-based approaches is illustrated in Figure 5. Statistical learning-based approaches generally use either traditional machine learning models, such as Random Forest (RF) [56], or deep learning models, such as Long Short-Term Memory (LSTM) [180].Statistical learning-based approaches can be further divided into two groups: learning-based named entity recognition (NER) and learning-based place name extraction (PNE).

[ 151 ]Fig. 5 .
Fig. 5. Basic architecture of statistical learning-based approaches.O denotes non-type.B-LOC and I-LOC denote the beginning and inner part of a location reference, respectively.
a deep learning-based NLP toolkit, named Stanza, which adopted a contextualized string representation-based tagger.Recently, the fully-connected self-attention architecture (a.k.a.Transformer) attracts a lot of attention due to its parallelism and advantage in modeling long-range contexts.For instance, Ushio and Camacho-Collados [174] presented a python library for NER model fine-tuning, named T-NER.It facilities the training and testing of a Transformer-based NER model.Nine public NER datasets from different domains are compiled as part of the T-NER library, such as CoNLL 2003, OntoNoted 5.0, and WNUT 2017 datasets.
[180] proposed to generate labeled training data from Wikipedia articles to train a BiLSTM model called NeuroTPR.Their model contains several layers to account for the linguistic irregularities in Twitter texts, such as the use of character embeddings to capture the morphological features of words, and contextual embeddings to capture the semantics of tokens in tweets.The approach was evaluated on 1000 tweets related to the 2017 Hurricane in Texas and Louisiana.Qiu et al. [148] proposed ChineseTR, a weakly supervised Chinese toponym recognizer.It first generated training examples based on word collections and associated word frequencies from various texts.Based on the training examples, a BiLSTM-CRF network built on BERT word embeddings was explored to

Fig. 6 .
Fig.6.Precision, recall, and F1 score of the tested tools on all the datasets with 39,736 location references.

Fig. 11 .
Fig. 11.Time consumption of the approaches running on the total test datasets.
expand the context of tweets; (2) to combine multiple SOTA geocoders in a voting mechanism; (3) to generate Manuscript submitted to ACM Hu, et al. abundant and high-quality training examples by using geotagged tweets.The geotag of a tweet can drastically remove the geo/geo ambiguities of the place names in the tweet text, enabling the generation of high-quality training examples.
3.1.1Rule-based approaches.Location references in texts often have certain lexical, syntactic, and orthographic

Table 2 .
Common heuristics used for disambiguation in gazetteer matching-based approaches Keep only places with a population over 1000 4 Limit the spatial range of gazetteers Use the gazetteers in the area of Florence 5 Filter place names of common (stop) words 'today', 'long', 'that building' and 'the street' 6 Use spatial indicators in texts 'in', 'near', and 'at' that appear before a place 7 Use orthographic cues Capitalization of words, such as 'Houston' and 'Germany' 8 Filter candidates by POS tags Keep only noun phrases in texts 9 Use a dictionary of other entity types Person names, such as'Washington Irving' and 'Houston Alexander' 10 Use other related place names 'Chennai, IN' and 'stay safe Houston, flood in Houston is serious' [48]ts with 3182 tweets as training set and the rest as test set are used to evaluate the Spanish extractor.The Spanish dataset is then translated into English with the Google translator and used to evaluate the English extractor., introduction, body, or table) are concatenated into the input vector of the deep learning model.Dutt et al.[48] proposed a cross-lingual location reference recognition for tweets, combining the results of a named location parser by using gazetteer matching, a rule-based building parser, a rule-based street parser, and a trained CRF-based named entity parser.The rules of the street and building parsers are created based on POS tags and indicator words, such as adjective plus noun and street and building indicators (e.g., 'street' and 'highway' in English and 'calle' and 'carreterra' in Spanish).4488 Spanish crisis-related abstract

Table 3 .
Main features of tools evaluated in this study [147]t al.of our test datasets are across the globe.Table3summarizes the features of the compared approaches.The number after the name of a tool represents its version.The second column denotes the category of the approach with regard to its underlying principle.NERs can not only recognize Location but also other entity types, as shown in the third column.3classes,4classes,10classes,and18classesdenote{LOC, PER, ORG}, {LOC, PER, ORG, MISC}, {PERSON, GEO-LOCATION, COMPANY, PRODUCT, FACILITY, TV-SHOW, MOVIE, SPORTSTEAM, BAND, OTHER}, and {LOC, PERSON, ORG, FAC, GPE, CARDINAL, DATE, EVENT, LANGUAGE, LAW, MONEY, NORP, ORDINAL, PERCENT, PRODUCT, QUANTITY, TIME, WORK_OF_ART}, respectively.28classesincludestheentities of the 18 classes and the entities of {CELL TYPE, CELL LINE, CHEMICAL, CORPORATION, DISEASE, DNA, GROUP, PROTEIN, RNA, OTHER}.Note that, DBpedia Spotlight can recognize quite a few detailed classes, such as Place, People, Event, and Color.The fourth column refers to the type of datasets on which it was developed.The fifth column refers to the development language of the approach.N/A means the code of the approach is unavailable and we reimplement the approach by ourselves.The last column refers to the time that a tool of a certain version was proposed or updated.•StanfordNER(4.3.1)[53]:It is a Java implementation25of a NER based on CRF, which was developed and maintained by the Stanford Natural Language Processing Group.We kept the entities of LOC (location) detected by Stanford NER as locations.•spaCy(3.2.1):It is a general NLP tool.We used its retrained model (en_core_web_lg) and kept the entities of LOC, FAC (facility), and GPE (geopolitical entity) detected by spaCy as locations.•Stanza(1.2)[147]:It is a general NLP toolkit

Table 4 .
Summary of 26 datasets.There are in total 39,736 places.
Only 2 of 27 , 6 of 27 , and 4 of 27 can recognize over 60% of the traffic ways, nature features, and POIs, respectively.The three categories refer to much more precise geographic scopes than admin units and are thus valuable in many key applications, such as emergency rescue and traffic event detection.It is worth mentioning that GazPNE2 can recognize over 70% of places in all four categories.
3. Conversely, recognizing abbreviations is easier, with over half (16 of 27 ) of the tools recognizing over 30% of the abbreviations.However, we can also observe that no tool can achieve a detection rate over 0.6 on the two