Learning Geolocation by Accurately Matching Customer Addresses via Graph based Active Learning

We propose a novel adaptation of graph-based active learning for customer address resolution or de-duplication, with the aim to determine if two addresses represent the same physical building or not. For delivery systems, improving address resolution positively impacts multiple downstream systems such as geocoding, route planning and delivery time estimations, leading to an efficient and reliable delivery experience, both for customers as well as delivery agents. Our proposed approach jointly leverages address text, past delivery information and concepts from graph theory to retrieve informative and diverse record pairs to label. We empirically show the effectiveness of our approach on manually curated dataset across addresses from India (IN) and United Arab Emirates (UAE). We achieved absolute improvement in recall on average across IN and UAE while preserving precision over the existing production system. We also introduce delivery point (DP) geocode learning for cold-start addresses as a downstream application of address resolution. In addition to offline evaluation, we also performed online A/B experiments which show that when the production model is augmented with active learnt record pairs, the delivery precision improved by and delivery defects reduced by on an average across shipments from IN and UAE.


INTRODUCTION
Entity matching (EM), also known as entity resolution (ER), aims at identifying and linking different representations of the same Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). real-world entities across databases. EM is a challenging task for real-world applications, particularly when entities are highly unstructured [33] and of low quality, for example, when there is lack of completeness and consistency in their descriptions. Further, realworld EM tasks [19,29] have limited access to labeled data and require substantial labeling effort to learn accurate EM models. For delivery systems, customer addresses play an important role in delivery planning, as the address is the primary source of information provided by customers regarding their location. Customers provide their addresses in text fields at their own discretion, which may or may not follow a fixed pattern. Address writing styles and patterns are idiosyncratic in the same way as hand writing or signatures. This leads to a lot of variation in similar addresses and their components (unit, building, road, locality). Customers often use references to neighbourhoods, landmarks or points-of-interest (POI). For example, it is common for customers in India (IN) to provide colloquial addresses that use landmarks and other POI to denote the place, for example, ABG Bank, Opp. Network Stone, Mahapurii. Other customers may provide more structured addresses that intend to indicate the same place but also conform to local postal standards, for example, Plot No. 438 Taj Towers, ABG Bank, Mahapuri. In the aforementioned examples 1 , both addresses refer to the same place in Mahapuri neighbourhood. The first example mentions Network Stone building as a landmark, refers Opp. for opposite and mentions ABG Bank as the place. In the second example, the number and name of the building Taj Towers is used to mention the same ABG Bank as the place. Further, neighborhood provided by a customer can also be known by other vernacular names or be a part of a larger neighborhood. For example, Khalifa City B and Shakhbout refer to the same sub-locality within the larger Khalifa neighbourhood of Abu Dhabi city in UAE. Customers use these synonyms interchangeably, making customer address even more challenging to comprehend.
Address resolution aims to de-duplicate a query address against a set of candidate addresses in the database via pair-wise address matching. Creating a representative training set for pair-wise matching is challenging for customer addresses for multiple reasons -(1) Data distribution is skewed towards negative pairs, i.e. no-match.
(2) Based on an analysis carried out by our data curation team, the average handle time (AHT) for an annotator to label a customer address pair is high; four times higher on average when compared to other EM tasks. (3) The component values in addresses are vernacular, redundant, noisy or missing, thus leading to unstructured data problems. (4) No appropriate pre-labeled data exists to bootstrap classifiers, nor rules to automatically label training data through weak supervision. In some real-world classification tasks, it is possible to create ground truth labels automatically at large scale via rules and heuristics using distant supervision [32] or data programming [41], while for a large number of real-world applications such as resolution of unstructured addresses, manual labeling of data cannot be avoided. Labeling a large volume of pairs for a variety of scenarios in EM does not scale, hence prior studies has adopted active learning (AL). AL aims to focus the labeling effort on the most informative records that will maximize the performance of the model, thus reducing annotation cost. But due to the variety of challenges posed by customer addresses, the existing AL techniques are insufficient and cannot be utilized in their current form.
Our proposed approach jointly leverages address text, GPS points from past deliveries, concepts from graph theory, namely graph partitioning, graph cuts and transitivity along with active learning to sample informative and diverse record pairs to minimize the cost of annotation. Our empirical evaluation show significant improvements in pair-wise matching and delivery point (DP) geocoding metrics compared to the existing production system and other state-of-the-art baselines. Further, it should be noted that the structure of addresses are quite different for IN and UAE, hence the improvements across both geographies confirm the wide applicability and generic nature of our approach. In summary, our main contributions are -(1) We propose a novel adaptation of graphbased active learning to tackle a real-world problem of customer address resolution, particularly pair-wise matching. (2) We jointly leverage address text, geospatial properties of addresses along with concepts from graph theory to retrieve informative record pairs. Our query strategy utilizes disagreement and geospatial-diversity to select record pairs to label in a data-efficient manner. (3) We deployed our approach to production and show its impacts on DP geocoding, a fundamental business problem that enables delivering packages in a cost-effective manner.

RELATED WORK
The widely used uncertainty-based methods leverage the prediction scores to select difficult examples for annotation [11,22] whereas diversity-based sampling exploits heterogeneity in the feature space [2,5]. Hybrid approaches that combine uncertainty and diversity [13,42,46] tackle the limitation of acquiring redundant and easy samples from earlier discussed sampling techniques, but some studies [30,31,38] found it to be less effective for EM tasks. Due to the lack of reusable EM models, crowd sourcing [21] is leveraged. But in real-world applications where data is domain-specific or confidential, crowdsourcing becomes challenging and incurs high costs. Many earlier works have also used AL with a variety of classifiers and explainable ER rules [39,40]. Some studies have effectively used graph based techniques [3,37] for various real-world applications. Different signals of the graph structure, such as the cluster representation and density are used for refining uncertaintybased sampling strategy [3]. Another work on multi-source resolution [37] used graph algorithms as well. But due to the variety of challenges posed by customer addresses, the existing techniques cannot be utilized in their current form for our problem space. Later, DTAL [19] tackled EM in low-resource settings and proposed learning a deep neural network via active learning with uncertainty sampling along with partitioning. None of the aforementioned studies consider transformer-based techniques for EM. Later, Mussmann

METHODOLOGY
Problem Statement In our problem domain, a match represents an address pair belonging to the same physical building whereas no-match represents an address pair referring to different buildings. We consider a pool-based setting [17,19] for active learning where unlabeled record pairs P are generated via blocking over customer address database A. At each iteration of AL, it selects a batch of K instances and add them to the labeled corpus L, removing them from P. We perform AL for T iterations. Our task is to effectively sample K record pairs to be labeled by human annotators from P, such that after re-training it on the acquired labeled pairs L, the performance on an unseen test set is maximized. The details of graph theory concepts are discussed in Appendix A.1. Next, we discuss our approach as shown in Figure 1. Blocking Comparing every address in the database to every other address (Cartesian product) is not scalable. We use ElasticSearch [15], with fastText embeddings [6,16,36] to index [28,44] the addresses and then filter those addresses that are an obvious no-match (addresses that belong to a different district, state or postal code). We retrieve top-k candidates for every customer addresses to get a pool of unlabeled candidate record pairs P and apply the trained classifier to determine pair-wise matching. The details of the classification model are discussed in Appendix A.4. Graph Construction An undirected and weighted graph with no self-loops, vertices , and edges is a pair ( , ), that is ⊆ {{ , } | , ∈ ∧ ≠ }. Each address is represented via a node and the edge between two nodes is determined based on prediction of a trained model M. Given P retrieved via blocking and its corresponding predictions (P) inferred via M, we construct an address graph . We add an edge for every matching record pair, while we skip the edge for every non-matching pair. The weights assigned to an edge is equal to the predicted probability score learnt by M. We leverage transitivity of an address graph to discover false negatives from base model predictions. However, given that the edges of the graph are derived from the predictions of M, and M is not always accurate, a wrongly predicted match edge can lead to a series of false positive record pairs. ( P ) ← Train classifier M on L to infer for P 4: ← Remove all edges in from 11: ← Remove "bridge" edges from 12: ( G) ← "Match" label to node-pairs within same CC else "No-Match" 15: ← ( G) ≠ ( P ) 16: X ← ( ) 17: L ← L ∪ X , P ← P ∖ X ⊲ Update labeled set and unlabeled pool 18: return final classification model trained on updated L Graph Partitioning and Graph Cuts We use graph partitioning and graph cuts to find and remove likely false positive edges from the graph and obtain smaller connected components (CC) so that the set of nodes within the same CC represent addresses from the same physical building. After constructing an address graph , we apply a single pass of Louvain algorithm [4] to separate the nodes into multiple mutually exclusive graph partitions . Louvain was preferred as it does not require us to input the number of partition sizes before execution. Further, a single pass of Louvain is a linear operation in terms of the number of edges of the graph, thus allowing it to scale across millions of edges in a graph. We then determine the CC of . For each component in , we use graph cuts to prune weak links and isolated components (Appendix A.1.2). We leverage minimum cut [9] and bridges [1] as graph cut techniques to prune the likely false positive edges from the graph. For a given CC, we iterate over all the node pairs and retrieve those node pairs − where the geospatial proximity is greater than a pre-defined threshold N (Appendix A.1.3 and A.1.4). This ensures that the pair of extracted nodes are likely to belong to two different physical buildings. We compute the min-cut for all node pairs − and then remove min-cut edges from the graph to get a graph . Further, we identify and remove bridge edges from the graph to finally get a pruned graph . In order to avoid creating too many small components, we only remove the bridge edges connecting nodes that have at least three neighbours each. (Algorithm 1, line 5-11). Query Strategy To learn a graph label (G), we first compute all the CC in . For all node pairs belonging to the same CC, we assign a match label else no-match label is assigned. The address pairs with graph label (G) different (disagreement) from the model prediction (P) hint towards nuanced matching patterns that are not yet learnt by the model. This can occur under the following two scenarios. First, if the record pair has been predicted as no-match by M but due to graph transitivity, the graph inferred label is match. Second, if the record pair has been predicted as a match by M but the corresponding edge was removed via graph partitioning and graph-cuts. From the record pairs in disagreement, we select an equal number of likely false-negatives and likely falsepositives to prevent skewness in the data distribution. To ensure geospatial-diversity across pairs in disagreement, we sample across a grid based on the Military Grid Reference System (MGRS) [35] to capture samples across different regions. In each iteration, we select K informative pairs to be labeled by the human annotators. We augment labeled data to the initial manually curated training data. The augmented train set is used to re-train M and is evaluated on the same unseen test data. (Algorithm 1, line 12-18).

EXPERIMENTAL EVALUATION
Curated Ground Truth (CGT) We did stratified sampling of addresses to cover all the address writing styles and abbreviations across the country. The selection also ensured to consider medium and low address volume districts, thus accounting for the varied density of addresses, i.e. probable urban vs rural split. We generate close to 15 unique address pairs each for IN and UAE, which are then manually labeled by the annotation team. Evaluation Settings In blocking, we use fastText model trained on customer addresses. For each address, we pick top-k similar records, with = 200 and generated tens of millions of unlabeled pairs P as the output of blocking step across the geography. The number of AL iterations T is set to 10 and number of records K to sample per query is 500. On an average, the graph constructed for each geography had close to 3 nodes and around 15 edges. Evaluation Metrics We split the CGT data in 70-10-20 for training, validation and testing. All the models in Table 1 were evaluated on same CGT test set. To align with downstream application, a high precision (atleast 95% precision of match class) pairwise-matching model is required. We evaluate the matching model across two metrics, namely -(1) Overall pair-wise accuracy (Accuracy), and (2) Recall at 95% Precision (R@95P). Table 1 reports the performance of all the models across these metrics on CGT test dataset. The R@95P numbers are corresponding to the match class to align the performance of the model with the downstream application. Learning Model Baselines The details of Production Model are discussed in Section 3. Ditto [23] is a BERT-based model fine-tuned on the CGT train set. Table 1 reports the performance of these models on the same CGT test set for IN and UAE. Active Learning Baselines Random Sampling retrieve random pairs from unlabeled pool P whereas Confidence Sampling select uncertain [45] record pairs. ER Rules proposed use of tree based models with rules [20,27,38]. We used our domain knowledge, and historic delivery data to design the rules. ALMSER [37] used graph algorithms for multi-source matching. Deep Transfer AL (DTAL) [19] used transfer learning in low-resource settings and proposed to learn a deep neural network via active learning with uncertainty sampling along with partitioning. Adaptive Retrieval with COSINEBERT [34] was used to greedily maximize the number of positive record pairs. Further, we leveraged geospatial properties  and GPS scan information with GEOBERT [24] and utilized adaptive retrieval [34] to generate a stronger baseline. We fine-tuned [25,26] the pre-trained COSINEBERT [34], and GEOBERT [24] models on CGT train set. DIAL [17] used index-by-committee with RoBERTa models. We fine-tuned pre-trained RoBERTa as described in DIAL [17] on CGT train set.

Results and Ablation Study
We start with CGT train set as the initial labeled set for all AL baselines. The record pairs retrieved by AL baselines are labeled by the data annotation team. We retrain after augmenting the active learnt pairs to CGT train set and evaluate the performance on the same CGT test set. The Production model and Ditto baselines are trained on CGT train set only. Table  1 show that our approach outperformed all baselines across IN and UAE by a significant margin, thus confirming its wide applicability and generic nature. In comparison to the production model the accuracy improved by 6.62%, and R@95P by 9.3% on average in absolute terms. In comparison to the second best performing approach, an average improvement of 4.84% is observed in R@95P across IN and UAE. The ablation study results in the last three rows of Table 1 highlight the importance of graph partitioning and graph cuts. Also, the improvement with our AL approach across metrics using Ditto as classification model (Our Approach (Ditto)) show its effectiveness for deep learning models as well. As a separate experiment, for UAE, we compared top AL strategies to identify the approach that takes the smallest possible subset of CGT train set to reach the R@95P performance observed using 100% of train set (Production Model). We start with identical 10% of the CGT train set, and sampled 10% data in each iteration from the remaining pool using different AL strategies. Figure 2 show that to achieve any R@95P, our approach outperforms all baselines in terms of the percentage of labels requested. Our approach required 28.7% less train data in comparison to the next best method to reach the same performance as production model. Similar trend was observed for IN as well. The details around qualitative analysis of our approach is discussed in Appendix A.2.

REAL WORLD APPLICATION
DP Geocoding converts free-form address text to a geocode (pair of latitude-longitude). Having sufficient number of deliveries to an address allows us to learn reasonable quality geocodes by aggregating the past delivery locations [14]. In this work, we limit the scope of learning DP geocodes for cold-start addresses, a particularly challenging task because of lack of historical geocode data. To learn a DP, we match the new address against existing addresses in the database (reference set) for which geocode information is available. We then aggregate the geocodes of matched addresses to learn a single DP geocode using KDE [43]. The key metrics used are -(1) Delivery Precision is the percentage of total shipments for which the actual delivery happened within a threshold distance Z from the planned delivery location.
(2) Delivery Defects is the percentage of total shipments for which the actual delivery happened outside of the threshold distance Y from the planned delivery location. Hence, lower the value of outliers, better the metric. For business reasons, we cannot reveal Z and Y. The details of the offline evaluation are discussed in Appendix A.3. Online A/B Experiment After observing significant improvements during offline evaluation, we launched an online A/B experiment in the third quarter of 2022 on live traffic for IN geography. We dialledup the model in a phased manner -10%, 50%, 100% delivery stations. We observed statistically significant improvements during one week of dial-up in each phase. During the A/B test period, our active learnt model has learned DP geocodes for a few hundred thousand shipments, where we observed 7.84% improvement in delivery precision and 12.32% reduction in delivery defects. Following the success in IN, we will launch an online experiment in UAE.

CONCLUSION
We propose a novel adaptation of graph-based active learning for customer address resolution. In comparison to existing production system, experiments on manually curated dataset show that our approach is highly effective. After observing significant improvements during offline evaluation for DP geocoding, we successfully deployed our approach and performed online A/B experiments which show that when the production model is augmented with active learnt record pairs, the DP geocoding metrics improved significantly. These improvements lead to better delivery planning, significant decrease in operation costs, and customer satisfaction.

A APPENDIX A.1 Background Details
This section introduces necessary background on concepts from graph theory and geospatial properties of customer addresses. An undirected and weighted graph with no self-loops is a pair ( , ), where is a set of vertices or nodes and is a set of edges between the nodes. Each distinct text from customer address database is represented via a node and the edge between two nodes is determined based on a match or no-match prediction of a trained machine learning model. We use = | | to denote the number of nodes and = | | to denote the number of edges.
A.1.1 Graph Partitioning. It [7] refers to a class of problems that deals with reducing a graph to multiple smaller graphs by partitioning its set of nodes into mutually exclusive groups. The nodes that are much more linked to nodes within the groups compared to nodes in the other groups are said to form communities. We use the Louvain algorithm [4], a graphical method to partition a graph based on network structure and edge relationships. Louvain is an unsupervised algorithm and consists of two important phases, modularity optimization and community aggregation [4]. These steps are executed until there are no more changes in the network and maximum modularity of the graph is achieved. In comparison to other graph partitioning techniques, Louvain was preferred as it does not require us to input the number of communities, or the partition sizes before execution. Further, a single pass of Louvain is a linear operation in terms of the number of edges of the graph, thus allowing it to scale across millions of edges in a graph.
A.1.2 Graph Cuts. These techniques have been successfully applied to a number of real-world applications, for example, designing flow networks, computer vision, graphics and image processing problems [9]. We leverage minimum cut and bridge as graph cut techniques for the task of EM across customer addresses. For a pair of given nodes, source and target , the − min-cut of a weighted graph is defined as the minimum sum of weights of (at least one) edges that when removed from the graph makes and disconnected. Bridge in an undirected graph is an edge that when removed from a graph increases the number of connected components. In other words, if we remove an edge which is a bridges, the graph will no longer remain connected.
A.1.3 Geospatial Properties. For each node in the graph , two main modalities of information are available namely: the free-form address text and multiple GPS points based on successful past deliveries. GPS points are sometimes noisy as it depends on driver compliance. We use the GPS points associated with each address to learn a single delivery point (DP) to direct future deliveries. A brute force approach would be to compute the centroid of GPS points from past deliveries. Unfortunately, this can direct delivery associates to the middle of the street or to a different building. Centroids and medoids are prone to outliers, hence proving inaccurate in estimating delivery points [14]. We use density-based methods to accurately approximate a single delivery point from historical deliveries for each address via Kernel Density Estimation (KDE) [43]. KDE maximized delivery points are further used to determine the geospatial proximity between a pair of graph nodes using haversine distance [10]. The haversine distance, also called as great circle distance, is the shortest distance between two points on the surface of a sphere (Earth) with each point denoted by its latitude and longitude.
A.1.4 Geospatial Proximity. We use Gaussian based Kernel Density Estimation (KDE) [43] with 25 meters bandwidth to approximate a single delivery point from historical deliveries for each address. KDE maximized delivery points are then used to determine the geospatial proximity between a pair of graph nodes using haversine distance [10].

A.2 Qualitative Analysis
We observe the quality of predictions generated by the model with and without our proposed approach. To observe the quality of pairwise matching, we analysed address pairs from some neighbourhoods. We replace the personally identifiable information with dummy values (e.g 1234, 12345, 1238) and retain the public components of addresses to not reveal actual customers. The outcomes of the base model and our approach are shown in Figure 3. It is evident from these examples that our approach handles false positives and false negatives effectively. Further, we analysed the DP geocodes predicted by the base model and our approach against the actual delivery location. The quality of predictions is highlighted through the following realworld scenario. Downtown, Diablo Dum Mall St., 1234, Downtown Sky View is a newly created address. Figure 4 shows that the base (Production) model incorrectly match the new address against multiple addresses from the adjoining streets (yellow points), hence learning an inaccurate DP (black point), resulting in a delivery defect when compared to the actual delivery location (green point). With our approach, the model accurately identified addresses from the same building (orange points) to learn an accurate DP (violet point) within the threshold Z of the actual delivery location.

A.3 Offline Evaluation
The deliveries that happened from April 2020 to April 2022 across all delivery stations were considered for creating the reference set. Deliveries across the span of first three weeks of May 2022 were used as test set. During the test period, a few hundred thousand deliveries were done on cold-start addresses against which we evaluated our approach. We re-trained the production model by augmenting the initial CGT train set with record pairs sampled via our approach and observed the impact on DP geocoding. On an average across IN and UAE, we observed 6.74% improvement in delivery precision and 11.39% reduction in delivery defects compared to the system in production.

A.4 Classification Model
Following Comber et al. [12], we first parse the address text using BiLSTM-CRF [18] into address fields (unit, building, road, locality, etc.). Further we engineer features, such as cosine similarity and fuzzy match score of record pairs for all the parsed address fields to perform pair-wise matching (binary classification) using the XGBoost [8] classifier. This classifier M, serves real-time traffic in our production system and was trained on manually curated ground truth data.