Hierarchical Cut Labelling - Scaling Up Distance Queries on Road Networks

Answering the shortest-path distance between two arbitrary locations is a fundamental problem in road networks. Labelling-based solutions are the current state-of-the-arts to render fast response time, which can generally be categorised into hub-based labellings, highway-based labellings, and tree decomposition labellings. Hub-based and highway-based labellings exploit hierarchical structures of road networks with the aim to reduce labelling size for improving query efficiency. However, these solutions still result in large search spaces on distance labels at query time, particularly when road networks are large. Tree decomposition labellings leverage a hierarchy of vertices to reduce search spaces over distance labels at query time, but such a hierarchy is generated using tree decomposition techniques, which may yield very large labelling sizes and slow querying. In this paper, we propose a novel solution hierarchical cut 2-hop labelling (HC2L) to address the drawbacks of the existing works. Our solution combines the benefits of hierarchical structures from both perspectives - reduce the size of a distance labelling at preprocessing time and further reduce the search space on a distance labelling at query time. At its core, we propose a new hierarchy, balanced tree hierarchy, which enables a fast, efficient data structure to reduce the size of distance labelling and to select a very small subset of labels to compute the shortest-path distance at query time. To speed up the construction process of HC2L, we further propose a parallel variant of our method, namely HC2L^p. We have evaluated our solution on 10 large real-world road networks through extensive experiments. The results show that our method is 1.5-4 times faster in terms of query processing while being comparable in terms of labelling construction time and achieving up to 60% smaller labelling size compared to the state-of-the-art approaches.


INTRODUCTION
A distance query in road networks is to find the length of a shortest path between any two given locations.This is a fundamental building block with numerous real-world applications, such as GPS navigation [22], route planning [18], traffic monitoring [27], and point of interest (POI) recommendation [33].Nowadays, computational resources such as memory and storage become readily available, e.g., road networks such as USA and EUR with tens of millions of vertices can be easily accommodated by a single server.However, many of these applications have low latency requirements, arising from the need to compute thousands to millions of distance queries per second as part of a more complex problem, which itself needs to be solved frequently, e.g.matching taxi drivers to passengers, optimizing delivery routes with multiple pick up and drop off points that can change dynamically, or providing recommendation on k-nearest POIs to their customers.
For example, ride-hailing companies such as Uber are often required to compute millions of shortest-path distances between cars and customers each second [24,34] (e.g., between the locations of 1k cars and 10k customers) in order to process customers requests, e.g., finding nearest cars to customers.In such cases, even if answering one distance query only takes 1 microsecond, processing 10 million distance queries however would still amount to 10 seconds.Thus, it is very important to improve the performance of computing shortest-path distances for these applications to ensure good service quality to their customers.Related Work.We briefly review the existing work for answering shortest-path distance queries in road networks.
Search-based approaches.A traditional approach to answering a distance query is to use the Dijkstra's algorithm [32] which can compute the length of a shortest-path from a source vertex to a destination vertex in  (|| + | || |) time.This is however very slow in reality.Particularly, for pairs of vertices that are far apart from each other, the search space is large.To improve search efficiency, a bidirectional scheme can be used to run two Dijkstra's searches: one from the source vertex and the other from the destination vertex [29].However, road networks are structurally characterised by high diameters and low node degrees, making search-based approaches such as Dijkstra's algorithm highly inefficient.In general, search-based approaches fail to achieve desired response time performance required by many real-world applications that operate on increasingly large road networks.
To accelerate search by directing the search space towards useful vertices, rather than conducting unrestrained searches, a number of search-based approaches leverage indices.For example, ALT [22] pre-computes a partial distance index to accelerate the A* search.HiTi [26] constructs a hierarchical structure using graph decomposition techniques to accelerate query performance.Highway Hierarchies [30] and Contraction Hierarchies (CH) [21] have shown to prune the search space massively for efficient querying.Moreover, algorithms such as Transit Node Routing (TNR) [7], TRANSIT [9] and Arterial Hierarchy (AH) [37] integrate coordinate information and divide a road network into multiple hierarchical grids.Transit Node Routing and TRANSIT then precompute an index which stores distances between vertices with respect to grids.Arterial Hierarchy is an improved version of the CH algorithm.Despite considerable progress, these algorithms still require exploring a large search space for distance queries when vertices are far apart from each other in a road network.
Labelling-based approaches.To address the limitations of searchbased approaches, labelling-based approaches for road networks have been studied with great success [1,2,4,5,14,25,28,35].Instead of performing searches on a road network at query time, these approaches precompute distance labels that fully capture distance information between pairs of vertices, and then perform searches on distance labels at query time to compute distances.Labellingbased approaches can answer distance queries significantly faster than search-based approaches, at the cost of requiring additional space for storing precomputed labels.As such, it is still an open problem how to design algorithms which can give both fast query times with small memory costs.
The current state-of-the-art labelling-based approaches exploit hierarchical structures of road networks.Most of these approaches satisfy the 2-hop cover property [15] which requires at least one common vertex in the distance labels () and () to be on a shortest-path between any query pair (, ), generally falling into three categories: hub-based labellings, highway-based labellings, and tree decomposition labellings.Approaches for hub-based labellings [1,2] generate distance labels for vertices that contain the distances to hub vertices following a vertex ordering computed by conducting CH searches on a road network.This results in a hierarchical structure among distance labels which is closely relating to the labelling size.Approaches for highway-based labellings [4] decompose a road network into disjoint shortest-paths and then construct distance labels for vertices that contain the distances to a subset of decomposed shortest-paths.Similar to the importance of a vertex ordering to hub-based labellings, the order of shortest-paths is important to highway-based labellings for small labelling sizes.Approaches for tree decomposition labellings [14,28] first find a hierarchy over vertices in a road network using tree decomposition techniques [12].Then a distance query (, ) is answered by searching such a tree-decomposition hierarchy to identify a subset of common vertices in the distance labels () and (), and compute the distance between  and .Present Work.From the above analysis, we observe that labellingbased approaches leverage hierarchies over vertices in a road network for distance queries in two ways.One way is to reduce the size of distance labellings.In hub-based labellings and highway-based labellings, a distance query (, ) needs to search through entire labels () and () when computing a shortest-path distance.Therefore, they aim to find a "good ordering" of vertices which can produce distance labels of small sizes, and reduce the search space on () and () in the process.The other way is to assign a hierarchy over vertices, and while querying, use such a hierarchy to reduce the search space to a small subset of entries in a distance labelling.In tree-decomposition labellings, such a hierarchy is obtained by applying existing tree decomposition techniques [12].
Motivated by the above observations, in this work, we aim to design a solution that combines the benefits of using vertex hierarchies from both perspectives -leveraging a vertex hierarchy to reduce the size of a distance labelling and further reduce the search space on such a distance labelling at query time.This leads to the following questions: • What are the desirable characteristics of such a hierarchy?
• How can a hierarchy be designed to reduce the sizes of distance labellings and find "good hops" to accelerate search on distance labels simultaneously?• What are the hierarchical properties of distance labellings under this framework?
Contributions.In this paper, we propose a new labelling-based method which addresses the above questions.Our contributions are summarised as follows: • Upon analysing existing approaches that exploit hierarchical structures of road networks for efficient distance querying, we introduce a new hierarchy called balanced tree hierarchy which enables us to reduce the size of a distance labelling significantly.Further, our balanced tree hierarchy allows us to answer a distance query by processing a small subset of labels that are necessary to compute the distance.Compared to the state-of-the-art approaches, our method achieves much faster query times.• We develop an efficient algorithm to construct our proposed balanced tree hierarchy.Our algorithm recursively partitions a road network while preserving two conditions: balanced partitions and small cuts.Specifically, it first finds a balanced and small vertex cut and then creates a minimum number of shortcuts to preserve the distance between border vertices.This results in a tree structured hierarchy among vertices, supported by an efficient data structure for finding hub vertices of any two vertices in a road network.• We propose a distance labelling called hierarchical cut 2-hop labelling (HC2L) which is hierarchical in terms of a vertex quasiorder defined by a balanced tree hierarchy.The labels of any two given vertices must contain a common hub vertex that can be found via their lowest common ancestor in the balanced tree hierarchy.We analyse upper and lower bounds on labelling size of HC2L and devise a novel pruning strategy, called tail pruning, that allows our proposed algorithm to produce a HC2L with smaller labelling sizes without compromising query efficiency.• We conduct extensive experiments to evaluate the performance of the proposed method on 10 real-world large road networks including the whole road network of the USA.The experimental results demonstrate that our method is 1.5-4 times faster than the state-of-the-art approaches in terms of query processing while consuming comparable labelling construction time and up to 60% smaller labelling size.

PRELIMINARIES
Let  = ( , ) be a road network where  is a set of vertices, and  is a set of edges.Each edge (, ) ∈  is associated with a positive weight  (, ) ∈ R. Given a set of vertices  ⊆  ,  [] = (, {(,  ′ ) ∈ |,  ′ ∈  }) is an induced subgraph of  formed from .A path is a sequence of vertices  = ( 1 ,  2 , . . .,   ) where (  ,  +1 ) ∈  for each 1 ≤  < .The weight of a path  is defined as  () =  −1 =1  (  ,  +1 ).For two arbitrary vertices  and , a shortest path  between  and  is a path starting at  and ending at  such that  () is minimised.The distance between  and  in , denoted as   (, ), is the weight of any shortest path between  and .We use  () to denote the set of direct neighbors of a vertex  ∈  , i.e.  () = { ∈  | (, ) ∈ }, and  () and  () to refer to the set of vertices and edges in , respectively.Each vertex  ∈  is associated with a label ().The set of labels  = {() |  ∈  } is called a distance labelling over .A vertex cut   ⊆  on  is a subset of vertices whose removal from  splits  into multiple connected components.Vertices in   are called cut vertices.
The distance query problem on road networks is defined below.
Definition 2.1 (Problem Definition).Given a road network  = ( , ), the distance query problem on  is to compute the distance   (, ) between any two arbitrary vertices ,  ∈  .
In this work, we study labelling-based techniques to efficiently answer distance queries on road networks.

EXISTING SOLUTIONS
The currently fastest known solutions for the shortest-path distance problem on a road network compute a distance labelling .Such a distance labelling usually satisfies a 2-hop cover property [15], requiring that the labels of any two vertices must contain at least one common vertex on their shortest-paths, hence called 2-hop labelling.In the literature, there are three popular labelling techniques that exploit hierarchical structures of road networks for computing 2-hop labellings: 1) hub-based labellings [1,2], 2) highway-based labellings [4], and 3) tree-decomposition labellings [14,28].We now discuss them in detail.

Hub-Based Labellings
Abraham et al. [1] proposed the first hub-based labelling algorithm, called Hub-based Labeling (HL), to construct labels by storing the distances from each vertex to hub vertices (i.e., vertices on shortest paths).The label of each vertex  is thus a set of distance entries {( 1 ,   1 ), . . ., (  ,    )} where { 1 , . . .,   } ⊆  and    =   (,   ).Their work was motivated by the observation that vertices visited by searches of hierarchical and reach-based algorithms [23,31] form a 2-hop labelling.It turns out that HL produces 2-hop labellings that are much smaller than the worst-case bounds [20].Hub-based labellings are not necessarily hierarchical.Later, Abraham et al. [2] studied hierarchical hub labellings with respect to a hierarchy defined by the relationship "vertex  is in the label of vertex ".For any hierarchical hub labelling , there exists a canonical labelling, which is the smallest hierarchical hub labelling with respect to all total vertex orderings that are consistent with the hierarchy of .

Tree-Decomposition Labellings
Recently, Ouyang et al. [28] proposed a tree-decomposition labelling method, called Hierarchical 2-Hop Index (H2H), to exploit tree decomposition structures in road networks.The general idea is to not only construct a 2-hop labelling but also generate a hierarchy among all vertices in a road network using tree decomposition techniques [11].Based on such a hierarchy, a subset of vertices in labels are visited when answering distance queries.In a later work [14], pruning techniques were proposed to further reduce the number of vertices in labels being visited at query time.

Discussion
Existing hub-based and highway-based labelling solutions search over all the distance entries in labels () and () for a query pair (, ).Thus, in order to accelerate querying, they exploit a vertex ordering to reduce size of their 2-hop labellings so that they can process smaller labels at query time.It is known that a vertex ordering is crucial for label construction.However, finding a 2-hop labelling of minimum size is NP-hard and finding a vertex ordering that minimizes labelling size is difficult [5,8,15].Thus, hub-based and highway-based labelling solutions still suffer from scalability issues resulting in slow querying when a road network is large.
Existing tree-decomposition labelling solutions assume that a tree decomposition of a road network is given [14,28].However, obtaining a tree decomposition with the minimal tree width is a very difficult problem.Even for determining whether a graph  has a tree width of at most a given value, it is known to be NP-complete [6].Sub-optimal algorithms exist [12], which can compute tree decomposition with a time complexity of  (| | • ( 2 + log(| |))) but may yield very large tree width  and height ℎ.Larger tree width and height take longer in constructing 2-hop labellings and produce larger labelling sizes (| | • ℎ), thus resulting in slow querying.In addition to this, existing tree-decomposition labelling solutions use the Range Minimum Query (RMQ) based algorithms [10] to compute LCA for any two given vertices at query time.Although these algorithms can compute LCA in a tree in time  (1), there are some hidden constant factors and additional computational requirements to achieve this.For instance, they need to precompute a data structure in order to store the information about LCA of all pairs of vertices in a tree.This data structure incurs significant computational overheads as shown in our experiments.
In Table 3(b), H2H stores distances to all ancestors of each vertex along with positions of the ancestors appearing in their associated nodes in   .This results in a large labelling size because vertices may appear in multiple nodes in   .Further, to compute LCA in constant time, this would require an extra space overhead.For instance, this extra space overhead is 4.64 GB for the whole USA (USA) dataset and 3.69 GB for Western Europe (EUR) dataset as shown in Table 3.

OUR SOLUTION
In this section, we present a novel labelling-based solution, referred to as Hierarchical Cut 2-Hop Labelling (HC2L) framework, for shortest-path distance queries on road networks.Let  be a road network.Our solution is to build an efficient distance scheme  = (  ,   , ) over , where   is a vertex hierarchy on ,   is a distance labelling on  satisfying the 2-hop cover property, and  is a query function which, given any two vertices ,  ∈  (), computes   (, ) based on their labels in  and their hierarchical positions in   .

Hierarchy Construction
To design an efficient distance scheme, the choice of a hierarchy is crucial.We observe that binary tree structure may serve as an efficient indexing scheme for road networks.This is because vertices can be indexed in a way that the LCA of any two indexed vertices contains at least one hub vertex on their shortest paths, enabling us to efficiently find them.It is also desirable to keep the height of such a binary tree small, thereby being as balanced as possible.Below, we define balanced tree hierarchy.Definition 4.1 (Balanced Tree Hierarchy).Let  be a balancingparameter with 0 <  ≤ 0.5.A balanced tree hierarchy is a binary tree   = (N, E, ℓ), where N is a set of tree nodes, E is a set of tree edges, and ℓ :  () → N is a total surjective function.Further,   must satisfy the following conditions: (1) For any internal tree node   ∈ N , its subtrees are balanced: where Right(•) and Left(•) refer to the vertices mapped into the right and left subtrees of a tree node, respectively, and Subtree(•) to vertices mapped into the whole subtree.(2) For any two vertices ,  ∈  (), the lowest common ancestor of their tree nodes (, ) contains at least one vertex on a shortest-path between  and .
Thus, for each tree node at the -th level of   , its subtree has at most  • (1 − )  vertices.This leads to the following lemma.In the following we introduce our algorithm for constructing a balanced tree hierarchy   .At its core, the algorithm recursively computes a balanced cut to partition a road network into smaller components, collectively named hierarchical balanced cuts.Note that this is closely related to the minimum balanced vertex separator problem, which is known to be NP-hard [19].

Hierarchical Balanced Cuts.
Let  be a road network.The algorithm recursively bisects a graph  in two steps: (1) Balanced partitioning: partition an input graph  ′ ⊆  into two initial partitions connected via another partition referred to as a cut region; (2) Minimal vertex cuts: find a minimal vertex cut within the cut region.Each iteration of the algorithm splits  ′ into two smaller but balanced components that contain the initial partitions, while a minimal vertex cut within the cut region ensures that vertices in such a cut are central in  ′ , i.e., passed through by many shortestpaths between vertices.This is illustrated in Figure 5(a).In the following, we discuss these steps in detail.Balanced Partitioning.Algorithm 1 provides the pseudo-code for this approach.We start with two vertices   and   as far apart as possible (Lines 11-12), and assign to each vertex  a partition weight (Line 13): Partition weights split vertices into equivalence classes.
Then, two initial partitions  ′  and  ′  are created by picking vertices with lowest and highest partition weights, respectively, until the desired balancing condition (e.g. = 0.3) is satisfied (Lines 14-15).The remaining vertices form a region, called a cut region.
However, there is a complication that needs to be addressed.Suppose that we have   = 2 and   = 3 in the partition   shown in Figure 5(a), the vertex 7 constitutes a bottleneck that causes all vertices whose shortest-paths to 2 and 3 pass through it to have the same equivalence class because of same partition weight as 7.As a result, nodes from this equivalence class are added to  ′  and  ′  arbitrarily, leading to large cuts.To address this issue, we detect and remove the bottleneck from the graph temporarily and repeat the process for finding a rough partition on the remaining graph (Lines 16-21).The bottleneck itself is then also added to the cut region (Line 22).As this bottleneck removal can lead to the graph being disconnected, we first compute the connected components of  (Line 3).If the largest component  max contains no more than (1 − ) • | | nodes, the empty cut is already balanced (Lines 9-10).
Otherwise we find our initial partition within  max (Lines 5-7).
It is worth noting that the choice of initial vertices only determines the initial partition and the real optimization happens during the vertex cut step.Picking distant vertices is mainly important to ensure that initial partitions are well separated, thus leaving enough room for a vertex cut to avoid dense clusters.We ran additional experiments (not reported) where we made multiple random choices for the arbitrary node in line 11, then computed a partition for each choice, and picked the one with the smallest cut.This resulted in only minor reductions to labelling size (less than 5% for the graphs examined), and we concluded that further optimization of our initial vertex choices (beyond being far apart) would be unlikely to justify the resulting increase in construction time.Minimal Vertex Cuts.Algorithm 2 shows the pseudo-code for this approach.To find a minimal vertex cut within a cut region, we case this as a minimal (, )-vertex-cut problem by contracting vertices in the initial partitions  ′  and  ′  into single vertices  and  , respectively (Lines 3-11).Here  is adjacent to a vertex  in the cut region iff any vertex in  ′  is, and similar for  .This allows us to reduce this problem to a maximum flow problem using the well-known graph transformation technique [13] and solve it using a variant of Dinitz's algorithm [17] (Line 12).
The flow graph for the example graph in Figure 1(a), resulting from the transformation in [13], is shown in Figure 4 When extracting a minimal vertex cut from the maximal flow, we have two options: pick for each flow path the node closest to  (amongst those that cannot be reached from  in the residual graph), or the node closest to  (amongst those that cannot reach  in the residual graph).We evaluate both options and pick the more balanced one.For the flow graph in Figure 4(b), this means we pick {16, 5, 12} over {15, 13, 12}.Once a vertex cut   has been found and removed from the graph, we assign connected components to either   or   while maximizing balance (Lines 14-15).However, if  and  end up with an edge between them, due to an edge between  ∈  and  ∈  , there exists no vertex cut in the cut region.This case occurs mostly at the lower levels of a tree hierarchy.To maintain balance guarantees, we move  and  to the cut region, and connect them to  and  , respectively.This ensures that one of them becomes a cut vertex and the other remains in its original partition.

Distance Preservation.
There is one issue with applying Algorithm 2. Although it can partition a graph  into two balanced components   and   via a vertex cut   , the induced subgraphs of  by   and   are not necessarily distance-preserving.

Definition 4.5 (Distance-Preserving Property). Let 𝐺 [𝑃]
denote an induced subgraph of  by the vertices in a partition .We say that  is distance-preserving iff the following condition is satisfied: The example below illustrates the distance-preserving property.
To characterise how the violation of this distance-preserving property by a partition relates to cut vertices, we introduce the notion of border vertex.Definition 4.7 (Border Vertex).Let (  ,   ,   ) be a balanced cut on a graph  and  ∈ {  ,   }.Then a vertex  ∈  is a border vertex w.r.t.  iff there exists an edge (, ) ∈  () such that  ∈   .
If  is not distance-preserving, according to Lemma 4.8, it would suffice to add shortcuts between border vertices in , leading to a distance-preserving subgraph as defined below.However, not all of the shortcuts in  Δ are actually needed.The following Lemma characterise redundant shortcuts: Lemma 4.11.Let  1 ,  2 ∈   ().A shortcut between  1 and  2 is redundant iff one of the following conditions is satisfied: (1) Algorithm 3 shows the pseudo-code for adding shortcuts.We first compute   (,  ′ ) in  that is the minimal length of paths between two border vertices  and  ′ which pass through at least one cut vertex in   (Line 7).The distance   (,  ′ ) between border vertices  and  ′ is the minimum of the lengths   (,  ′ ) and   [ ] (,  ′ ) (Line 8).After obtaining distances between all pairs of border vertices in  [] and , we can check Conditions (i) and (ii) in Lemma 4.11 to eliminate redundant shortcuts.Lemma 4.12.The time complexity of Algorithm 3 for adding shortcuts between border vertices  is

Label
Distance Entries

Labelling Construction
Now we present our algorithm for constructing a distance labelling, namely Hierarchical Cut 2-Hop Labelling (HC2L).This distance labelling is said to be hierarchical because it is constructed upon a vertex quasi-order defined by a balanced tree hierarchy   over .
Definition 4.13 (Vertex Quasi-Order).A balanced tree hierarchy   defines a vertex quasi-order ⪯ on  () such that   ⪯   iff ℓ (  ) is an ancestor of ℓ (  ) in   , including ℓ (  ) itself.Condition (1) states that   is hierarchical in terms of the vertex quasi-order ⪯ defined by   .Condition (2) stipulates that   is a 2-hop labelling for which the distance between  and  can be computed from distance entries corresponding to (, ) in their labels alone.

Upper and Lower bounds.
Assume that we are given a balanced tree hierarchy   = (N, E, ℓ).We start with a naive labelling algorithm to construct labels.Let () = ∅ for each vertex  ∈  ().Then for each tree node   ∈ N , we conduct Dijkstra's search from every  ∈   to compute the shortest-path distance   (, ) and add it to () for all  ∈ Subtree(  ), i.Indeed, the naive labelling approach provides the upper bound of the labelling size for hierarchical cut 2-hop labelling.It suffers from the following drawbacks: 1) it can produce very large labelling sizes for large road networks such as the whole USA and Central Europe road networks, 2) queries may perform unnecessary computations.The question arises: can we prune distance entries to reduce labelling size and accelerate queries?To answer this question, we first exploit a new labelling property, called cut cover property.Definition 4.16 (Cut Cover).Let   ⊆  () be a vertex cut on .Then, for any vertex  ∈  ( )\  and any cut vertex  ∈   , (,   ) ∈ () iff there is no other cut vertex  ′ ∈   satisfying: (, ) =   (,  ′ ) +   ( ′ , ). ( Assume that distances among two cut vertices in any vertex cut of   are known, denoted as   .The cut cover property ensures that the label of every non-cut vertex contains the distance information to cut vertices -either directly or indirectly through another cut vertex.Thus, the cut cover property relates to the lower bound of the labelling size for hierarchical cut labelling.Lemma 4.17.Let  * be a hierarchical cut 2-hop labelling satisfying the cut cover property.Then the intersection of all hierarchical cut 2-hop labellings of   is  * , i.e.,  ∈  * () iff  ∈ () for all hierarchical cut 2-hop labellings  of   .
If a hierarchical cut 2-hop labelling  satisfies the cut cover property, then  is minimal, i.e., if removing any distance entry from {()}  ∈ \  , then there must exist two vertices ,  ∈  () whose distance   (, ) cannot be computed from {()}  ∈ \  and   .Note that not all hierarchical cut 2-hop labellings can satisfy the cut cover property because it does not guarantee 2-hop labelling.Below we introduce a new labelling method, called tail pruned labelling, which guarantees that: 1) the resultant labelling is still 2-hop, and 2) the labelling size is small and within the upper and lower bounds, thereby enabling efficient querying.

Tail Pruned
Labelling.Given a balanced cut (  ,   ,   ) on a graph  = ( , ), we rank each cut vertex  ∈   based on the following equation: where  (,  ′ , ) denotes that  ′ lies on a shortest path between  and , i.e.,   (, ) =   (,  ′ ) +   ( ′ , ).Ties are broken arbitrarily to obtain a total ordering.Intuitively, vertices in a vertex cut are assigned a rank in terms of how frequently they can be hit by other cut vertices in their shortest-paths to other vertices.We then prune as follows.Condition (1) ensures that a 2-hop labelling remains after pruning, and Condition (2) allows us to omit vertex identifiers in labels.
Algorithm 5 shows the pseudo-code for the tail pruned labelling approach.A modified version of the Dijkstra's algorithm is shown in Algorithm 4 which computes the distances from a given (cut) vertex  and tracks whether a shortest path from  to a vertex  ∈  passes through another (cut) vertex in a given set .In Algorithm 5, we compute the ranks of cut vertices as per Equation 6(Lines 2-5).Afterwards, Algorithm 4 is invoked with  containing only cutvertices of lower ranks, allowing condition (1) of Definition 4.18 to be checked (Line 7).Finally, the list of distance values is tail-pruned in accordance with Definition 4.18 (Lines 8-10).
We may construct a hierarchical cut 2-hop labelling by applying the pruned landmark labelling (PLL) method [5], which can conduct a pruned breadth-first search (BFS) from each cut vertex.However, PLL does not suit our data structure design for labelling because the way PLL reduces distance entries still requires a full scan on distance arrays (will discuss in the next paragraph) in the labels of vertices in a distance query.As a result, the query performance deteriorates in comparison with our tail pruned labelling method.
We implement an efficient data structure to leverage the advantages of our balanced tree hierarchy   .Since   is a binary tree, we represent each tree node using a bitstring of length equal to its level (its distance to the root) in   .When  = 1 /3 and  contains no more than 2 /3 −58 ≈ 16.3 billion vertices, binary strings (including their 6-bit length) can be stored as 64-bit integers.Furthermore, unlike other approaches, we only store distance values in labels, which reduces storage requirements by half (for vertex identifiers and distance values of equal size).Thus, the label of each vertex is a list of distance arrays, each corresponding to a vertex cut.This enables us to search only a reduced set of hub vertices from exactly one vertex cut in the labels of any query pair.
Before constructing labels, we contract the graph by repeatedly removing degree-one vertices.Most shortest-paths from a degreeone vertex to other vertices pass through its closest vertex in the contracted graph, called its root.For each degree-one vertex we store its distance to its root, as well as a reference to the root.This allows us to compute distances between two vertices  and  by using the references and then adding the distances stored.
However, this approach only works when the roots of  and  lie on all (shortest) paths between them, which may not be the case when  and  have the same root.To deal with this case efficiently, we observe that contracted vertices with the same root form a tree (with their common root designated as its root), and store for each contracted vertex its tree parent.This allows us to follow the paths from  and  to their lowest common ancestor  in the tree using reference information alone, and compute their distance as A similar approach is taken by PHL in [4].The difference is that they only prune vertices that have degree one in the original graph.This ensures that all paths between different vertices must pass through their roots, but reduces contraction effectiveness (from ∼30% to ∼20% for the graphs we experimented on).

Query Processing
Now we describe how our query function  efficiently answers distance queries on a road network , given a balanced tree hierarchy   and a hierarchical cut 2-hop labelling   .
Given a query pair (, ) with ,  ∈  (), we process the query for (, ) in two steps: (1) In the first step, we compute the lowest common ancestor (, ) of the tree nodes ℓ () and ℓ () in   (Line 2); (2) In the second step, we compute the distance between  and  using the information in LCA and the label sets () and () in   as follows.
As distances to cut vertices are organized by level, we only need to find the level of (, ), i.e., the length of its identifying bitstring.This can be computed as the number of leading zeros of the XOR of the bitstrings of  and  -operations that are natively supported by most CPUs and thus extremely fast.¡ 3], which are found as the first distance array in  (14), with the last distance value 3 removed by tail-pruning.Similarly, distances between 15 and {12, 5, 16} are found to be [3, 1, 1].We then compute   (14,6) as min(2+3, 2+1), ignoring the last distance value 1 in [3, 1, 1] as its counterpart was pruned.Lemma 4.21.Given a balanced tree hierarchy   , and a query pair (, ), (, ) in   can be obtained in time  (1).Lemma 4.22.Given a distance query (, ), the distance   (, ) can be computed using Equation 7in  (  ), where   is the largest cut in the balanced tree hierarchy   .

Parallelization
To speed up construction, we can use multi-threading to parallelise certain tasks.Whenever we partition a (sufficiently large) subgraph, we create a new thread in order to process two partitions in parallel.In each thread, we further parallelise computing distances for labels, shortcuts, and pruning in each thread, i.e., we perform a Dijkstra search from each cut (or border) node.These searches can easily be carried out in parallel, with workloads being practically identical.Together these tasks account for the majority of labelling construction process.We must note however that effective parallelization is limited to a handful of physical threads, as a significant part of the work (>10%) is not or not fully parallelised -mainly due to early cut computations.

EXPERIMENTS
Hardware and Platform All the experiments are performed on a Linux server Intel Xeon W-2175 with 2.50GHz CPU, 28 cores, and 512GB of main memory.All the algorithms were implemented in C++11 and compiled using g++ 9.4.0 with the -O3 option.Datasets We use 10 undirected real road networks, nine of them are from the US and publicly available at the webpage of the 9th DIMACS Implementation Challenge [16] and one is from Western Europe managed by PTV AG [3].Table 1 summarizes these datasets in which the largest dataset is the whole road network in the USA.Further, we consider two versions for these datasets which have different units of edge weights i.e., distances and travel times.Baseline Methods We compared our proposed algorithm HC2L with four state-of-the-art algorithms for shortest-path distance queries in road networks as follows, 1) Hub Labelling (HL) [2], 2) Pruned Highway Labelling (PHL) [4], 3) Hierarchical 2-Hop Labelling (H2H) [28], and 4) Projected Vertex Separator Based 2-Hop Labelling (P2H) [14].
The code for H2H, PHL and HL was publicly available and implemented in C++.Unfortunately, we could not obtain the implementation of P2H; we reported only the part of experimental results presented in [14].They implemented P2H in C++ and their experiments were conducted on a machine with Quad Intel(R) Xeon(R) Platinum 8160 24-core @ 2.10GHz CPU and 768 GB RAM, running CentOS Linux 7. We use the same parameter settings as suggested by the authors of these methods, unless otherwise stated.We select balance partition threshold  = 0.2 for HC2L.Furthermore, we set the number of threads equal to the available 28 cores.Benchmark Generation To evaluate the query performance, we randomly sampled 1,000,000 pairs of vertices from all pairs of vertices in each road network, i.e.,  ×  .We also evaluate the query performance of the algorithms by varying the distance between the source and target vertices in a query similar to [28,29].Specifically, for each road network, we generate 10 sets of queries  1 ,  2 , . . .,  10 as follows: we set   to be 1000 meters, and set   to be the maximum distance of any pair of vertices in the map.
Let  = ( ) 1/10 .For each 1 ≤  ≤ 10, we generate 10,000 queries to form each set   , in which the distance of the source and target vertices for each query falls in the range For each algorithm, we report the average query processing time.

Performance Evaluation
We compare the performance of our proposed algorithm with the baseline methods in terms of the query time, labelling size, and construction time.The experimental results are presented in Table 2, Table 4, and Figure 6.
5.1.1Querying Time.In Table 2, we report for each method and dataset the average query time over 1 million random distance queries, where edge weights are distances.We confirm that HC2L is the fastest on all the datasets.In most cases, we are 1.5-2.5 times faster than HL, 2-3 times faster than H2H, and 2-4 times faster than PHL.The is because we process a significantly smaller number of distance entries in the labels of a query pair when computing their shortest-path distance.Table 4 reports the results when we consider travel times, rather than distances, as edge weights for each dataset.We notice that the average query times for almost all the methods become faster compared to their corresponding results in Table 4.
The reason for this speedup lies in the reduced labelling sizes which will be further discussed in Section 5.1.2.Evidently, HC2L is still the fastest on all the datasets.Table 3 shows the average number of hubs for which the sum of distances is computed when evaluating Eq. ( 1) or Eq. ( 7).Compared to H2H, HC2L produces minimal cuts which are very small in practice as shown in Table 5, and thus the search space of HC2L is significantly reduced for distance queries.The query time of PHL and HL depends on the label sizes of a query pair.We can see from Table 3 that the average number of hubs of PHL and HL are much bigger than HC2L, which make them significantly underperform HC2L.Furthermore, PHL is generally slower than HL as it employs highways instead of vertices as hubs, making distance computation more complex than simply adding two distance values.Varying Distance Querying In Figure 6, we report results for distance query sets containing query pairs with varying distances to test the performance of all the algorithms.We can clearly see that HC2L significantly outperforms all the baseline methods in every query set.Particularly, our method HC2L shows good performance for both short and long distance query sets.Regardless of distance, only cut vertices of the LCA of the query vertices need to be considered as hubs.Vertex cuts tend to be smaller at lower levels of the hierarchy, thus making local queries generally faster.Exceptions to this exist though -the NY dataset has a top-level cut of size 5, making distant queries very fast as well.Additionally, queries involving top-level cuts enjoy better caching, due to the memory layout of labels.H2H and HL show similar behaviour.In contrast, PHL has rather poor query performance for local queries -this happens because it optimizes its index for high-speed highways, which are less likely to intersect with shortest paths between local vertex pairs.Optimizing indexes for real-world workloads is an interesting problem that often requires specific algorithmic designs as discussed in [36].  2 shows that the labelling sizes produced by our proposed method HC2L is significantly smaller than the baseline methods when indexing for distance queries.In most cases, the labelling size of HC2L is about 2-4 times smaller than H2H, 2-3 times smaller than PHL and 1-2 times smaller than HL.The only exception is the EUR dataset, for which HL produces a smaller labelling.Despite this, HC2L still beats HL in terms of query time on EUR.This is because HC2L only uses a fraction of labels during query answering, although the gap does become smaller.
We have also analysed the extra overhead of labelling sizes by H2H and HC2L to find LCA in constant time in Table 3.It shows that the overheads incurred by H2H are about 20 times greater -this happens because the tree hierarchies of H2H are neither binary nor balanced, thus requiring a different indexing approach.We found that without tail pruning index sizes grow by 10-15%, but construction time is reduced by around 20%.Table 4 shows labelling sizes for datasets using travel times as edge weights.Here HC2L produces slightly smaller labelling sizes than the corresponding labelling sizes in Table 2, H2H is roughly the same, but for PHL and HL labelling sizes are reduced significantly when changing edge weights from distances to travelling times.The reason for this is that PHL and HL can exploit better orderings using travel times as edge weights, which leads to better pruning.Further, we notice that HL pruning is more effective than the tail pruning approach employed by HC2L.This is because HL pruning affects the whole vertex ordering, while the tail pruning by H2CL only affects the vertex ordering within each cut; thus, one can expect HL to perform better in the cases where a large percentage of labels can be pruned.

Construction Time.
In Table 2 we compare the construction time of our method HC2L with the baseline methods when indexing for distance.We observe that the single-threaded implementation of HC2L is slower than H2H, faster than PHL and comparable to HL, but the parallel variant of our method, denoted by HC2L  , significantly outperforms the baseline methods on all the datasets.The construction time of our algorithms includes both the time for obtaining a balanced tree hierarchy and the time for constructing a hierarchical cut labelling.When considering travel times as edge weights, as shown in Table 4, PHL and HL perform significantly faster in construction, compared to considering distances as edge weights in Table 2; nonetheless, these methods are still slower than our parallel implementation.This increase in construction time is consistent with the decrease in labelling size due to better pruning, as discussed earlier.

Analysis of the Proposed Approaches
We evaluate how the balance threshold , cut size and tree height affect the performance of our method.Figure 7 shows the average query times and cut sizes under varying balance thresholds  on all the datasets.We can observe that whenever a cut size increases/decreases under a particular threshold, a similar trend reflects in query time of that threshold.The average query times decrease or remain the same on the majority of datasets when we increase  from 0.15 to 0.20 and for thresholds over 0.20 seem increasing.This also aligns well with the cut sizes, and using the larger balance threshold 0.20 leads to more balanced trees.
In Table 5, we also report the height and maximum cut size/width of our method HC2L with a balance threshold 0.20 against the baseline methods H2H and P2H.It is clear that HC2L produces much smaller cuts and has significantly smaller heights for all datasets compared to H2H and P2H.This is because our method is based on novel techniques to find balanced partitioning and small cuts which result in a balanced binary tree with a very small height.In contrast, H2H and P2H use heuristic techniques to decompose a    road network into a tree which may produce arbitrary height and width of a very large number in practice as shown in Table 5.

Extension to Directed Graphs
Our approach can be extended to directed graphs by storing distances from both directions in each label.We can compute them by performing searches on both directions when adding shortcuts as well as conducting label construction.To compute vertex cuts, we can treat all edges as undirected to ensure that they separate paths in both directions.However, road networks tend to be almost undirected (with some famous exceptions such as Stockholm), in which case the two distances stored within each label will frequently be identical -this can be exploited for optimizations.

Remarks
Below we comment on some possible extensions and open problems related to our method proposed in this work: • Although our parallel implementation reduces construction time to less than one hour for all graphs, further increasing the number of cores has a limited impact on performance.This is because most cores would remain poorly utilized.Particularly, better utilization requires the vertex cut computation to be parallelizable, which seems tricky to do and is outside the scope of this paper.• In dynamic settings, e.g., due to roads being temporarily closed or suffering from slower travel speeds, we need to update our labellings, preferably without recomputing them from scratch.
Our balanced tree hierarchy construction does not depend on edge weights, except for shortcuts.This should enable us to preserve a balanced tree hierarchy (with some adjustments for shortcuts) and limit updates only to distance values.This is in contrast to existing approaches such as HL or PHL which rely on edge weights for node or highway ordering.• Existing labelling-based methods for distance queries on road networks still require labellings of sizes larger than the original graph.Thus, how to reduce labelling sizes needed for distance queries on road networks without compromising query performance still remains an open problem.

CONCLUSION
In this paper, we analyse the drawbacks of the current state-of-theart labelling-based solutions in order to exploit hierarchical structures of road networks for efficiently answering distance queries.We propose a novel solution called hierarchical cut labelling (HC2L) to overcome the drawbacks of existing solutions.Our proposed solution uses a novel balanced tree hierarchy to find a partial vertex order which helps in significantly reducing labelling size and selecting a small subset of labels for a distance query pair.Accordingly, query processing is accelerated, regardless of distance between the node pairs queried.As demonstrated experimentally, index size and construction time are competitive as well.For future work, we plan to investigate dynamic updates to our index structure, and efficient algorithms for the balanced minimal S-T cut problem.
) and Table

Figure 3 :
Figure 3: Tree decomposition labelling: (a) a tree decomposition   of the example road network  shown in Figure 1(a), (b) H2H-Index.

Figure 3 (
a) illustrates a tree decomposition   of the road network shown in Figure1(a) while Figure3(b)shows the H2H index.The label of each vertex stores a position array and a distance array.Consider two vertices 7 and 13, the label (7) stores a position array[1,5,7] which represents the positions (depths) of the ancestors {14, 9, 16} inside the node associated with the vertex 7 in   shown in Figure3(a), and a distance array [2, 3, 3, 2, 1, 1, 0] which represents the distances from 7 to all the ancestors of vertex 7 in   .

Figure 5 :
Figure 5: An illustration of: (a) Hierarchical balanced cut, (b) Balanced tree hierarchy, and (c) Hierarchical cut 2-hop labelling, for the road network illustrated in Figure 1(a).

Figure 6 :Figure 7 :
Figure 6: Distance query performance under varying distances for all the datasets with distances as edge weights.

Table 1 :
Summary of datasets.

Table 2 :
Comparison of query times, labelling sizes and construction times with distances as edge weights between our method, i.e., HC2L, and the baseline methods.HC2L  represents HC2L with parallelism.

Table 3 :
Comparing LCA storage requirements and average hub size by HC2L and the baseline algorithms for all the datasets with distances as edge weights.

Table 4 :
Comparison of query times, labelling sizes and construction times with travelling times as edge weights between our method and the baseline methods.

Table 5 :
Comparing Tree Height and Max Cut Size on all the datasets with distances as edge weights.