Generalized Matching Distance: Tumor Phylogeny Comparison Beyond the Infinite Sites Assumption

As the field of tumor phylogenomics matures, numerous methods have been developed to infer tumor phylogenies from many types of sequencing data. The tumor phylogenies being inferred have transitioned from abiding strictly to the Infinite Sites Assumption (ISA), which says that mutations are gained once and never lost, to more relaxed and more biologically accurate models such as Camin-Sokal or k-Dollo models which allow mutations to be either gained or lost multiple times, respectively. As the tumor phylogenies being inferred have become more attuned to the underlying biology of cancer, methods of comparing, or computing distances, between these phylogenies have not yet caught up. In order to address this discrepancy, we propose the Generalized Matching Distance (GMD) Problem which allows for ISA distance measures to be applied to non-ISA phylogenies after a particular type of transformation. We provide a simple, but effective solution for exactly solving the GMD Problem which is often efficient enough for many tumor phylogenies. We also provide a heuristic approach to solving the GMD Problem for instances where our exact solution is not appropriate. In our simulated experiments, we show that by using our approach to solve the GMD Problem we can effectively use ISA tumor distance measures to compare phylogenies with parallel mutations (those that are gained multiple times). Additionally, we show that our heuristic approach works well on a subset of phylogenies under the Camin-Sokal and k-Dollo models. Finally, we apply our method for solving the GMD to three tumor phylogenies generated from a colorectal cancer patient. The data for our experiments and the code for using GMD is available at: https://bitbucket.org/oesperlab/gmd/src/master/


INTRODUCTION
The clonal theory of cancer [21] describes how tumors grow as the result of an evolutionary process.As descendants of the original founder cell acquire new somatic mutations, the resulting cell populations may proliferate even more quickly, leading to a tumor that is a heterogeneous collection of different cell populations.There has been a strong interest in the computational cancer research community to design computational methods that are able to reconstruct the evolutionary history of an individual's cancer as a particular type of rooted tree [3].The vertices in this tree represent distinct populations of cells, with unique complements of somatic mutations, that exist or existed at some point during the tumor's evolution.The edges in the tree represent ancestral relationships between those cell populations.Being able to accurately identify such a tree has potential implications for both improving a general understanding of how tumors evolve and how a particular patient might be best treated [2,20,22].
In recent years there have been many new methods developed to infer the tree encoding a tumor's evolution from various types of sequencing data.See [11,25,28] for good reviews of such approaches and the challenges they face.As the number of methods for inferring a tumor's evolutionary history from DNA sequencing data have proliferated, there has been a recent interest in how to appropriately compare, or compute a distance, between two such trees.In addition to being an interesting problem in its own right, there are certain applications that would benefit immensely from such distance measures.In particular, such distance measures are essential for benchmarking the performance of novel phylogenetic inference methods on simulated data as their output needs to be compared to known ground truth trees [6].
Following the trends that appeared in the tumor phylogenetic tree inference space, the first distance measures developed for tumor evolutionary trees assumed the input trees adhered to the infinite sites assumption (ISA) which states that any mutation is only gained once and never lost [16].Early such distance measures include parent-child and ancestor-descendant distances [9,10] which capture information on the number of parent-child (ancestordescendant) relationships in one tree but not the other.Since then, more distance measures designed specifically to capture features important in the development of cancer have been developed, but still assume the input trees adhere to the ISA.This includes CASet and DISC [6] which both aim to capture aspects of how mutations This work is licensed under a Creative Commons Attribution International 4.0 License.are inherited in subsequent populations when computing the distance between two trees.This also includes Bourque distances [12], which generalize the Robinson-Foulds distance [23] from traditional phylogenetics to be applicable to tumor evolution trees.Finally, another such related distance measure that assumes ISA trees is MLTED/MLTD [13,14] which in contrast to other distances, considers two trees that could represent the same mutational history, but at different levels of resolution, as identical.
In response to recent studies that suggest that the ISA is not appropriate in many cancers [18], the number of methods that can infer tumor evolution trees that don't adhere to the ISA has been increasing.These methods often utilize a different model of tumor evolution such as the -Dollo model [7] which allows each mutations to be lost up to  times, or the Camin-Sokal model [4] which allows each mutation to be gained multiple times.Some recent such inference methods include SiCloneFit [27], SCARLET [24], MEDICC2 [15], recent pre-print FiMO [1], and many others.Correspondingly, there is starting to be an interest in the design of distance measures that can handle input trees that don't adhere to the ISA.However, to our knowledge, the only existing tumor tree distance measure with this capability is MP3 [5] which generalizes the classical phylogenetic concept of using rooted triplets for determining similarity and applies it to tumor phylogenetics.This method is also designed to be able to handle multiply occurring/parallel mutations or losses of mutations.However, given the number of existing distance measures that capture different features of tumor evolution, but rely on the ISA, it would be advantageous to have a way to use these measures on non-ISA trees, rather than wait for new distance methods to be designed.
In this paper, we address the need for a diverse set of tumor distance measures that don't assume the ISA.We introduce a framework that enables all non-ISA tumor phylogenetic trees to be converted into ISA trees without loss of information.Existing ISA dependent distance measures can then be applied to these transformed trees.Specifically, we propose the Generalized Matching Distance (GMD) Problem and describe both an exact algorithm and a heuristic approach to solving it.Our approaches can be applicable for any input trees that allow for mutiple gains and/or losses of mutations.On simulated data we demonstrate the effectiveness of our exact approach for solving the GMD when applied to trees with parallel mutations (those that are gained multiple times).We also demonstrate the effectiveness of our heuristic approach when input trees contain either parallel mutations or losses of mutations.Finally, on both simulated data and a real colorectal cancer data set [19] we demonstrate the ability of our approaches to effectively measure the distance between trees while maintaining important properties of the original ISA dependent distance measures.

Tumor Phylogenies
We consider a tumor that contains  mutations.We won't distinguish what types of genomic alterations these may be (SNV, CNA, etc.).We model the presence or absence of a mutation as a binary character where 1 represents the presence of the mutation and 0 indicates its absence.Thus any cell in the tumor may be described using a binary mutation vector b ∈ {0, 1}  whose  ℎ entry,  (), indicates the state of mutation  in the cell.A clone is a collection of cells with identical mutation vectors.We can now describe the history of a tumor as a tumor phylogeny  where vertices represent clones that either currently or previously existed at some point during the tumor's evolution and directed edges represent the direct ancestral relationships between those clones.We note that inherently all edges are directed away from the root.Definition 2.1.A tumor phylogeny  is a rooted tree with the following conditions: (1) Each vertex  is labeled with a binary mutation vector b  ∈ {0, 1}  indicating the mutations present in that clone.(2) Tumors evolve from a healthy cell (without mutations), so the root  is labeled with the vector b  = [0, . .We define an edge (, ) in  as a gain edge for mutation  if   () = 0 and   () = 1.Similarly, we define an edge (, ) in  as a loss edge for mutation  if   () = 1 and   () = 0. Note, with these definitions a single edge (, ) may be a gain edge for mutation  but also is a loss edge for mutation .A tumor phylogeny  can be given one of two different categorizations depending on the number of distinct mutations that are gained or lost on the edges in  .Specifically,  is called a mutation phylogeny if for any  in  there exists exactly one mutation  such that  is either a gain or loss edge for .Otherwise, there must exist some edge  in  where more than one mutation is either gained or lost.In that case, we refer to the tumor phylogeny  as a clonal phylogeny.Intuitively, a clonal phylogeny is when some mutations cannot be ordered and are instead clustered together.Finally, we note that we may use the terms phylogeny and tree interchangeably.

Models of Tumor Evolution
The mutational history of a tumor is generally not as permissive as our definition of a tumor phylogeny.Instead, we often need to apply a model of evolution that further constricts how mutations are gained or lost.We now can identify two types of tumor phylogenies that adhere to two different existing models of evolution.
The -Dollo model allows for each mutation to be gained exactly once but lost up to -times [7].Formally, we now define a -Dollo phylogeny as follows.
We note that a 0-Dollo phylogeny represents a special case called the Infinite Sites Assumption (ISA) [16] where mutations are gained but never lost.The ISA model has been used extensively in the field of tumor evolution as it provides helpful constraints for inferring tumor phylogenies (e.g., [8]).In recent years there has been a growing interest in dropping the ISA assumption [18], and in particular the -Dollo model has been shown to be a useful alternative.
The Camin-Sokal model allows for mutations to be gained any number of times, but never lost [4].This model has also been shown to be a useful alternative to the more restrictive ISA [16].Let -Camin-Sokal denote the restriction of the Camin-Sokal model where each mutation can be gained at most  times.Formally, we now define a -Camin-Sokal phylogeny as follows.

Distance Measures on Tumor Phylogenies
A distance measure on tumor phylogenies is a function that takes in two tumor phylogenies  1 and  2 and returns a non-negative real valued number that indicates how dissimilar they are.The larger the value, the more dissimilar the trees are to each other; the closer the value is to 0, the more similar they are.Note that a distance measure does not need to be a distance metric (i.e.observe the triangle inequality, symmetry, etc.), although some distance measures such as CASet and DISC are distance metrics in certain contexts.Most existing distance measures that are designed for tumor phylogenies assume that the input phylogenies adhere to the ISA.In this section, we will build a framework that allows for distance measures that assume the ISA to be applied to any tumor phylogeny.We will first show how distance measures that assume the ISA can be applied to 1-Dollo phylogenies.Then, we will generalize that approach.

Distance
Measures applied to 1-Dollo Phylogenies.First we will describe a transformation process to turn a 1-Dollo phylogeny into a 0-Dollo phylogeny, which is the same as a tumor phylogeny adhering to the ISA-sometimes also called a perfect phylogeny.Consider a 1-Dollo phylogeny  with  mutations and  of those mutations have a single loss.We will show how to convert  into a 0-Dollo phylogeny  ′ on  +  mutations.The only real difference between  and  ′ is how the mutation vectors that label the vertices in  ′ are constructed, the phylogenies themselves have the same topology.So, we describe only how to construct the mutation vectors for  ′ .The first  indices in each mutation vector b ′  in  ′ correspond to the gains of the  characters in  .The last  indices correspond to losses of these characters (if present), effectively representing each loss state as a new character.We can construct such a phylogeny  ′ using the following three steps.(1) For each vertex , directly copy over all entries from b  to the first  indices in b ′  and set  ′  () = 0 for the remaining  indices.This means that all mutation gains are encoded in the same way as in the original phylogenies.(2) Iterate through all loss edges in  .For the  ℎ loss edge considered (, ) in  where the loss edge corresponds to character , set  ′  () = 1 and  ′  ( + ) = 1.So, instead of encoding losses as a change of a single mutation from present to absent, we instead encode it as the gain of a new 'loss' mutation while keeping the original mutation as present as well.
(3) For any vertex  such that  is an ancestor of  in  , also make the following updates:  ′  () = 1 and  ′  ( + ) = 1.This procedure allows the original mutation and the newly created 'loss' mutation to be inherited by a descendant populations.
With this transformation, we can now describe a process for applying distance measures designed for ISA phylogenies to 1-Dollo phylogenies.Given two 1-Dollo phylogenies  1 and  2 , convert them into 0-Dollo phylogenies  ′ 1 and  ′ 2 using the transformation described above.Now, any distance measure that assumes the ISA can be applied directly to  ′ 1 and  ′ 2 as these encode all the information from the original phylogenies.

Distance Measures applied to Generalized
Tumor Phylogenies.We now describe how the transformation approach described above for 1-Dollo phylogenies can be generalized for any tumor phylogeny, but especially -Dollo phylogenies or -Camin-Sokal phylogenies.While for 1-Dollo phylogenies there was exactly one gain and at most one loss for each mutation, we may now have multiple gains and losses for each mutation.To apply the same transformation, we need to match losses and gains of the same mutations between tumor phylogenies  1 and  2 .Mathematically, we may view such a pairing as a matching of a bipartite graph  ( 1 , 2 ), whose vertices and edges encode allowed matches between gains and losses of mutations.To formalize this, we start by defining the matching graph  ( 1 , 2 ) obtained from  1 and  2 .
Definition 2.4.The matching graph of two tumor phylogenies  1 and  2 is a bipartite graph  ( 1 , 2 ) = ( ∪ , ) whose vertices  () correspond to gains and losses of mutations in  1 ( 2 ), and whose edge set  is composed of edges (, ) such that  and  correspond to either two gains or two losses of the same mutation, one in each tree.
Recall that a matching  in a bipartite graph is a subset of edges such that no two edges in  are incident to the same vertex.Intuitively, a matching of the matching graph  ( 1 , 2 ) of tumor phylogenies  1 and  2 describes how to match the gains and losses of the same mutation between the two phylogenies.Given a matching  = {( 1 ,  1 ), . . ., In b ′ 1 and b ′ 2 mutation gains are largely filled out in the same way as in the 0-Dollo case.That is, a 1 entry at index  indicates the gain of mutation , and this entry persists in the mutation vector of all descendant vertices.The one difference is that a new mutation vector index  exists for each novel gain of a single mutation (e.g.,

Matched losses
Unmatched losses SNV/CNA).Losses are encoded by introducing a new mutation index in the mutation vector instead of having the original mutation revert from a 1 to a 0. Figure 1 shows a complete example for two 3-Dollo phylogenies including their matching graph and the resulting 0-Dollo phylogenies for the specified matching.
We can now apply any distance measure dist designed for 0-Dollo (ISA) Phylogenies to  ′ 1 and  ′ 2 .Thus, a matching  of  ( ; and (4) Return the matching M* that produces the smallest such distance.Note that while the number of matchings that need to be checked has a factorial growth rate, in practice we expect the number of matchings to often remain relatively small.
To enumerate all maximum cardinality matchings we first observe that all connected components in  ( 1 , 2 ) consist of vertices labeled entirely by gains or losses of a single mutation.Therefore, if we can describe how to enumerate all maximum cardinality matchings for a single connected component, then that approach can be generalized for all maximum cardinality matchings across the whole graph by combining matchings for all connected components.Furthermore, each such connected component is always a complete bipartite sub-graph of  ( 1 , 2 ) (containing all possible edges between the two sets of vertices).Consider the connected sub-graph 1:1, 1:2) and contains all 6 possible edges between these vertices.The maximum cardinality matching in this component has size 2 and there are 6 such matchings.Algorithms such as [26] exist for enumerating maximum cardinality matchings in bipartite graphs.Furthermore, the fact that  is a complete bipartite graph makes enumerating these matchings an easy two step process: (1) Choose all sets of size |  | vertices from the set of vertices   ; (2) Consider all possible ways of connecting these vertices to the vertices in   .
In cases where the matching graph  ( 1 , 2 ) has relatively few maximum cardinality matchings (when there are few mutations with relatively small numbers of multiple gains or losses), it can be computationally feasible to simply check all such matchings.We will refer to this as the enumerative approach to solving the GMD.Alternatively, we can take a heuristic approach to quickly pick a good max-sized matching, which we describe in the following section.

2.3.4
Minimum-weight matchings: a heuristic approach.In the GMD problem, the matching  on the matching graph  ( 1 , 2 ) is the component that identifies how to map gains and losses in  1 to those in  2 .The optimal such mapping will depend on the distance measure dist being used.In place of an enumerative, exact solution that may be computationally expensive, we propose a heuristic approach.Specifically, we propose the following procedure: (1) Assign weights  (, ) to edges (, ) in  ( 1 , 2 ) based on features in both  1 and  2 ; (2) Find a minimum-weight, maximum cardinality matching using the Hungarian algorithm [17]; (3) Use this matching to perform the transformation to 0-Dollo phylogenies than exhaustively check all possible matchings.The details of the exact distance measure dist used may also be helpful for picking an useful weighting for the edges in the matching graph.In particular, we explore the following three weighting schemes.
depth -the weight  (, ) of an edge (, ) in  ( 1 , 2 ) is set equal to the absolute value of the difference between the depth of  in  1 and  in  2 .
parent -the weight  (, ) of an edge (, ) in  ( 1 , 2 ) is set equal to 0 if  and  share the same parent mutation(s), 1 otherwise.
lineage -the weight  (, ) of an edge (, ) in  ( 1 , 2 ) is set equal to the cardinality of the symmetric difference between the lineage sets of  and .We define the lineage set of a mutation as the union of all of its ancestor mutations and all of its descendant mutations.Figure 2 provides a visual for understanding these weighting schemes.

RESULTS
We apply and analyze our proposed approaches on both simulated and real data.

Results on Simulated Data
On simulated data sets we evaluate several aspects of both the enumerative and heuristic approaches to solving the GMD.Specifically, we evaluate: (1) If brute-force GMD enables ISA distance measures to appropriately penalize differences between phylogenies with parallel mutations; and (2) How well the proposed heuristic approach for the GMD works when applied with different edge weighting schemes and distance measures.This analysis includes both parallel mutations and mutation losses.
3.1.1Data Simulation.Our general simulation procedure for each of the experiments described below was to use a recursive approach to enumerate all trees in a specified space (e.g., k-dollo trees with 8 mutations and 3 losses of one of those mutations) and then to randomly sample a subset of these trees to use in each experiment.We note that such trees are inherently biologically feasible as they do not allow a mutation to be lost before it is gained.The specific details for simulating data for each experiment are outlined in the corresponding sections below.
3.1.2Examining the Effects of Parallel Mutations.Similar to Ciccolella et al. [5], we wanted to investigate the effect on the distance evaluation between clonal phylogenies as the number of parallel mutations increases.We would expect that a phylogeny with more occurrences of a mutation when compared with a phylogeny with less occurrences of that same mutation should be deemed further away from each other than two phylogenies with more similar numbers of mutation occurrences.In order to set up an experiment to support this investigation we first create 4 data sets containing all possible mutation phylogenies with 8 mutations and 1, 2, 3, or 4 gains of mutation 'A'.Note, these are a specific subset of -Camin-Sokal phylogenies.We then created a base set of trees (called Group 1) by randomly sampling 50 trees from the data set containing 1 gain of mutation 'A'.We also created 4 test sets (Groups 2-5) by then randomly sampling 50 trees from the sets with 1-4 gains of mutation 'A'.For each group of trees we converted them from mutation phylogenies to clonal phylogenies by randomly selecting 2 pairs of connected nodes in the mutation phylogeny to collapse to simulate mutations whose order cannot be ascertained.We do this collapsing since real data is much more likely to be a clonal phylogeny with gains or losses of different mutations grouped on single vertices rather than the idealized mutation phylogeny where each new gain or loss appears on its own vertex.We then conducted pairwise comparisons between clonal phylogenies in Group 1 with Groups 2-5. Figure 3 shows the average distance between each test set and the base set for various different distance measures.
All methods that included the enumerative GMD transformation plus an ISA dependent distance measure (CASet [6], DISC [6] and MLTED [14]) show the desired property of monotonically increasing distances as the number of parallel mutations of mutation 'A' increased.The MP3 method [5], which was designed to handle parallel mutations, also shows the same monotonically increasing property.While neither CASet nor DISC was designed to handle parallel mutations, neither program throws an error when run with such data (without the GMD transformation).Specifically, the implementation of these methods utilize set data-structures rather than multisets for mutations and thus only one mutation gain is considered whenever parallel mutations are present within the phylogeny.This explains the relatively flat slope for these results.However, we intentionally include these results here to better demonstrate the exact impact of our approach.We also note that when phylogenies being compared do not have identical sets of mutations, both CASet and DISC have the option to either use the union or intersection of those sets.We only include here results using the union option because the intersection option effectively removes all signals from the parallel mutations as only a single gain of that mutation can be included in the intersection set.As a result, CASet and DISC when applied without GMD behave almost identically for both union and intersection.MLTED, without GMD, returns an error on phylogenies with parallel mutations.
When Ciccolella et al. [5] proposed MP3 (and performed an experiment similar to this one), they suggested that it is better to have a steeper curve when comparing phlogenies with differing numbers of occurrences of a mutation.Following this, CASet (union) + GMD , DISC (union) + GMD, and MLTED + GMD all have a steeper curve than MP3 and therefore penalize differences across mutation sets more than MP3.Among these distance measures, MLTED + GMD has the steepest curve with a total change in distance of 0.15.

Effect of Parallel Mutations on Distances Between Clonal Trees Occurrences of 'A'
Distance to base set Figure 3: The results of an experiment to evaluate how various distance measures with and without enumerative GMD vary in their evaluation of clonal phylogenies.Specifically, the evaluated phylogenies are a subset of -Camin-Sokal phylogenies where the only mutation duplicated is mutation 'A' and  is 1, 2, 3, and 4. We evaluate distances using MP3, CASet (union) + GMD, CASet (union), DISC (union) + GMD, MLTED + GMD, and DISC (union).

Performance of Heuristic Approach.
We also analyzed the performance of our proposed heuristic approach in approximating solutions to the GMD Problem.Specifically, we explored which weighting schemes (depth, lineage, parent) paired best with different ISA distance measures (parent-child (PC) [10], ancestor-descendant (AD) [10], MLTED [14], CASet [6], DISC [6]).We first create a data set containing all possible mutation phylogenies with 5 mutations and 2 losses of mutation 'A'.Note, these are a specific subset of -Dollo phylogenies.We then randomly sampled 100 mutation phylogenies from the data set and ran pairwise comparisons between these 100 mutation phylogenies using all combinations of weighting schemes (for our heuristic approach) and different ISA distance The results of experiments to evaluate various factors that could effect the performance of our heuristic approach at approximating solutions to the GMD Problem.We paired various ISA distance measures (CASet, DISC, PC, AD, and MLTED) with our proposed weighting schemes (depth, lineage, parent).a) Fraction of trials where the heuristic approach achieved the optimal solution on phylogenies with 5 mutations and 2 losses of mutation 'A'.Results are also shown for both mutation phylogenies and clonal phylogenies that resulted from collapsing nodes in the mutation phylogenies.Random indicates the probability of selecting an optimal matching randomly from all possible matchings.b) Fraction of trials where the heuristic approach achieved the optimal solution for a data set with 8 mutations and 2 gains of mutation 'A'.Results are shown for both mutation phylogenies and clonal phylogenies obtained by collapsing nodes in the mutation phylogenies.
measures.We then compare these results to using the enumerative approach for solving the GMD problem for all distance measures.This allows us to capture what fraction of trials optimally solve the GMD problem for each combination of weighting scheme and ISA distance measure.We also perform this same experiment after using the same collapsing approach described in the previous section to turn all mutation phylogenies into clonal phylogenies.Figure 4a shows the complete results from this experiment.We note we also performed this same experiment but with phylogenies with 8 mutations and 2 losses of a single mutation and found the results to be similar.
Across all distance measures and heuristic weighting schemes there is a decrease in performance when applied to clonal phylogenies in comparison to application to the corresponding mutation phylogenies.This could be because the clonal phylogenies were shorter, providing less information to compute the weighting schemes-especially in the case of the  weighting scheme.For mutation phylogenies, the heuristic approach is almost always optimal when using PC distance with theparent weighting scheme.Specifically, for 5 mutations and 2 losses of 'A', it finds the optimal solution 99.32% of the time.AD distance with the parent weighting scheme and MLTED distance with the parent weighting scheme also perform well, and identify an optimal solution in 94.30% and 92.40% of trials.While intuitively simple, the parent weighting scheme performed the best across all the distance measures for both clonal and mutation phylogenies except for MLTED in which lineage achieved better performance for clonal phylogenies.All pairings of ISA distance measures and weighting schemes performed better than randomly selecting any maximum cardinality matching.
We also performed a similar analysis for phylogenies with 5,6,7 and 8 mutations and 2 gains of mutation 'A' (a subset of -Camin-Sokal phylogenies).In this case we used only the distance measures that performed the best in the previous experiment, parent-child and ancestor descendant.We saw little difference in the results for the different number of unique mutations in each tree, so Figure 4b shows only the results for 8 mutations with 2 gains of mutation 'A'.Despite the fact that the number of unique mutations was increased and that the phylogenies in this experiment contained multiple gains rather than multiple losses, many of the patterns we saw in the previous experiment persist.PC distance and the parent weighting scheme still perform the best with the combination of these obtaining an optimal solution in 98.10% of trials on mutation phylogenies and 79.89% of trials on clonal phylogenies.Similar to the previous experiment, we also see a performance decrease between clonal phylogenies and mutation phylogenies.

Results on Real Data
We also compare three phylogenies that were inferred for a colorectal cancer patient CRC2 from Leung et al. [19].The original data set consists of targeted single-cell sequencing of 182 cells from a primary colon tumor and a liver metastasis and a 1000-cancer gene panel used as the target region for sequencing.The phylogenies were inferred by three different methods in three separate papers-SCARLET [24], SiCloneFit [27], and FiMO [1] (a pre-print paper).The phylogenies and computed distances for CASet (union) + GMD, DISC (union) + GMD, MLTED + GMD and MP3 (as these were the only distance measures to accurately penalize increasing differences in mutation occurrences in the simulated experiments) are shown in Figure 5.We ran the enumerative variant of GMD as the number of gains/losses were small, and it finished virtually instantaneously.Specifically, we ran the code on a Dell Poweredge R540 server with 28 cores and 384 GB of RAM.Note, the SCARLET tree is the only one containing back mutations (losses) whereas the SiCloneFit and FiMO phylogenies contain parallel mutations.Furthermore, the SCARLET tree explicitly orders many mutations that the other methods simply group together.Thus, we may reasonably expect the SCARLET tree to be more dissimilar to the other two phylogenies for most distances.The results of CASet (union) + GMD, DISC (union) + GMD, MLTED + GMD and MP3 agree with that assumption.We note that the MP3 analysis is more extreme in both the similarity of the SiCloneFit and FiMO phylogenies and the dissimilarity of the SCARLET tree to the others.This resembles the situation in Figure 3 with MP3 yielding a larger dissimiliarity in its evaluation than the other distance measures.On the other hand, while MLTED + GMD still evaluates the SiCloneFit and FiMO trees as most similar, it does not identify the SCARLET tree to be as dissimilar from them as the other methods do.This is consistent with the intended behavior of MLTED, which is different than the other distance measures.Specifically, MLTED was designed to evaluate phylogenies at different resolutions that could represent the same underlying tumor evolutionary history as similar.Thus, the expanded nature of many mutations in the SCARLET tree should contribute less to the total distance to the other two phylogenies when using the MLTED distance, which is exactly what we see.Thus, our results here suggest the ability of the GMD approach to maintain the desired properties of the distance measures that it is used with.

CONCLUSION AND FUTURE WORK
There are many existing distance measures to compare tumor phylogenies that abide by the Infinite Sites Assumption (ISA).However, the field of tumor phylogenomics is gradually transitioning to models of tumor evolution beyond the ISA such as the -Dollo and Camin-Sokal models in order to better represent the realities of tumor evolution.In order to leverage already existing ISA tumor distance measures to evaluate tumor phylogenies inferred under these more relaxed models, we propose the Generalized Matching Distance (GMD) Problem.We both provide an enumerative approach to solving GMD (which is often very practical to use), and also propose a heuristic to solve the GMD that utilizes various weighting schemes (depth, lineage, parent) to identify a single matching that is likely to produce a good result.We have shown that without GMD, some existing ISA distance measures are incapable of correctly penalizing differences in occurrences of parallel mutations; in some cases, these distance measures are even unusable without first using GMD.We also showed that the heuristic approach we proposed performs well in some restricted cases of tumor phylogenies; parent-child distance combined with the parent weighting schemes even generally produces an optimal solution to the GMD in the case of small mutation phylogenies.Finally, we applied our GMD approach along with several ISA distance measures to a real colorectal cancer dataset.We found that our approach allowed the distance measures to retain their originally designed properties such as the ability of the MLTED distance measure to consider phylogenies at different resolutions as similar if they could represent the same evolutionary history.
Yet, there is much future work that could be done.For one, more weighting schemes can be developed in addition to the three that were provided in this paper (depth, lineage, parent).Specifically, while parent maps quite well to parent-child distance and some sort of intuition guides the pairing of lineage and ancestor-descendant distance, we haven't extensively explored weights that would work especially well for MLTED, which relies on edit distance rather than sets of mutations.In addition, experiments on larger phylogenies containing greater complexity like the gain and loss of multiple different mutations could provide more comprehensive information on the optimality of the heuristic approach.Finally, more extensive comparison between all the distance measures and their GMD extensions could help researchers in the field better determine which distance measures to use for their use case.
be the subsets of unmatched vertices (mutations) of  ( 1 , 2 ), where  − indicates the unmatched mutations from  1 and  − the unmatched mutations from  2 .From ,  − and  − , we obtain the corresponding 0-Dollo (ISA) phylogenies  ′ 1 and  ′ 2 as follows.Similar to the 1-Dollo case, the topologies of the transformed phylogenies are identical, so we need only to describe how the mutation vectors for  ′ 1 and  ′ 2 are created.Each vertex in  ′ 1 and  ′ 2 are labeled by mutation vectors b ′ 1 and b ′ 2 , respectively, of size | | + | − | + | − |.The first | | indices correspond to matched gains/losses of mutations between  1 and  2 , followed by | − | indices corresponding to unmatched gains/losses of mutations in  1 and then | − | indices corresponding to unmatched gains/losses in  2 .

Figure 1 :
Figure 1: Phylogenies  1 and  2 (left) are both 3-Dollo Phylogenies.Loss edges are indicated by a dashed line and gain edges by a solid line.The corresponding matching graph  ( 1 , 2 )is also shown (center), with node colors that match the respective gain/loss edges in the original phylogenies.The matching indicated by the dark purple edges in  can be used to transform  1 and  2 into  ′ 1 and  ′ 2 , two 0-Dollo Phylogenies (right) by introducing a new index into their corresponding mutation vectors b ′ 1 and b ′ 2 for every time a mutation is lost in either phylogeny.Each loss is now encoded as a gain of the mutation at the newly added index.

Figure 2 :
Figure 2: A toy example comparing two phylogenies to demonstrate the weighting schemes used as part of the heuristic approach.(Left) Two mutation phylogenies  1 and  2 , with  2 containing parallel mutations of A. (Right) A small portion of the matching graph, specifically the portion of the graph that matches mutation A, is shown.parent is shown in blue, depth is shown in green, and lineage is shown in yellow.Brief descriptions of the calculation of the weighting schemes are attached to the matching graph.

Figure 4 :
Figure4: The results of experiments to evaluate various factors that could effect the performance of our heuristic approach at approximating solutions to the GMD Problem.We paired various ISA distance measures (CASet, DISC, PC, AD, and MLTED) with our proposed weighting schemes (depth, lineage, parent).a) Fraction of trials where the heuristic approach achieved the optimal solution on phylogenies with 5 mutations and 2 losses of mutation 'A'.Results are also shown for both mutation phylogenies and clonal phylogenies that resulted from collapsing nodes in the mutation phylogenies.Random indicates the probability of selecting an optimal matching randomly from all possible matchings.b) Fraction of trials where the heuristic approach achieved the optimal solution for a data set with 8 mutations and 2 gains of mutation 'A'.Results are shown for both mutation phylogenies and clonal phylogenies obtained by collapsing nodes in the mutation phylogenies.

Figure 5 :
Figure 5: Pairwise comparisons between three phylogenies representing a patient with colorectal cancer.(Left) 1 is the tree inferred by SCARLET, containing losses (indicated in red). 2 is the tree inferred by SiCloneFit, containing parallel mutations (indicated in purple). 3 was inferred by FiMO, also containing parallel mutations.(Right) Pairwise distance comparisons between the phylogenies on the left using CASet (union) + GMD, DISC(union) + GMD, MLTED + GMD, and MP3.
. , 0]  .(3) Any two vertices connected by an edge must have binary mutation vectors that differ in at least one place.That is, if (, ) is an edge in  then there must exist some  ∈ {1, . . ., } such that   () ≠   ().(4) All children of any vertex  must have unique mutation vectors.That is, if vertices  and  are siblings, then b  ≠ b  .(5) All  mutations appear at least once in  .That is, for all  ∈ {1, . . ., } there exists some vertex  where   () = 1.
1 , 2 ), the matching graph for the original phylogenies, induces a distance of  ( ′ 1 , ′ 2 ) on the transformed phylogenies.This leads to the following optimization problem.Problem 2.1.Generalized Matching Distance (GMD) Problem: Given tumor phylogenies  1 and  2 and a distance measure dist under the the ISA, find a maximum cardinality matching  of the matching graph  ( 1 , 2 ) such that the resulting 0-Dollo phylogenies An Exact Algorithm to Solve the GMD.The structure of the matching graph  ( 1 , 2 ) makes finding the exact solution to the GMD fairly straightforward, although potentially computationally costly.We propose the following method: (1) Create the matching graph G( 1 , 2 ); (2) Enumerate all maximum cardinality matchings M in the graph; (3) For each matching, compute the transformed phylogenies  ′ 1 and  ′ 2 and compute dist( ′ 1 ,  ′ 2 )