Phylogenomics using Compression Distances: Incorporating Rate Heterogeneity and Amino Acid Properties

Efforts to reconstruct the tree of life have a long history, but the field has changed fundamentally in the genomic era. Phylogenomics examines evolutionary relationships using very large datasets, so a major problem in the field is the development of unbiased computational methods for tree inference. Sources of bias include sequence alignment errors, discordance among gene trees, and long branch attraction. Distances based on data compression can address sequence alignment errors and analyses of distances may be robust to a major source of discordance among gene trees (incomplete lineage sorting). However, compression distances appear to be susceptible to long branch attraction. This study tested the hypothesis that compression distances can be modified to be more resistant to long branch attraction and found that correcting compression distances for multiple substitutions improved their behavior. Calculating distances after grouping amino acids based on their physicochemical properties incorporated more biological information. The modified compression distances used in this study also made it possible to estimate tree support using a method that closely resembles the bootstrap, the most popular support metric in phylogenomics.


INTRODUCTION
Understanding the historical relationships among living organisms is an major goal in the fields of comparative genomics and phylogenetics.In the current "phylogenomic era" the primary source of tree estimation error is systematic rather than stochastic in nature.Adding data ameliorates stochastic error but it exacerbates systematic error [20].Standard methods of phylogenetic estimation (e.g., maximum likelihood [ML] with commonly-used models of sequence evolution) will yield incorrect estimates of topology if those sources of error are strong enough.Long-branch attraction (LBA) is the best characterized source of systematic error [12], but changes in state frequencies (i.e., shifts in nucleotide or amino acid frequencies) is also important [20].However, discordance among gene trees has come to the forefront as a source of systematic error in phylogenomics [11].This discordance reflects processes like incomplete lineage sorting (ILS) that result in genuine conflict among true gene trees [11,29].
The simplest analytical approach in phylogenomics is ML analysis of concatenated multiple sequence alignments for many different loci.However, this "ML concatenation" approach can yield an incorrect topology when there is ILS [22,35].Summary coalescent analyses, which estimate individual gene trees (typically using ML) and combine those trees [29], are the most commonly used solution to the ILS problem.Summary coalescent analyses have many advantages but errors in the estimated gene trees can be problematic for summary coalescent methods [26,40].Perhaps surprisingly, analyses of concatenated data using certain distance methods were recently shown to be statistically consistent (i.e., to converge on the true tree in the limit of infinite data) [1,8].
Genetic distance calculations typically use pairs of aligned sequences, but data compression methods can be used to alignmentfree calculate distances [7,23,24].Compression distances are based on the idea that compressing the concatenation of two very similar files results in a smaller file than compressing the concatenation of two very different files.In phylogenetics, the normalized compression distance (NCD) would be calculated using equation 1 with files of sequences represented as one-letter codes (i.e., nucleotides ∈ {A, C, G, T} or the 20 amino acids): In equation 1, C(x) and C(y) are compressed file sizes for sequences from taxa x and y and C(x,y) is the compressed file size for their concatenation.NCD exhibited good performance in a simulation study that assumed ILS [45].However, it has seldom been used in empirical studies; most NCD trees have used mammalian mitogenomic data (Fig. 1).Unfortunately, those trees strongly suggest the NCD is susceptible to LBA.Compression distances must be less sensitive to LBA to be useful phylogenomic tools.The goal of this study is to determine whether it is possible to modify the NCD to incorporate biological information and if doing so makes it less sensitive to sources of systematic bias, like LBA.

Monotremata (platypus)
Primates (human, great apes) A (True tree) Figure 1: Published compression distance trees for mammals appear to be affected by LBA.(A) Consensus mammal tree based on many phylogenomic studies (reviewed by Murphy et al. [30]; this tree can be viewed as true.The rodent branch is extended by a dashed line to emphasize the high rate of molecular divergence in muroid rodents (e.g., rat and mouse); the high rate creates the potential for LBA that disrupts the "indicator clade" (Euarchontoglires).(B) Li et al. [23] compression tree rooted rooted between monotremes (e.g., platypus) and other mammals.(C) Cilibrasi and Vitányi [7] compression tree, rooted to a fish outgroup (not shown).
Note that the only rodents in the compression distance trees [7,23] were the rat and mouse.

METHODS
NCD is an approximation to an idealized but uncomputable distance based on Kolmogorov complexity [25].The quality of this approximation depends on the properties of the data compression program.Comparing NCD(x,y) to NCD(x,x) and NCD(y,y), as shown in equation 2, should correct for compressor imperfections (to some degree).
Correcting NCD for the behavior of the compressor does not yield additive distances and non-additivity is the likely basis for LBA in distance analyses.This issue can be addressed using a correction analogous to the Poisson and Γ distances for aligned data [31], as shown in equations 3 and 4.
Both equations correct the NCD c for multiple substitutions, but equation 4 also accommodates variation in substitution rates across sites using an adjustable parameter () related to that variation.This is desirable because among-sites rate variation is pervasive in proteins [10].However, different types of substitutions also accumulate at different rates in proteins (see Braun [4] for details).This can be addressed by using alternative amino acid alphabets (Table 1) might address the second issue.Four of the alphabets group amino acids based on physicochemical properties and the fifth groups captures the GC content of codons; the GC alphabet was used because amino acid frequencies are correlated with genomic base composition [37,39].NCD values were calculated using gzip with best compression (gzip -9).Concatenated sequence files were "tiled"; each line in file x,y is a protein from taxon x immediately followed by its taxon y ortholog (proteins were reiterated twice in files x,x and y,y).This strategy requires ortholog identification but it does not require multiple sequence alignment.Support values analogous to the bootstrap, the support metric using in most phylogenetic studies [16]), were calculated by randomly sampling orthologs (100 orthologs for each replicate).Least squares trees were estimated from the distances using PAUP* 4.0b166 [44] and consensus trees were generated using SumTrees.pyfrom DendroPy [43].Programs and links to additional information (including the data used for these analyses) are available from https://github.com/ebraun68/compdisttest.
The mammalian dataset comprised 1453 orthologs from the Or-thoMaM v. 10c [38]; 19 taxa that resemble the taxon sample in Fig. 1 of Li et al. [23] were analyzed.The avian dataset comprised 2590 aligned orthologs from Jarvis et al. [19]; 21 taxa were selected for analysis; TAPER [46] was used to identify homology errors in the Jarvis data, which have been noted earlier [41].Analyses were limited to loci with all selected taxa and no TAPER masking.ML concatenation analyses used IQ-TREE, [28] and support was calculated using the ultrafast bootstrap [15].Patristic distances for the ML trees (the sum of branch lengths connecting each pair of taxa) were calculated using the T-REX server [3].

RESULTS AND DISCUSSION 3.1 Transformed compression distances are less susceptible to systematic bias
The concatenated ML tree (Fig. 2) was congruent with recent genomescale mammalian phylogenies [13,30]; those phylogenies were generated using a variety of methods, including some that explicitly incorporate discordance among gene trees due to ILS.The only node with <100% support was Euungulata (Perissodactyla [horse and rhinoceros] + Cetartiodactyla [cow and whales]); that clade is poorly supported in other phylogenomic studies.The raw NCD c tree broke up the LBA "indicator clade" (Euarchontogilres, see Fig. 3A), suggesting that raw NCD c trees are susceptible to LBA under conditions where standard ML is robust.NCD c branch lengths had proportionally longer terminal branches than the ML tree and they exhibited less heterogeneity in root to tip distances.IQ-TREE topology obtained using the best-fitting model (Q.bird+F+I+G4 [27]) identified using the -m TEST option [21].Support values reflect the ultrafast bootstrap [15]; unlabeled nodes had complete support.
Using NCD Γ (equation 4) to calculate distances appeared to ameliorate LBA (Fig. 3B).Relative branch length differences between ML tree and the NCD Γ tree were evident, suggesting that the two methods extract different information from the data.The most striking branch length difference involves the wallaby.Provocatively, NCD Γ trees with  <0.7 had, at most, weak support for marsupial monophyly.NCD Γ with =0.7 is likely reasonable since the ML concatenation estimate of  for variable sites in these sequences was 0.7531.Moreover, support for Euarchontoglires was >50% for all alphabets when  was between 0.3 and 0.8 (see github).There is no ideal way to estimate ; the  parameter for aligned Γ distances is non-identifiable given pairwise comparisons [42].ML estimates of the  parameter are themselves imperfect because ILS can lead to sites that appear artificially fast (see Fig. 5 in Houde et al. [17]).Estimating rate heterogeneity parameters is fertile area for further research if compression distances are to become a mature tool in phylogenomics.

Different alphabets capture distinct evolutionary information
Pairwise NCD c values calculated using different alphabets (Table 1) were quite similar, but they were not identical (Fig. 4).Nevertheless, there were some general patterns; uncorrected NCD c values for the two-and three-state alphabets showed consistently higher slopes than the standard 20-state alphabet whereas the uncorrected values for the four-and six-state alphabets had shallower slopes.Support for some nodes also differed based on the alphabet.NCD Γ trees had relatively high support for Euarchontoglires when the value of  was between 0.3 and 0.7 (Fig. 5A).However, the behavior of the Hanada and Dayhoff alphabets differed from the other alphabets (Table 1); both of those alphabets provided higher support for Euarchotoglires for >0.7 and for =0.2.Support for marsupial monophyly in NCD Γ trees exhibited a pattern that was essentially the opposite of the pattern for Euarchontoglires.As described above, NCD Γ trees <0.7 had greatly reduced support for Marsupialia (Fig. 5B).However, the different patterns of support for Euarchontoglire and Marsupialia also extended to the alphabets; calculating NCD Γ using the Hanada and Dayhoff alphabets always resulted in lower support marsupial monophyly, regardless of the  value.For additional details, see the github page for this project.

Compression distances exhibit similar behaviors with mammal and bird proteins
This overall pattern of increase for NCD c observed for avian proteins (see github) was essentially identical to that observed for mammalian proteins (Fig. 4).This is not especially surprising given that models of protein sequence evolution estimated from aligned data are very similar for birds and mammals [32].However, avian coding regions are known to exhibit a high degree of variation in GC-content [5,6,18,34].Moreover, this variation is correlated, at least to some degree, with evolutionary rate [5,18].This prompted us to compare uncorrected NCD c values for the standard amino acid alphabet and the GC alphabet to patristic distances for the ML tree of birds.The concatenated ML tree for the avian dataset exhibited substantial branch length heterogeneity (Fig. 6A) .The deepest divergence   within Neoaves (the clade comprising most extant birds) was between Strisores (hummingbirds, nightjars, and allies [36]) and all other Neoaves in this tree.For this taxon sample, the correct root of Neoaves is likely to be between Mirandornithes (flamingos and grebes [36]) and other Neoaves [5,6,18].This conflict was not completely unexpected; several studies [6,34] have noted that analyses of coding data often place Strisores sister to all other Neoaves (e.g., the coding exon trees in Jarvis et al. [18] and the primary tree in Prum et al. [33]).Perhaps surprisingly, given the presumed role of GC-content variation across taxa in biased estimation of the bird tree [5,6,18,34], we observed very similar patterns of increase for NCD c values calculated using the standard and GC alphabets relative to patristic distances (Fig. 6B).Although the two alphabets clearly extract different information from the data, the most noticeable difference between the alphabets was the higher values for the GC alphabet, which was expected based on the mammal data (Fig. 4).A recent study showed that the patterns of base composition change across the bird tree are more complex than expected for simple shifts in GC content [2]; it might be more difficult to detect those complex changes using the relatively simple GC alphabet.
Estimates of the bird tree based on transformed NCD c values had limited support regardless of the value of .However, we note the ML concatenation tree for birds exhibits several differences from the likely true species tree (Fig. 6A).In addition to the unexpected (A) Concatenated ML tree for birds used to calculate patristic distances.This topology was obtained using IQ-TREE topology with the best-fitting model (Q.bird+F+I+G4 [27]) identified using the -m TEST option [21].Red arrows indicate rearrangements relative to the likely true tree; moving the relevant clades to the positions indicated using the arrows would yield the best available estimate of the true tree (see Braun [5] for review).Avian clade names are from Sangster et al. [36].Unlabeled nodes had complete support.(B) Uncorrected NCD c values the standard and GC alphabets calculated for birds and compared to patristic distances.position of Strisores, the seriema is also misplaced relative to expectation (and that expectation is based on extensive data; see Braun [5] for review).Thus, neither ML concatenation nor analyses of compression distances were able to provide accurate estimates of phylogeny for birds.The part of the bird tree captured in this taxon sample is very challenging, with clear branch length differences, base compositional variation, and pervasive ILS (and potentially other sources of bias) [2,5,6,17,18,34].Thus, it is unsurprising that modified NCD trees for birds were inaccurate.

CONCLUSIONS
This study provided empirical evidence that phylogenetic analyses using unmodified compression distances (i.e., the NCD) are susceptible to LBA.However" it also revealed that there are straightforward ways to transform the NCD to correct this problem; specifically, the NCD Γ can be used to incorporate the potential for multiple that correct can this problem.This study also corroborated the hypothesis that calculating compression distances using alternative alphabets can reveal distinct signals in the data; using different alphabets affected support (Fig. 5).Although the transformation to incorporate among-sites substitution rate variation (i.e., the NCD Γ ) and the use of alternative alphabets both represent ways to incorporate biological information into compression distances, the Γ transformation is likely to be more important based on these analyses.Finally, the approach used in this study made it possible to generate support values similar to the bootstrap, which is a major benefit relative to other methods to calculate support that can be used with compression distance analyses (e.g., Li et al. [23]).This study did not show that analyses of compression distances (or analyses of any distances) were more robust to ILS than concatenated ML.However, overcoming LBA is a prerequisite for any useful phylogenomic method and that was the focus of this study.The demonstration that compression distances can be improved by adding biological information represents an important step in the development of those distances as useful phylogenomic tools.

Figure 3 :
Figure 3: Comparison of uncorrected and corrected NCD c trees for mammals.(A) Least squares tree for uncorrected ("raw") NCD c values.Support was calculated using 100 random samples of the full ortholog set (see Methods); unlabeled nodes had complete support.Red asterisks indicate conflicts with the ML tree.(B) Least squares tree for corrected NCD c values with =0.7.Support and conflicts with the ML tree are indicated as in part A. Distances calculations for both trees used the standard amino acid alphabet.

Figure 4 :
Figure 4: Different amino acid alphabets capture distinct information about evolutionary divergence.(A) Uncorrected NCD c values for each alphabet plotted against NCD c for the standard alphabet.The dashed green line has a slope of one.

BFigure 5 :
Figure 5: Support for selected clades in phylogenetic analyses of NCD c values.(A) Support for Euarchontoglires (ro-dents+primates). (B) Support for marsupial monophyly.Both graphs show different transformations of the NCD c as sets of columns.Column colors indicate the alphabet.Each set of columns indicates the transformation used to correct the NCD c ; Uncorr indicates no transformation.

Figure 6 :
Figure6: NCD c values for the standard amino acid alphabet and the GC alphabet exhibit similar patterns of increase in birds.(A) Concatenated ML tree for birds used to calculate patristic distances.This topology was obtained using IQ-TREE topology with the best-fitting model (Q.bird+F+I+G4[27]) identified using the -m TEST option[21].Red arrows indicate rearrangements relative to the likely true tree; moving the relevant clades to the positions indicated using the arrows would yield the best available estimate of the true tree (see Braun[5] for review).Avian clade names are from Sangster et al.[36].Unlabeled nodes had complete support.(B) Uncorrected NCD c values the standard and GC alphabets calculated for birds and compared to patristic distances.