A Comparative Analysis of Transformer-based Protein Language Models for Remote Homology Prediction

Protein language models based on the transformer architecture are increasingly shown to learn rich representations from protein sequences that improve performance on a variety of downstream protein prediction tasks. These tasks encompass a wide range of predictions, including prediction of secondary structure, subcellular localization, evolutionary relationships within protein families, as well as superfamily and family membership. There is recent evidence that such models also implicitly learn structural information. In this paper we put this to the test on a hallmark problem in computational biology, remote homology prediction. We employ a rigorous setting, where, by lowering sequence identity, we clarify whether the problem of remote homology prediction has been solved. Among various interesting findings, we report that current state-of-the-art, large models are still underperforming in the "twilight zone" of very low sequence identity.


INTRODUCTION
An explosion in the number of known protein sequences is allowing researchers to leverage the Transformer [29] architecture and build Protein Language Models (PLMs) [4,11,13].PLMs are highly appealing due to their ability to learn task-agnostic representations of proteins.In particular, they provide an alternative framework to link protein sequence to function without relying on sequence alignments and similarity.Sequence representations learned via PLMs have been shown useful for various prediction tasks, including predicting secondary structure [11], subcellular localization [11,26], evolutionary relationships within protein families [14], and superfamily [15] and family [20] membership.
Observations from recent studies indicate that PLMs, though trained exclusively on sequence data, learn structural information; work in [24] suggests that sequence-only PLMs indeed learn structural aspects.Scaling up to 15 billion parameters in ESM-2 (and training over 65 million unique sequences) yields representations that, harnessed through an equivariant NN, additionally predict tertiary structure (though not at AlphaFold2 accuracy) [17].These reports are not entirely surprising; PLMs capture the well-understood selective pressures that have been exerted on protein sequences throughout millennia of evolution.These pressures originate from the functional requirements of proteins, which in-turn determine their structure by affecting the evolution of their underlying sequences.This ability to encode structure is perhaps also a major aspect of the utility of PLMs in downstream prediction tasks related to protein function, even if limited to superfamily prediction, function co-localization, Gene Ontology categorization [16,18] and more.
We caution, however, that such performance, though seemingly impressive, may be somewhat exaggerated for various reasons.First, care has to be taken when constructing training datasets to remove sequence redundancy as well as to avoid data leakage, where proteins in the test data set may have high sequence identity with proteins in the training dataset.Second, structure and function are well preserved above 30% sequence identity [25].Proteins with similar structure and function are also present below this level of identity but cannot be detected from sequence similarity alone [25].It remains unclear how PLMs perform in this zone (which some authors have taken to referring to as the "twilight zone" [25]).
One challenging, hallmark problem in computational molecular biology, remote homolog detection, is a suitable stress test for how much a PLM has learned from sequence information alone, and whether indeed it can detect remote homologs in the twilight zone.It is worth noting that (protein) remote homology detection refers to the identification of proteins that are similar in structure but share low sequence identity; this is a working definition.The term remote homology was originally introduced to refer to proteins that share a superfamily1 but not a family2 .For the purpose of computational studies, this working definition lends itself to a gradated problem, where one lowers the sequence identity between proteins in the "test" dataset with the query/target protein, and determines whether proteins similar to the query can be detected.This is the setting for this paper, and it is in this setting, over decreasing levels of sequence identity, in which we evaluate pre-trained, transformerbased PLMs (over exclusively protein sequences) of various sizes for their ability to detect remote homologs.
Remote homology prediction is a particularly appropriate problem to determine whether a PLM pre-trained exclusively over protein sequences has also encoded/learned structure information.As one lowers sequence identity, it becomes increasingly difficult to identify homologous proteins based on sequence; remote homologs are those that retain their function (and structure) similarity at low levels of sequence identity.So, if a PLM allows identification of homologs at very low levels of sequence identity, then it has additionally encoded structure in its learned representations.
In this paper, we select powerful, representative, state-of-the-art transformer-based PLMs (trained exclusively over protein sequence data) and evaluate whether representations learned by them aid in remote homology detection/prediction.We employ a rigorous setting, where, by lowering sequence identity, we clarify whether the problem of remote homology prediction has been "solved."Indeed, in contrast to existing pre-prints and other reported findings that enthusiastically declare the problem solved (see Section 2), we show through a careful evaluation that these reports are highly exaggerated.The problem, particularly as one reaches the truly challenging setting of 30% or lower sequence identity, remains challenging for all current, SOTA PLMs, including large ones such as ESM2.This is one of the major findings of this paper.
An additional contribution of this paper is the presentation of metrics to objectively determine whether the distance between PLM-learned representations of proteins correlates with distance between corresponding sequences.This becomes particularly important after removing from consideration easy, high-sequence identity pairs.This analysis and others clarify and allow us to better understand the success and failure cases of PLMs for remote homology prediction.For instance, as we show here, we identify which protein domains are most and least amenable to remotehomology prediction based on PLM representation-similarity; we provide several visualizations to aid our understanding of whether useful structural information is easily-obtainable (or not) from PLMlearned representations of proteins.
The rest of this paper is organized as follows.We first relate some definitions, preliminaries, and necessary details about existing PLMs in Section 2. Section 3 relates our analysis setting, and the metrics utilized.Section 4 reports our findings, and Section 5 concludes the paper.

RELATED WORK AND BACKGROUND 2.1 Protein Classification and Homology
Currently, the most commonly used definition of remote homology in computational studies is based on the hierarchical classification system for proteins provided in the SCOP2 [2,3] and SCOPe [5,12] databases [6].These databases divide protein sequences into "domains" in levels of classes, folds, superfamilies, families, protein regions, and protein types.Generally, the criteria for family membership are related to sequence-level similarities; SCOP's documentation indicates that all sequences sharing sequence identity above 30% are grouped in the same family.
However, this appears to be a simplification of the actual criteria, as the analysis in [6] is based on similarity-based sequence clustering rather than all-to-all alignment and comparison of all protein sequence pairs in the database.Using this system, proteins belonging to the same superfamily are referred to as superfamilylevel homologs [27].Proteins in the same superfamily but in different families are considered remote homologs at the superfamily level [6,22,27].

Protein Language Models
Several iterations of PLMs have been developed since the advent of the transformer architecture.In particular, in this paper we employ three publicly available, pre-trained, SOTA PLMs to obtain representations for our analysis: (1) ESM-1 is the Evolutionary-Scale Modeling PLM [23].ESM has been trained on 250 million protein sequences (a total of 86 billion amino acids) on masked-language-modelling tasks.While there are several lighter-weight ESM-1 variants, we utilize the ESM-1b variant with 33-layers and 650 M parameters.
(2) ESM-2 is a more recent update to the ESM-1 architecture and was trained with variations spanning from 8 M to 15 B parameters [17].For consistency, we used the 33-layers, 650 M parameter version.(3) ProtTrans T5 [10] is another, more recent PLM with selfsupervised training, based on the original T5 model [21] for natural language processing.Specifically, ProtTrans-T5 is a 3 B parameter encoder-decoder model, and it was trained on a denoising task where 15% of the amino acids in the input were randomly masked.
All three of these models employ masked-token prediction as their training objective.

Classic Definition: Remote Homology
In this study, we utilize the Structural Classification of Proteins SCOP2 [2,3] database (latest update: 29 June 2022), containing 5, 936 families and 2, 816 superfamilies.SCOP2 defines family as a group of closely related proteins with clear evidence for their evolutionary origin and superfamily as a group that brings together more distantly-related protein domains.The similarity among proteins in a superfamily is frequently limited to common structural features that, along with a conserved architecture of active or binding sites Language Models for Remote Homology BCB '23, September 3-6, 2023, Houston, TX, USA or similar modes of oligomerization, suggest a probable common evolutionary ancestry.Following the definition from [6,22,27], we first define that a pair of proteins,   and   , are remote homologs if they belong to the same superfamily but different families, as follows: where   and   define the superfamily and family label annotation of the -th protein.

Hardened Definition: Remote Homology
We harden the above definition to accommodate the sequence identity threshold and focus on the truly hard cases; that is, no pair of remote homologs will share sequence identity more than a predefined threshold.This threshold will ensure that this pair falls into the "twilight zone" [25] in terms of sequence identity.The sequence identity is computed as a pairwise global alignment score.The extended equation is as follows: We report experiments and results considering both of the above equations which allows us to truly gauge the performance of various PLMs as the problem becomes harder (lower sequence identity).We observe in our study that there is a high number of sequence pairs in different families with above 30% sequence identity than the sequence pairs belonging to the same family domain in the SCOP2 database (Figures not shown).This reinforces our hypothesis that extra filtering may be required if we want to identify the nontrivial remote-homologs without high sequence-level similarity, to test PLMs with.Fig. 1 shows the pairwise sequence identity distribution in the SCOP2 [2,3] database.

Learned Amino-acid Level and Protein-level Representations
For a protein, 1 ≤  ≤  , in the SCOP2 database, each defined by its sequence of   amino acids, we obtain a corresponding representation   ∈ R   × from a PLM transformer; in this representation, each amino acid of a protein is mapped into R  .Given a learned   ∈ R   × , we obtain protein-level representation   ∈ R 1× by taking the average of the learned amino-acidlevel features over the sequence length as in:

Comparing PLM-learned Representations of Proteins
Following the methodology of Rives et al. [22], we adopt cosine similarity between a pair of protein representations as our similarity metric.Specifically, for each pair of sequences in SCOP2, we compute the representation similarity as follows:

Comparing Sequences of Proteins
To enable our analysis of the embedding representations of remote homologs in PLMs, we compute pairwise sequence alignments and identity scores for each of the 2 ×  2 pairs of sequences in the SCOP2 database.To compute these, we used Biopython's [7] pairwise alignment tool with default parameters.

From Representation Similarity to Prediction of Remote Homology
We employ several metrics and forms of analysis to evaluate whether structural commonalities between pairs of sequences are reflected in their embeddings.
3.6.1 Query-based Analysis.Using each sequence's PLM-learned embedding as a query (  ), we exclude all other sequences from the same family (  ) from the corpus of sequences (C) that will be queried.In our case, C refers to the set of all N sequences in SCOP2.
We then exclude from C all sequences sharing a sequence identity above a given threshold ℎ with the query sequence.The remaining query-sequence pairs are denoted as {(  , )|  ∈ ,  ∈   }, where For evaluating the performance, we consider the ground truth to be (i.e., the sequences are true homologs) if a sequence in the test dataset is from the same superfamily as the query and false otherwise, in accordance with Equation 2.
(3) Hit-10 [19] is the percentage of queries for which a true homolog was in the top-10 sequences with the most similar embeddings.
3.6.2Clustering Analysis.We perform k-means clustering on the embeddings of sequences from the most-successfully-predicted and least-successfully-predicted superfamilies from each PLM based on the AUC (see above).We evaluate the quality of the resulting clusters and their agreement with the ground truth (i.e., whether sequences from the same superfamily are likely to be clustered together).

RESULTS & ANALYSIS 4.1 Experimental Setup
Our experimental setup is designed with the goal of accessibility and reproducibility.
4.1.1Data Preprocessing.We opt to perform our analysis using all sequences in SCOP2 with minimal preprocessing or filtering.One exception to this is the removal of sequences where multiple spans were indicated within the same sequence, due to the ambiguity this creates when assessing the domains of the sequence and subsequences.We remove 506 such sequences compared with the total of 36, 900 sequences provided in SCOP2 database.Consequently, we have 2, 260, 440 remote-homolog pairs at the superfamily level.Note that we analyze significantly more remote-homologs (24 times) compared to Rives et.al [22] that reports performance on 92, 944 pairs of remote-homologs from SCOPe due to heavy filtration.
4.1.2Sequence-Identity Thresholding.We compute the performance metrics using all protein sequences as individual queries.The thresholds we choose vary from 10% to 100% sequence identity with 5% increment.To compute AUROC, AUPRC, and HIT-10, we do not perform any sub-sampling or averaging of the protein sequences but instead choose to calculate all query-vs-ground-truth pairs and compute the metrics once over all samples, for each value of the sequence-identity threshold.This has the advantage of providing robust and reliable metrics, but this strategy also weights our results in favor of the larger superfamilies when compared with the strategy of sampling a single query from each superfamily.So, to provide a more fine-grain analysis at the superfamily level, we also report the same metrics for individual superfamilies from "hard" and "soft" domains, that is, difficult-to-predict and easy-to-predict superfamilies for each PLM.

Performance in the Twilight Zone
Figures 2, 3, and 4 show AUROC, AUPRC, and HIT-10 respectively for all three PLMs at varying levels of the sequence identity threshold.These metrics are also reported in numerical form in Tables 1,  2, and 3.For reference, "random" is added as a random baseline model; in it, the distribution of ground-truths is unchanged, but random numbers are used for embedding similarities.In Table 1 the DeLong variance is reported below the AUROC scores.These results appear to confirm that PLMs still struggle to identify remote homologs in the "twilight zone" [1,25] from the sequence alone.We observe AUROC dropping sharply when the sequence identity threshold is lowered below 40%, indicating that above this threshold the problem is much easier.We also note much lower performance than Rives et.al [24] at remote homology, even with no filtering (see "AUROC (Eq.1)" in Table 1, or th=100% in Figure 2 ).Because the dataset used in their study of remote-homology prediction is not publicly available, we can only speculate as to the lower performance observed here.It is possible that it is due to the differences in the filtration applied to the dataset mentioned in Section 4.1.1,or differences in methodology for computing embeddings or calculating metrics that go beyond the details listed in their paper.Because remote homolog pairs are exceedingly rare when compared with the number of possible sequence pairs in SCOP2, this created a significant class-imbalance in the ground truth, calculating the AUROC scores.Thus, DeLong variances are also provided to give a measure of the reliability of the provided AUROC scores, especially as the threshold is lowered and the number of positiveground-truth examples becomes even lower.In addition, we observe the similar trend of decreasing performance in AUPRC scores, indicating that these results are not simply an artifact of the worsening class imbalance as the threshold is lowered.The random baselines shown in Figures 2, 3, and 4 also confirm that the changing ground truth distribution for different values of the threshold are not to blame for the decrease in performance.
Language Models for Remote Homology BCB '23, September 3-6, 2023, Houston, TX, USA  The Hit-10 scores show a similar trend regarding model performance in the "twilight zone", but with the difference that ESM-1b now outperforms ProtTransT5 and ESM-2 on this metric.Because this metric is calculated at the query level and then averaged over all queries, this may indicate that there are some classes of query where ESM-1b can identify the remote homologs at least to some degree, but the other two models completely fail to assign a high "top-10" rank to the true homologs.

Protein Domain Analysis for Remote Homology Prediction
In addition to calculating AUROC across all queries in the SCOP2 database, we also calculate the same metrics separately for query sequences coming from each superfamily in SCOP2.To identify the "hard" and "soft" (i.e., difficult-to-predict and easy-to-predict) superfamilies for each PLM, we start with the 150 superfamilies with the highest number of remote homologs in SCOP2, and identify the 10 superfamilies with the highest AUC and the 10 with the lowest AUC when attempting to predict homologs based on PLM embeddings, when using queries from that superfamily.Notably, the better-performing superfamilies tended to have fewer included sequences on average, indicating that these may be the superfamilies with more refined and restrictive definitions than the larger superfamilies shown in Table 4 and 5. Another explanation is that the inflated AUROC scores may be caused by the increased class imbalance for queries from the smaller superfamilies.However, the PRC column indicates that generally the bottom-10 superfamilies tended to also have lower PRC.To a lesser degree, this also holds for the Hit-10 scores, despite the fact that even many the top-10 superfamilies had a hit-10 score of zero.
Table 4 shows the superfamilies with the highest AUROC when using embeddings from the ProtTransT5 model (the best-performing PLM, judging by its AUROC in Table 1) to predict remote homologs, and Table 5 shows the superfamilies where the AUC was lowest.Similar tables, giving the hard and soft domains for the other two PLMs, are provided in the supplement.Note that the AUC scores used here are using the sequence-identity filtering threshold of 30%.

Visual Analysis of Hard and Soft sets
To visualize how well the "hard" and "soft" domains are separated in the representational space of the PLMs, we perform a T-SNE [28] dimensionality reduction to view the embeddings from these superfamilies in a two dimensional plot.The T-SNE transformation is fit using all sequences in the SCOP2 database.Note that superfamilybased filtering is only applied later, when producing the visualizations.In Figure 5, 6 and 7, we report the top-5 "soft" domains in the top panel and the bottom-5 "hard" domains in the bottom panel for ESM1b, ESM2 and ProtTransT5, respectively.Subjectively, in all cases this appears to show cleaner and more defined clusters when considering the "soft" domains, relative to the "hard".This indicates some level of agreement between the distances between sequences in our T-SNE projection, and the cosine similarity between pairs of sequences that we used to define remote homologs in the high dimensional protein embedding space.

Distribution of Pairwise Embedding Similarity
To better understand the significance of a given similarity level between two sequences in the representational space of the PLMS, we visualize the distribution of embedding similarities across all sequence pairs in SCOP2 in all three PLMs in Figure 8.All three PLMs show unimodal distributions of embedding similarities.However, the distributions for both ESM models is skewed heavily toward higher similarities between embeddings.Interestingly, the distribution for ProtTransT5 embedding similarities is almost identical to BCB '23, September 3-6, 2023, Houston, TX, USA A. Kabir, A. Moldwin, and A. Shehu Table 4: Superfamilies with highest AUROC(@th=0.3) in Prot-TransT5, selected from a list of 150 superfamilies with the most remote homologs in the SCOP database.Note that the maximum possible number of sequence pairs that can be remote homlogs is actually higher than the number of sequences in the superfamily.This is because the number of pairs is 2 × | | 2 .Also note that the sequence counts reported in the "Num Seqs" and "Num RHs" columns are prior to applying the threshold.7 and  8.
the distribution of pairwise sequence identities in SCOP2 shown in Figure 1.However, while this may seem to indicate that the sequence information is retained by the model, it may actually be a coincidence: Figure 9 shows little correlation between pairwise sequence identity and pairwise embedding similarity in the ProtTransT5 model.

Clustering Analysis
To better quantify how well-defined and separated the superfamilies from the "hard" and "soft" domains are in the representational space of PLMs, we provide a clustering analysis.Table 6 shows the performance of k-means clustering on the "hard" and "soft" domains using embeddings from each of the PLMs.Note that for each PLM, we use its own "hard" and "soft" domains based on the AUROC of that PLM's embeddings at predicting the remote homologs in those domains.Predictably, this unsupervised clustering is more successful at differentiating the "soft" superfamilies from each other than it is at differentiating the "hard" superfamilies.These results serve to bolster the results achieved using pairwise cosine similarity, indicating that this distinction between "hard" and "soft" domains holds, even at the cluster (rather than just level.
Table 6: Cluster separability and accuracy metrics for Kmeans applied to embeddings for all sequences coming from the top-10 "hard" and "soft" domains (i.e., superfamilies) for each model, where the superfamily label is taken as the ground truth.

CONCLUSION
Through our rigorous experiments where we carefully controlled the difficulty of the setting for remote homology prediction, we have gained valuable insights into the current state of PLMs in identifying remote homology and capturing structural features of protein sequences.Our main set of results largely conflicts with the analogous analyses performed by other research groups investigating their own state-of-the-art PLMs.In summary, remote homology prediction remains difficult for PLMs where it matters; that is, as sequence identity gets lower.By conducting analyses in the challenging "twilight zone" and excluding numerous trivial samples from the dataset used to evaluate remote homology prediction metrics, we have shed light on the behavior of PLMs under difficult conditions.We have examined specific superfamilies where PLMs effectively capture remote homologs as well as cases where they exhibit poor performance, offering valuable insights for improving future PLMs and even facilitating the development of novel protein-modeling approaches beyond the traditional PLM paradigm.
In addition, our thorough analysis includes visualizations of various aspects of PLM representations that provide further understanding of their successes and failures.These visualizations complement our main conclusion and offer useful insights into the factors contributing to PLMs' performance.
We also uncovered important details regarding the distribution of protein domains and pairwise sequence identities in the SCOP database that supplement their original documentation and provide missing information regarding the presence of many sequences officially categorized as being in different families, that in reality share a high sequence identity.
In future work, we plan to leverage these findings to inform our exploration of different training regimes and model architectures.Rather than relying on sequence-level similarity, we aim to focus on performance in the "twilight zone" using a new benchmark dataset.Furthermore, we aspire to incorporate more biological knowledge to explain the successes and failures of existing PLMs through further analysis.
We believe that our work will be valuable to researchers dedicated to advancing protein structure models.The datasets, code, and analyses presented here are available at: github.com/amoldwin/plm-remote-homolog-analysis.

Figure 1 :
Figure 1: Histogram showing the sequence identity distribution of sequence pairs from the SCOP2 database.

Figure 2 :
Figure 2: AUROC and DeLong variance embedding similarity as a predictor of homology for embeddings from all three PLMs, as the filtering sequence identity threshold is decreased from 100% to 10%.A threshold of 100% indicates no filtering beyond removal of sequences in the same family as the query, following Eq. 1. PLM embeddings of each sequence from the sequences in SCOP2 are used as queries.

Figure 3 :
Figure 3: AUPRC of embedding similarity as a predictor of homology, as filtering threshold is decreased from 100% to 10%, from all three PLMs.

Figure 4 :
Figure 4: HIT-10 of embedding similarity as a predictor of remote-homology, as filtering threshold is decreased from 100% to 10% for all three PLMs.

Figure 7 :
Figure 7: Top: T-SNE plot of ProtTransT5 embeddings for sequences from superfamilies that had the (top panel) top-5 AUROC and (bottom) bottom-5 AUROC shown in Tables4 and 5.

Figure 8 :
Figure 8: Histogram showing pairwise embedding similarities (cosine) for each model, using all pairwise comparisons between sequences from the SCOP2 database.

Figure 9 :
Figure 9: Scatterplot of embedding similarity vs Sequence identity for 1000 randomly-sampled pairs of sequences in the SCOP2 database.Embeddings shown are from ProtTransT5.

Table 1 :
AUROC Comparison.DeLong variances are shown below the AUROC score.