AcrTransAct: Pre-trained Protein Transformer Models for the Detection of Type I Anti-CRISPR Activities

Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) and CRISPR-associated (Cas) proteins serve as a formidable defense mechanism for bacteria against foreign DNA; on the other hand, some bacteriophages (phages) and other mobile genetic elements have evolved anti-CRISPR (Acr) proteins to counteract CRISPR-Cas systems and ensure their own survival. Because Acr proteins provide phages with a fitness advantage relative to the bacteria that they infect, accurately identifying Acr proteins that inhibit CRISPR-Cas systems has the potential to significantly and positively impact our ability to harness phages to fight antimicrobial resistance. However, Acr identification is, at present, laborious and involves costly experimental procedures. Existing computational tools for protein-protein interaction (PPI) are not designed to predict complex inhibition, which could be the collective result of multiple PPIs. In this study, we developed a transformer-based deep neural network, AcrTransAct, to predict the probability of Acr-mediated CRISPR-Cas inhibition. Our model comprises two main components: 1. a feature extraction module that incorporates a pre-trained Evolutionary Scale Modeling (ESM) protein transformer and the NetSurfP-3.0 secondary structure prediction system; 2. a classification module that consists of either a convolutional or recurrent neural network. We created an inhibition dataset compiled from two Acr databases, AcrHub [30], Anti-CRISPRdb [5], and several published works [9, 12, 18, 20]. The AcrTransAct model is trained and tested on this dataset. We achieved an accuracy of 95% and an F1 score of 0.95 in predicting the inhibition of I-C, I-E, and I-F CRISPR-Cas systems by Acrs in our dataset. A web application of AcrTransAct (https://acrtransact.usask.ca) is implemented with the best-performing models from this study to predict the probability of multiple CRISPR-Cas systems inhibited by a putative Acr protein. Our code and data are available here: https://github.com/USask-BINFO/AcrTransAct.


INTRODUCTION
Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) and CRISPR-associated (Cas) proteins provide a powerful and efficient mechanism for bacterial immunity against foreign genetic material, such as DNA from bacteriophages (phages), which are viruses that infect and destroy bacteria.CRISPR-Cas systems consist of two main components: Cas (CRISPR-associated) proteins and CRISPR-arrays [3,11].CRISPR-arrays are comprised of short, repeated, palindromic sequences that are separated by unique spacer sequences.Spacer sequences are acquired by bacteria through a process called adaptation, during which pieces of invading phage DNA are extracted and integrated into the CRISPR-array [8].In a process called biogenesis, a precursor crRNA (pre-crRNA) transcript is produced, along with a transcript(s) encoding one or more Cas proteins.The pre-crRNA transcript is then cleaved into smaller units called CRISPR (cr)RNAs, onto which the translated Cas protein(s) assemble.The resulting ribonucleoprotein complexes, also known as surveillance complexes, are armed with crRNAs that are complementary to previously encountered phage DNA.Finally, during a process called interference, invading phage DNA is targeted, recognized, and cleaved, neutralizing the threat.CRISPR-Cas systems are categorized into two classes depending on whether their surveillance complex contains multiple Cas proteins (Class 1) or a single Cas protein (Class 2) [15].Each class is subcategorized into multiple types and subtypes.Since CRISPR-Cas systems can precisely target DNA, they have been harnessed for biotechnology and medical applications.
The Red Queen hypothesis [26] proposes that organisms must continually evolve new mechanisms of resistance to parasites to avoid extinction; therefore, some phages and other types of mobile genetic elements (MGEs) that infect bacteria have evolved anti-CRISPR (Acr) proteins that inhibit CRISPR-Cas systems, enabling their own survival.Many Acrs appear to inhibit a single CRISPR-Cas subtype encoded by a particular bacterium.An example is AcrIF1, which inactivates the type I-F CRISPR-Cas system of Pseudomonas aeruginosa [18].This suggests that these Acrs, like AcrIF1, may have evolved to specifically target the CRISPR-Cas system of a certain bacterial species, despite likely encountering other CRISPR-Cas subtypes.For example, AcrIF1 is likely exposed to multiple systems, as Pseudomonas species encode I-C and I-E CRISPR-Cas systems in addition to I-F [25].In contrast, other Acrs, such as AcrIF2, inactivate divergent CRISPR-Cas systems, such as the type I-F and type I-C CRISPR-Cas systems of P. aeruginosa, suggesting that they may be under additional evolutionary pressures [2,4,7,12].
Since Acrs can inhibit or modulate CRISPR-Cas activity, it is vital to understand when and how the inhibition works.This is especially important as we begin to explore the use of phages as natural alternatives for fighting antimicrobial resistance [17,29].Unfortunately, the in vivo detection of Acr inhibitory activity can be time-consuming and costly.In silico prediction of Acr inhibition can serve as an important method to accelerate the discovery of functional Acr proteins.However, predicting the inhibition of a certain CRISPR-Cas system by an Acr is a nontrivial task.Existing proteinprotein interaction (PPI) prediction software, such as D-script [22] and TAGPPI [23], cannot predict Acr-mediated CRISPR-Cas inhibition due to the complex interactions of Acrs with multiple Cas proteins, especially for Class I CRISPR-Cas complexes, which are comprised of several Cas proteins assembled around a crRNA molecule [16].Most PPI prediction tools focus on interactions between two proteins.However, in CRISPR-Cas inhibition by Acrs, one protein (Acr) interacts with a complex of proteins (CRISPR-Cas system) collectively contributing to the inhibition.This means that an Acr can interact with multiple Cas proteins and disrupt the CRISPR-Cas system by interfering with their functions [31].Therefore, an effective predictive model for inhibition needs to consider multiple proteins to assess the likelihood of inhibition.
There are several challenges in predicting Acr-CRISPR-Cas inhibition.First, there are multiple mechanisms by which an Acr can inhibit a CRISPR-Cas system, such as preventing Cas protein recruitment, blocking crRNA maturation, or altering or blocking the function of the ribonucleoprotein complex responsible for interference activity [31].Second, the outcome of an Acr-CRISPR-Cas interaction can be variable, depending on factors such as, for example, the relative stoichiometry of the Acr to CRISPR-Cas components [19].This variability can make it difficult to predict the effect of an Acr on a CRISPR-Cas system.There have been advancements in classifying Acr proteins using deep learning approaches.For example, AcrNET [13] categorizes input sequences as either Acr or non-Acr and can identify the class of Acrs to which the Acr belongs; however, it cannot perform an inhibition prediction between an Acr and a specific CRISPR-Cas system.To the best of our knowledge, there do not exist any computational methods developed for such inhibition prediction.To combat these challenges and address the void, we present a transformer-based deep neural network (DNN) called AcrTransAct to computationally predict the inhibition of CRISPR-Cas systems by Acr proteins.Our model is trained by a novel inhibition dataset that we developed for the prediction task.Our best model achieved an accuracy of 95%, an F1 score of 0.95, and an AUC score of 0.96 in predicting the inhibition of I-C, I-E, and I-F CRISPR-Cas systems by Acrs.We also built several of our top-performing models into the AcrTransAct web application at https://acrtransact.usask.cathat allows users to submit a putative Acr protein sequence and receive a comprehensive report on the probability of the Acr inhibiting multiple CRISPR-Cas systems.Our work provides a valuable tool for predicting interactions between Acrs and CRISPR-Cas systems and will support Acr activity experiments performed in the laboratory by enabling the selection of the most likely to be functional Acr from a large number of homologous candidate proteins.Furthermore, we provide insights into the capabilities of transformer networks in biological sequence analysis tasks, especially in the context of PPIs.Transformer networks are types of DNNs that utilize a selfattention mechanism [27], by which a context vector is created for each token in the input sequence using the surrounding tokens.A token can be a word, a DNA -mer, or an amino acid.Attention scores between each pair of tokens are calculated using matrices called the query, key, and value.These matrices are learned during the training process and are used to project the input sequence into the query, key, and value spaces, respectively.The attention scores are then calculated by taking the dot product of the query and key vectors, which measures the similarity between each pair of tokens, followed by applying a softmax function to obtain the attention weights.Finally, the values are weighted by the attention scores and summed to obtain the context vector.Each layer in a transformer network can have multiple heads where each head learns a different context vector.By creating these context vectors, transformers can capture long dependencies between pairs of tokens.The calculation of the context vectors is done in parallel, which allows the transformers to be trained efficiently.Token embeddings can be learned during the training and then the positional encoding module in the transformer architecture (Figure 1) adds the relative position information to each embedding vector based on the length of the sequence and position of the token in it.Both the encoder and the decoder parts of the transformer include multi-head attention and feed-forward modules.These modules can be stacked on top of several times as shown in Figure 1 to create models for comprehensive tasks.Auto-encoding models (stacked encoders) make predictions based on all tokens before and after the masked token, while autoregressive models (stacked decoders) predict each token only using the ones before it.Both auto-encoding and auto-regressive models are trained using masked language modeling in which the models predict masked tokens in an input sequence.

BACKGROUND
The trained transformer models can be used as an information extraction tool for an input sequence.The input can be natural language sentences, amino acid sequences (for protein models), or -mers of DNA sequences (for DNA models).The input is mapped to a higher dimensional feature space using the embedding layer and then passed to the feature extraction module, which can be a stack of encoders or decoders.By using the features from the extraction module, token-level tasks such as protein secondary structure prediction or sequence-level tasks such as protein family classification are possible.The extracted features are useful representations of the input sequence, capturing important information about the relationships and dependencies between the different elements of the sequence.
Transformers from ProtTrans [6] and Evolutionary Scale Modeling (ESM) [14] projects include both auto-regressive and autoencoding models that are trained on large protein sequence databases such as UniRef [24] (more than 4 million sequences) which include amino acid sequences with their functional annotations.ESM-2 models are the most recent versions of the ESM models.They have achieved state-of-the-art performance on a variety of complex protein-related tasks [14] such as residue-residue contact prediction and prediction of a protein sequence from its backbone atom coordinates.These models are trained on the UniRef50 [24] dataset using masked token prediction.The UniRef dataset is a clustered set of sequences from the UniProt [1] Knowledge-base and selected UniParc records.UniRef50 is built by clustering UniRef90 seed sequences that have at least 50% sequence identity and 80% overlap with the longest sequence in the cluster.This dataset provides a significantly reduced sequence space while still maintaining high sequence diversity.The largest ESM-2 model has 48 layers and 15 billion parameters and the smallest one has 6 layers with 8 million parameters.

THE ACR-CRISPR-CAS INHIBITION DATASET
Our Acr-mediated CRISPR-Cas inhibition dataset includes experimentally verified positive/negative inhibition activity, consisting of 227 Acr and CRISPR-Cas system pairs with 132 pairs of positive (functional) inhibition and 95 pairs of negative (non-functional) inhibition.The data is sourced from AcrHub [30] and Anti-CRISPRdb databases [5], and several published works [9,12,18,20].Each row in the dataset, representing one Acr and CRISPR-Cas system pair, includes the Acr sequence, Cas protein sequences, the identity of the organism encoding the Acr protein, the type of CRISPR-Cas system used for experimentation (i.e.type I-C, I-E, or I-F), the bacterial species/strain encoding the CRISPR-Cas system, and a label that indicates if the Acr inhibits the CRISPR-Cas system (label 1) or not (label 0).We use the Cas proteins in the effector module [16] in addition to the Cas3 for each CRISPR-Cas system in our work.These Cas proteins have been shown to interact with Acrs [16,30].Currently, the dataset only includes type I CRISPR-Cas systems and the Acrs that have been tested against them.This is because most of the records in the sources we have used belonged to type I systems, especially subtypes I-F and I-E.Several research papers offer both positive and negative test results on these systems which is essential to training our model.Note that for subtypes I-B and I-D, we could not identify the bacterial strains used in the experiment, so we did not include these sub-types.Figure 2 illustrates the distribution of samples for each sub-type of type I CRISPR-Cas system in our dataset, along with the ratio of negative to positive samples.The CRISPR-Cas systems in our dataset are from the following bacteria species/strains, with the type of the system labeled in parenthesis: • Pseudomonas aeruginosa PA14 (type I-F) , PaLML1 (type • Serratia species American Type Culture Collection (ATCC) 39006 (types I-F and I-E) The dataset is divided into two sets, one for train and validation and the other for testing, with the test set containing 20% of the data (46 samples).For the purpose of hyper-parameter search, a 5-fold cross-validation is conducted on the train-validation set to identify the optimal parameters based on cross-validation results.Subsequently, the model with these parameters is evaluated on the test set, which consists of samples that were not used during the hyper-parameter search.

METHODOLOGY
The AcrTransAct is a transformer-based DNN that uses two inputs: an Acr protein sequence, and the list of Cas proteins from a CRISPR-Cas complex.The network has two major modules as depicted in Figure 3.A feature extraction module consisting of two parts: a pre-trained sequence feature extractor based on the ESM-2 protein model, and a structural feature predictor utilizing the NetSurfP-3.0 system [10].A classification module that utilizes Convolution (CONV) or Long Short-Term Memory (LSTM) layers in combination with two Fully Connected (FC) layers to predict the likelihood of inhibition between the two inputs.
The ESM-2 sequence feature extractor that is pre-trained on the UniRef50 protein database is used to compute latent features from the input sequences.The structural feature predictor uses NetSurfP-3.0 models to map the input sequences into different types of secondary structure features.Feature vectors are computed individually and independently for each Cas protein and the Acr protein.
For each protein sequence of length , the latent feature vector is of size  × , where  is the size of the vector that the ESM-2 network generates for each residue.Larger versions of the ESM-2 model have larger  as shown in the Feature Dimension column in Table 1.Previous works [13,21] have shown that structural features (SF) can be informative for PPI prediction.Therefore, in our study, we use NetSurfP-3.0 to map regions in the input sequences into four classes: 3-state secondary structure (Q3), accessible surface area (ASA), relative solvent accessibility (RSA), and residue disorder.The secondary structure elements (helices, sheets, or coils) of a protein sequence are represented as/by Q3.These structural elements play a crucial role in determining protein function and interaction with other molecules.The surface accessibility of amino acid residues is defined by ASA and RSA, which can be indicative of interaction potential.Residue disorder can be useful in PPI, as it allows for flexibility in binding sites, accommodating a wider range of interacting partners and influencing binding specificity and affinity.After structure elements are predicted, we convert each Q3 state to a one-hot encoded vector of 13.The one-hot vector, along with ASA, RSA, and disorder, creates a vector of size 6 for each residue.For a protein sequence of length , the structural feature vector is of size  × 6.The computed latent and structural features are then concatenated and fed into the classification module for prediction.
The classification module is a binary classification network that receives the extracted features from the feature extraction module as the input and outputs the probability of inhibition.For the classification module, we designed it with two different types of networks, using either 1-dimensional CONV or LSTM layers, followed by two FC layers.Sigmoid is used as the activation function of the last layer in the network to generate a probability in the range of [0,1].We use weight decay, dropout, and batch normalization as methods of regularization.
Furthermore, we utilize a weighted binary cross-entropy loss (WBCE) to address the label imbalance present in our data.The weighted loss assigns higher weights to minority classes, allowing the model to pay more attention to these classes during training.This helps in improving the model's performance by mitigating the bias towards the majority class and enhancing its ability to accurately predict the minority classes.
To comprehensively evaluate the performance of AcrTransAct in predicting Acr-CRISPR-Cas inhibition, F1 score, accuracy, Area Under the Curve (AUC), precision, and recall are used in our work.The PyTorch framework was used to develop the models.The models are trained on a single NVIDIA RTX 3080 Ti for a maximum of 250 epochs each with a batch size of 32.We tuned the model for the best value of hyper-parameters such as dropout and CONV number of output channels, and LSTM hidden size using the Weights and Biases (WandB) library.To prevent over-fitting, we used early stopping, dropout, and weight decay methods.Moreover, we implemented a learning rate reduction strategy, wherein the model's learning rate was decreased by a factor of 0.9 if the validation loss did not improve for 10 consecutive epochs.

RESULTS
As described in the method of model design, we experiment with CONV and LSTM layers in our network.We try four different types of features as input to the classification module to compare the efficacy of different features contributing to inhibition classification: 1) only the protein sequences.We use one-hot encoding to map each amino acid to a vector of size 1 × 21 (20 positions for the known amino acids and 1 for unknown or missing); 2) only Since there does not exist other computational methods developed to predict protein inhibitions to the best of our knowledge, we can not demonstrate a performance comparison between our model and other similar methods for benchmarking.However, we employ the first two types of features as baselines to showcase the efficacy of the selected latent input features (ESM-2 features) and our approach of combining these features with SF in enhancing inhibition prediction.With the last type of feature in this study where structural features are used together with extracted ESM-2 latent features, we assess whether the inclusion of structural features adds additional information that ESM-2 might not have captured.
Table 1 shows the test F1 score, accuracy, AUC, precision, and recall values for the CNN and LSTM models with the four different types of input features.The results indicate a clear superiority in the classification performance of both LSTM and CNN models when leveraging extracted latent ESM-2 features, compared to using only protein sequences or structural features.The CNN network shows consistent test results across different versions of ESM-2 features, with larger ESM models yielding improved results when combined with SF.The best performance for the CNN model was observed when using a combination of ESM150m and structural features.However, our LSTM models show a decline in performance as the feature size of the ESM model increases, indicating a tendency towards over-fitting.The LSTM model demonstrates its best performance when utilizing the features from the smaller ESM8m model and SF.
The hyper-parameter search process determines the number of kernels for the CNN and the hidden size of our LSTM.Our largest CNN model comprises 372.2K parameters, while the largest LSTM model has approximately 200 K more parameters but with a 10% lower F1 score.Despite the better performance, our CNN model remains significantly lighter than the LSTM model, even when their performances are comparable.For example, when both models are fed the ESM35m features, the CNN model achieves the same performance as the LSTM model while utilizing roughly half the number of parameters.Furthermore, the integration of SF with ESM features results in a slight performance improvement, elevating the accuracy and F1 score of our runner-up model from 0.93 to 0.95.
To gain further insights into our best-performing models, we conducted an in-depth analysis of each model on different CRISPR-Cas system sub-types within the test set.The detailed results can be found in Table 2.It is essential to acknowledge that an accuracy of 1 for sub-types I-C and I-F does not necessarily indicate a flawless performance of our models on this sub-type.This high accuracy is attained due to the limited size of our test set.Moreover, it is evident that type I-E poses the most challenge for our models in terms of prediction.

DISCUSSION AND CONCLUSION
The accurate in silico identification of Acrs that inhibit CRISPR-Cas systems will enhance our ability to use phages to combat antimicrobial resistance.In this study, we developed a deep learning model, AcrTransAct, to computationally predict the probability of CRISPR-Cas inhibition by any given Acr candidate using a dataset of 227 samples.We explore four different feature combinations for training our network-solely using structural features, using amino acids with one-hot encoding, utilizing only ESM-2 features, and combining ESM-2 features with structural features.
In our investigation, we performed a thorough comparison between the CNN and LSTM models as classifiers.Both models demonstrated remarkably similar performance, with the CNN model showing marginally better results while having fewer parameters.Moreover, both models achieved their peak performance when trained with a combination of ESM-2 extracted features and structural features.
Previous studies [28] have shown that pre-trained protein transformer models, such as those trained on UniRef50 with functional annotations, have the ability to implicitly capture structural features.The addition of structural features with ESM-2 latent features causes only a slight improvement in CNN models and for LSTM models it causes a decline in performance when using features from larger ESM models.This could be due to increased feature redundancy or potential inaccuracies in NetSurfP-3.0 predictions, which may introduce unwanted information.
Moreover, despite the class imbalance in our dataset, the average accuracy and F1 score are nearly equal for all models, indicating the absence of significant bias towards any specific class.Overall, the results demonstrate the effectiveness of our method, even with a small dataset.The utilization of extracted features and the careful selection of model architectures enable accurate predictions.These findings may have promising implications for various biotechnological research and therapeutic development applications.
Our work can be expanded to include other types of CRISPR-Cas systems.This can be accomplished by incorporating additional positive and negative experiments specifically related to these types.Also, it is possible to fine-tune the ESM-2 model on our data and update the weights of the pre-trained ESM models.Additionally, we can explore the application of semi-supervised learning techniques to train the network using unlabeled data.This will allow us to leverage the current trained network to make predictions about the inhibitory capabilities of Acrs that have not been experimentally validated.Subsequently, the model can be trained further by using these predictions and evaluated using a set of validated experiments.

Figure 1 :
Figure 1: The figure illustrates the construction of a language model using stacked encoder or decoder transformer modules for sequencelevel or token-level tasks.

Figure 2 :
Figure 2: A visual representation of CRISPR-Cas system distribution in our dataset, categorized by bacteria species.Positive experiments (inhibition = 1) show successful Acr inhibition, while negative experiments (inhibition = 0) represent unsuccessful inhibition.

Figure 3 :
Figure 3: The architecture of the AcrTransAct model.The sequence features are extracted in the first stage and passed to the second stage where the structural and ESM features are separately processed by CONV or LSTM layers and later concatenated and fed to FC layers.The classification network generates the probability of the input Acr sequence inhibiting the input CRISPR-Cas system.

Table 1 :
Comparison of performance for CNN and LSTM models using different input feature types.

Table 2 :
Accuracy of our top 3 models on each of the CRISPR-Cas sub-types in the test set.