Single-Cell Multimodal Prediction via Transformers

The recent development of multimodal single-cell technology has made the possibility of acquiring multiple omics data from individual cells, thereby enabling a deeper understanding of cellular states and dynamics. Nevertheless, the proliferation of multimodal single-cell data also introduces tremendous challenges in modeling the complex interactions among different modalities. The recently advanced methods focus on constructing static interaction graphs and applying graph neural networks (GNNs) to learn from multimodal data. However, such static graphs can be suboptimal as they do not take advantage of the downstream task information; meanwhile GNNs also have some inherent limitations when deeply stacking GNN layers. To tackle these issues, in this work, we investigate how to leverage transformers for multimodal single-cell data in an end-to-end manner while exploiting downstream task information. In particular, we propose a scMoFormer framework which can readily incorporate external domain knowledge and model the interactions within each modality and cross modalities. Extensive experiments demonstrate that scMoFormer achieves superior performance on various benchmark datasets. Remarkably, scMoFormer won a Kaggle silver medal with the rank of 24/1221 (Top 2%) without ensemble in a NeurIPS 2022 competition1. Our implementation is publicly available at Github2.


INTRODUCTION
Advancements in multimodal single-cell technologies provide the capability to simultaneously profile multiple data types in the same cell, including chromatin accessibility [3,6], DNA methylation [14], nucleosome occupancy [28].These technologies offer an exciting opportunity to characterize cell identity and state at an unprecedented resolution, enabling a better understanding of gene regulatory networks in multicellular organisms and tissues [43].Despite the rapid accumulation of multimodal single-cell data, the analyses of such data are still faced with numerous challenges.First, single-cell measurements often exhibit high sparsity and noise levels, making it difficult to draw meaningful insights from the data [13].Furthermore, samples are often measured under different conditions, including batches, times, locations, or using different instruments, which leads to systematic variations in the measured values, which further complicates the interpretation of single-cell data.These imperfections can result in biased estimates of cell-cell interactions and poses tremendous challenges for computational models to exploit such interactions.
To effectively capture the intricate interactions within cells and genes, current research focuses on constructing static graphs based on heuristic criteria and then employing graph neural networks (GNNs) [17,26,35] to extract information from the built graph.However, the quality of built graphs is contingent upon the selected heuristic similarity measure and it does not incorporate downstream task information.An alternative way for building the interaction graph is to construct the graph based on domain knowledge, e.g., utilizing publicly accessible databases [37].However, similar to k-NN graphs, such graph construction process does not leverage downstream task information and the knowledge base may not include all relevant genes/proteins.Furthermore, GNNs have some inherent limitations that hinder their success in applications: the over-smoothing [20] and over-squashing [1] issues where GNNs produce poor results when we deeply stack GNN layers.Hence, one question naturally arises: can we have a better approach to construct interaction graphs (among cells, genes, and proteins) that utilize downstream information while avoiding the aforementioned issues?
In light of the recent advances of transformers [9,18,23,24] in capturing pairwise relations among objects, we seek to utilize transformers for learning the interaction graph for cells, genes, and proteins in an end-to-end manner.Transformers are well-suited to address the limitations of static graphs: they learn the interaction between objects through the self-attention mechanism, where all objects are attended to each other with learnable attention scores indicating their interaction strength.Thus, the attention matrix provides an advanced approach to characterize the interaction between objects in a data-driven way and has demonstrated success in reducing unwanted variance and noise across batches [40].However, these traditional transformers do not account for the available graph structure and are therefore unable to leverage prior information present in graph data, such as biological knowledge graphs.In this context, graph transformers [10,29,42] offer a solution by combining the strengths of GNNs and transformers to make use of graph data.These approaches allow for the incorporation of prior insights from structural information learned from GNNs, while still allowing for data-specific interactions to be learned through the attention mechanism.On top of that, graph transformers also alleviate the over-smoothing and over-squashing problems in GNNs by enabling individual objects to attend to unconnected objects [5,11,20,30].Therefore, it is of great importance to investigate the potential of (graph) transformers in single-cell analysis.
In this work, we aim to design a transformer framework for multimodal single-cell data.In essence, to utilize the strengths of (graph) transformers, we are faced with two non-trivial challenges.First, since multimodal data contains diverse information from various sources, e.g., genes, proteins, and cells, it can be difficult for a single transformer to capture all aspects.Second, traditional transformers suffer a quadratic computation complexity w.r.t. the number of objects, which poses a challenge for single-cell analysis where the number of cells can be large.To address the first challenge, we introduce the Single-Cell Multimodal Transformer scMoFormer, which employs multiple transformers to model the multimodal data, allowing each transformer to deal with a specific data modality.The core of scMoFormer is the cross-modality aggregation component which builds a bridge between these transformers and aggregates the necessary information from individual ones.For the second challenge, scMoFormer employs linearized transformers [7] to the cells which greatly reduces the computational complexity.To the best of our knowledge, we are the first to employ transformers to advance the analysis of multimodal single-cell data.Our proposed framework achieves promising results on the benchmark datasets, providing a very strong baseline for follow-up research.Our contributions can be summarized as follows: • We study the problem of multimodal single-cell data analysis and propose a transformer framework scMoFormer to capture the intricate relations within modalities and between modalities.

RELATED WORK
In this section, we review related works of our proposed framework, including GNNs, transformers, and other in single-cell analysis.

Deep Learning on Multimodal Integration
There is a growing number of deep learning-based methods for multimodal single-cell analysis in the community.For instance, scMDC [22] is an end-to-end autoencoder-based model with one encoder and two decoders.The encoder takes the concatenation of two modalities as an input and then reconstructs two modalities separately via two individual decoders.DCCA [44] learns a coordinated but distinct representation for each omics data by mutually supervising each other on the basis of semantic similarity across embeddings, and then reconstructs back to the original dimension as output via a decoder for each omics data.Cross-modal Autoencoders [41] utilize multiple autoencoders to map different modalities onto the same latent space, and incorporate prior knowledge through the use of adversarial loss and paired anchor loss in the training process.BABEL [39] consists of two neural-network-based encoders and two decoders for translation between gene expression and chromatin accessibility.Both Cross-modal Autoencoders and BABEL focus on multimodal translation by adding interoperability constraints to train multiple encoders and decoders.Another approach, scMM [27], captures nonlinear latent structures with variational autoencoders.It exploits a mixture-of-expert framework with a deep generative model and attains end-to-end learning by modeling raw counts of each modality.While these models have made significant advancements in multimodal integration, most of them are based on autoencoders and tend to overlook the underlying biological interactions of molecules and cells.

GNNs and Transformers in Single-Cell
To capture the biological interactions of molecules and cells, there has been an increasing number of GNNs and transformer frameworks published in the field of single-cell analysis.One benefit of transformers applied in single-cell data is to capture long-range dependency in a global view.Another benefit is to interpret biological phenomena via the attention mechanism in transformers.From GNNs' perspective, graphs are natural to represent all kinds of data in single-cell data, like gene-to-gene graphs, cell-to-cell graphs, and cell-to-gene graphs.Another benefit of GNNs is to easily add prior knowledge into graphs, like pathways between genes or overlaps between genes and peaks.For example, scGNN [36] models cell-cell interaction by incorporating GNN with multi-modal autoencoders.Specifically, scGNN builds a cell graph by capturing cell-type-specific regulatory signals and utilizes a Left-Truncated Mixture Gaussian model for scRNA-Seq data.GLUE [4] pre-trains modality-specific variational autoencoders to get cell embeddings and then encodes a knowledge-based graph with GNNs.The next step involves performing an adversarial multimodal alignment of the cells through an iterative optimization process.In addition, Sc-MoGNN [37] models the cell similarity and feature similarity by building a cell-feature graph and extracts information from data with a graph encoder.ScMoGNN takes advantage of gene pathway data as prior knowledge to enhance the graph and denoise the data.Moreover, scBERT [40] follows the pre-training and fine-tuning paradigm of bidirectional encoder representations from transformers (BERT) for cell annotation of scRNA-seq data.The process of annotation involves extracting high-level patterns of cell types from the reference dataset.Different from these approaches which focus on single-modality data, we are the first to introduce transformers and GNNs to single-cell multimodal prediction.

PROBLEM STATEMENT
Before we state the problem, we first introduce the notations used in the following sections.For clarity and simplicity, we use the subscripts "g", "p" and "c" for gene, protein, and cell, respectively.
For instance, we use h  , h  , h  to denote the embeddings of genes, proteins and cells, respectively.
In this work, we follow the problem setting in the NeurIPS 2022 competition, i.e., Multimodal Single-Cell Integration Across Time, Individuals, and Batches 3 .The goal of this competition is to predict a paired modality with a given modality and to infer how DNA, RNA, and protein measurements co-vary in single cells.Specifically, we focus on using gene expression (RNA) to predict surface protein level.We denote X g and X p as the measurement counts of gene and protein, respectively.With X g , we try to learn a mapping function that can best describe the relationship between two modalities.We denote L (•, •) as the objective function that measures the dissimilarity between the predicted and the true protein level.Formally, we describe our target as an optimization problem: Given X g and the objective function L (•, •), we aim to find a mapping function  *  (parameterized by  ) that minimize the objective loss: In the subsequent sections, we formulate the mapping  *  using transformers and GNNs and employ Root Mean Square Error, Mean Absolute Error, and Pearson correlation coefficient as evaluation metrics for our predictions.

OUR MULTIMODAL TRANSFORMER FRAMEWORK
We now introduce the proposed multimodal framework scMoFormer.An shown in Figure 1, it consists of multimodal graph construction, a multimodal transformer, and a prediction layer.In brief, we first construct a heterogeneous graph that contains cell, gene, and protein nodes together with their interactions.We then utilize multiple (graph) transformers to extract rich cell representations and predict each cell's surface protein abundance levels.

Multimodal Graph Construction
In this subsection, we introduce how we construct the graph for multimodal single-cell data.Specifically, we construct a heterogeneous graph containing three different types of nodes.It has four subgraphs as shown in Figure 1, i.e., a protein-protein graph, a gene-gene graph, a gene-protein graph, and a cell-gene graph.Next, we detail how to conduct these subgraphs.refer to the STRING [33] database.STRING provides a comprehensive resource of protein-protein interactions and the functional relationships between different proteins.It contains seven different channels covering varied aspects of sources, including genomic context, experiment results, and text mining efforts.We use the combined confidence score of all channels as edge weights to comprehensively enhance the graph.The proteins in STRING are labeled in Ensemble [8] protein (ENSP), and the mapping from ENSP to the protein preferred name is provided in additional information resource.Notice that regularly one protein may have multiple aliases.To match the protein display names with ENSP, we utilize GeneCards [31] to find all possible aliases of our target proteins.In a nutshell, the mapping from STRING nodes to our target proteins is given by ENSP possible aliases of proteins the display names of proteins within the dataset.
Gene-Gene Graph.To maintain consistency and eliminate potential noise from multiple sources of prior information, we also utilized the STRING database to construct the gene-gene graph and combined all seven channels of the STRING database to enhance the graph as much as possible.Note that although the gene and protein nodes are biologically equivalent in the sense of prior information, we process them separately to more specifically handle data-specific information related to the target labels.The genes are labeled in Ensemble gene (ENSG) and additional matching efforts are needed to form the gene-gene graph.We utilize the MyGene-Info [38] gene query service to align the STRING protein nodes, encoded by ENSP, with the input gene nodes, encoded by ENSG.
Gene-Protein Graph.Now we have two separate graphs among proteins and genes respectively, and we would like to form a general frame by adding the connections between genes and proteins.Following the central dogma of molecular biology, information flows from RNA to proteins via translation.This encoding relationship is recorded within the gene and protein nomenclature.Specifically, the gene names from existing single-cell multimodal datasets contain two parts: the ENSG and the symbol or name of corresponding proteins.That is, if a gene and a protein share the same symbol, it means the target protein is encoded by the gene.Utilizing biological information, we link the proteins to genes by matching their symbols.
Cell-Gene Graph.The constructed gene-protein graph is fully based on prior knowledge, and we integrate data-based information into the multimodal heterogeneous graph by involving cell nodes.
The gene expression counts of multimodality data imply the dataspecific relationships between genes and cells.Naturally, a cell and a gene are connected if the gene expresses within that cell.Note that the number of genes detected within each cell varies significantly, which depicts that the raw counts of RNA abundance show substantial heterogeneity among cells.In addition, the raw data includes extremely large counts, making it impractical to apply counts data as the edge weights between cells and genes.Thus, we normalize each cell by total counts over all genes and take the logarithm.The processed data is then fed into the multimodal graph to describe the cell-gene links.
Remark.It is worth noting that we do not construct the cell-cell graph.Instead, we use the transformer to learn the cell-cell relationships via the attention mechanism.This is in contrast with some previous studies that utilize a static cell-cell similarity graph generated from the input features, which might be prone to inaccurate cell relationships due to the noisy nature of single-cell data.On the other hand, we do not explicitly establish connections between the cells and the target protein nodes, as it might cause the model to easily overfit.More discussions are included in Appendix 5.7.1.

Heterogeneous Graph Construction.
Formal Graph Definition.We now formally define the multimodal heterogeneous graph.
Let A be the adjacency matrix of the multimodal heterogeneous graph with V = v p , v g , v c as the node set, and  p ,  g ,  c are the number of proteins, genes, cells, respectively.Denote E ∈ V × V as the edge set, we write the multimodal heterogeneous graph as G = (V, E).Let  = ( p +  g +  c ) be the total number of nodes, the adjacency matrix A can be written as follows: where A p ∈ R  p × p is the protein-protein interaction graph; A g ∈ R  g × g denotes the gene-gene graphs mapped from STRING via MyGene.info[21]; S ∈ R  g × p is the encoding relationship between genes and proteins; and A RNA ∈ R  c × g represents the gene expression.Node Feature Initialization.With the graph structure, next we discuss how to initialize the node features.Since the RNA counts X g show drastic sparsity along with high dimension, it is not practical to be directly used as cell features.Hence, we first denoise the data and reduce the dimension by Singular Value Decomposition (SVD).To alleviate the effects of cell-to-cell heterogeneity and extremely large counts, we conduct library size normalization and centered log-transformation.The preprocessed data X is then passed through the SVD algorithm.Cell features are initialized with the reduced features h 0 c ∈ R  c × 0 .We then initialize gene features by the weighted sum of the reduced features h 0 g = X g • h 0 c ∈ R  g × 0 with the normalized counts Xg as the weights.In the studied problem, proteins are the target modality for prediction; thus they are initialized randomly based on their indices.

Multimodal Transformers
To effectively handle the heterogeneity in the heterogeneous graph, we will introduce how scMoFormer employs multiple transformers to process each individual modality and utilizes a cross-modality aggregation mechanism to combine the information.[34] has made significant achievements in the field of Natural Language Processing (NLP) in recent years.The attention mechanism can capture high-order and non-Euclidean connections between nodes, which is desired to explore the cell-cell relationships within single-cell multimodal data.We denote the queries, keys, and input cell embeddings as Q, K, h c ∈ R  c × with input dimension , the scaled dot-product attention can be formulated as Attn(h c ) = softmax QK  / √  h c , where

Cell Transformer. Transformer
is the attention matrix of cell-cell interactions.Despite being effective in various NLP tasks, the original transformer has limitations in scalability due to its relatively high space complexity of  ( 2 c +  c ) and time complexity of  ( 2 c ).This issue considerably limits the application of transformers in single-cell multimodal analysis.Since typically multimodal data includes tens of thousands of cells, it is not applicable to implement the original transformer directly on cells.
To address the scalability issue, we employ generalized kernelizable attention [7] as a computationally efficient approximation of traditional attention.The attention blocks is kernelized in the form A(, ) = K q ⊤  , k ⊤  , where q  stands for the  ℎ row of query Q and k  denotes the  ℎ row of key K.The kernel K : R  × R  → R + is specified as K(x, y) = E  (x) ⊤  (y) , where  : R  → R  + denotes the feature mapping.Let Q ′ , K ′ ∈ R  c × be the approximate query and key with rows given as  q ⊤  ⊤ and  k ⊤  ⊤ respectively, the kernel approximation of cell attention is formulated as (3) With kernel of dimension  , the space complexity and time complexity is reduced to  ( c  +  c  + ) and  ( c ), respectively.The generalized kernelizable attention boosts our cell transformer to linear space and time complexity, while still delivering results comparable to regular transformers [7].The attention mechanism in the model captures the intricate relationships between cells, resulting in an improved representation of the individual cells.

Gene Transformer and Protein Transformer.
To leverage biological insights, we include STRING [33] as an addition to provide local information of genes and proteins.Although STRING provides solid prior networks, there may still exist data-related concerns when applied to sequencing data.For instance, the NeurIPS 2022 competition dataset 4 contains 22, 050 genes, but only 13, 101 of these genes have interactions recorded in the STRING database.This issue also occurs in proteins, as the NeurIPS 2022 competition dataset collects 140 proteins and only 120 of them are found within the STRING [33] networks.This means that information about interactions within the remaining molecules is not available unless additional global information is included.To address the concerns, we utilized graph transformers to encode both the local and global information about genes and proteins.Specifically, following GraphGPS [29], we adapt the graph transformers for gene-gene and protein-protein graphs separately, i.e., gene transformer and protein transformer.Since the gene transformer and protein transformer have the same architecture, we will use the gene transformer to illustrate the details.
A gene transformer layer consists of two parallel components: a message-passing GNN block and a global attention block.The GNN block subtracts the gene interaction information from the prior local network, while the attention block learns global dataspecific relationships by allowing each gene to attend to all other genes.The results from two blocks are summed together and then processed by fully connected layers to update the gene embeddings.Recall that the adjacency matrix of the gene-gene graph is denoted as A g ∈ R  g × g , let h ℓ  ∈ R  g × ℓ with dimension  ℓ be the gene embedding of ℓ-th layer, the update functions are as follows: where GNN ℓ and Attn ℓ represent the GNN block and the global attention mechanism; MLP ℓ is a 2-layer MLP block.The GNN block brings in prior biological insights, while the attention block allows resolving the expressivity bottlenecks caused by over-smoothing [20] and over-squashing [1].
Positional encoding (PE) is applied to the gene transformer layer to provide another solution to the expressivity bottlenecks among gene-gene graphs.Message-passing GNNs update gene node embeddings by aggregating local neighborhood representation given by gene knowledge graphs.By incorporating additional positional information, PE helps to differentiate nodes that have the same local surroundings but distinct positions.Accompanied by prior genegene interaction networks, PE distinguishes genes by the absolute position of them within the STRING network.This is important from a biological perspective as each gene functions differently.Here, we implement Laplacian PE [10] and random walk PE [12].The Laplacian PE captures the spectral information of graph Laplacian by its eigenvectors.Denote the graph Laplacian of input gene graph as L g , the matrix factorization of L g is formulated as where D g is the degree matrix of gene graph, Λ g and U g correspond to the eigenvalues and eigenvectors respectively.The gene node Laplacian PE of dimension  is defined as the  smallest non-trivial eigenvectors.While Laplacian PE embeds the positional information from the graph Laplacian, the random walk PE tends to grasp 4 https://www.kaggle.com/competitions/open-problems-multimodal/ the positional information given by the graph clusters.The random walk PE of dimension  is defined with -steps of the random walk where RW = A g D −1 g is the random walk operator.The term RW   represents the landing probability of a gene node  to itself after  steps.The processed PE is then combined to gene features through a fully connected layer.

4.2.3
Cross-Modality Aggregation.Transformers are constructed separately for each modality.To build a bridge among those transformers, we implement message passing GNNs among the links which connect nodes of distinct types.The information from cell transformer will denoise the prior knowledge by adding data-specific details into gene embeddings and protein embeddings.Meanwhile, information from gene transformer and protein transformer will bring in biological insights to cell transformer to predict the target proteins.Particularly, we take the advantage of GraphSAGE [15] to transfer the information.Denote the -th destination node as   and -th source node as   .The information from the source nodes to the destination nodes is updated by: For node   , the message passing GNN aggregates information from its neighbors through aggregator function (Aggregate).The neighborhood information h ℓ N (  ) is then combined to the embeddings h ℓ   and processed by an updating procedure.The newly generated embeddings are normalized before next iteration.The GNN modules facilitate communication between the transformers, enabling the transformers to leverage various forms of information during the training process.Now we summarize the workflow of scMoFormer.As shown in Figure1, we apply transformers within each type of node and utilize message passing GNNs to form the bridges between transformers.In a formal way, for ℓ-th layer, denote the cell transformer as Trans ℓ c , gene and protein graph transformer as GT ℓ g and GT ℓ p respectively.Let MPG ℓ g p and MPG ℓ p g be the message passing GNN modules between genes and proteins, and MPG ℓ g c and MPG ℓ c g are for the links between genes and cells.We process the cell embeddings with MLP as cell readouts, where FC ℓ represents the ℓ-th fully connected layer.The updates of node embeddings are achieved by where S, A p , A g , A RNA are adjacency blocks defined in Section 4.1.Prediction Layer.With the number of layers be , the predictions are given by one extra full connected layer FC +1 as where X p ∈ R  c × p is the final prediction.To optimize the framework, we adapt a Mean Square Error (MSE) loss to measure the difference between the predictions and the ground-truth values:

EXPERIMENT
In this section, we present the experimental results of scMoFormer against baselines on benchmark datasets.In particular, we aim to answer the following questions: • RQ1: How does scMoFormer perform compare against baselines based on various evaluation metrics?• RQ2: Given various choices of the PEs, how do they affect the performance of scMoFormer?• RQ3: What is the best way to fuse information across modalities?• RQ4: Can scMoFormer handle other modality prediction tasks?• RQ5: Can scMoFormer process large datasets?• RQ6: How does each of the model component impact the performance of scMoFormer?
Before presenting our experimental results and observations, we first introduce the experimental settings.

Experimental Settings
5.1.1Datasets.We follow the setting of the NeurIPS multimodal single-cell integration competition of the year 2021 [25] and 2022 and collect the joint measurements of gene expression and surface protein levels datasets from the competitions.Both datasets contain the raw counts, which represent the number of reads per gene per cell, as well as the normalized counts.For the NeurIPS 2021 competition, we pick the data corresponding to the task of protein abundance prediction via gene expression and refer to it as "GEX2ADT".The processed data is centered and log-transformed for denoising purposes.For the competition in 2022, which we refer to as "CITE", the objective is to utilize CITE-seq [32] data measured from days 2, 3, and 4 to predict the protein level on day 7 from different individuals.It is worth mentioning that the protein level testing data is not available during the completion of this work.Therefore, we simulate the competition scenario by treating the training data from day 4 as our testing set.The processed RNA data is centered and log-transformed, while the normalized protein levels are denoised and scaled by background [19].We summarize the dataset details in Table 8 in Appendix A.

Baselines.
We evaluate the performance of scMoFormer against state-of-art multimodal prediction models among the task of using gene expression to predict surface protein levels.The selected baselines are as follows: • Cross-modal Autoencoders [41], short for CMAE, incorporated multiple autoencoders to integrate multimodal data and utilized domain knowledge by adding discriminative loss to the training process to align shared markers or clusters.
• BABEL [39] proposed a general framework for multimodal translation with modality-specific encoders and decoders.Note that initially, BABEL focused on RNA and ATAC-seq [2] data.In this evaluation, we repurpose BABEL to the RNA to protein setting.
• scMM [27] modeled the multimodal data with generative setting.We note that the input of scMM is restricted to raw counts, and the predictions are scaled as centered log-transformed data.
• ScMoGNN [37] involved domain knowledge like biological pathways to enhance the GNNs.The original ScMoGNN followed a transductive setting.In this work, we implement an inductive setting of ScMoGNN for a fair comparison with the baselines.

Parameter Setting.
To benchmark the performance of baselines and scMoFormer, we uniformly employ inductive settings among both datasets.On the CITE, we use the data measured on day 4 for testing and randomly split 80/20% of the data prior to day 4 for training and validation.On the GEX2ADT, we randomly pick 15% of the training data for validation and evaluate the predictions on the testing set.For BABEL, the hidden dimension is tuned from {16, 32, 64, 128}.For CMAE, the weights of adversarial loss and reconstruction loss are chosen from {0.1, 1, 2.5, 5, 10}.For scMM, the hidden dimensions are tuned from {16, 32, 64, 128}.For ScMoGNN, the weight decay parameter of the optimizer is tuned from {5 × 10 −6 , 1 × 10 −5 , 5 × 10 −5 , 1 × 10 −4 }.

Evaluation of Predictions
We evaluate the final protein-level prediction performance using Root Mean Square Error (RMSE) and Mean Absolute Error (MAE).Meanwhile, because multimodal data usually suffers from the influence of batch effects and unbalanced measuring depth, the count's scale of each cell may vary significantly, which will substantially affect the RMSE and MAE metrics.Therefore, we also include the Pearson correlation coefficient (Corr), which is cell-wise normalized by the mean and variance of the input, as a robust and scale-free metric to evaluate the predictions.A lower RMSE or MAE score indicates a geometrically closer estimation of the protein levels, while a higher Corr score suggests a statistically more similar match to the actual value.We report the mean and the standard deviation of each metric across five different runs, and the results are illustrated in Table 1.The best performance is highlighted in bold.
To answer the first question, we note that our scMoFormer consistently outperforms all other baselines according to all three metrics on both datasets, indicating that scMoFormer successfully captures the quantitative characteristics of target protein levels given the input gene expression measurements.Particularly, for the CITE, scMoFormer achieved significantly lower RMSE compared to the second-best model ScMoGNN, by 0.04.More importantly, scMo-Former achieved a significant improvement in terms of the Pearson correlation metrics over all other baselines, with a noticeably lower performance variation across runs.
We further analyze the performance of different models on proteins that are least well captured by any models.Specifically, for each model, we compute the RMSE for each protein separately and identify ten proteins that resulted in the highest average RMSE across all models.As shown in Figure 2, scMoFormer and ScMoGNN achieved relatively stable results and are consistently better compared to BABEl and CMAE.

CITE GEX2ADT
Laplacian PE 1.63161 ± 0.01082 0.42025 ± 0.00243 Random Walk PE 1.63014 ± 0.01129 0.41987 ± 0.00234 w/o PE 1.62720 ± 0.00731 0.42202 ± 0.00399 knowledge graph.For ease of notation, we use the abbreviation PE to refer to the PE.To benchmark the impact of the two types of PE among two datasets and answer the second question, we show the performance of scMoFormer with each PE and compare them with the scenario without any PE.The mean and standard deviation of RMSE scores of five runs are shown in Table 2.
According to the results, the influence of PE varies among datasets.scMoFormer reaches the best RMSE score on CITE without PE, while two types of PE both improve the performance on GEX2ADT.Notice that in Table 8, the RNA zero rate of CITE is significantly lower compared to the GEX2ADT, providing the model with greater access to data-specific information.If the data contains sufficient information, then the neighborhood information from the GNNs alone is adequate and there is no need for the extra prior knowledge from the PEs.This is further supported by the observation that random walk PE performs better than Laplacian PE in both datasets.The Laplacian PE models global information by using the spectral information of the graph Laplacian, while the random walk PE encodes local information by accessing the landing probability of a -step random walk.In cases where the prior knowledge may be noisy for downstream tasks, the local information alone is enough for predictions and the global structure becomes redundant.

Comparasion of Different Fusion Strategies
Our framework is based on hybrid fusion.Layer-wise speaking, we refer to our fusion strategy as concurrent fusion since node representations simultaneously go through GNNs and transformers.In the case of protein nodes, information from protein nodes and gene nodes is fused after being processed by protein transformers and gene-protein (gene to protein) GNNs, respectively.In the following experiment, we compared the performance of layer-wise concurrent fusion, GNN-first fusion, and mixed fusion.Still, in the case of protein nodes, we implemented GNN-first fusion by first summing protein embeddings and the gene-protein GNNs outputs, and then passing through protein transformers.For mixed fusion, we utilized concurrent fusion on protein and gene nodes while applying GNNfirst fusion on cell nodes.We summarize the prediction RMSEs and the standard deviations across five runs in Table 3. Table 3: The prediction RMSEs for different Fusion strategies.

Concurrent
GNN-first Mixed CITE 1.62720±0.007311.63054±0.010181.63092±0.00631GEX2ADT 0.41987±0.002340.42861±0.000940.42948±0.00158 As presented in the table, the original concurrent fusion strategy exhibited superior performance compared to the other two fusion strategies.The concurrent setting allows each modality to utilize intra-modal information prior to combining with other modalities, resulting in better performance since each modality holds varying importance for downstream tasks.However, the improvement in performance was not significant enough for the CITE.This observation is consistent with the findings of other experiments.Since the RNA zero rate in the CITE is considerably lower than that of the GEX2ADT, the significance of data-specific information (cell nodes) in the CITE outweighs the importance of other modalities, resulting in a relatively smaller performance boost.

Handling Other Single-Cell Multimodal Prediction Tasks
In this work, we focused on the specific task GEX2ADT (gene expression to protein levels) as a showcase.However, our framework is versatile and can be applied to other modality prediction tasks, i.e., the other three tasks mentioned in the NeurIPS 2021 competition [25] including ADT2GEX (protein levels to gene expression), GEX2ATAC (gene expression to chromatin accessibility) and ATAC2GEX (chromatin accessibility to gene expression).ADT2GEX: In our framework, we constructed a multimodal heterogeneous graph consisting of gene, protein and cell nodes.For the GEX2ADT task, we removed the cell-protein edges to eliminate information leakage.Similarly, for the ADT2GEX task, we incorporate protein measurements into the cell-protein edges while removing the cell-gene edges.Cell embeddings were initialized by the reduced protein levels, and protein embeddings were initialized by the weighted sum of cell embeddings, where the weights are with the normalized protein levels.GEX2ATAC and ATAC2GEX: In the context of the GEX2ATAC and ATAC2GEX tasks, we removed the protein nodes from the multimodal graph.To initialize the cell embeddings for the GEX2ATAC task, we used the reduced gene expression values, while for gene embeddings, we computed a weighted sum of cell embeddings using normalized gene measurements as weights.For the ATAC2GEX task, we masked the cell-gene edges in the testing set.We initialized the cell embeddings using the reduced ATAC measurements, and the gene embeddings were randomly initialized.
We have compared the performance of scMoFormer with sc-MoGNN on the four tasks in Table 4.Note that scMoFormer outperformed scMoGNN in tasks that involve protein modality and vice versa.These results suggest that incorporating prior information of protein nodes can enhance the performance of scMoFormer when protein modality is present.The RMSE scores across five runs are summarized as in Table 4.

Training Efficiency of scMoFormer
Since scMoGNN is the best-performing baseline, we compared the running time and total GPU memory of scMoFormer and scMoGNN across five runs on one Quadro RTX 8000 GPU.The results are summarized in Table 5.We observed that with higher GPU consumption, scMoFormer required a significantly shorter running time compared to scMoGNN.It is worth noting that scMoFormer was configured with a relatively large batch size of 8000 cells per batch and a hidden dimension of 512, in order to achieve better performance and better utilization of computing resources.One can surely reduce the required GPU memory of scMoFormer by limiting the setting accordingly.Can scMoFormer process large datasets?In practical applications, it exhibits the capability to process large datasets.We address this issue from the perspectives of both the model and the data: • Data aspect: Due to technological or biological reasons, the number of RNA nodes and the number of protein nodes will be similar across datasets.Therefore, large datasets mean more cells, which can be handled by mini-batching cells.• Model aspect: In our multimodal transformer module, we employed linearized transformers [7] with linear space and time complexity, which can be conveniently adapted to large datasets.Concerning GNNs, we incorporated GraphSAGE [15], whose space and time complexity are related to the number of edges and hidden layer dimensions.Indeed, we can control the number of edges by mini-batching cells and make appropriate adjustments based on available computational resources.

Ablation Study
Table 1 demonstrates that models that incorporate domain knowledge perform better in modality prediction compared to those that do not.BABEL, ScMoGNN, and scMoFormer are the three models that make use of domain knowledge, and they show improved performance compared to the other two models.Among these three models, ScMoGNN, which is based on GNNs, performs better than BABEL, while scMoFormer outperforms all other models with its combination of transformers and GNNs framework.Given that scMoFormer includes a multimodal heterogeneous graph and three transformers, this raises the questions: Why no cell-cell graph and cell-protein graph?Which transformer has the biggest impact on performance?How much do the transformers contribute to the improvement in performance?5.7.1 Exclusion of Cell-Cell Graph and Cell-Protein Graph.In Remark of Section 4.1.1,we provided an explanation for our decision to exclude the cell-cell and cell-protein graphs.Additionally, through our empirical analysis, we observed a decrease in performance on both datasets when these links were incorporated into our heterogeneous graph.In comparison to the original w/o neither graph setting, the performance drop was evident in both the w/ cell-protein graph and w/ cell-cell graph settings, with the former demonstrating a more significant decline.The prediction RMSEs and standard deviations of five runs are summarized in Table 6.Why cell-cell graph did not help: In the above experiment, we constructed a k-NN cell-cell graph by measuring the similarity in gene expression between cells.Nevertheless, since gene expression measurements often suffer from noise and sparsity, this static cellcell graph may have introduced biased information into the downstream task.To address this issue, we utilized a cell transformer module to learn the dynamic cell-cell interactions via multi-head attention mechanism, which led to improved performance.Why cell-protein graph did not help: For the cell-protein links, we incorporated target surface protein levels into edge weights, similar to how we built the cell-gene graph.However, these links conveys information about the prediction targets of the training set, causing the model to overfit easily.Hence, eliminating the cell-protein links served to eradicate information leakage.

Influence of Every
Transformer.The propose multimodal transformers consist of three different transformers, namely the cell transformer, gene transformer and protein transformers.As our predictions are based on the cell readout, it is expected that each of the three transformers will have different levels of impact on the performance.To quantify the specific impact of a single transformer, we conduct an experiment by removing the other two 1.63020 ± 0.00672 0.42731 ± 0.00138 scMoFormer 1.62720 ± 0.00731 0.41987 ± 0.00234 transformers and measuring the prediction RMSE scores.The results of the evaluation, including the scores of three partial models and the model with no transformers, are summarized in Figure 3.The performance of the gene transformer and protein transformer is better than that of the cell transformer in the CITE, while it is the opposite in the GEX2ADT.This can be explained by the difference in RNA zero rate between the two datasets, as shown in Table 8.For the GEX2ADT, the high RNA zero rate means less information, making the cell transformer crucial in increasing performance by drawing more information from the data.On the other hand, the CITE has a lower zero rate, meaning it provides more information, allowing the gene transformer and protein transformer to enhance the model by adding external biological knowledge.

How to Utilize Prior
Knowledge.To answer the third question, we compare scMoFormer with two GNN-based models in Table 7.The model "GNN-prior" refers to the GNNs that built on the same graph in Section 4.1, while the "GNN" model is constructed using only the cell-gene graph without incorporating any prior information.The results show that the incorporation of prior knowledge into the graph results in a slight performance boost in both datasets.However, when multimodal transformers are included, the performance improvement is much more pronounced.This highlights the usefulness of prior knowledge and the importance of transformers to effectively incorporate this information into the model.

CONCLUSION
Recent advances in multimodal single-cell technology have enabled the simultaneous profiling of the transcriptome alongside other cellular modalities, leading to an increase in the availability of multimodal single-cell data.In this paper, we present scMoFormer, a multimodal transformer model for single-cell surface protein abundance from gene expression measurements.We combined the data with prior biological interaction knowledge from the STRING database into a richly connected heterogeneous graph and leveraged the transformer architectures to learn an accurate mapping between gene expression and surface protein abundance.Remarkably, scMo-Former achieves superior and more stable performance than other baselines on both 2021 and 2022 NeurIPS single-cell datasets.Future Work.Our framework of multimodal transformers with the cross-modality heterogeneous graph goes far beyond the specific downstream task of modality prediction, and there are lots of potentials to be further explored.Our graph contains three types of nodes.While the cell embeddings are used for predictions, the remaining protein embeddings and gene embeddings may be further interpreted for other tasks.The similarities between proteins may show data-specific protein-protein relationships, while the attention matrix of the gene transformer may help to identify marker genes of each cell type.Additionally, we may achieve gene interaction prediction using the attention mechanism.To extend more on transformers, a potential next step is implementing cross-attention cross-modalities.Ideally, all three types of nodes, namely genes, proteins, and cells, would be jointly modeled using a large transformer that includes specific regulations for each modality.

A EXPERIMENTAL DETAILS
We summarize the dataset details in

B MORE EXPERIMENTAL RESULTS
Performance on CBMC CITE-seq Dataset.We have conducted additional experiments on the CBMC CITE-seq dataset obtained from Seurat [16], which contains 20501 RNAs and 10 proteins.For the purpose of comparison, we employed scMoGNN along with scMoFormer since the former is the best performing baseline.The results of five runs are presented in Table 9.Our findings show that scMoFormer consistently outperformed scMoGNN across all three metrics, which aligns with our previous observations on the CITE and GEX2ADT datasets.

Figure 1 :
Figure 1: An illustration of scMoFormer.In this framework, three important components are included: graph construction, multimodal transformer, and prediction layer.

Figure 3 :
Figure 3: RMSE↓ results of keeping only one Transformer.
This research is supported by the National Institutes of Health (NIH) under grant number UO1 DE029255 and R01 DE026728, the National Science Foundation (NSF) under grant numbers CNS 2246050, IIS1845081, IIS2212032, IIS2212144, IOS2107215, DUE 2234015, DRL 2025244 and IOS2035472, the Army Research Office (ARO) under grant number W911NF-21-1-0198, the Home Depot, Cisco Systems Inc, Amazon Faculty Award, Johnson & Johnson, JP Morgan Faculty Award and SNAP.

•
The proposed scMoFormer is versatile and can flexibly incorporate domain knowledge regarding genes and proteins.
• The proposed scMoFormer achieves superior performance on various benchmark datasets.Remarkably, we are one of the top winners in a NeurIPS 2022 competition.

Table 4 :
Results on other Single-Cell Mutimodal Tasks.

Table 6 :
The prediction RMSEs on different graph variants.

Table 8 .
The experimental results on CBMC are shown in Appendix B.