An Extensive Overview of Feature Representation Techniques for Molecule Classification

The application of Machine Learning (ML) algorithms in biomedical engineering and specifically on molecular data has gained much attention in recent years. Accurate predictions using molecular data is directly linked with many open problems, like drug discovery, disease prediction and treatment optimization. However, finding the most appropriate method for transforming molecules into feature-ready inputs for an ML algorithm is a challenging task. Despite the numerous featurizers, i.e. algorithms that transform molecules into features, there is a lack of comprehensive analysis comparing their impact on model’s accuracy and efficiency for downstream tasks. In this study, we evaluate ten (10) featurizers and five (5) ML models for classification tasks. In addition, we explore the differences between the two main categories of featurizing techniques, i.e., Graph and Linear form representations. Our results show that the selection of an appropriate featurizer is model and application specific. We demonstrate that a combination of linear molecular representation and a conventional ML algorithm can result in superior predictive performance than more complex and sophisticated graph-based representations.


INTRODUCTION
Machine Learning (ML) has experienced significant success in various fields, including recommender systems [24], networking [25], social care applications [21] and biomedical engineering [12].Recent advances in the field of chemo-informatics have raised the interest in extracting, processing and extrapolating meaningful data from chemical structures [34].ML techniques are widely used in accelerating scientific research in fields such as drug discovery [32] and disease prediction [31].All these procedures are timeconsuming and financial-demanding, while they require intensive cycles of medical analysis in the physical laboratory.The introduction of ML can significantly reduce the associated costs, while giving reliable predictions.
A common task in chemo-informatics is molecular property prediction [34].Given that molecules are the foundational units in this field, it is crucial that ML algorithms can process molecular data.However, incorporating molecules as inputs into ML models is not straightforward.A lot of different representation algorithms, i.e., "Featurizers", have emerged in order to transform molecules into forms compatible with ML models.One of the most popular transformation method is the Simplified Molecular-Input Line-Entry System (SMILES) [33], which converts chemical compounds into ASCII characters.Although several studies [9,13] have shown that by feeding raw SMILES strings into a ML model may lead to adequate results, there are a lot of featurizers that transform SMILES into a more ML model-friendly representations, mainly vectors or matrices to increase performance and efficiency of the algorithms.
For instance, transformative approaches such as the MACCS Fingerprint [5] and ConvMol [6] have led to higher predictive performance of ML algorithms.
Despite the recent research efforts in transforming molecules into meaningful representations, there is a lack of consensus on which is the most effective approach for specific downstream tasks.To address this gap, in this work, we compare ten (10) featurizers and five (5) ML models to determine the combination that maximizes the performance, both in terms of accuracy and efficiency, in classification tasks.In addition, we examine whether state-of-theart ML techniques, i.e., Graph Neural Networks [28], outperform traditional approaches like Support Vector Machines (SVM) [3], Random Forests (RF) [10] and Gradient Boosting models [8].
The rest of this work is structured as follows.Section 2 surveys the related work on featurization methods and similar survey studies.Section 3 provides an overview of the employed featurizers and ML algorithms.Section 4 presents the considered learning settings and the experimental results.Finally, Section 5 concludes our work.

PRELIMINARIES AND RELATED WORK
In this section, we provide an overview of featurizers, i.e., methods that convert molecular structures into ML compatible formats.In addition, we review similar efforts that assess the impact of transformations in terms of the ML model's predictive accuracy in downstream tasks.

Featurizers
Featurizers are algorithms that transform molecules into a form suitable for ML algorithms.The most common used representation is SMILES characters [33], a linear notation that captures various chemical characteristics like atoms, bonds and rings representation.However, SMILES have limited compatibility with ML models and should be further processed to numerical values.For this reason, the research community has developed specialized featurizers that convert molecules, typically from SMILES, into more advanced forms.Fig. 1 shows a high level overview of featurization methods.Featurizers can be categorized into [30]: (1) Expert-based representations: These methods are built using domain knowledge.This class of featurizers is further classified into two categories: (a) Molecular Fingerprints: These are fragment-based descriptors that specify the presence or absence of specific structural characteristics in a binary vector [30].Initially, Molecular Fingerprints designed for isomers identification and later have been used for rapid substructure searching and molecular similarity assessment [7].These fingerprints can be divided further based on whether they use structural keys or hashed methods.(b) Molecular Descriptors: These are vectors or arrays that describe molecules with a set of physiochemical properties of molecules.They are typically categorized further by dimension into 1, 2 or 3 dimensional descriptors [22].(2) Learnable Representations: These are generated directly from molecular structures without incorporating expert input.The primary algorithms in this category include: (a) Molecular Graphs: These methods represent molecules using feature vectors for node and edge properties, along with an adjacency matrix to convert them into graph structures.The feature vectors usually include high-level chemical attributes of each atom, such as its element, charge, and hybridization state [36].More advanced models like [14]

Featurizers Overview Studies
Due to the plethora of tranformative techniques that are available, numerous studies have attempted compare featurizers and assess their overall performance across various tasks.Zhut et al. [36] demonstrated that different featurization techniques capture chemical information in distinct ways.In addition, a pre-training framework, named MOCO, was introduced, that adaptively distills information from different featurization methods.The authors evaluated the impact of different featurizers across eight (8) datasets and showed an average improvement of 1.1% with MOCO, highlighting the importance of using multiple featurization techniques.
Stepišnik et al. [30] examined several molecular representations, ranging from methods like fingerprints and descriptors to neural network-based encodings.They evaluated these approaches on eleven (11) benchmark datasets for classifying properties like mutagenicity and solubility.The findings suggest that no single featurization technique consistently outperforms other approaches, while MACCS fingerprints were particularly effective for certain tasks.
Lowdon et al. [18] focused on the importance of ML model selection for molecular-based training.They compared different models for predicting molecular binding to Molecularly Imprinted Polymers (MIPs).Their results demonstrate that multitask regression and the Graph Convolution Network (GCN) provided the most reliable predictions.
Sivaraman et al. [29] developed MOLAN, a comprehensive pipeline for molecular analysis that combines featurization, clustering, and high-performance regression models.This workflow integrates several advanced techniques, including a semi-supervised Variational AutoEncoder (VAE) to facilitate molecular design.
Elton et al. [7] explored ML techniques for predicting the properties of energetic molecules.Using diverse dataset, they found kernel ridge regression to be the most effective model.
In [1], a comparison between expert-based and learnable representations showed similar downstream predictive performance between the two categories.Lastly, in [4], the authors compared descriptor-based and graph-based models for molecular property prediction.The results indicated that descriptor-based models, on average, outperformed graph-based models in both prediction accuracy and computational efficiency.For regression tasks, SVMs were found to be the best performing algorithms, while RF and XGBoost excelled in classification tasks.
To the best of our knowledge, our comprehensive overview surpasses typical surveys in both scope and depth.It not only covers a significantly larger number of algorithms but also marks a pioneering effort by including a comparative analysis between linear featurizers and graph-based ones to an unprecedented extent.

SYSTEM MODEL
In this section, we introduce the featurizers and the ML models employed for the downstream classification tasks.Overall, we evaluate the performance of ten (10) featurizers and five (5) machine learning models.

Featurizers
Linear Featurizers.(8) OneHot Featurizer: This is the simplest featurizer examined in this study.It converts SMILES notation into one-hot encoding using a pre-defined dictionary of characters.
(1) ConvMol Featurizer [6]: This method serves as the fundamental input for all graph-based models.It generates a feature vector for each molecule's atom, containing information about atom type, hybridization and valence structure.It also returns a list of atom neighbors that will be used by ML models to perform graph convolution operations.(2) Weave Featurizer [15]: This method captures both the local chemical environment of each atom and their interconnectivity.While atom feature vectors are calculated similarly to the ConvMol approach, Weave provides a more complex edge representation by incorporating bond properties and graph distance metrics.

ML models
• SVM [3] is a popular and widely used ML algorithm.It is used for solving learning tasks by defining a decision plane to separate data points of different classes while maximizing the margin between them.• RandomForest [10] is an ensemble algorithm that combines the predictions from two or more decision trees.The results of all individual decision trees are then aggregated to provide the output for the forest.• Gradient Boosting Classifier [8] is another ensemble method consisting of decision trees connected in a sequential manner.Each new tree is generated by minimizing the loss function in a greedy approach.• Graph Convolution Network (GCN) [6] is the most popular algorithm from Graph Neural Networks (GNNs).GNNs are designed to identify graph patterns and use them for downstream tasks, such as graph or node classification, link prediction and graph clustering.The node features, edge features (if present) and the adjacency matrix that describe the graph's structure are fed as input into the GNN model.Then, an iterative process (message passing) takes place in which every node of the graph aggregates the feature vectors of its neighbors in order to compute its new embedding for the next iteration.• Weave Graph Convolution Network [15] is another type of graph-based model.The major difference from GCN is that Weave does not perform the graph convolution only to neighbooring nodes.Instead, it uses information from all the nodes in the graph.Despite being significantly more complex, this advanced convolution is efficient at transmitting information (messages) between distant atoms.

EXPERIMENTS
In this section, we outline the considered datasets and the metrics used for ML model evaluation.In addition, we describe the experimental setup and finally present the results.

Dataset
We employ the Moleculenet [35] for our experiments, which is a collection of datasets in the domain of molecular properties prediction.We focus on two datasets which differ both in data size and task complexity.Dataset statistics are presented in Table 1 .
(1) Clintox: This dataset comprises 1491 drug compounds that have undergone clinical trials and present two distinct classification tasks.Specifically, the goal is to predict the presence or absence of toxicity in clinical trials and if a drug is approved or rejected by the Food and Drug Administration (FDA).1372 of the dataset's molecules are characterized as non-toxic and the rest 112 as toxic, indicating a high class imbalance.( 2) SIDER (Side Effect Resource): This dataset comprises 1427 FDA-approved drugs and includes 27 distinct binary classification tasks.Adverse drug reactions are categorized into 27 binary labels, where each label indicate the presence or absence of adverse effects in human system organs.

Evaluation Metrics
For predictive performance evaluation, we use the Area Under the Receiver Operating Characteristic Curve (AUC-ROC), as recommended in the original paper [35] based on previous works and dataset's characteristics.AUC-ROC measures the ability of a model to distinguish between the positive and negative classes.Besides predictive performance, we are also interested in computational efficiency.We measure the training time for each machine learning model as our efficiency metric.

Experimental Setting
We use the DeepChem [26] Python library to implement the featurization techniques.For ML model training we use Scikit-learn [23] for the SVM, RandomForest and Gradient Boosting classifiers.For graph-based models, i.e., GCN and Weave, we use the DeepChem's implementations.
For dataset pre-processing, missing or invalid data are replaced with zeros.The datasets are partitioned into 80% for training, 10% for validation and 10% for testing.To avoid the risk biased results due to specific random seeds, we conduct our experiments using 20 different seeds and train each model from scratch.We then calculate the mean and standard deviation of these runs to present the final results.

Results
Clintox Dataset.Fig. 2 the experimental results obtained on the Clintox dataset.We compared the effectiveness of ten distinct featurizers when used in conjunction with three classical machine   learning models.The performance scores of two graph-based machine learning models are also displayed in the same chart as dashed lines.
The key observations are: (1) The MACCS Fingerprint featurizer consistenly results in high ROC-AUC scores, ranking first with SVM and Gradient Bossting and second with Random Forest.(2) Other Fingerprint-based featurization also deliver satisfactory results.(3) Descriptor-based methods lead to lower AUC-ROC scores than other featurizers.(4) The Gradient Boosting Classifier tends to outperform other models across most featurizers, except when paired with the BP Symmetry Function.(5) Classical ML methods often match or exceed the predictive performance of more sophisticated graph-based approaches.
We also assessed the computational efficiency of featurizermodel combinations by measuring their training times.The training times are illustrated in Fig. 3.
SIDER Dataset.The results on the SIDER dataset support the findings from the analysis on Clintox.Fig. 4 and Fig. 5 present the predictive performance and the training times, respectively.
The key conclusions from the results are as follows: (1) Graph-based models significantly perform worse when compared to expert-based representations, particularly fingerprint representations.(2) The AUC-ROC score of the conventional machine learning algorithms is similar when fingerprint representation is used.(3) Fingerprint-based featurizers tend to outperform descriptor featurizers.
In terms of training times, the MORDRED descriptors paired with Gradient Boosting showed exceptionally high training times.This observation suggest that this combination may not be advisable for scenarios where computational efficiency is a priority.

CONCLUSION
In this study, we explored the impact of featurizers on the performance of several machine learning models for molecular property prediction.The findings indicate that expert-based featurization techniques, i.e., fingerprints and descriptors, often outperform more sophisticated graph-based models.Additionally, we observed that the MACCS fingerprint consistently yields superior results across different machine learning models.
In the future, it would be beneficial to conduct further experiments using additional featurizers, learning models and datasets.This would provide a more comprehensive understanding of the relationship between featurization and machine learning performance in molecular property prediction.

( 1 )
MACCSKeysFingerprint [5]: This method falls under the categorty of structural keys.It encodes the structure of the molecule into a binary bit string, where each bit corresponds to a specific structural feature.For instance, if the molecule contains an aromatic ring, then the bit associated with this feature is set to 1.In total, there are 166 MACCS keys, each corresponding to a unique molecular structure.(2)Circular Fingerprint[27]: These fingerprints are generated by considering the circular environment around each atom up to a specified radius or diameter.The most common example of circular fingerprints is Extended-Connectivity Fingerprints (ECFPs), which are created using a variant of an algorithm that identifies molecular isomorphism.(3) Mol2Vec Fingerprint [11]: This method utilizes ML-based featurization.More precisely, it uses Mol2Vec, which is based on the Word2Vec unsupervised learning algorithm.Mol2Vec generates vector representations of molecular substructures in a similar manner that Word2Vec learns word embedding in NLP tasks.(4) RDKIT Descriptors [17]: This method calculates a list of chemical descriptors, such as molecular weight, number of valence electrons and maximum and minimum partial charge.(5) Mordred Descriptors [20]: Mordred calculates more than 1800 biochemical descriptors for each molecule.Despite the large number of descriptors, the featurization process remains highly efficient.(6) Coulomb Matrix Eigenvalues [19]: This method produces a matrix that represents the electronic structure of a molecule.For a molecule with  atoms, the Coulomb method produces a  ×  matrix, where each element represents the intensity of the electrostatic interaction between two atoms within the molecule.(7) BP Symmetry Function [2]: The main idea of this featurizer lies in preserving the rotational and permutation symmetry of the molecular system.A series of radial and angular symmetry functions with various distance and angle cutoffs are used to describe the local environment of an atom in a molecule.
[16]ct Learning from Smiles: This approach involves using deep learning algorithms like Recurrent Neural Networks (RNNs) or 1D Convolutional Neural Networks (CNNs) to learn directly from SMILES notations of molecules.These notations are tokenized or one-hot encoded, a typical process for text classification.A notable example in this category is the TextCNN algorithm[16].