Are we making much progress? Revisiting chemical reaction yield prediction from an imbalanced regression perspective

The yield of a chemical reaction quantifies the percentage of the target product formed in relation to the reactants consumed during the chemical reaction. Accurate yield prediction can guide chemists toward selecting high-yield reactions during synthesis planning, offering valuable insights before dedicating time and resources to wet lab experiments. While recent advancements in yield prediction have led to overall performance improvement across the entire yield range, an open challenge remains in enhancing predictions for high-yield reactions, which are of greater concern to chemists. In this paper, we argue that the performance gap in high-yield predictions results from the imbalanced distribution of real-world data skewed towards low-yield reactions, often due to unreacted starting materials and inherent ambiguities in the reaction processes. Despite this data imbalance, existing yield prediction methods continue to treat different yield ranges equally, assuming a balanced training distribution. Through extensive experiments on three real-world yield prediction datasets, we emphasize the urgent need to reframe reaction yield prediction as an imbalanced regression problem. Finally, we demonstrate that incorporating simple cost-sensitive re-weighting methods can significantly enhance the performance of yield prediction models on underrepresented high-yield regions.


INTRODUCTION
Recent advancements in machine learning have introduced a paradigm shift in the field of computational chemistry [5].These breakthroughs have led to a diverse array of machine learning models that now play critical roles in assisting chemists across a broad spectrum of tasks, including but not limited to retrosynthesis, product prediction, and drug discovery.Among this multifaceted landscape, the prediction of reaction yields [1, 12,13,15,17] emerges as an issue of paramount importance in the domain of synthesis planning, where complex molecules are synthesized through a sequence of reaction steps.Based on the empirical categorization, yields above 67% are classified as high yields and those below 33% are classified as low yields [17].In this context, the occurrence of a low-yield reaction within this sequence can drastically impact the feasibility and overall efficiency of the synthesis process.As a result, chemists often prioritize the accurate prediction of high-yield reactions.
While the introduction of numerous yield prediction models has indeed showcased improved performance across the entire yield range, the challenge of effectively enhancing performance for highyield reactions remains an open problem [9].In real-world scenarios, yield data often exhibits a highly skewed distribution, with high yield values being much rarer than lower ones, despite their greater importance to chemists in synthesis planning.For example, the Buchwald-Hartwig dataset [1], a widely used benchmark for yield prediction, is significantly skewed toward the low-yield regions, with certain high-yield ranges having only few data points.In this paper, we argue that the increased difficulty in predicting high-yield reactions stems from its limited availability of data samples, often due to unreacted starting materials and inherent ambiguities in the reaction processes.Despite the presence of such data imbalance, existing yield prediction methods continue to treat different yield ranges equally with the false assumption of a balanced data distribution.
To gain a deeper insight into the field's actual progress, we conduct extensive experiments to benchmark six state-of-the-art yield prediction methods on three real-world datasets.Surprisingly, the results become less impressive than claimed when we take data imbalance into account.We discover that the overall good performance across the entire yield spectrum primarily results from enhancing performance in areas with sufficient data, typically the low-yield range, while overlooking the significant performance gap in underrepresented high-yield regions.This finding has motivated us to revisit reaction yield prediction and reformulate it as an imbalanced regression problem, a well-established topic in machine learning.
Unlike imbalanced classification, reaction yield prediction involves regression rather than classification, and there has been limited exploration of addressing data imbalance in the regression context [11].Most prior research on imbalanced regression has directly adapted the SMOTE algorithm [3] to regression settings [2,14].However, the continuous nature of target labels in regression tasks makes these adaptations less practical.A more intuitive solution is to apply cost-sensitive re-weighting strategies [4] that can be seamlessly combined with various regression models.We demonstrate that incorporating simple imbalanced regression arXiv:2402.05971v1[cs.LG] 6 Feb 2024 techniques such as Focal loss [7] or label distribution smoothing [16] can significantly enhance the performance of yield prediction models on underrepresented high-yield regions without sacrificing the overall performance too much.We believe these findings have the potential to redirect the future research direction in reaction yield prediction, benefiting both chemistry and machine learning communities.In summary, the contributions of this paper include: • We are the first to introduce the novel concept of reformulating reaction yield prediction as an imbalanced regression problem.• We conduct comprehensive experiments on three real-world yield prediction datasets to uncover and understand the limitations of existing models when predicting high-yield reactions.• We demonstrate that incorporating cost-sensitive re-weighting methods into existing yield prediction models can lead to significant performance improvements on high-yield reactions.

THE EXAMINATION OF EXISTING YIELD PREDICTION METHODS
In this section, we begin by introducing the definitions of reaction yield prediction and imbalanced regression.We then proceed to evaluate six yield prediction methods on three real-world datasets, all viewed through the lens of imbalanced regression.The left section of Figure 2 visualizes the division of these regions on three real-world benchmark yield prediction datasets.

Evaluation Settings for Yield Prediction
2.2.1 Datasets.We use three real-world benchmark datasets for predicting reaction yields, sourced from either high-throughput experimentation (HTE) or electronic laboratory notebooks (ELN).Following prior research on imbalanced regression [4,8,16], we categorize the bins within the target yield space into three disjoint subsets: many-shot (bins with over # upper reactions), medium-shot (bins with # lower to # upper reactions), and few-shot (bins with fewer  than # lower reactions) regions, based on their respective counts of reactions.For all three datasets, we set the bin size to 1.
• B-H [1]: It comprises 3,955 Buchward-Hartwig reactions from HTE, with the number of reactions per bin varying between 1 and 412.Here, # lower is set to 25, and # upper is set to 50.• S-M [10]: It consists of 5,760 Suzuki-Miyaura reactions from HTE, with the number of reactions per bin ranging from 1 to 209.Here, # lower is set to 20, and # upper is set to 65. • AZ [12]: It includes 750 Buchward-Hartwig reactions from ELN at AstraZeneca, with the number of reactions per bin ranging from 0 to 145.Here, # lower is set to 3, and # upper is set to 5.

Yield prediction methods.
• Machine learning methods: Random Forest (RF), XGBoost, Support Vector Machines (SVM) • Deep learning methods: Multi-layer Perceptron (MLP), Yield-GNN [12], Yield-BERT [13] 2.2.3 Evaluation pipeline.We report yield prediction results on many-shot, medium-shot, and few-shot regions as well as on the entire yield space (i.e., the all region).In all three datasets, 70% of the data is used for training and the remaining 30% is reserved for testing.Our evaluation employs common yield prediction metrics: mean absolute error (MAE) and root mean square error (RMSE).Additionally, we also utilize the geometric mean of  1 errors (G-Mean) as a supplementary metric.Lower values (↓) of MAE, RMSE, and G-Mean indicate better yield prediction performance.

Implementation details.
For RF, XGBoost, SVM, and MLP, the input features include structural fingerprints (e.g., ECFP), chemical properties (e.g., NMR shifts, HOMO/LUMO energies, vibrations, dipole moments), and reaction-specific parameters (e.g., scale, volume, temperature).For Yield-BERT, the SMILES representation of the reaction is used as input; an encoder based on a pre-trained BERT [6] for SMILES is employed, with a k-Nearest Neighbors (kNN) regressor serving as the decoder.For YieldGNN, we construct graph structures to represent the molecules involved in the  1: Reaction yield prediction results on three real-world datasets.We report the average performance across 10 repetitions in all experiments and present the average rankings for the many-shot, medium-shot, and few-shot regions, respectively.
reaction; a GNN is used to encode the reaction, while an MLP is used to decode the reaction embedding into yield predictions.We employ the  1 distance as the training loss L in all experiments.

Uncovering and Understanding the Performance Gap in High-Yield Predictions
Figure 2 provides an insightful comparison between yield distributions and test error distributions, both as functions of yield values.
Regarding the yield distribution, we note that the few-shot region predominantly comprises high-yield reactions, while the many-shot region primarily consists of low-yield reactions.Specifically in the B-H and S-M datasets, 81% and 100% of the few-shot reactions fall into the high-yield category, while 89% and 100% of the many-shot reactions belong to the low-yield category.This finding establishes a clear connection between yields and their distributions.
To quantify the impact of data imbalance on prediction errors, we compute the Pearson correlation coefficients between the testing error distribution and the training yield distribution.Across all three reaction yield prediction datasets, we consistently observe negative correlation coefficients, with values of -0.42, -0.28, and -0.07, respectively.Moreover, in the right part of Figure 2, it is evident that the few-shot region (in orange) exhibits the largest test error, while the many-shot region (in blue) demonstrates the smallest error.To complement this observation, we further evaluate the yield prediction performance of six state-of-the-art models using three metrics and report the average performance rankings for the three regions in Table 1.As a result, the average ranking indicates a decline in model performance as we transition from the many-shot region to the medium-shot and few-shot regions.
Therefore, both observations from Figure 2 and Table 1 converge to the same conclusion: During the training process, low-yield values with a larger number of data samples tend to be learned better in comparison to those high yields with fewer samples.This highlights the demand for specialized machine learning techniques that can effectively address the challenge of data imbalance.

MITIGATING DATA IMBALANCE FOR BETTER HIGH-YIELD PREDICTIONS
In this section, we present two cost-sensitive re-weighting methods for imbalanced regression, which can be seamlessly integrated into the learning process of existing yield prediction models.We also provide evidence of their effectiveness in high-yield predictions.

Cost-sensitive Re-weighting Methods
The key idea is to assign a weight   ∈ W to each training sample (  ,   ), resulting in the following modified loss function L ′ : where L is a loss function for regression tasks, such as  1 loss, MSE loss, and Huber loss.The distinctiveness of each re-weighting method arises from the various designs of training weights W.

Figure 1 :
Figure 1: (a) An example of reaction yield prediction.(b) A depiction of the data-driven yield prediction workflow.(c) An illustration of the imbalanced regression problem.

Definition 1 :
Reaction yield prediction. Reaction yield prediction is a regression problem that predicts the yield value  ∈ [0, 100] of a chemical reaction  = (R, ) composed of multiple reactant molecules R and a single product molecule .A yield prediction model comprises an encoder  :  →  ∈ R  and a decoder  :  → ŷ ∈ [0, 100].The encoder (•) takes the form of a graph neural network (GNN) or a sequential model, depending on the actual representations of molecules.The decoder  (•) outputs the yield prediction ŷ based on the encoded reaction embedding .Definition 2: Imbalanced regression.Let D = {(  ,   )}  =1 denote the training dataset of a regression problem, where   ∈ R  is the input feature vector and   ∈ R is the label.We divide the label space Y into  disjoint bins with equal intervals that cover the entire range of target values, i.e., [ 0 ,  1 ), [ 1 ,  2 ), . . ., [ −1 ,   ).Let C  be the set of data samples in the -th bin with  ∈ {1, 2, . . ., }.The data imbalance occurs when the label distribution is highly skewed, i.e., max  | C  | min  | C  | ≫ 1.To facilitate fair evaluation considering data imbalance, the bins are classified into many-shot, medium-shot, and few-shot regions based on the number of training samples.

Figure 2 :
Figure 2: A comparison between yield distributions (left) and test error distributions (right) on three benchmark datasets.