Fusion of Graph and Natural Language Processing in Predictive Analytics for Adverse Drug Reactions

Adverse Drug Reactions (ADRs) pose a critical challenge to patient safety and healthcare economics worldwide. This study presents a novel graph-assisted machine learning algorithm applied to a comprehensive dataset provided by the Commonwealth Bank Health Society (CBHS), spanning 1976 to 2018, to predict ADRs. Utilizing the narrative-like structure of patients’ medical histories through Natural Language Processing (NLP), the research uniquely encodes International Classification of Diseases (ICD) codes into word embeddings. In a departure from traditional methods, it also employs networks analytics to transform disease histories into a knowledge graph structure that captures temporal and contextual relationships between ICD codes. This study's results demonstrated that integrating Node2Vec with Word2Vec improved model performance across various metrics, with significant enhancements in recall for K-nearest neighbors (KNN) and area under the receiver operating characteristic curve (AUROC) for all models. The findings underscore the potential of NLP in medical contexts and highlight the advantages of graph-based approaches in capturing complex, non-linear interdependencies inherent in patient data. This research marks a significant stride toward harnessing the power of NLP and graph theory in the predictive modeling of ADRs, aiming to improve patient outcomes by preempting potential adverse drug interactions.


INTRODUCTION
Adverse drug reactions (ADR) stands at the forefront of pharmaceutical safety discussions, with profound implications for both patients' well-being and the economics of healthcare sector.The European Medicines Agency (EMA) delineates ADR as all noxious and unintended responses to a medicinal product related to any dose [1].In the U.S., the occurrence of ADRs correlates with a marked uptick in patient fatality rates -patients who have experienced ADR would have a 19.18% higher death rates as reported by Bond [2].In South Africa, 2.9% of medical admissions were due to adverse drug reactions [3].In Australia, Strong correlation between increases in medication use and rates of ADR associated with hospitalization has been identified and analyzed [4].However, many of these incidents are arguably preventable.This research sets out to apply graph assisted machine learning algorithms to data from the Commonwealth Bank Health Society (CBHS) [5] to forge anticipatory defenses against the rise of ADR.
The integration of machine learning into the realm of ADR prediction may have been a game-changer for medical protocol by facilitating preemptive actions.Investigative work by Pauwels et al. revolved around the chemical makeup of drugs to forge predictive models using a spectrum of data analysis techniques including k-nearest neighbours and support vector machines [6].In parallel, Liu et al. employed a variety of machine learning classifiers, such as logistic regression and random forest, to interpret drug characteristic vectors [7].Huang et al. used information from drug-target interactions [8], while Zhou and Uddin innovated with patient diagnostic histories to enhance prediction accuracy [9].All these methods have been using comprehensive datasets and complicated algorithms.Despite these strides, the potential of natural language processing (NLP) for ADR prediction has been relatively untapped, notwithstanding its proven efficacy in analogous analytical contexts.
NLP is engineered to parse and understand human language and has been utilized in various fields such as providing substantial benefits to customer service through intelligent chatbots, and in human resources for optimizing job placements.Rather than human language, its versatility extends to the analysis of biological sequences in genomics [10], the scrutiny of programming code in software development [11], and the examination of musical compositions [12], all of which are not natural human language but also show a sequence of information that could be learned and predicted.This investigation endeavors to harness NLP to extract patterns from patients' medical histories, viewing these histories as narratives that unfold over time, thereby bolstering the predictive detection of ADRs.On top of this, the proposed method also uses a graph-based approach to embed disease history information, which is also novel.

DATASET AND METHOD
This research analyzes health insurance claim data provided by CBHS, a private health insurer in Australia, covering the period from 1976 to 2018.The dataset includes de-identified records of 419,952 individuals, capturing demographic details such as their age and gender, along with clinical information like the type of service (coded as ICD-9 or ICD-10), specific item numbers (ICD codes), the date of records, and the diagnostic procedure codes.The crux of this study lies in the utilization of sequenced ICD codes, which are abbreviations for disease classifications detailed in the 9th and 10th Australian modification of the International Classification of Diseases.These codes chronicle the disease progression in patients over time.To maintain uniformity, this investigation focuses solely on patients recorded with ICD-10 code entries.Drawing on a systematic compendium of ICD-10 codes aggregated by Hohl [13] from 41 scholarly articles, we have been able to categorize patients and integrate their disease code histories for analysis via a novel graph-based learning approach.In the initial selection phase, 118,695 individuals were identified as having utilized ICD-10 codes in their medical records, qualifying them for additional scrutiny.The selection criteria further narrowed down the subjects by: (1) including only patients with a minimum of two claim instances, and (2) selecting patients in line with the ICD-10 code review conducted by Hohl.These criteria have resulted in a cohort of 1,660 patients for the subsequent phase of the research.The process of labeling in this context involves a focus on individuals at risk of adverse drug reactions (ADRs) and the specific timing of these occurrences.This approach takes into consideration previous ICD-10 codes that may precipitate an ADR.We analyzed a cohort of 1,660 patients, treating all ICD codes before each time point as a distinct record.For each individual, their ICD-codes are systematically organized, first by the date of record and then in alphabetical order of the codes.
The general method of this research is detailed in Figure 1.After processing the raw data, each patient will be associated with an ICD code history covering the entire duration within this dataset.This history can be segmented according to each ICD code, using their prior history selected from a specific window of time (7 days in this experiment) to predict the likelihood of a subsequent ADR.Based on such a history, or portions thereof, each ICD code is treated as a word and can be transformed into word embeddings to be input into machine learning models.
To find the embeddings that can be used to measure the similarities in the words, which in this context is the ICD-codes, we used Word2Vec.Continuous Bag-of-Words Model proposed by Mikolov et al. uses continuous distributed representation of the context [14].The main idea is that words appearing in similar contexts tend to have similar meanings.There are two primary training models for Word2Vec: Continuous Bag of Words CBOW and Skip-Gram.Based on the prediction task in this study, the CBOW is appropriate to be used because it's designed to predict a target word from its surrounding context.In this task, it will base its prediction on the previous ICD-10 codes which are regarded as the surrounding words.However, it has a disadvantage that it does not take the ordering of words into consideration, which might make it less effective in this study as the ordering of ICD-10 codes is a key component of the prediction.
Based on the ICD histories, we can create a knowledge graph where nodes represent each ICD code.In this graph, an edge is formed between two nodes if the corresponding ICD codes appear consecutively or concurrently within a patient's disease history at any point in time in the dataset, signifying a temporal relationship between the codes for individual patients.The edges in this experiment are also weighted; the more frequently two codes are associated within patients' histories, the greater the weight of the edge.After the successful construction of this graph, an embedding method will be employed to derive a representation for each node, which can then be concatenated with the word embeddings.In my experiment, Node2Vec is utilized to generate the node representations.Node2Vec is an algorithmic framework specifically designed for learning continuous feature representations of nodes in networks [15].This approach efficiently explores diverse neighborhoods of nodes, utilizing a balance between breadth-first and In many datasets, especially in the medical domains, the distribution of classes can be heavily skewed.This is also true in this case, where we observed a significant disparity between the number of negative labels compared to positive ones.When one class heavily outnumbers the other, machine learning algorithms can become biased towards the majority class, often leading to misleadingly high accuracy but poor generalization to new data.Therefore, instead of using the entire set of negative labels, we randomly selected a subset of the negative cohort.This approach aims to create a more balanced dataset, allowing the machine learning algorithm to better discern patterns and make more accurate predictions.

RESULTS
Using Word2Vec embeddings for different machine learning methods yields the Table 1 and adding Node2Vec embeddings would produce Table 2.
The integration of Node2Vec with Word2Vec generally enhances model performance in terms of all metrics, albeit with some tradeoffs in certain models.Notably, LR and SVM shows the most significant improvement in accuracy (from 0.8065 to 0.8380 and from 0.8226 to 0.8428 respectively), suggesting that combining Word2Vec with Node2Vec is particularly beneficial for these models.There's a significant improvement in recall for KNN (from 0.8785 to 0.9243), indicating a better identification of positive cases.There is an improvement in AUROC for all models.This improvement suggests that all the models' ability to distinguish between the classes has been enhanced with the addition of Node2Vec embeddings.

DISCUSSION
The utilization of word embeddings to represent ICD-10 codes offers a unique perspective on the potential associations between different medical conditions and ADRs.By converting these codes into dense vector representations, we can capture the inherent relationships and similarities between various medical conditions.This is analogous to how word embeddings in traditional NLP capture semantic relationships between words.The performance metrics obtained from our models suggest that these associations, as captured by the embeddings, play a significant role in predicting ADRs.The higher accuracy rates indicate that the embeddings effectively encapsulate the necessary information to discern patterns leading to ADRs.
In the realm of NLP, semantic analysis typically pertains to understanding the meaning and context of words within a text.When applied to ICD-10 codes, this concept takes on a different nuance.Instead of traditional linguistic context, the 'semantics' of an ICD-10 code pertains to the medical implications and intricacies it represents.By employing NLP techniques, we aim to understand the deeper medical contexts and relationships between different codes.The results from our models shed light on how effectively the word embeddings captures these medical 'semantics'.
The utilization of graphs in the field of ADR predictions underscores the significance of network properties in understanding complex drug interactions.Graphs inherently capture the interconnected nature of drug-taking processes, where diseases and drugs entities do not operate in isolation but rather influence each other in a vast network.When predicting ADRs, a graph approach can model the multifaceted relationships between different ICD codes, illustrating how the combination or sequence of ailments may lead to potential drug reactions.This method mirrors the holistic view required in clinical settings, where patient history is not merely a list of discrete events but a narrative with interdependencies.Furthermore, graphs can efficiently encode the temporal dimension of patient histories rather than just the ICD code embeddings from Word2Vec, reflecting the chronological progression of medical events that could be pivotal for predicting ADRs.

CONCLUSION AND FUTURE WORK
The novel contribution of this experimental design lies in its application of NLP methodologies to analyze datasets that do not conform to traditional human language datasets.Additionally, the incorporation of graph representations has been validated as beneficial in such contexts.This approach is distinguished by its exclusive reliance on disease history, eschewing the inclusion of demographic data or drug features commonly employed in more complicated algorithmic frameworks.The elegance of this method is evident in its streamlined nature and the remarkable predictive performance it has achieved.
Future research directions may include the application of advanced word and graph embedding techniques to enhance the model's predictive capability.It would also be prudent to investigate the impact of varying the temporal window for historical data analysis, expanding beyond the current 7-day window time.A noted limitation of the present study is the non-sequential characteristics of the dataset, where numerous entries are cumulatively recorded on the same dates due to the claims nature, and this could have reduced the prediction accuracy.

Figure 1 :
Figure 1: General mechanism of the fusion method.

Table 1 :
Results of the baseline method using Word2Vec embeddings.

Table 2 :
Results of the proposed method using Word2Vec and Node2Vec embeddings.
depth-first search strategies.The core idea is to encapsulate the network's structure by representing nodes as low-dimensional vectors, thus preserving both local and global network properties.