Drug–Drug Interaction Relation Extraction Based on Deep Learning: A Review

Drug–drug interaction (DDI) is an important part of drug development and pharmacovigilance. At the same time, DDI is an important factor in treatment planning, monitoring effects of medicine and patient safety, and has a significant impact on public health. Therefore, using deep learning technology to extract DDI from scientific literature has become a valuable research direction to researchers. In existing DDI datasets, the number of positive instances is relatively small. This makes it difficult for existing deep learning models to obtain sufficient feature information directly from text data. Therefore, existing deep learning models mainly rely on multiple feature supplementation methods to collect sufficient feature information from different types of data. In this study, the general process of DDI relation extraction based on deep learning is introduced first for comprehensive analysis. Next, we summarize the various feature supplement methods and analyze their merits and demerits. We then review the state-of-the-art literature related to DDI extraction from the deep neural network perspective. Finally, all the feature supplement methods are compared, and some suggestions are given to approach the current problems and future research directions. The purpose of this article is to give researchers a more complete understanding of the feature complementation methods used in DDI extraction to be able to rapidly design and implement custom DDI relation extraction methods.


INTRODUCTION
In the treatment process of patients, drug-drug interaction (DDI) affects the medication strategy and the effect of therapy, and even poses a great threat to the health of patients and the public [137].DDI refers to the biological effects of two or more drugs taken by patients at the same time, such as drug antagonism and synergy [20].If the DDI information of the drug is not fully available, it may lead to negative consequences, namely, adverse drug reactions (ADRs) [22].For example, chloroquine or hydroxychloroquine, which are commonly used in the treatment of COVID-19, may cause myopathy when took with statins [45].Most of the patients with COVID-19 are elderly people who often suffer from multiple diseases [33,139], including hyperlipidemia; thus, these patients may be taking statins at the same time [13].If chloroquine or hydroxychloroquine is given to the elderly with COVID-19 without considering other drugs the patient is taking, it is likely to cause serious ADR events.Therefore, the acquisition of DDI is an important clinical task for drug researchers and doctors [84].
At present, there are multiple DDI databases that can be accessed and retrieved by researchers, doctors, and patients, including Drug Interactions Facts [132], Stockley [12], DailyMed1 [30], WebMD2 [86], National Drug File3 [19], DrugBank 4 [141] and TWOSIDES [131].However, it is difficult to update these databases in time; thus, a large amount of DDI information is still buried in a large amount of biomedical literature [39,105].If these cannot be extracted and utilized fully, it would be a waste for the huge investment of the previous researchers.Extracting DDI information quickly and accurately from the unstructured text data can update the DDI database in time, reduce the cost of time and energy for researchers in literature retrieval [119], and provide DDI information for drug developers and patients in a timely fashion.
With the continuous progress in technology, a large amount of medical literature is published every day [71].In order to find the key DDI information from massive text data, many machine learning methods have been proposed [15,16,29,47,67,112,133].Since deep learning has shown good performance in the fields of computer vision and Natural Language Processing (NLP) [39, 49,80,123,130], the major DDI relation extraction methods in recent years are all based on deep learning.They can be divided into Convolutional Neural Network (CNN)-based, Recurrent Neural Network (RNN)-based, method, Recursive NN-based, and Graph Convolutional Network (GCN)-based methods.The CNN-based method has a simple network structure, for which there is almost no need to define the features manually.This method extracts the context information of the target drug well by using a multi-scale convolution window, thus, it shows great potential [82].The RNN-based method is suitable for sequence data because it can get the connection between words with a large span in long sentences [159].The sentence structure of natural language is recursive; thus, some scholars believe that the method based on a Recursive NN can more Drug-Drug Interaction Relation Extraction Based on Deep Learning 158:3 effectively understand natural language sentences [76].GCN-based methods can make full use of dependency information and integrate it into word representation to obtain potential feature information [143].
Besides the meaning of words, the semantics of a sentence is also influenced by the structural combination of words [99].In addition, each word and grammatical structure may contain complex semantic information [74] or semantic ambiguity [43,66,152], which has a serious impact on the subsequent processing of sentences, especially long and complex sentences [98].Thus, the deep learning model finds it hard to extract the complete feature information from a sentence, and finds it more difficult to obtain the real relationship between the two target drugs in the sentence.In contrast, the feature supplementation method refers to obtaining additional feature information from the original data through other methods, tools, or data, which cannot be directly obtained from the original text data by deep learning models.Thus, the feature supplement method allows the computer to deeply understand the connection between each word and its surrounding words in the sentence as well as the inner meaning of the word itself.In summary, the related technology in NLP is used for DDI text extraction [44,68,78,83] and various feature supplement methods that are applicable for DDI relation extraction have been proposed, such as word embedding [70,87,88] and position embedding [5].With the combination of deep learning and various feature supplement methods, DDI relation extraction technology based on deep learning has made continuous progress.
The rest of the article is organized as follows.In Section 2, we summarize the general process of DDI text extraction based on deep learning.In Section 3, a complete summary of the literature in recent years is made from the perspective of feature supplements.Section 4 contains a review of deep learning frameworks.Section 5 compares the performance of each method.Finally, in Section 6, the existing challenges and future research prospects are discussed.

GENERAL PROCESS OF DDI RELATION EXTRACTION
In this section, we summarize the general process of DDI relation extraction according to the existing methods based on deep learning.Limited by the length of the article, many articles focus on highlighting their own innovation without a complete description of the process.Hence, this section aims to provide a complete DDI text extraction framework for researchers.

Preprocessing
In DDI relation extraction, we define all operations before the data enters the embedding layer as preprocessing, which has an important impact on the subsequent calculation [63,64].Preprocessing generally includes tokenization, lowercase, drug blinding, digital string processing, negative instance filtering, and sentence length fixing.
Tokenization is the process of segmenting the original text sentence, which splits a long string into several small strings [124,128].Then, all the letters are changed into lowercase.The purpose is to make the terms containing the same information group correctly and to obtain features with lower dimensions but greater differences [136].The above steps are designed to facilitate the subsequent operation of each word or phrase and commonly rely on the Natural Language Toolkit [85,116,155].
Drug blinding is the process of replacing target drugs in sentences with simple words [59].In this operation, DRUG1, DRUG2, and DRUGN are generally used to represent drugs of interest and noninterested drug entities, respectively [145], to maintain the generalization of the model and avoid the error of complex drug names in embedding.Moreover, some researchers use (DRUG1, DRUG2, DRUGOTHER) [8], (Drug1, Drug2, DrugOther) [7], or (drug1, drug2, drug0) [125] to represent the corresponding drugs.In the process of drug blinding, there is a phenomenon that multiple drug entities share the same prefix or suffix.If no special treatment is adopted, the drug entities will not be found.For some new researchers, it is difficult to detect this situation because they are not familiar with the original dataset, and it is rarely mentioned in existing papers as well.As a result, researchers often spend a lot of time and energy attempting to find and correct errors in data preprocessing.For example, the phrase "inactivated ND and AI vaccines" stands for the two target drugs, inactivated ND vaccines and inactivated AI vaccines [28].If we find "inactivated ND vaccines" and "inactivated AI vaccine" in the sentence directly, the result will be wrong (cannot find the target drugs).In the process of drug blinding, we must write special programs to deal with this situation according to our own habits.
In the original text, numbers will inevitably appear.To express the meaning of numbers better, all numbers appearing in sentences are replaced with "NUM" [144]; "DG" is used by some scholars to represent numbers [114] as well.This method can improve the generalization of the model and avoid word segmentation errors.In order to distinguish the types of numbers, some researchers denote integers with "num" and real numbers with "float" [158].
For the corpus of DDI relation extraction, DDI Extraction 2013 5 has the following problems: (1) the number of positive instances is small (2) and the proportion of negative and positive instances is too large (the proportion of negative and positive instances in the training set is 5.8:1) [82], which seriously affects the performance of deep learning network models [28,126].To reduce the negative effects of sample imbalance, most articles use negative instance filtering to filter the negative instances (the two target drugs in the same sentence do not interact with each other).The common filtering rules are as follows: (1) in the same sentence, the two target drugs have the same name; (2) in the same sentence, one target drug is the abbreviation or other special form of the other target drug; (3) in the same sentence, two target drugs appear in the same parallel structure (and contain one or more non-interested drugs); (4) in the same sentence, one target drug is the special case of another target drug.
In the corpus, the length of each sentence is not fixed.To facilitate the subsequent operation, the length of all sentences is fixed before the embedding operation.Sentences that are too long are cut and "#" is added for sentences that are too short [81].The RNN can handle sentences with variable lengths, but articles based on an RNN do not mention whether the sentence length needs to be fixed.In the code published in an article based on an RNN, we found the process of sentence length fixing [114].

Complete Process of DDI Relation Extraction
The entire DDI relation extraction process includes (1) data preprocessing, (2) word embedding, (3) inputting the features of the training data into the user-defined network, (4) training the network and getting the final model, and (5) processing the test data in the same way as the training data and then feeding it into the final model.The detailed process is shown in Figure 1.
Since computers cannot understand natural language directly, scholars use word embedding in NLP to transform words into real valued vectors, which can capture some potential semantic and grammatical information by learning from a large amount of unlabeled data [70,87,88].Word embedding reflects the topic similarity between words and improves the generalization of words.Compared with one-hot vectors [18,106], word embedding requires a lower dimension and contains more intensive information.In addition, it takes into account the semantic relationship between words [26], thus, it has been widely used in DDI relation extraction.Word embedding in the existing work can be roughly divided into two categories from the perspective of generation.The first is trained by the researchers themselves [82] and the second is provided by other researchers [52].In terms of training methods, they can be roughly divided into three categories: word2vec6 /glove7 [26,52,75,92,104,120], character level embedding [6,65,115], and BERT8 (Bidirectional Encoder Representations from Transformers) [4,34,56,138].Among them, the embedding method of BERT has recently been applied to DDI relation extraction and has achieved good results [9].The datasets involved in word embedding training generally include the abstract data of PubMed9 containing the keyword "drug" [21,156], DDI Extraction 2013 corpus [51,118], PMC literature data, MEDLINE, Wikipedia, the combination of Wikipedia and PubMed, and the combination of PMC and PubMed [91].In practice, some scholars adopt multiple word vector models for word embedding [35] to improve the performance of DDI relation extraction.
Researchers can add a variety of features in the word embedding stage according to their own interests, including position embedding, Part-Of-Speech (POS) embedding [96,97] (detailed in the next section), etc., to obtain more abundant input features.Researchers can also construct a custom network in the Network section according to the actual situation.For example, the Attention mechanism [58,60,134] can be added in the word embedding.
Almost all deep learning models used for DDI relation extraction adopt various feature supplement methods.To display the table better, all the feature supplement methods introduced in the third section are simplified as shown in Table 1.For the internal supplement method, Position and POS Embedding are marked as I 1, I 2, respectively; the feature supplement method based on 158:6 M. Dou et al.  2, we have summarized the relevant literature and focus on the basic network framework, the feature supplement method adopted, and whether the code can be used by researchers.

FEATURE SUPPLEMENT METHOD FOR DDI RELATION EXTRACTION
Word embedding technology can obtain syntactic and semantic information from a large-scale unmarked corpus, and has shown good results in many NLP tasks.Unlike the pixel value in images, every word and syntactic structure in DDI text data contains rich semantic, grammatical meaning and domain information [27,46,94].Hence, it is difficult to mine enough features from text data only by word embedding.To get enough feature information from text data, many feature supplement methods have been proposed by researchers, which can be divided into internal supplement methods and external supplement methods.

Internal Supplement Method
Text data contains rich information.For example, a word may contain a lot of domain knowledge, and a syntactic structure may determine the relationship between two words.Thus, it is very important to fully mine the feature information contained in the sentence for the improvement of network performance [102].The internal supplement method of feature information, which adopts certain rules or auxiliary tools to find potential feature information from the text data DDI Extraction 2013, has been widely applied in NLP.

Position Embedding.
The DDI relation extraction task aims to determine the relationship between two drug entities in a sentence.However, the position relationship between the two target drugs cannot be obtained by relying on word embedding.Zeng et al. [5] proposed position embedding so that the network can judge the importance of all words in the same sentence (the closer to the target drug, the more important it is) by obtaining the distance [d1, d2] of each word relative to two target drugs, as shown in Figure 2. Generally, d1 and d2 are converted into 10-dimensional vectors w p1 , w p2 , respectively.Then, w p1 and w p2 are concatenated with the vector w i obtained by word embedding to compose the final word representation: All the words in the sentence are converted into w i emb and then entered into the subsequent network for the next step.Since position embedding can improve the performance of DDI relation extraction, it is adopted by most studies.When the relative position information is converted into a real vector, the general processing is performed to obtain the maximum value (denoted as P Max ) and the minimum value (denoted as P Min ) of the relative distance first.Then, the absolute value of P Min is added to all the relative distance values so that negative numbers can be eliminated.After setting the dimension of position embedding, all possible values can be initialized randomly by using related library functions.

POS Embedding.
When the feature information is extracted from the natural language sentence, the deep learning network cannot clearly distinguish the part of speech (POS tag [69]) of each word, which is significant for relation extraction of drug entities.POS embedding uses some existing natural language parsing tools, such as the Stanford parser 10 [17, 85], to label the POS of each word in a sentence.Then, the labeled symbols are mapped to a low-dimensional vector w i POS .The final word representation (word embedding + POS embedding) can be indicated as w i emb = w i +w i POS .In this way, the original text data can be supplemented with POS feature information.Figure 3 is the result of using Stanford core NLP, where "JJ", "NN", "IN", "CC", "VBZ", "RB", and "VBN" represent adjectives, nouns, prepositions or subordinate conjunctions, coordinate conjunctions, third person singular, adverbs, and verb past participles, respectively.In addition, different parsing tools have different POS tag sets.For example, there are 36 types of POS tags obtained by the Stanford parser and 37 types of POS tags obtained by Enju parsing tools [89,90,158].
There are two ways to convert POS tags into vectors.The first method transforms POS into vectors in the pre-trained word vector model.The second method groups all POS tags and converts the POS tags into a vector according to the position embedding.

Feature Supplement Method Based on Dependency.
In natural language sentences, the syntactic structure can provide important semantic information, similar to words.There is a dependency [122,143] between words in a sentence.This relationship connects all the words according to the grammatical structure, forming a dependency tree (also denoted as the dependency structure of a sentence) [147,148].Figure 4 shows the dependency tree obtained by Stanford core NLP, where "Root" represents the sentence that needs to be processed.
The dependency parse tree of sentences contains rich syntactic information, which can help the network understand the true meaning of sentences more accurately.To improve the performance of DDI text extraction, many methods have been proposed to make full use of the dependency in sentences.A common method is to get the dependency tree of a sentence and then input the tree into a GCN for feature extraction [72,143].Some scholars reorder sentences according to the dependency tree and special traversal rules; then the network is used to extract feature information [140].In addition, some scholars employ the dependency parse tree of the sentence as the basis for judgment to obtain the importance of each word [76].

Feature Supplement Method Based on SDP.
In a DDI corpus, most of the sentences are long and complex.It may lead to the network neither obtaining the key feature information in the sentence during training nor understanding the semantics of the sentence correctly.To reduce the adverse effects caused by complex sentences, researchers use natural language parsing tools to simplify sentences.The most important strategy is SDP [73,146].SDP can indicate the shortest path between two target drugs in the dependency structure of a sentence.The process of finding the shortest dependency path is as follows.First, use grammar analysis tools to obtain the grammar dependency relationships of all words in the current sentence.Second, find all paths between two target entity nouns.Finally, the shortest path will be selected as the shortest dependency path between the current two target entity nouns.This process is shown in Figure 5, where "amod", "det", "nsubj", "dobj", "poss", "infmod", "advmod", "aux", and "prepto" represent adjectives, articles, noun subjects, direct objects, possessive cases, infinitives, adverbs, auxiliary words, and prepositions, respectively.Thus, SDP contains important semantic and grammatical information for relation classification.It is the common way to input SDP of the original sentence into the network and concatenate the obtained features with other features for feature extraction [154,156].
Both SDP-based and dependency-based feature supplement methods are implemented by using NLP parsing tools.It should be noted that these two methods are different.The latter utilizes all the dependency information contained in the sentence, whereas the former only uses the shortest dependency path information between two entities.In addition, the purposes of these two methods  are different.The dependency-based method aims to obtain more text grammar information in order to reduce the occurrence of text ambiguity.The SDP-based method aims to simplify complex and lengthy sentences in order to reduce the noise and errors contained in the data.

Feature Supplement Method Based on Network Combination.
For the same sentence, the features extracted by various network structures are different.To obtain more feature information, researchers adopt the feature supplement method based on network combination to process text data.There are three commonly used methods of feature supplement based on network combination.The first method adopts bidirectional long short-term memory (Bi-LSTM) or a bidirectional gated recurrent unit (Bi-GRU) to extract preliminary features from sentences and perform specific processing, then the processed features are input into the next network for feature extraction and classification (also called semantic embedding) [92,127,143,157], as shown in Figure 6(a).The second method is to extract features by using a variety of networks with different structures, and the obtained features are concatenated and input into the subsequent network layer [52,114], as shown in Figure 6(b).In the third method, different types of data are input into different networks for feature extraction.Then, the obtained features are concatenated and transferred to the subsequent network layer [81,140], as shown in Figure 6(c).

Feature Supplement Method Based on Special Preprocessing.
To capture more potential feature information in a sentence, a variety of feature supplement methods based on special preprocessing have been proposed.In the preprocessing stage of text data, these methods process the data or label the data in a specific way to supplement more feature information.
Some scholars segment the sentences in a DDI Extraction 2013 dataset according to the position of two target drugs and obtain five kinds of data: left clause, e1, middle clause, e2, and right clause.By extracting the features of these five types of data separately, the impact of long sentences can be reduced [92,156], and the network can get the extra location information of each word as well (as shown in in Figure 7(a)).To capture the location and drug types of two drug entities, some researchers use "<e1i>", "</e1i>", and "<e2i>", "</e2i>" to mark the location and type of information of target drugs [95], and the index of drug type is specified: 1, drug; 2, brand; 3, group; 4, drug n.For example, "ibogaine" and "cocaine" are the target drugs in "Only ibogaine enhances cocaine-induced increases in accumbal dopamine".Then, the original sentence becomes "Only <e13> ibogaine </e13> enhances <e20> cocaine </e20> -induced increases in accumbal dopamine".In order to obtain more positional information of the tokens in the sentence, some researchers assign a skeleton tag to each token [61].There are four types of skeleton tags: R (red box area), BG (cross area between blue and green boxes), B (blue box area), and G (green box area), which represent target drug, internal words, surrounding words, and internal surrounding words, respectively (as shown in Figure 7(b)).The skeleton tag of each word (convert to vector using embedding) is concatenated with the position embedding feature as part of the final input.

External Supplement Method
3.2.1 Feature Supplement Method Based on Text Data.It needs data with high scale and quality to extract DDI from literature by using a deep learning method.Constrained by cost, the existing annotation datasets are small in scale, and the update speed is also relatively slow.In contrast, some external text data resources are updated quickly and have sufficient scale.If the rich feature information contained in the external dataset can be fully utilized, the performance of DDI text extraction will be further improved.Based on DDI Extraction 2013, previous scholars introduced User-Generated Content (UGC) embedding.This method first generates UGC documents from the comments in the SIDE EFFECTS and COMMENTS columns of the Ask a Patient website, then searches the related content of the drug contained in DDI Extraction 2013 from UGC documents.Finally, UGC embedding can be obtained by training [145] (as shown in Figure 8).In addition, some scholars apply the text data in DrugBank to supplement feature information.For the drug entities in DDI Extraction 2013, they first searched DrugBank in order to find the description text corresponding to the drug entities.Then, [9] converts each token of the drug description text into a vector through SciBERT and uses a CNN to extract the semantic features in it (as shown in Figure 9(a)).In contrast to [9], [163] treats the drug description text as a document and adopts Doc2vec to obtain relevant feature information about drug entities (as shown in Figure 9(b)).

Feature Supplement Method Based on Molecular Structure Information.
In addition to external text data, the drug molecular structure information in external data resources can be used in feature supplement.For drug molecular structure information, there are two main types of applications: two-dimensional (2D) molecular structure-based methods and three-dimensional (3D) molecular structure-based methods.For the former, researchers first need to extract the SMILES (Simplified molecular input line entry system) string encoding [40,107] of a drug entity from the DrugBank database.Then, the SMILES string encoding is converted into a two-dimensional molecular structure graph by RDKit11 (RDKit: Open-source cheminformatics software).Based on the molecular Graph Neural Network (GNN) method proposed in [135], the representation of all atomic nodes with a molecular radius of 2 is obtained.Finally, the molecular structure features of drug entities can be obtained after accumulating all node representations and inputting them into the fully connected layer [9,37], as shown in Figure 10.
For the method based on 3D molecular structure, researchers need to obtain SMILES string encoding of a drug entity from the PubChem database.Then, the MMFF94 (a commonly used force field optimization method that provides energy-optimized conformations of compounds in space) in RDKit is adopted to obtain the energy-optimized conformation of the drug molecule in 3D space.Finally, the molecular structure features of drug entities are obtained by a 3D GNN [48], as shown in Figure 11.

Biomedical Embedding. The text data in DDI relation extraction contains a large number of biological entities and syntactic blocks, which has a wealth of biomedical-related knowledge.
Because domain knowledge has a strong connection with the true semantics of the sentence, it is important for DDI relation extraction to mine the potential domain knowledge in the sentence.Researchers proposed biomedical embedding [62] in DDI text extraction, including stem embedding, chunk embedding [103,111], and entity embedding (also called concept embedding) [42,55,101].Among them, stem embedding obtains the stem information of the word, chunk embedding is designed to find the chunk information of the current word, and entity embedding is aimed at utilizing the biological concept information of the current noun, as shown in Figure 12 ("NP", "VP", and "PP" represent noun phrases, verb phrases, and prepositional phrases, respectively).It should be noted that these three biomedical embeddings are trained on MEDLINE.A recent work [10] constructed heterogeneous networks using various types of biomedical information  primary network (a network that specifically processes features extracted from raw medical texts and features extracted from the assisting network) to obtain the final extraction results.Almost all DDI relation extraction work consists of a primary network and several assisting networks, as shown in Figure 13.Thus, our network classification method is based on the primary network structure, that is, the type of assisting network structure is not considered when classifying the network structure.

DDI Relation Extraction Based on CNN
A CNN is one of the most representative deep learning models, which has been applied in many fields, such as image classification, image segmentation, and natural language processing.In general, a basic CNN consists of convolutional layers, pooling layers, and several fully connected layers.Compared with traditional machine learning methods, a CNN can automatically extract feature information from input data without handcrafted features [150].This endows the CNN with powerful feature extraction ability and excellent performance.In 2016, a CNN was first applied to DDI relation extraction and achieved great performance improvements [82].Due to the simple structure and excellent performance of CNNs, some DDI relation extraction work still uses CNN networks.4.1.1CNN.Asada et al. [9] proposed a method to utilize external drug information to assist DDI relation extraction.Specifically, the drug description text data and molecular structure information of drug entities are searched from external databases.Then, the external features of drug entities are extracted from the drug description text and molecular structure through a CNN and GNN, respectively.For the text data in DDI Extraction 2013, the SciBERT pre-trained word vector model is used to convert the words in the sentence into vectors.Then, a CNN is used to extract the semantic features, which are the internal features of the drug entity.Finally, the two features are combined and input to the softmax layer to obtain the extraction result.He et al. [48] introduced 3D molecular structure information to fully mine the hidden drug attribute information in the molecular structure data when realizing the DDI relation extraction task.For the input text data, SciBERT is also used to represent the words in the sentence, and then the CNN is applied to mine the semantic features of drug entities from the sentence.The structure of the CNN-based method is shown in Figure 14.

Dilated Convolutions.
In addition to traditional CNNs, dilated convolutions have been applied to DDI relation extraction.Sun et al. [127] proposed a novel Recurrent Hybrid Convolutional Neural Network (RHCNN) for DDI relation extraction.This method obtains more accurate semantic embedding through word embedding and context information.Using a CNN and dilated convolutions simultaneously, sentence-level features composed of local context features and dependency features between separated words are learned from consecutive words, respectively, as shown in Figure 15.Finally, in order to eliminate the negative impact of data imbalance, Improved Focal Loss is used to alleviate this problem.

DDI Relation Extraction Based on RNN
Similar to a CNN, an RNN is one of the most classic network frameworks in deep learning and is currently mainly used in NLP and other related fields.An RNN can be viewed as multiple copies of a network; the former network is able to transmit messages to the latter network [3].Therefore, the unique structure of an RNN makes it very suitable for processing sequence-type data.However, an RNN also has the problem of gradient disappearance.Hence, in practical applications, researchers often adopt improved methods of RNNs, such as long short-term memory (LSTM) and GRU.In 2017, Wang et al. [140] applied LSTM to DDI relation extraction for the first time, and employed it to extract feature information from training data and corresponding dependency parse trees.Due to the special nature of RNNs, increasing numbers of scholars will add an RNN network to DDI relation extraction in order to accurately extract the semantic feature information of drug entities contained in text data.

RNN.
In recent years, RNN-based methods have been widely used in DDI relation extraction tasks.Zhu et al. [163] proposed a DDI relation extraction method based on a multi-entityaware attention mechanism.This method adopts BioBERT to convert each word into a vector representation and then fuses each word representation through a single Bi-GRU network layer to obtain a sentence representation.Finally, after passing through a multi-layer multi-entity-aware attention layer, the final extraction result can be obtained.Fatehifar and Karshenas [41] proposed a new DDI extraction method that concatenates three different features of words, POS tags, and distances, and then inputs them into the Bi-LSTM layer.Finally, a new hybrid attention mechanism based on word similarity and position is used to highlight the output of key words.Shi et al. [121] also adopted a similar approach.In this method, the features generated from word embedding and POS embedding are concatenated and input into a Bi-LSTM network, and finally fused with the dependency information of the text to obtain the final result.Similar to [163], Chen et al. [25] used BioBERT to obtain the vector representation of each word, and adopted the Key Semantic Sentence (KSS) method to filter the non-important words in the sentence and retain the most important DDI interaction information.Then, through the Bi-LSTM layer and the multiple fully connected layer, the final extraction result is obtained.In order to learn additional word features and avoid overfitting, Wu et al. [142] proposed a stacked Bi-GRU network framework (SGRU-CNN) to complete the DDI relation extraction task.The method applies a stacked Bi-GRU network and CNN on lexical information and entity location information, respectively.The SGRU-CNN model assigns the weights of each word feature to improve performance with one attentive pooling layer.Inspired by the stacked Bi-GRU network [142], Zaikis and Vlahavas [151] proposed a stacked Bi-LSTM network structure, that is, the word embedding, POS label embedding, and distance embedding are fed into the two-layer Bi-LSTM and CNN, respectively, and then the final result is obtained through the attention layer.Deng et al. [31] proposed a joint learning framework based on Bi-LSTM to fully utilize the connections between words by designing four different tasks.The structure of the RNN-based method is shown in Figure 16.

Pack Bi-GRU.
In addition to the common RNN networks, some improved RNN methods have been proposed by related researchers.Huang et al. [56] proposed an EGFI method that uses  BioBERT to achieve sequence representation and semantic representation of textual sentences.Then, the EGFI is adapted to the target data through a multi-head attention mechanism and Pack Bi-GRU.Finally, combining the features of text sentences generated by BioGPT-2 can effectively improve the extraction performance of EGFI.The structure of a Pack Bi-GRU is shown in Figure 17.

DDI Relation Extraction Based on GCN
With the continuous development of deep learning-related technologies, a neural network that can operate on any structural graph (knowledge graph, social network, molecular structure, etc.) has been proposed: the graph convolutional neural network (GCN).A typical GCN uses convolution operators, loop operators, sampling modules, and skip connections to propagate information at each layer, and then adds a pooling module to extract high-level feature information [161].In contrast to general image and text data, graph-structured data models a group of objects (nodes) and their relationships (edges).Therefore, CNNs and RNNs cannot process graph-structured data.For the drug entity dependencies existing in the sentences, a GCN can be adopted to obtain the relationship of distant nodes and reduce the influence of noise.
At present, in DDI relation extraction, the application of a GCN on a primary network revolves around the dependencies of sentences.To fully exploit the dependency information of sentences, Xiong et al. [143] proposed a method combining a GCN with a single-layer Bi-LSTM as follows.By first converting all the words into vectors through the pre-trained word vector model and then feeding all the word vectors into a layer of BiLSTM, the final word representation is obtained.Finally, the relationship between drug entities is extracted from the dependency parse tree using GCN.Zhao et al. [157] proposed a method combining a Bi-GRU and GCN, that is, using a Bi-GRU and GCN to automatically learn the sequential representation and syntactic graph representation of sentences, respectively.Park et al. [100] proposed an attention-based GCN named AGCN.An AGCN employs a GCN to model the parse dependency tree into a graph structure and proposes a new attention-based pruning strategy that optimizes the use of grammatical information while ignoring irrelevant information.In the primary network for DDI relation extraction, the application of a GCN is shown in Figure 18.

DDI Relation Extraction Based on Simplified Network
Recent works have also shown a characteristic: the network has been simplified, that is, many works only use one or several fully connected layers in the main network section.Duan et al. [36] used BioBERT to encode sentences and fused them with features extracted from molecular structures, ultimately inputting them into a simple fully connected network to obtain DDI extraction results.Huang et al. [57] mainly optimized the pre-training process of BERT and directly fed the output of BERT into the fully connected network to obtain the probability of corresponding categories.Asada et al. [10] introduced KG embedding into text representation and achieved excellent extraction performance by using only two fully connected layers.

PERFORMANCE EVALUATION
DDI text extraction is aimed at the second subtask of DDI Extraction 2013: Extraction of drug-drug interactions.In this task, the drug entity contained in the text is considered to be known [118].In other words, DDI text extraction can be regarded as a multi-classification problem [5,162].3.
In DDI text extraction, the evaluation is based on standard precision, recall, and F-score, which are defined as: where TP is true positive, FP is false positive, and FN is false negative.

Performance Evaluation
In Table 4, we systematically compare the deep learning methods for DDI text extraction in chronological order.Since this article focuses on the classification performance of DDI extraction, we show the best performance of each method.As can be seen, the best performance method recorded in existing papers is Span-LSTM [113].However, we think the reason for this result is that Span-LSTM first identifies all drug pairs with or without DDI and then classifies the interaction type of the drug pairs that exist in DDI.The results recorded in this article are only the results of classifying the identified DDIs, not the results of whole DDI extraction.Therefore, Span-LSTM is not the best performing method.Compared with other works in the same period, DeepCNN1 [35] and DeepCNN2 [129] show a huge performance improvement, even outperforming the performance of current state-of-the-art methods.This may be due to the fact that the deep network extracts enough rich feature information; thus, the performance of DDI text extraction has been greatly improved.However, in subsequent works, the number of primary network layers used in DDI relation extraction is relatively shallow.Thus, too many network layers may not achieve very good results in practical applications.We also note that most of the works using BERT pre-trained models as well as external feature supplementation methods achieve high performance.

CHALLENGES AND OPPORTUNITIES
Previous scholars have achieved better performance based on deep learning, but there are still some problems restricting the further improvement of DDI text extraction.

Data Imbalance
In the field of computer vision, commonly used datasets include MNIST [32,117], CIFAR-10 [110], and others.Taking CIFAR-10 as an example, the total amount of data has reached 60,000 and the number of images in each category is equal (there are 10 categories, 6,000 pictures in each category).Therefore, a deep learning network can achieve good performance on such datasets.Take the SemEval dataset in NLP as an example.Although the total amount of this dataset is only more than 10,000, the amount of data between each category is relatively balanced and the ratio of the two categories with the largest data amount gap is 2.82:1 [50].By contrast, the total number of DDI Extraction 2013 reaches more than 30,000 and the proportion of the two categories with the largest gap in quantity is 100.03:1.This leads to a large error in the DDI text extraction of the deep learning network.

The Ability to Obtain Key Features
In DDI text extraction, the most commonly used network architectures are CNNs and Bi-LSTM.Although these two network architectures have achieved good results in the early development of DDI relation extraction, it is difficult to achieve new breakthroughs only relying on attention mechanisms in the later stage.With the continuous emergence of new feature supplement methods, although the feature information received by deep learning networks is becoming rich, it is difficult for a single network to capture key feature information from different types of features.Affected by many factors, sometimes the feature information obtained by the feature supplement method may have errors, which will have a negative impact on the result.Hence, how to accurately extract the key feature information from a large number of different sets of feature information becomes very important.

Utilization of External Data Resources
In the early research of DDI extraction, researchers focused on the improvement of network models text internal feature supplement methods.However, external data resources related to drug entities also hide a large amount of key feature information.Sometimes it is even possible to rely on external data resources to determine the interaction relation between two drug entities (DDI prediction [53,77]).Thus, making full use of external data resources can further improve the performance of DDI text extraction.However, only a few researchers employ external data resources (drug molecular structure, drug description, etc.) in DDI text extraction.Therefore, there is still a lot of room for improvement.

Future Opportunities
For these problems, we propose some directions that may improve the performance of DDI text extraction for researchers:

-Combination of oversampling and undersampling
In the existing work, to reduce the impact of the imbalance problem, undersampling is generally used to process the data.As one of the undersampling methods, negative instance filtering can reduce the number of negative instances and improve the proportion of positive and negative instances [1,2].However, excessive negative instance filtering will inevitably filter out valuable positive instance data, resulting in data waste.Consequently, it is not the most appropriate way to process data only with undersampling.When dealing with the problem of data imbalance, oversampling [24,93] is also one of the frequently used methods, which adopts certain rules to increase the number of positive samples.How to use oversampling and undersampling together to reduce the impact of data imbalance is a direction that can be tried.

-Network combination
It can be seen from Table 4 that extracting features in a serial or parallel manner through multiple networks may achieve better performance.This shows that relying on single network architecture alone may not be able to capture the key feature information from the rich feature information.Therefore, using multiple same or different network architectures to extract feature information can obtain key feature information with high probability.
-Utilize the latest NLP technology The performance of DDI text extraction depends heavily on the development of NLP-related technologies.For example, some articles use token segmentation tools and sentence dependency analysis tools.However, due to the limitations of the tool itself, the results obtained may not be accurate [82].This leads to the feature information obtained by the feature supplement method based on dependency or SDP not necessarily being accurate.Thus, it cannot improve the performance of DDI text extraction and even may cause the opposite effect.In the future, using better text processing tools may achieve better results.For example, BERT proposed in NLP recently as a substitute for word2vector has resulted in better results for many problems in NLP [23,54].Researchers have adopted this technology to DDI text extraction instead of word embedding, achieving good results.Thus, future DDI text extraction needs to be supported by the latest NLP technology.
-Feature mining of external data resources There are abundant data resources related to drugs, which often contain important feature information.Some databases contain rich domain knowledge related to drugs.For example, there is a large amount of information about drug targets, action pathways, and chemical formulas in DrugBank, which is significant to the interaction between drugs.In addition to professional drug datasets, other datasets (such as electronic cases [38] and website user reviews [11]) may contain a lot of valuable data and hide more key information about drug interactions as well.Therefore, using this information to assist DDI text extraction will be one of the main research directions in the future.

CONCLUSION
DDI is crucial to the safety of patients and has an important impact on drug management and patient treatment.Although a relevant DDI database can provide DDI information, the update speed is slow.For the researchers and doctors, it will cause immeasurable losses if they do not get enough DDI information in time.Thus, it is of significance for pharmacovigilance to quickly obtain DDI information from scientific literature.Methods based on deep learning have been widely applied in DDI text extraction, and various feature supplement methods have been proposed.This article reviews the existing feature supplement methods and provides a new perspective for researchers to understand DDI relation extraction.

Fig. 4 .
Fig. 4.An example of constructing a dependency parse tree from a sentence.

Fig. 5 .
Fig. 5.An example of determining the SDP of two word entities from a sentence.

Fig. 6 .
Fig. 6.Different network combinations.A represents the serial combination method, B represents the single feature parallel combination method, and C represents the multi-feature parallel combination method.

Fig. 7 .
Fig. 7. Two feature supplement methods based on special preprocessing.A represents sentence segmentation and B represents skeleton structure.

Fig. 9 .
Fig. 9. Extract drug entity features from drug description text.A indicates the method that uses SciBERT and B indicates the method that uses Doc2vec.

Fig. 10 .
Fig. 10.The feature supplement method based on two-dimensional molecular structures.

Fig. 11 .
Fig. 11.The feature supplement method based on three-dimensional molecular structures.

Fig. 12 . 4 FRAMEWORKS
Fig. 12.An example of constructing stem embedding, chunk embedding, and entity embedding in biomedical embedding.and used knowledge graph (KG) embedding methods to extract drug feature representations.This is a new type of biomedical embedding, which essentially introduces biomedical information into the embedding process, thus, making the features of the text carry biomedical-related information.4 FRAMEWORKS In this section, we review existing works on deep learning network frameworks.Deep learning has shown strong performance after being applied to DDI relation extraction.Summarizing and analyzing existing deep learning network frameworks will help new researchers to quickly design and implement DDI relation extraction tasks.However, existing reviews on DDI relation extraction [153] mainly review the related work on DDI relation extraction before 2019, thus, review contents urgently need to be updated.In addition, [108] reviews some related contents about DDI relation extraction, but it mainly describes DDI prediction and its review of DDI relation extraction is not detailed enough.Therefore, we focus on reviewing the related deep learning networks of DDI relation extraction that appeared after 2019 (including 2019) to show and analyze the latest development status of DDI relation extraction for researchers.When reviewing the network frameworks of recent work, we mainly classify existing network frameworks according to feature extraction methods for raw text data.In the previous section, the feature supplementation methods of DDI relation extraction are mainly divided into two basic methods: internal and external, and there are several different types of branches in each basic method.For different feature supplementation methods, the required feature extraction assisting networks (a network specifically designed for extracting supplementary features from different types of supplement information) are also different.The key features in DDI relation extraction are mainly derived from the text data of DDI Extraction 2013.These key features are processed by the

Fig. 13 .
Fig. 13.The different positions of the primary network and the assisting network in the DDI relation extraction framework.

Fig. 16 .
Fig. 16.Primary network based on an RNN.The red dashed box represents at least one Bi-LSTM layer.The yellow dashed boxes represent arbitrary network structures and attention layers.

Fig. 17 .
Fig. 17.An example of using a Pack Bi-GRU to process a sentence.

Table 1 .
Symbolization of Feature Supplement Methods

Table 2 .
Summary of the Articles used for DDI Relation Extraction

Table 3 .
Statistics for the DDI Corpus Datasets and Evaluation Metrics DDI text extraction is based on the DDI corpus (XML format, from SemEval 2013 DDI extraction task) composed of documents that describe drug interactions in DrugBank and MEDLINE.The DDI corpus that consists of 784 DrugBank texts and 233 MEDLINE abstracts is divided into a training set and test set.There are four types of interaction between two drug entities: mechanism, effect, advise, and int.The detailed data are shown in Table Because BERT has excellent performance in extracting text semantics, the pre-trained word vector model based on BERT has gradually replaced traditional word embedding in DDI relation extraction.In addition, including external data such as text description and molecular structure related to drug entities, it Drug-Drug Interaction Relation Extraction Based on Deep Learning 158:25 can provide rich drug attribute information.Therefore, DDMS-CNN, 3DGT-DDI, CDBERT, IMSE, HKG-DDI, and the like all achieve almost the highest results.In summary, adopting BERT series models and external feature supplementation in DDI relation extraction can achieve better performance than traditional methods.Therefore, improving the BERT model and making full use of external data provide huge development prospects.Researchers can continue the work related to the BERT model and external data processing in the future.