Multi-source information comprehensive malicious domain name detection based on convolutional neural network

As the communication carrier of malware, viruses and malicious servers, malicious domain names pose a threat to social public information security. Aiming at the problem that the characteristics of malicious domain names change all the time, which leads to a low accuracy of traditional malicious domain name detection models, this paper proposes a multi-source information comprehensive malicious domain name detection model based on convolutional neural network. Firstly, different types of features in Domain Name System are divided into five categories, and a convolutional neural network framework is designed for each category to reduce the mutual influence between different types of features. Secondly, perform feature extraction is carried out and corresponding weights are trained, and multiple classification results are fused by decision pool to capture information among various features. Finally, the experimental results show that our scheme has higher prediction accuracy than other machine learning algorithms.


INTRODUCTION
According to the 49 reports issued by China Internet Information Center [1], by July 2023, the number of netizens in China had reached 1.067 billion.It is worth noting that on the one hand, the Internet makes communication, transportation and medical care more convenient, on the other hand, people are more vulnerable to cyber threats.The analysis report of China's Internet network security detection data in the first half of 2021 released by the National Internet Emergency Center [2] shows that in the first half of 2021 alone, the number of malicious program samples was about 23.07 million, with an average daily spread of more than 5.82 million times.Secondly, according to the statistics of AV-TEST GmbH, in the first three quarters of 2022, a total of 62.29 million new malware samples were detected, which is equivalent to about 228,164 malware threats every day.Among them, the attacks against Domain Name System (DNS) are endless.For example, some malicious attacks use botnets [3], and attackers can use the command and control channels behind DNS services to improve the survivability of botnets, launch distributed denial of service attacks, online identity theft and other behaviors.In addition, phishing scams often occur, and attackers commit online scams by sending suspicious spam.
Literature [4] uses the text content of DNS logs and phishing history data to actively find phishing web pages.Although this method can actively find phishing attacks, it does not detect malicious domain names according to the whois characteristics and analytic IP characteristics of domain names, which leads to a high false positive rate.Yin [5] improved the above method, using the random forest algorithm in machine learning algorithm.However, the model uses too many domain name features, resulting in a long detection time and process.Zhao [6] designed a detection algorithm based on lexical analysis and feature quantization.This detection method uses the N-gram model in natural language processing to judge the malice of domain names according to the threshold.Zhou [7] think that features can be established by the correlation of multiple IP addresses resolved by the same domain name, and random forest is used as the detection algorithm of malicious domain names.In reference [8], the author summed up the characteristics of nearly 40 malicious domain names, and the classification basically met all the characteristics of malicious domain names, but no corresponding algorithm was designed to realize it.
Using network traffic to monitor malicious domain names has recently attracted extensive attention of scholars because it is more in line with the dynamic characteristics of domain names.Literature [9] not only indirectly obtains users by means of stream record data, but also puts forward a "domain name dependence" measure group based on the access relationship between domain names, and uses this measure group to realize the identification of malicious domain names.In reference [10], a forward feature selection algorithm based on information gain rate is proposed to solve the problem of high network traffic dimension.Using greedy strategy, the solution strategy to the problem that it is difficult to select candidate feature subsets is designated as calculating the information gain of each feature in traffic records.
Graph representation learning has attracted more and more attention in the field of network security because it can capture the characteristics of network data more comprehensively and get better detection results.Literature [11] classifies publicly available network security data sets according to logs and network traffic, executable files, social networks and trading networks, and designs malicious domain name recognition algorithms based on dimensionality reduction and random walk.Literature [12] takes domain name and host ip as data sources to construct DNS map and mine the internal relationship between domain and host ip.Literature [13] constructs a weighted undirected graph based on the correlation of domain names in multiple information; The weighted undirected graph is divided by Louvain algorithm, and several subgraphs with organizational structure are obtained.
The above three methods often need to give the label of malicious domain names in advance, but this assumption is often untenable in practical applications.Faced with the ever-changing network environment and various types of attacks, the characteristics of malicious domain names are changing all the time.Therefore, this paper uses deep learning technology [14] [15] to seek a more flexible and robust malicious domain name detection method to cope with the ever-changing evasion strategy of malicious domain name creators.Secondly, due to the different data sources of threat intelligence vendors, the coverage and accuracy of malicious domain name detection are different [16], and a single intelligence source can not provide accurate malicious domain name classification results.Therefore, this paper fuses multi-source data to improve the accuracy of malicious domain name detection model.

MULTI-SOURCE INFORMATION COMPREHENSIVE MALICIOUS DOMAIN NAME DETECTION MODEL BASED ON CONVOLUTIONAL NEURAL NETWORK 2.1 Overall process
In this paper, deep convolution neural network is used to extract key features automatically.Firstly, these features are divided into five main categories to reduce the interaction between different types of features.Secondly, five convolutional neural network frameworks are designed to extract the features of the above five categories, and the extracted features are input into the convolutional neural network model to train the corresponding weights.Finally, the decision pool is used to integrate the feature information under different classifications.By fusing multiple classification results, the information between various features can be captured, thus improving the accuracy of the overall classification.The malicious domain name detection framework is shown in Figure 1.The structural characteristics of domain name mainly refer to the number of top-level domain names, whether to include IP addresses, Shannon entropy of domain names, etc.As shown in Table 3.

d) Whois characteristics
Whois features provided by well-known websites also provide a lot of malicious domain name identification required feature vectors.As shown in Table 4. e) Threat intelligence characteristics Since many threat intelligence sources such as Weibu, Tencent, Qianxin, Green Alliance, Qiming Stars, Anheng, etc., have provided corresponding malicious domain name identification schemes and given the identification results of malicious domain names.As shown in Table 5.

Feature extraction
Since there are different statistical relationships among various categories of features, in order to avoid the daring escape among various features and better distinguish the importance of each feature, this paper separately designs five convolutional neural network frameworks for feature extraction of the above five categories of features.

2.3.1
Multi-source information processing.Attention mechanism and residual connection need to be introduced before the structural construction of the five-block neural network.
The attention mechanism [17] uses the weight to calculate the contribution of the input sequence to the overall model, and reflects the context position of the important elements of the input sequence according to the weight value.In convolutional neural networks, due to its various features and complex structure, the generalization performance of models is often poor, and the attention mechanism can effectively deal with this problem.
In the deep neural network, when there are many hidden layers, there will be the problem of gradient disappearance/explosion, Residual connection [18] can make information transfer across layers, avoid information loss in the deep layer of the network, and reduce the difficulty of training.The specific feature extraction framework of convolutional neural network under the five features is shown in Figure 2.   (1) Individual model training: Before fusion, five feature models need to be trained.Each model is trained using training data and feature sets corresponding to the five characteristics so that they can produce accurate malicious/non-malicious predictions when given a domain name (2) Generate individual prediction: Input the domain name to be detected into each individual model to obtain the respective malicious/non-malicious prediction results.
(3) Fusion prediction results: The prediction results of each individual model are combined to generate a final comprehensive prediction result (4) Comprehensive forecast results: The combined comprehensive forecast results become the final output.This combined result is generally considered to be more accurate and robust than the predictions of a single model because it takes full advantage of the different characteristics and predictive power of multiple models.

Decision pool loss function selection.
This paper chooses MHF (Multi Huber-Focalloss Function) as the comprehensive loss function: Where,  is the improved cross entropy loss function, ℎ is the Huber loss function, andis a hyperparameter that needs to be manually set, which controls the proportion of Huber loss function in the total loss function.In this paper, = 0.5is taken.
The improved cross-entropy loss function above indicates that when the predicted value of the model is closer to 0 or 1, it means that the model has achieved better prediction results and the difficulty of model adjustment is lower; otherwise, the value of the loss function needs to be increased to realize that the loss value of difficult samples can guide model optimization more.
Where  is a trainable hyperparameter, the Huber loss function can enhance the robustness of outliers to improve the problem of initial label noise.

EXPERIMENTAL ANALYSIS 3.1 Experimental environment and dataset
The algorithm proposed in this paper is written in Python programming language and developed and implemented in PyCharm integrated environment.The hardware environment of the experiment is AMD Ryzen 5 3600 3.6GHz CPU and 16GB RAM.Get 10,000 domains from Alexa and Malware, including 7,000 legitimate domains and 3,000 malicious domains.The 10,000 domains are divided into two parts, 80% as a training dataset and the rest as a test dataset.

Analysis of experimental results
In this experiment, five evaluation indicators are used to evaluate the model, which are accuracy rate, recall rate, accuracy rate, F1 value and AUC.The following three methods are repeated in the experimental environment of Table 6 and compared with the method in this paper: 1) Single feature-based [19]: a malicious domain name detection method based on text features.2) Based on Graph [20]: a malicious domain name detection method based on knowledge graph.3) Based on domain name relationship graph [21]: a method for detecting typical malicious domain names based on the relationship between domain words.The experimental results are shown in Table 7.
As can be seen from Table 7, the results of the proposed scheme are slightly higher than the malicious domain name detection scheme based on a single feature, and compared with the other two schemes, it has greater advantages.It shows that the proposed method can detect malicious domain names effectively.
By using the model ablation experiment method, the malicious domain name detection schemes with different modules were compared with the accuracy rate, recall rate, accuracy rate, F1 value and AUC as the evaluation criteria, and the results were shown in Table 8.
The proposed scheme is compared with baseline neural network algorithm, attention mechanism + neural network algorithm and residual connection + neural network algorithm.It can be seen from the five evaluation criteria in the figure that our scheme is superior to other schemes and has obvious improvement.It shows that our scheme is more efficient and accurate than other schemes.

CONCLUSION
The multi-source information comprehensive malicious domain name detection model based on convolutional neural network proposed in this study has obtained good results in the experiment.By using linguistic features, behavioral features, structural features, Whois features and threat intelligence features, multiple information source features are classified.Five convolutional neural network frameworks are designed to extract features from the above five major categories of features, train corresponding weights, and then integrate these weights together to obtain a more accurate classification model.Malicious domain names can be detected more effectively.Compared with the traditional single algorithm method, the proposed model shows higher accuracy in malicious domain name detection.Future research could further explore and refine the model to address challenges and changes in the malicious domain.

Figure 1 :
Figure 1: Multi-source information comprehensive malicious domain name detection model based on convolutional neural network

Figure 2 :Figure 3 :
Figure 2: Feature extraction framework of convolutional neural network under five features 2.2.1 Description of features of multi-source information.a)LanguagefeaturesLanguagefeaturesrefertothefeatures extracted from the text information of domain names.Malicious domain names usually adopt different naming methods, character combinations, or have abnormal domain names.Therefore, language features can be used to distinguish normal domain names from malicious domain names, as shown in Table1.b)BehavioralcharacteristicsBehaviorcharacteristicsdescribe the DNS exchange type, TXT record, and SPF record of the domain name.Table2shows the behavior characteristics.For example, if you query the MX record of example.com,you can obtain the MX record of the domain name.
c) Structural characteristics

Table 1 :
Description of specific sub-features under linguistic features

Table 2 :
Description of specific sub-features under behavioral characteristics

Table 3 :
Description of concrete sub-features under structural features

Table 4 :
Description of specific sub-features under Whois

Table 5 :
Description of specific sub-characteristics under threat intelligence

Table 6 :
The soft and hard configuration table of the experimental environment

Table 7 :
The performance of four models was compared

Table 8 :
Results of model ablation experiment