TFE-GNN: A Temporal Fusion Encoder Using Graph Neural Networks for Fine-grained Encrypted Traffic Classification

Encrypted traffic classification is receiving widespread attention from researchers and industrial companies. However, the existing methods only extract flow-level features, failing to handle short flows because of unreliable statistical properties, or treat the header and payload equally, failing to mine the potential correlation between bytes. Therefore, in this paper, we propose a byte-level traffic graph construction approach based on point-wise mutual information (PMI), and a model named Temporal Fusion Encoder using Graph Neural Networks (TFE-GNN) for feature extraction. In particular, we design a dual embedding layer, a GNN-based traffic graph encoder as well as a cross-gated feature fusion mechanism, which can first embed the header and payload bytes separately and then fuses them together to obtain a stronger feature representation. The experimental results on two real datasets demonstrate that TFE-GNN outperforms multiple state-of-the-art methods in fine-grained encrypted traffic classification tasks.


INTRODUCTION
To protect user privacy and anonymity, various encryption techniques are used to encrypt the transmission of network traffic [28].Although Internet security is improved for a regular user, encryption technologies also provide a convenient disguise for some malicious attackers.Moreover, some privacy-enhanced tools like VPN and Tor [26] may be utilized to achieve illegal network transactions, such as weapon trading and drug sales, where it is difficult to trace the traffic source [13].Traditional data packet inspection (DPI) methods concentrate on mining the potential patterns or keywords in data packets, which is time-consuming and loses its accuracy when facing encrypted traffic [24].Consequently, how to effectively represent encrypted network traffic for more accurate detection and identification is a significant challenge.
To solve the above problems, many approaches have been proposed.The earliest port-based works are no longer effective due to the application of dynamic ports.Subsequently, a series of statisticbased methods emerged [8,31,37,40,43], which rely on statistical features from traffic flows (e.g., mean of packet length).Then, a machine learning classifier (e.g., random forest) is adopted to get the final prediction results.Unfortunately, these methods need hand-crafted feature engineering and may fail due to the unreliable/unstable flow-level statistical information in some cases [36].Most statistical features of relatively short flows have higher deviations compared with long flows.For example, the flow length generally obeys the long-tailed distribution [45], implying the universal existence of unreliable statistical features.Therefore, we use packet bytes instead of those statistical features.
Recently, graph neural networks (GNNs) [14] have been widely used in lots of applications of processing unstructured data.Due to the powerful expressiveness, GNNs can recognize specific topological patterns implied in graphs so that we can classify each graph with a predicted label.For the traffic classification task, most current GNN-based methods [1,11,21,25,29] construct graphs according to the correlation between packets, which actually is another usage form of statistical features and also suffers from the issue mentioned above.While the others do utilize packet bytes but have two major flaws: 1) Mix usage of the header and payload.Existing methods simply treat the header and payload of a packet equally but ignore the difference in meaning between them.2) Inadequate utilization of raw bytes.Although the packet bytes are utilized, most methods regard packets as nodes and just take their raw bytes as node features, which does not make the most of them [11].
Based on the above observations, in this paper, we propose a byte-level traffic graph construction approach based on point-wise mutual information (PMI) and a novel model named Temporal Fusion Encoder using Graph Neural Networks (TFE-GNN) for encrypted traffic classification.The byte-level traffic graphs are constructed by mining the correlation between bytes and served as inputs for TFE-GNN.TFE-GNN consists of three major submodules (i.e., dual embedding, traffic graph encoder, and crossgated feature fusion mechanism).The dual embedding treats the header and payload of a packet separately and embeds them using two independent embedding layers.As for the traffic graph encoder which consists of multilayer GNNs, it encodes each graph into a high-dimensional graph vector.Finally, we use the cross-gated feature fusion mechanism to integrate header graph vectors and payload graph vectors, obtaining an overall representation vector of a packet.For end-to-end training, we employ a time series model to get final prediction results for downstream tasks.In the experiment section, we adopt a self-collected WWT dataset (including the data from WeChat, WhatsApp and Telegram) as well as the public ISCX dataset to compare TFE-GNN with more than a dozen baselines.The experimental results show that TFE-GNN surpasses almost all the baselines and comprehensively achieves the most excellent performance on the adopted datasets (e.g., 10.82% ↑ on the Telegram dataset, 4.58% ↑ on the ISCX-Tor dataset).
In summary, the main contributions of this paper include: • We first construct the byte-level traffic graph by converting a sequence of packet bytes into a graph, supporting traffic classification from a different perspective.• We propose TFE-GNN, which treats the packet header and payload separately and encodes each byte-level traffic graph into an overall representation vector for each packet.Thus, TFE-GNN utilizes a packet-level representation vector rather than a flow-level one.• To evaluate the performance of the proposed TFE-GNN, we compare it with several existing methods on the selfcollected WWT dataset and public ISCX dataset [5,15].The result shows that, for user behaviour classification, TFE-GNN outperforms these methods in effectiveness.

PRELIMINARIES 2.1 Notations
In this paper, a graph is denoted by G = {V, E, X}, where V is the node set, E is the edge set, and X ∈ R | V | ×  is the initial feature matrix of nodes whereby the initial feature of node  can be represented by   .We use A ∈ {0, 1} | V | × | V | to represent the adjacency matrix of G, which satisfies that the entry (, ) of A, i.e.,    , equals 1 if there is an edge between nodes  and , otherwise it is 0. We use  () to represent the neighborhood of node .Moreover, we use   to represent the embedding dimension in the -th layer.
For brevity and convenience, we extend the concept of traffic flows by introducing time-induced Traffic Segments (TS), which are collectively referred to as traffic samples in the rest of the paper.
where    denotes a single packet with its time stamp   ,  is the sequence length of a traffic segment,  1 ,   are the start and end times of a traffic segment, respectively.From the definition above, the traffic segment has a broader scope than the traffic flow, i.e., each traffic flow can be seen as a traffic segment, but the reverse does not necessarily hold.In this way, we can directly take traffic segments as training samples and do inference using either traffic flows or traffic segments, which helps to improve flexibility and unleash the expressiveness of an end-to-end model.

Encrypted Traffic Classification
The encrypted traffic classification task aims to differentiate the traffic generated from various sources (e.g., applications, web pages or services) by using the information of traffic packets captured by professional software or programs.In this paper, we concentrate on in-app user behaviour classification which differentiates finegrained user actions such as sending texts and sending pictures.Assume that there are  training samples and  categories in total, let the -th traffic sample be a sequence where  is the sequence length and    is the -th byte sequence of the -th traffic sample denoted by where  is the byte sequence length and     denotes the -th byte value in the -th byte sequence of the -th traffic sample.According to the definition above, the (segment-level) encrypted classification task can be described formally as predicting the category   of an unseen test sample   with a designed and well-trained end-to-end model  (  ) on  training samples, where

Message Passing Graph Neural Networks
Graph Neural Networks (GNNs) [14] are powerful models for handling unstructured data.With the application of the message passing paradigm (MP) [6] to GNNs (MP-GNNs), the node embedding vectors can be updated iteratively by integrating nodes' embedding vectors in neighborhood through a specific aggregation strategy.Generally, the -th layer MP-GNNs can be formalized as two procedures (i.e., the message computation and aggregation): Due to the high scalability of our proposed model, various GNN architectures can be easily adapted.Section 3.3 discusses the concrete choice of message aggregation strategies and our designed GNN architecture according to the design space of GNNs [42].

METHODOLOGY 3.1 Byte-level Traffic Graph Construction
We attempt to convert a sequence of bytes into a graph G = {V, E, X} by mining the potential correlation between bytes, where each element in V denotes a byte (i.e., a byte corresponds to a node in G).Note that all the bytes with the identical value share the same nodes so that there are no more than 256 nodes in G, which ensures a relatively small scale of traffic graphs.
Correlation representation between bytes.For edges, we can easily connect all bytes chronologically, which means creating an edge from byte  to  if byte  comes before byte  in a byte sequence.But we do not adopt this method since it will lead to a very dense graph and the topological structure will lack distinguishability.Therefore, inspired by [22] which uses cosine similarity to measure the correlation between two bytes, we adopt point-wise mutual information (PMI) [41], which is a prevalent measure for word association computation in natural language processing (NLP), to model the correlation between two bytes.In this paper, we represent the PMI value of bytes  and  as PMI(, ).
Edge creation.The PMI value makes a comprehensive measurement of two co-occurrence bytes from the perspective of semantic associativity of bytes.We utilize it to create an edge between two bytes.A positive PMI value implies a high semantic correlation of bytes while a zero or a negative one implies little or no semantic correlation of bytes.Consequently, we only create an edge between two bytes whose PMI value is positive.
Graph construction.Below, we give the formal description of edges through the entries of adjacency matrix A of nodes  and : The initial features of each node in graph G are given by the corresponding byte value, which ranges from 0 to 255.Notably, since PMI(, ) = PMI( , ), the byte-level traffic graphs are undirected.

Dual Embedding
The byte value is commonly utilized to serve as initial features for further vector embedding.Two bytes with different values correspond to two distinct embedding vectors.However, the meaning of a byte varies not only with the byte value itself, but also with the part of the byte sequence in which it is located.In other words, the representation meaning of two bytes with the identical value within the header and payload of a packet respectively may be completely different.The reason is that the payload carries the transmission contents of a packet while the header is the first part of a packet that describes its contents.If we make two bytes with the identical value in the header and payload correspond to a same embedding vector, it is difficult for a model to converge to the optimum on these embedding parameters because of the obfuscated meaning.
For the rationale mentioned above, we treat the header and payload of a packet separately and construct byte-level traffic graphs for the two parts, respectively (i.e., byte-level traffic header graphs and byte-level traffic payload graphs).We adopt dual embedding with two embedding layers that do not share parameters to embed initial byte value features into high-dimensional embedding vectors for the two kinds of graphs, respectively.
Dual embedding layer.Assume that  0 denotes the embedding dimension and  is the number of embedding elements (i.e., byte value).The dual embedding matrices, which consist of two embedding matrices, can be viewed as  ℎ ∈ R  × 0 and   ∈ R  × 0 , where each row-wise entry represents the embedding vector of each byte value.

Traffic Graph Encoder with Cross-gated Feature Fusion
Since we construct byte-level traffic graphs based on the header and payload of packets, respectively, the following modules of TFE-GNN in this section are also dual, do not share parameters (architecture is the same) and can process in parallel.
Traffic graph encoder.To encode each traffic graph into a graph feature vector, we elaborately design a traffic graph encoder using stacked GraphSAGE [7], which is a powerful graph neural network.For every node  in graph G, GraphSAGE computes the message from each neighboring node  ∈  () by normalizing its embedding vector using the degree of node .Then, Graph-SAGE computes the overall message of all neighboring nodes  () through element-wise mean operation and aggregates the overall message as well as the embedding vector of node  through concatenation operation.Finally, a nonlinear transformation is applied to the embedding vector of node , finishing the forward procedure of one GraphSAGE layer.Formally, the message computation and aggregation of GraphSAGE can be described by: where | ()| is the neighbor number of node , w ( ) ∈ R   −1 ×  is the parameter in layer , CONCAT(•) denotes the concatenation operation and  (•) denotes the activation function.Specially, we employ parametric ReLU (PReLU) [9] as an activation function.PReLU scales each negative element value by a factor, which not only plays the effect of nonlinear transformation but also plays a role similar to that of the attention mechanism by different scale factors for each channel in the negative axis.Lastly, we normalize the updated feature vector h ( )  by batch normalization (BN) [12].Due to the over-smoothing issue [2] in the deep GNN model, we only stack GraphSAGE up to 4 layers and concatenate the output feature vectors of each layer for each node  to alleviate this problem, which is similar to Jumping Knowledge Network (JKN) [39]: where h final  is the final feature vector of node .Finally, we apply mean pooling on all nodes to get a graph feature vector g: where ⊕ denotes element-wise addition.For simplicity, we use g ℎ and g  to represent graph feature vectors extracted from traffic header graphs and traffic payload graphs, respectively.Cross-gated feature fusion.Since we extract features from traffic header graphs and traffic payload graphs respectively mentioned in Section 3.2, we aim to create a reasonable relationship between g ℎ and g  to get an overall representation of packet bytes.To this end, we carefully design a feature fusion mechanism named cross-gated feature fusion, to fuse g ℎ and g  into a final encoded feature vector for each packet.
As shown in Figure 1, we adopt two filters, each of which consists of two linear layers with a PReLU activation function between them.First the two filters, which do not share parameters, are applied to g ℎ and g  , respectively and then an element-wise sigmoid function is used to scale each element to [0, 1].We consider the scaled vectors as gated vectors (s ℎ and s  for the header and the payload) and use them to crosswise filter the corresponding g ℎ and g  .Such a mechanism allows the model to filter out unimportant information and reserve the significant one for the two feature vectors.As the first part of the packet, the header describes its important features.Thus, it is reasonable to use header gated vector s ℎ to filter payload graph feature vector g  and conversely use payload gated vector s  to filter header graph feature vector g ℎ .
The cross-gated feature fusion can be formally represented by: where are the weights and biases of linear layers.The symbol ⊙ denotes element-wise product and z is the overall representation vector of the packet bytes, which can be used for the downstream tasks.

End-to-End Training on Downstream Tasks
Based on the overall representation vector z for each packet, a packet-level or a segment-level classification task can be easily solved using a downstream classifier.We primarily focus on the segment-level task in this paper.
Temporal information extraction.Since we have already encoded raw bytes of each packet in a traffic segment into a representation vector z, the segment-level classification task can be considered as a time series prediction task.Here, we just adopt long short-term memory (LSTM) [10], which is a classical and famous time series model, as our baseline downstream model.LSTM is bidirectional with two layers and its output vectors are fed into a two-layer linear classifier with PReLU as its activation function to get the final prediction results.Seeing that we need to compute the difference between prediction results and the ground truth, we just adopt the cross entropy function as the loss function: where  is the segment length,  is the ground truth and CE(•) denotes the cross entropy function.Specially, we also attempt to employ a transformer layer [33] as a downstream model, which is also an effective time series model based on the self-attention mechanism.The experimental results for transformers are also presented in the experiment section.

EXPERIMENTS
In this section, we first present experimental settings.Then, we conduct experiments on multiple datasets and baselines and analyze the results.We also conduct an ablation study to show the effectiveness of each component in TFE-GNN.For comprehensive analysis, we design some model variants to evaluate the scalability of TFE-GNN and compare several baselines w.r.t.their model complexity.Finally, we analyse the model sensitivity of TFE-GNN.In detail, we conduct the experiments to answer the following questions: RQ1: How is the usefulness of each component (Section 4.3)?RQ2: Which GNN architecture performs best (Section 4.4)?RQ3: How is the complexity of the TFE-GNN model (Section 4.5)?RQ4: To what extent can changes in hyper-parameters affect the effectiveness of TFE-GNN (Section 4.6)?
ISCX VPN-nonVPN is a public traffic dataset which contains ISCX-VPN and ISCX-nonVPN datasets.The ISCX-VPN dataset is collected over virtual private networks (VPNs) which are used for accessing some blocked websites or services and difficult to be recognized due to the obfuscation technology.Conversely, the traffic in ISCX-nonVPN is regular and not collected over VPNs.
Similarly, ISCX Tor-nonTor is a public dataset and ISCX-Tor dataset is collected over the onion router (Tor) whose traffic can be difficult to trace.Besides, ISCX-nonTor is also regular and not collected over Tor.For comparison, we use the ISCX VPN-nonVPN and ISCX Tor-nonTor datasets with six and eight user behaviour categories, respectively.We use SplitCap to obtain bidirectional flows from public datasets.Specially, due to the scarcity of flows in the ISCX-Tor dataset, we increase the training samples by dividing each flow into 60-second non-overlapping blocks in our experiments [27].Finally, we utilize stratified sampling to sequentially partition the training and testing dataset into 9:1 for all datasets.
The WWT dataset includes fine-grained user behaviour traffic data from three social media apps (i.e., WhatsApp, WeChat and Telegram), which have twelve, nine and six user behaviour categories, respectively.Unlike the public ISCX dataset, we additionally record the start and end timestamps of each user behaviour sample for traffic segmentation.
4.1.2Pre-processing.For each dataset, we define and filter out two kinds of "anomalous" samples: (1) Empty flows or segments: the traffic flows or segments where all packets have no payload.
(2) Overlong flows or segments: the traffic flows or segments whose length (i.e., the number of packets) is larger than 10000.An empty flow or segment does not contain any payload, thus we can not construct the corresponding graph.In fact, such samples are generally used to establish connections between clients and servers, having little discriminating information that helps to classify.An overlong flow or segment contains too many packets and a large number of bad packets or retransmission packets may appear in it due to temporarily bad network environment or other potential reasons.In most cases, such samples introduce too much noise, so we also consider overlong flows or segments as anomalous samples and remove them.Additionally, as for each rest sample of datasets, we remove bad packets and retransmission packets within.
For each packet in a flow or segment, we first remove the ones without payload.Then we remove the Ethernet header, which only provides some irrelevant information for classification.The source and destination IP addresses, and the port numbers are all removed for the purpose of eliminating interference with sensitive information deriving from these IP addresses and port numbers.

Implementation Details and Baselines.
In the stage of traffic graph construction, we set the max packet number of one sample to 50.The max payload byte length and the max header byte length are set to 150 and 40, respectively.The PMI window size is set to 5 by default.In the stage of training, we set the max training epoch to 120.The initial learning rate is set to 1e-2 and we use the Adam optimizer with a learning rate scheduler, which gradually decays the learning rate from 1e-2 to 1e-4.The batch size is 512, the ratio of warmup is 0.1 and the dropout rate is 0.2.We implement all models with PyTorch and run each experiment 10 times independently to take average on a single NVIDIA RTX 3080 GPU.

Comparison Experiments
The comparison results on WWT and ISCX datasets are shown in Tables 1 and 2. According to Tables 1 and 2, we can draw the following conclusions: (1) TFE-GNN reaches the best performance compared with several baselines on the WWT dataset.Additionally, TFE-GNN also achieves the best results on four metrics, which further comprehensively demonstrates the effectiveness of our method.

Ablation Study (RQ1)
In this section, we conduct an ablation study of TFE-GNN on the ISCX-VPN and the ISCX-Tor datasets and show experimental results in Table 3.To facilitate the presentation of results, we denote header, payload, dual embedding module, jumping knowledge network-like concatenation, cross-gated feature fusion and activation function and batch normalization as 'H', 'P', 'DUAL', 'JKN', 'CGFF' and 'A&N', respectively.Specially, we not only verify the effectiveness of each component in TFE-GNN, but also test the impact of some alternative modules or operations, including 'SUM' and 'MAX' operation on node features to get graph representation vectors instead of the default 'MEAN', and 'GRU' or 'TRANSFORMER' modules to serve as downstream models instead of LSTM.
From the component ablation study of Table 3, we can draw the following conclusions: (1) The packet headers play a more important role in classification than the packet payloads and different datasets have different levels of the header and payload importance (the f1-score decreases by 2.5% when switching the header to payload on the ISCX-VPN dataset and by 21.06% on the ISCX-Tor dataset).(2) The usage of dual embedding increases the f1-score by 3.63% and 0.95%, which indicates its general effectiveness.JKN-like concatenation and cross-gated feature fusion both enhance the performance of TFE-GNN by a similar margin on two datasets.(3) We further verify the impact of the activation function and batch normalization and a significant performance drop can be seen on both datasets, which demonstrates the necessity of this two operations.
While on the rest part of Table 3, we can also obtain the following several points: (1) The element-wise summation on node features performs worse than the mean operation by a margin of 11.1% and
From Figure 2, we can find that GraphSAGE [7] achieves the best f1-score on three datasets.As for the rest variants, a noticeable drop in performance can be discovered, especially for GAT [34].The rationale behind the results is that GNN models are easy to overfit on small-scale graphs like ours (number of nodes is up to 256).As for GAT [34], the application of the attention mechanism in neighborhood feature aggregation exacerbates overfitting, which leads to a significant decline in f1-score.Among the three datasets, the relatively small fluctuation of the results on the Telegram dataset further validates the analysis above, which benefits from its larger number of training samples.

Model Complexity Analysis (RQ3)
To comprehensively evaluate the trade-off between model performance and model complexity, we present the floating point operations (FLOPs) and the model size of all baselines except for the traditional models in Table 4.
From Tables 4 and 2, we can draw a conclusion that TFE-GNN achieves the most significant improvement on public datasets with relatively slight model complexity increasing.Although ET-BERT reaches comparable results on the ISCX-nonVPN dataset, the FLOPs of ET-BERT are approximately five times as large as that of TFE-GNN and the number of model parameters are also doubled, which generally indicates longer model inference time and requires more computation resources.Furthermore, the pre-training stage of ET-BERT is very time-consuming and costs a lot due to the large amount of extra data during pre-training and the high model complexity.In comparison, TFE-GNN can achieve higher accuracy while reducing the training or inference costs.

Model
FLOPs(M) Parameters(M) FS-Net [18] 1.0e+2 3.2e+0 DF [30] 2.8e+0 9.3e-1 EDC [16] 2.2e+1 2.2e+1 FFB [44] 2.6e+2 1.7e+0 MVML [4] 7.2e-4 3.7e-4 ET-BERT [17] 1.1e+4 8.6e+1 GraphDApp [29] 3.8e-2 1.1e-2 ECD-GNN [11] 2.9e+1 1.4e+0  3b, we can find that a smaller window size usually results in better f1-score.The larger the window size, the more edges will be added in the traffic graphs, and the model will be harder to discriminate different traffic categories due to the too dense graphs.(3) The Impact of Segment Length.From Figure 3c, we can draw a conclusion that a short segment length for training usually makes the performance better.When the segment length becomes longer, more noise will be introduced and the downstream model LSTM has shortcomings in long sequence modeling, affecting the evaluation results.On the other hand, our method can achieve high accuracy when facing a short traffic flow or segment, reducing the amount of computation while improving performance.[31].CUMUL uses the features of cumulative packet length [23] and GRAIN [43] uses payload length as its features.ETC-PS utilizes the path signature theory to enhance the original packet length features [40], and Liu et al. exploited packet length sequences using wavelet decomposition [19].Conti et al. [3] adopts hierarchical clustering for feature extraction.The fingerprinting matching is also used in the traffic classification task.FlowPrint [32] constructs correlation graphs as traffic fingerprinting by computing activity value between destinations IP.K-FP [8] creates fingerprinting using random forest and matches unseen samples by k-nearest neighbor.All of these methods suffer from the unreliable features (mentioned in Section 1).
Deep Learning Based Methods.With the popularity of deep learning models, many traffic classification approaches are developed based on them.EDC [16] uses some header information of packets (e.g., protocol types, packet length and time duration) to build features for multilayer perceptions (MLPs).MVML [4] designs local and global features using packet length and time delay sequences, and simply employs a fully-connected layer for classification.Furthermore, FS-Net [18], DF [30] as well as RBRN [47] all utilize traffic flow sequences like packet length sequences to serve as the inputs of deep learning models.Additionally, DF and RBRN use convolutional neural networks (CNNs) while FS-Net utilizes gated recurrent units (GRUs) to extract temporal information of such sequences.For some other methods, packet bytes are used as model inputs to extract features.FFB [44] uses raw bytes and packet length sequences as features to feed into CNNs and RNNs.While Deep Packet [20] utilizes CNNs and autoencoders for feature extraction.Recently, pre-training models are utilized to pre-train on large-scale traffic data.To give an example, ET-BERT [17] designs two novel pre-training tasks for traffic classification, which enhance the representation ability of raw bytes but are very timeconsuming and costly.In a word, these methods can not obtain the discriminative information which is contained in raw bytes very well in a relatively efficient way, while our approach solves this pain and difficulty by introducing byte-level traffic graphs.
Graph Neural Network Based Methods.Graph neural networks have strong potential in processing unstructured data and can be migrated to many fields.For encrypted traffic classification, GraphDApp [29] constructs traffic interaction graphs using traffic bursts and employs graph isomorphism network [38] to learn representations.MAppGraph [25] constructs traffic graphs based on different flows and time slices within a traffic chunk, which is almost impossible to construct a complete graph in the face of a short traffic segment.GCN-ETA [46] is a malicious traffic detection method.To construct a graph, it will create an edge if two flows share common IP, which may result in a very dense graph.MEMG [1] utilizes markov chains to construct graphs from flows while GAP-WF [21] maps a flow as a node in graphs and connects edges between flows which share the same identity of the clients.Besides, Huoh et al. [11] directly created edges based on the chronological relationship of packets among a flow, being lack of specificity.These methods all construct graphs at the level of traffic flows, which are vulnerable if there is too much noise within flows.

CONCLUSION AND FUTURE WORK
We propose an approach to construct byte-level traffic graphs and a model named TFE-GNN for encrypted traffic classification.The byte-level traffic graph construction approach can mine the potential correlation between raw bytes and generate discriminative traffic graphs.TFE-GNN is designed to extract high-dimensional features from constructed traffic graphs.Finally, TFE-GNN can encode each packet into an overall representation vector, which can be used for some downstream tasks like traffic classification.Several baselines are selected to evaluate the effectiveness of TFE-GNN.The experimental results show that our proposed model comprehensively surpasses all the baselines on the WWT and the ISCX datasets.Elaborately designed experiments further demonstrate that TFE-GNN has strong effectiveness.
In the future, we will attempt to improve TFE-GNN in terms of the following limitations.(1) Limited graph construction approach.The graph topology of the proposed model is determined before the training procedure, which may result in non-optimal performance.Moreover, the TFE-GNN can not cope with the bytelevel noise implied in the raw bytes of each packet.(2) Unused temporal information implied in byte sequences.The bytelevel traffic graphs are constructed without introducing the explicit temporal characteristics of byte sequences.

Figure 2 :
Figure 2: GNN Architecture Variants Study w.r.t.F1-score ∈ R   are the embedding vectors of nodes  and  in layer .m  is the computed message from node  in layer .MSG ( ) (•) is a message computation function parameterized by

Table 1 :
Experimental Results on Self-collected WeChat, WhatsApp and Telegram Datasets