Identifying Network Congestion Using Knowledge Graphs and Link Prediction

In this work, we introduce a dynamic, context-aware system for managing communication networks using knowledge graphs. We utilize graph embeddings to create vector representations of a network's properties, while preserving each node's topological data. Our model employs link prediction techniques to proactively identify potential network congestion events. This methodology has been applied in a simulated communication network environment. The results demonstrate its promising ability to enhance network performance and reliability by predicting and mitigating congestion before it disrupts service delivery. By leveraging this enriched representation, our model identifies events that could disturb efficient network function, hence enabling more efficient and reliable delivery of digital services. This approach significantly contributes to the proactive and predictive management of digital communication networks, establishing a new way of enhancing network performance and reliability.


INTRODUCTION
The exponential growth in networking capacity today is primarily driven by the advancements in 5G and 6G wireless networks and the emergence of smart environments in various sectors such as urban cities, industrial setups, transportation systems, and energy grids.A significant portion of applications and services in these sectors rely on centralized computing and storage infrastructures, placing data centers at the forefront of the contemporary digital ecosystem (Fig. 1).However, this centralized model faces challenges in meeting specific performance requirements such as low latency, high bandwidth, and energy efficiency, necessary for certain application services [1].
Content distribution networks and edge computing have eased the core network's burden [2,3].Yet, the surge in data from IoT devices and diverse data sources strains the reliability of communication networks interlinking these distributed resources [4].Consequently, the urgency for adaptive network management solutions, capable of real-time monitoring and decision-making, has grown [5].These solutions facilitate adjustments in traffic management based on current and projected link statuses.Addressing these challenges, our work introduces an ML-driven approach to predict potential network congestion by using a graphbased representation of the network.The key lies in employing a Knowledge Graph (KG) for modeling the network, ensuring a comprehensive understanding of each link's state.Contrary to many ML models that focus on node properties, our strategy emphasizes both nodes and their interrelationships.Utilizing graph embeddings, we convert the network elements into vectors, capturing the network's topology.This method aids traditional ML algorithms in forecasting congestion events while considering each sample's topological data.These forecasts can guide administrators in proactively adjusting traffic, enhancing the network's efficacy.
The paper proceeds with Section 2 discussing prior work on automated network traffic management.Section 3 delves into our methodology, Section 4 outlines the experimental design and findings, and Section 6 concludes the study.

RELATED WORK
The growing complexity in managing communication networks, particularly in routing and traffic optimization, is intensified by the rise of new technologies like the Internet of Things (IoT) and 5G [6].Network management strategies need to adapt to changing conditions, optimizing resources while ensuring Quality of Service (QoS) and Quality of Experience (QoE) [7].
Advances in machine learning (ML) have brought significant improvements to network management, including traffic prediction, anomaly detection, and resource allocation [8].Various traditional ML methods have been utilized, ranging from Support Vector Machines (SVM) and Long Short-Term Memory Networks (LSTM) to reinforcement learning, for optimizing resources and satisfying QoS requirements [9].Techniques like federated learning are being applied for power consumption optimization in edge-based infrastructures [10][11][12][13].
However, the dynamic nature and complexity of communication networks challenge traditional ML models, which often need extensive, annotated training data [14].Graph Neural Networks (GNNs) have gained traction for their effectiveness in managing complex, interconnected data typical in communication networks.GNNs are being used for tasks such as predicting network congestion and integrating with anomaly detection methods for identifying events like processing and memory failures [15][16][17][18][19][20].
The integration of Knowledge Graphs (KG) and ML techniques forms a potent combination for automated network management, enabling the representation and reasoning about diverse data sources [21].While many resource allocation problems are approached through traditional methods like queuing theory or Q-learning [22], recent works explore using edge graph neural networks and unsupervised learning for resource allocation and QoS prediction [23,24].However, the full potential of an end-to-end GNN implementation, combining graph-based embeddings with KG-based event detection, remains largely untapped.This approach could provide a scalable solution for comprehensive network management, addressing topology modeling and anomaly prediction.

PROPOSED METHODOLOGY
We propose an ML-driven pipeline utilizing knowledge graphs and link prediction for traffic management between data centers.This pipeline allows for recurring inference and continuous control.Knowledge graphs serve to represent the network intuitively and provide rapidly accessible information -essential for human operators and for generating actionable insights for data-driven methods.
Our method involves four steps (Fig. 2): 1) Utilizing simulated network topologies and communication network assumptions as initial data due to limited real-world data availability.2) Transforming this simulated data into a knowledge graph, representing entities (data center nodes, routers, network nodes) and their interrelations.3) Extracting topological embeddings from the knowledge graph.4) Applying link prediction on these embeddings, identifying potential network issues, like congestion, thus aiding in proactive traffic management.
The network topology includes four node types: Data Centers, Routers, Subnetwork Nodes, and Exchange Points, linked by different edge types and properties, such as "data packet" (linking Data Centers to Routers with attributes like processing and memory usage), "regional connection" (Routers to Subnetwork Nodes), and "backbone connection" (Subnetwork Nodes to Exchange Points).
The assumptions made during the generation of the above infrastructure are listed below: • One Data Center node originates many data packets; each processed by only one Router node.We simulated two types of network events: "Processing Failure" (Router's processing demand exceeds capacity) and "Memory Failure" (data packets' cumulative memory demand surpasses Router's memory).Both events assume up to thirty percent capacity overrun, simulating unexpected traffic spikes and potential network strain or failures.

Knowledge Graph Representation
We leveraged the simulated infrastructure of the previous step to populate a Knowledge Graph (KG) that can be queried using the Cypher query language [25].Each corresponding entity of the modelling step is assigned to a different node type, while their in-between relationships are represented as different edge types in the graph.
Overall, we represented the connection of 3200 Data Center nodes with 2400 Routers via 9660 [:Data Packet] relationships.These Routers are connected to 120 different Subnetwork Nodes through 3600 [:Regional Connection] relationships.Finally, the Subnetwork Nodes are allocated to 80 Exhchange Points via 240 [:Backbone Connection] relationships.It is also noted that each Router includes a total processing usage and a total memory usage property, while each [:Data Packet] relationship contains a computing demand and data size property, denoting the upcoming infrastructure needs.
Apart from the infrastructure-type nodes, the KG was enriched with two event-type nodes: the Processing Failure Event node and the Memory Failure Event node, corresponding to cases of processing and memory overusage of Router nodes, respectively.Given that link prediction relies on training data derived from a subset of edges that have ground-truth labels to predict similar connections on unseen data, the simulated instance encompassed 80 Router nodes with overusage properties: 40 of them with processing overusage and 40 of them with memory overusage.However, only 30 Router nodes of each type were connected to their corresponding event node, serving as a train set for the link prediction algorithm.The remaining edges were intentionally excluded, in order to be validate the performance of the link prediction model.The virtual graph that represents all node labels and relationshiptypes available in the above described knowledge graph is shown in Fig. 3.

Creation of Graph Embeddings
The graph from the previous phase models our infrastructure using graph embeddings to transform nodes and edges into lowdimensional vectors, maintaining the graph's topology.We used GraphSAGE [26], a neural graph embedding approach, leveraging unsupervised learning by aggregating features from a node's local neighborhood through random walk processes.
GraphSAGE builds a tree for each graph node  ∈  , with depth equal to the search depth , and children as adjacent nodes.Fixedsize uniform sampling of immediate neighbors reduces computational demands (Fig. 4, Step 1).Node embeddings are formed by aggregating features from this tree.
Node representation updates depend on network topology and neighboring features.For each depth  ∈ {1, ...,  }, nodes at ( − 1) ℎ layer are updated based on  ℎ layer features (Fig. 4, Step 2 and 3).The final node embeddings are derived by aggregating the updated representations of its immediate neighbors.
Updates involve aggregating immediate neighbors' previous representations, concatenating this with the node's representation, and transforming through a nonlinear activation in a fully connected layer to a fixed size.Training employs a graph-based loss function in an unsupervised setting, tuning weight matrices   through stochastic gradient descent, incorporating negative sampling to differentiate node vector representations.
GraphSAGE's inductive approach enables embedding generation for unseen nodes, leveraging local node attribute information, making it adaptable to unseen data unlike static-graph dependent transductive embeddings.

Event Prediction
The embeddings created as described in the previous section were utilized for link prediction, aiming to anticipate two types of events: processing failure and memory failure events.Every vector element was used as a feature for predicting links, with the dataset size corresponding to the number of graph nodes and vector lengths.
Link prediction algorithms aim to predict future or missing links among nodes in a graph.This prediction largely depends on the nodes' properties and the network structure.Nodes sharing similar characteristics typically form connections.For instance, Fig. 5 illustrates a scenario where a Router node (characterized by properties: total processing usage = 110, total memory usage = 60, 1 st embedding element = 8.23, 2 nd embedding element = 0.75) links to a Processing Failure node.Another Router node with similar properties is a candidate for a similar link.A Graph Neural Network (GNN) method was used for link prediction, utilizing the PyTorch Geometric (PyG) library [27].
GNNs are designed to produce node representations integrating local graph structure and node attributes [28].Through iterative updating of node embeddings, informed by neighboring nodes, GNNs effectively model node interdependencies.After training the GNN with GraphSAGE embeddings for each node, we acquired a refined set of node embeddings.These embeddings were then employed to compute similarity scores between node pairs, with Cosine Similarity as the chosen metric [29].Node pairs were ranked based on these scores, the higher scores indicating a stronger likelihood of future or missing links.
This methodology facilitates the anticipation of processing or memory failure events.For instance, a high similarity score between a Data Center node and a Router node suggests a probable future association and potential overload in processing resources, indicating a network bottleneck risk.Likewise, a potential memory overuse event is anticipated if a high similarity score is detected  between a Data Center node and a Router node, implying a demand for memory that might surpass the Router's capacity.
In essence, our link prediction strategy allowed us to anticipate and respond to potential failure events in our infrastructure, facilitating proactive resource management and performance optimization.

EXPERIMENTAL SETUP
In this work, we used the NetworkX Python package [30] to construct the graph of the infrastructure and generate the events.We utilized the Neo4j Python Driver [31] to import the graph into the Neo4j graph database management system, thereby representing it as a knowledge graph.
To create node embeddings, we relied on Neo4j's built-in Graph-SAGE algorithm.We set up a 3-layer GraphSAGE architecture with a pool aggregation strategy, a random walk search depth of  = 5, and a sigmoid activation function [32].The model was trained in batches of size  = 10 for 30 epochs, using a learning rate  = 0.1, to produce node embeddings of dimension  = 16.The training process took approximately 2.5 seconds, and the trained model was used to derive embeddings for all nodes of the sub-graph.These were then added as additional properties of type embeddingGraph-Sage to each node.It should be noted that, despite the existence of various graph embedding methods such as Fast Random Projection [33] and Node2Vec [34], the notable difference of GraphSAGE is that it can be applied to heterogeneous graphs and produce embeddings for heterogeneous nodes.
Following the embedding process, we utilized the PyTorch Geometric (PyG) library's link prediction method to forecast the processing failure and memory failure events.We trained the link prediction model on the generated node embeddings, and used cosine similarity as the measure to compute similarity scores between node pairs.These scores were used to rank potential future links, with high scores indicating likely future connections.We trained our model for 300 epochs using the mean square error loss function, on 1800 Router nodes: 1740 of them with normal usage properties and plus 60 with processing and memory overusage properties.The test set comprised of 580 Router nodes with normal usage properties plus the excluded 10 nodes with processing overusage and the 10 nodes with memory overusage properties for our evaluation.
The training process was run on an Intel Core i7 processor with 16GB RAM and the results obtained are presented in the following subsections.
By using this approach, we were able to successfully apply machine learning methodologies to a graph-based representation of our infrastructure.This not only allowed us to predict potential traffic congestion events, but also provided a clear and intuitive understanding of the underlying infrastructure and its usage patterns.

Event Detection Evaluation
In this subsection we present the simulation results for a particular instance of the setting.The evaluation metrics utilized in order to represent the results of the implemented link prediction algorithm are Confusion Matrix, Accuracy, Precision, Recall and finally F1 Score.A brief description of each is as follows: Confusion Matrix: A confusion matrix details classification errors; C ij in cell (i,j) denotes the count of group i observations predicted as group j.In binary classification, True Negatives (TN, C 0,0 ) and True Positives (TP, C 1,1 ) are correct negative and positive predictions.False Positives (FP, C 0,1 ) and False Negatives (FN, C 1,0 ) are incorrect negative-as-positive and positive-as-negative predictions.
Accuracy: The ratio of correct predictions (TP + TN) to total predictions (TP + FP + FN + TN).F1 Score: Balances Precision and Recall, optimal when they are equal.
We composed a summary metrics table (Table 1) that includes performance metrics for each class.From Table 1, it can be seen that the accuracy of the link prediction model is exceeding 95%.Additionally, it should be noted that the model performs similarly in predicting both types of Failures, while being more efficient in predicting Normal cases.This is an indication that the learned embeddings are not overfit on the training data but are robust enough to capture the complex interdependencies within the infrastructure.
The model exhibits a high recall score for both 'Memory Failure' and 'Processing Failure' cases, signifying that it can correctly identify a substantial proportion of overusage events.The lower precision in these classes is a result of the model predicting more false positives, which might be an acceptable trade-off in this context as a preventive measure.It is more crucial to flag potential overusage events to prevent them from escalating into larger issues that could disrupt the system performance or availability.
On the other hand, the 'Normal' class shows high precision, recall, and F1 score, indicating the model's effectiveness in correctly identifying normal system usage scenarios and reducing false alarms.
It is important to remember that the performance metrics of the model are tied directly to the quality of the node embeddings created by the GraphSAGE algorithm.Given that the embeddings encode both topological and node feature information, the success of the link prediction model in identifying possible overusage events speaks to the expressive power of the embeddings.
Despite the imbalanced distribution of classes in the dataset, the weighted average of precision, recall, and F1 score surpasses the 96% mark, further reinforcing the efficacy of the link prediction model in this task.These high scores suggest that the model is able to generalize well across the different classes, exhibiting its ability to handle both normal and overusage events effectively.
For an alternative visualization of model performance, we provide a confusion matrix for the link prediction model (Fig. 6).

DISCUSSION
In this section, we discuss the potential benefits and implications of our proposed methodology from two main perspectives: network service providers and end users.
For providers, our graph-based methodology effectively captures and interprets the multifaceted and evolving nature of network management and client needs.It enables proactive detection of communication events, aiding in pattern recognition and outlier prediction.This anticipatory strategy not only reduces SLA violations (availability, response time, reliability, cost limits) but also enhances management of diverse resources, leading to improved network efficiency, billing models, and readiness for future demands.
End users benefit from increased reliability and trust in network services through our event detection strategy.By ensuring high QoS and alerting users to potential abnormal events, our approach enhances user experience and transparency.Such events, indicative of compromised services, malfunctioning components, or adversarial attacks, offer users crucial insights.Our method can be employed to add security layers by identifying harmful or malicious activities through combined resource usage patterns, represented in network embeddings.
Thus, our approach addresses a dual need: equipping service providers with the tools to efficiently manage their networks and helping end users experience a reliable, secure, and well-maintained communication service.

CONCLUSIONS
In this work, we propose an intelligent knowledge graph (KG)-based event prediction methodology for communication networks.We encapsulate the intricate features of a communication network into a KG, where node embeddings are generated using GraphSAGE, an inductive learning approach.This technique takes into account both the individual features of each node and the features of their local neighborhood, allowing for a robust representation that can generalize to evolving graphs characterized by previously unseen data.
The resulting node embeddings transform the graph entities (nodes, edges) into fixed-length vectors, enabling the application of data-driven machine learning algorithms for event detection in the network.We then illustrate the efficacy of our approach in an event prediction scenario, using PyG's link prediction functionality to identify potential network events.Our results show high accuracy in the detection of different types of network failures, such as memory and processing failures.

Figure 2 :
Figure 2: Overview of the proposed KG-based modelling and event detection methodology.

Figure 4 :
Figure 4: The three-step process of the GraphSAGE inductive representation method.

Figure 5 :
Figure 5: The link prediction process.

1 )
=   +     +   +   +   (Precision: Indicates the proportion of TP out of all positive predictions, key when FP must be low. =     +   (2) Recall: Measures TP proportion out of actual positives, critical when reducing FN is vital. =     +   (3)
• A Router node processes numerous data packets; a Subnetwork node connects multiple Router nodes; multiple Subnetwork nodes associate with each Router node.
• Exchange Points handle traffic from several Subnetwork nodes, allowing for diverse traffic paths based on the network's topology and traffic conditions.

Table 1 :
Link prediction evaluation metrics.