Multi-Agent Deep Reinforcement Learning for Weighted Multi-Path Routing

Traditional multi-path routing methods distribute evenly traffic across multiple paths in a network, which can lead to inefficient use of resources if some paths are significantly longer or less reliable than others. Weighted multi-path routing addresses this issue by introducing weights to appropriately distribute traffic across the available paths based on their state. This paper proposes a novel approach to weighted multi-path routing using a multi-agent actor-critic framework, in a manner that is aligned with the need to keep up with the Quality of Service requirements of contemporary, bandwidth-intensive applications.


INTRODUCTION
During the last years, we have witnessed the emergence of Next-Gen applications, such as XR services [12]. In the context of Cloud & Edge Computing paradigms, these applications [20] are designed to consist of numerous sub-components which may be distributed across a multitude of locations. Furthermore, they are often accompanied by various Quality of Service (QoS) requirements [21], such as the need for high bandwidth and reduced latency. Thus, it is of paramount importance to leverage networking solutions that may enable these sub-components to communicate with each other in a manner that is aligned with the aforementioned QoS requirements in terms of bandwidth and latency. Unfortunately, traditional networking approaches are intertwined with a plethora of limitations that prevent them from providing reduced latency and high bandwidth in a reliable and deterministic manner.
Software Defined Networking (SDN) [23] was introduced to mitigate these limitations. The concept of SDN involves separating the control plane of network equipment from the data plane, which allows for programmable network control and flexible distribution of network traffic. While routing is a fundamental aspect of both traditional networks and SDN networks, the shortest path algorithms that are used in mainstream SDN routing modules are not ideal, as they may cause link congestion and reduce link utilization. Therefore, better routing strategies are needed in SDN networks to improve network performance and to ensure the QoS requirements shall be met.
Multi-path routing [5] is a method of routing that uses multiple paths between a pair of nodes to improve efficiency, as it can provide load balancing, improve bandwidth utilization, and mitigate congestion in networks. There are two important issues that need to be addressed when designing multi-path routing: finding multiple paths and distributing traffic among them. In Open Shortest-Path First (OSPF) networks, routers calculate the paths using Dijkstra's shortest path algorithm based on the cost of each link. If there are multiple shortest paths between the source and destination nodes, routers apply the Equal-Cost Multi-Path (ECMP) split rule, which splits the traffic equally among all available next hops corresponding to the shortest paths. The even splitting is based on a hash function applied to the packet header, which evenly maps hash values to all available next hops. However, even traffic distribution cannot achieve optimal load balancing.
Weighted multi-path [2] routing is a type of routing algorithm used in computer networks to balance traffic over multiple paths between a source and a destination. In this approach, each available path is assigned a weight or a metric that reflects its current state, such as congestion level, available bandwidth, delay, or packet loss. The goal of weighted multi-path routing is to distribute traffic across multiple paths such that the overall network performance is improved. By using multiple paths, the routing algorithm can avoid congested or faulty links and reduce the likelihood of network congestion or packet loss.The weights or metrics associated with each path can be adjusted dynamically based on the network conditions, which allows the algorithm to adapt to changing traffic patterns or link failures. If a link becomes congested, the routing algorithm can adjust the weights of the affected paths to reduce their traffic load and redirect traffic to less congested paths.
DRL (Deep Reinforcement Learning) is a subfield of machine learning that focuses on using neural networks to model and optimize decision-making processes in environments where feedback is delayed or sparse. In DRL, an agent learns to interact with an environment in order to maximize a reward signal. This is accomplished by the agent taking actions in the environment and receiving feedback in the form of rewards or penalties based on the quality of those actions. Through trial and error, the agent learns to take actions that lead to higher cumulative rewards over time.
Multi-Agent DRL [7] extends the idea of DRL to situations where there are multiple agents interacting with each other in a shared environment. In multi-agent environments, agents must not only learn to interact with the environment but also with each other in order to achieve their goals. This paper is dedicated to proposing a novel multi-agent DRL algorithm for weighted multi-path routing. Section 2 is dedicated to briefly exploring some of the related works. Section 3 describes the proposed multi-agent algorithm for weighted multi-path routing. Section 4 describes the experimental evaluation of a single-agent prototype version that showcases that the proposed representation paradigm is quite efficient. Finally, Section 5 summarizes the main points of this work, and describes future research directions.

RELATED WORK
There have been various advancements in RL-based routing optimization mechanisms for software-defined networks (SDN) that aim to simplify network operations and maintenance, while improving network performance metrics such as delay and throughput. For instance, Chiu et al. [4] proposed Reinforcement Discrete Learning-Based Service-Oriented Multi-Path Routing (RED-STAR) in order to understand the policy of distributing an optimal path for each service. Yu et al. [24] introduced the deep deterministic policy gradient routing optimization mechanism (DROM), while Sun et al. [18] developed the time-relevant deep reinforcement learning for routing optimization (TIDE) architecture. Similarly, Stampa et al. [16] designed a DRL agent that adapts to traffic conditions and minimizes network delay, while Pham et al. [14] used a DRL agent with convolutional neural networks for QoS-aware routing. Guo et al. [8] proposed a DRL-based QoS-aware secure routing protocol (DQSP) that optimizes routing policy dynamically by learning from traffic demands.
Rischke et al. [15] developed the QR-SDN approach that allows multiple routing paths while preserving flow integrity, and Ibrar et al. [9] proposed the intelligent hybrid SDN-based fog computing IoT system (IHSF) for improved performance of time-sensitive flows. Meanwhile, [19] proposes an RL-based agent that optimizes routing strategy without human intervention, using a Deep Learning prediction model to capture the periodic nature of traffic [22]. In Ref [17], the authors use Deep Reinforcement Learning for automated routing in optical transport networks, where the DRL agent selects the best path among candidate paths to maximize network usage. [1] introduces an RL algorithm to achieve efficient and privacypreserving embedding of virtual graphs over multi-operator telecom infrastructures, while Doke et al. [6] exploit Deep Reinforcement Learning to design a cost-effective network load-balancer that continuously adapts to dynamic environments. All of the aforementioned DRL solutions are designed to be implemented on the basis of single-agent algorithms.
Cooperative control in multi-agent systems offers an alternative approach to solve complex problems that are difficult for a single agent to handle. In multi-agent learning, each router in the network can be treated as a separate agent that can observe some parts of the environment and decide on actions based on its own routing policy. One approach that is based on multi-agent DRL routing is referred to as DQN-routing [13]. In DQN-outing, each router is treated as an agent whose parameters are shared by all routers and updated simultaneously during centralized training. Comparative analysis with contemporary routing algorithms has confirmed a significant performance boost. In a similar vein, a multi-agent metaproximal policy optimization (meta-MAPPO) [3] approach was proposed to optimize the network performances under time-varying traffic demand. Inspired by the latter two, we proposed a multiagent actor-critic algorithm for weighted multi-path routing. The proposed solution, contrary to the aforementioned solutions, is designed to guarantee that a specified amount of bandwidth shall be available at all times despite the changes in network dynamics and traffic.

MULTI-AGENT ACTOR-CRITIC ALGORITHM FOR WEIGHTED MULTI-PATH ROUTING
Actor-critic [10] is a class of Deep Reinforcement Learning (DRL) algorithms that combines elements of both policy-based and valuebased methods. In the actor-critic algorithm, there are two main components: the actor and the critic. The actor is responsible for selecting actions, while the critic is responsible for evaluating the actor's actions. The actor network is a deep neural network that takes in the current state of the environment and outputs a probability distribution over possible actions. The critic network, on the other hand, takes in the current state of the environment and outputs an estimate of the expected reward for that state. During training, the actor uses the critic's evaluation to update its policy.
Specifically, the actor's policy is updated in the direction of the expected reward as estimated by the critic. This allows the actor to learn which actions are more likely to lead to higher rewards. The critic is also updated during training. It uses the difference between the predicted reward and the actual reward obtained by the agent to update its estimate of the expected reward for that state. This allows the critic to learn to better estimate the rewards associated with different states. The aforementioned process is depicted in Fig. 1.

Figure 1: The Actor-Critic Algorithm.
The algorithm works by interacting with the environment over a number of episodes, where each episode consists of a sequence of steps . At each step, the agent observes the current state of the environment and selects an action to take based on its current policy. The action is then executed, and the agent receives a reward and transitions to a new state. The goal of the actor-critic algorithm is to learn a policy function, ( | ), which maps a state to a probability distribution over actions to maximize the expected cumulative reward , which is the sum of the immediate rewards at each time step: ,where is the discount factor, which determines the importance of future rewards. In order to construct actor-critic algorithms that are capable of facilitating weighted multi-path routing, it is of paramount importance to introduce the appropriate problem formulation paradigm. According to the proposed paradigm, the actor-critic algorithm is in charge of deciding how the network traffic shall be split among the two available paths that connect two network nodes. It is worth mentioning that the proposed algorithm can be extended to facilitate more than two available paths, by incorporating a softmax function. That way one can formulate the actor-critic algorithm in a manner that is compatible with the discrete action space paradigm, which in turn is capable of facilitating multiple paths. However, in the context of this work, we only consider the use of two available paths between a source and a destination node. The aforementioned traffic split is performed in a manner that corresponds to the actual percentages of the overall network traffic, each available path shall facilitate. As such, in the context of the actor-critic algorithm, each action corresponds to the percentage of network traffic that shall be sent via the first path, while the rest of traffic shall be sent via the second path. The state of the network corresponds to the available bandwidth in each available path. Finally, the reward corresponds to the desired amount of bandwidth minus amount of available bandwidth in the first path. The point of this approach is to leverage a desired bandwidth threshold that is associated with the QoS requirements of a service in terms of bandwidth, to guarantee that the desired amount of bandwidth shall be available in order for the aforementioned to operate in an optimal manner when using this path.
At each time step , the actor network outputs a probability distribution over all available paths ( | ), where is the chosen action (percentage) and is the current state of the network. The critic network estimates the value function ( ), which is the expected total reward from the current state onwards. The aforementioned and are the dedicated policy and value function parameters, respectively.
Advantage Actor-Critic (A2C) is a type of actor-critic algorithm that further enhances the learning process by using an advantage function to update the policy functions. In A2C, the advantage function is defined as the difference between the estimated value function and the actual reward obtained from taking an action in a given state. The advantage function provides an estimate of how much better or worse an action is compared to the average action taken in that state. By incorporating the advantage function, A2C can learn policies that can make more informed decisions by focusing on actions that provide better returns.
The actor-critic algorithm updates the actor and the critic at each time-step using the following equations: Actor update: ,where is the learning rate for the Actor, and is the advantage function: = + ( +1 ) − ( ) (3) Critic update: is the learning rate for the Critic, and is the temporal difference error, which represents the difference between the expected value of the current state and the actual reward obtained from that state: The actor-critic algorithm is described in Algorithm 1. At the end of training, the learned actor policy can be used to route traffic in the SDN network by selecting the optimal split percentage based on the current state .
Multi-Agent Actor Critic (MAAC) [11] is a reinforcement learning algorithm designed for multi-agent systems, where multiple agents interact with each other and the environment. MAAC extends the actor-critic algorithm by incorporating a centralized critic that estimates the value function of the joint state and action space, which is then used to update the decentralized actors. The singleagent algorithm that was explored in this Section, can be extended in a way that incorporates multiple agents in the following manner. Observe a state . 2bb.
Sample an action from the policy ( | ). 2bc.
Observe the reward and the next state +1 . 2bd.
Compute the advantage using the estimated value ( ). 2be.
Update the Critic parameters using the Critic update equation. 2bf.
Update the Actor parameters using the Actor update equation Observe a state . 2bb.
The joint action is executed, and the agents receive a reward and observe the next state +1 . 2bd.
Calculate the temporal difference error Update the centralized critic:

EXPERIMENTAL EVALUATION
In the frame of this work, we conducted an experimental evaluation of a single-agent prototype version of the DRL model in order to showcase the efficiency of the proposed representation paradigm. The experimental setup is depicted in 2. It involves four switches, one SDN controller and six virtual machines. This SDN testbed was built using the Ryu 1 SDN controller, and Mininet 2 . The switches were constructed using OpenvSwitch 3 . Network traffic consists of TCP packets. Four of the machines (gray) produce randomly generated packets (noise) that traverse the network. The remaining two (blue and red) are the ones whose traffic is distributed among the two available paths (green A and yellow B). The traffic gets split between the two available paths depending on the split percentage that has been set by the DRL agent. The DRL agent has been implemented using the 4 framework. The actions, rewards, and states that correspond to the agent have been implemented in a manner identical to the one that was explored in the previous section. More specifically, the yellow and green paths have the ability to facilitate 10 Gbps each, while the required lower bandwidth threshold was set to 7 Gbps. The DRL agents decides once every 5 seconds what percentage of network traffic shall be sent via path A. The SDN controller is responsible for gathering statistics regarding the bandwidth while these processes take place. The multi-path split is made possible via the use of group tables from the OpenFlow protocol 5 . Finally, the experimental results are depicted in Fig. 3. The axis corresponds to the number of episodes, while the -axis correspond to the Loss and Cumulative Reward, respectively. As one can see, as the number of episodes increases, so does the Cumulative Reward, while the Loss significantly drops after the first episodes. These are good indicators that the implemented actor-critic model is capable of devising efficient traffic splitting strategies in order to guarantee the QoS requirements in terms of bandwidth.

CONCLUSION AND FUTURE WORK
In this work, we presented a novel weighted multi-path routing approach that is based on the multi-agent actor-critic algorithm and that is capable of supporting bandwidth-intensive applications. Furthermore, we conducted an experimental evaluation of a singleagent prototype version of the DRL model in order to showcase the efficiency of the proposed representation paradigm. Our next steps will be to conduct a more advanced large-scale experimental evaluation that examines the efficiency of the multi-agent actorcritic approach.