Pandia: Open-source Framework for DRL-based Real-time Video Streaming Control

Deep Reinforcement Learning (DRL) has rapidly gained traction as a viable method for optimizing control in real-time video streaming. Recent research endeavors are shifting towards enabling direct DRL control over multiple streaming parameters, moving away from traditional bitrate only control. Despite this growing interest, there is a notable absence of a dedicated open-source framework to facilitate such research. In response, we introduce Pandia, a specialized open-source framework designed for the direct manipulation of multiple realtime video streaming parameters using DRL. Pandia effectively bridges advanced DRL frameworks like OnRL and SB3 with the WebRTC, a leading real-time video streaming platform. Our initial use case with Pandia showcases its capability in advancing DRL application in WebRTC control. Our study identifies significant training challenges in basic network settings, mainly due to negative impacts from random exploration. To counter this, we adopt a curriculum learning approach, using domain knowledge for more effective, guided exploration. Both the training methodology outlined in our use case and the Pandia framework are poised to contribute to ongoing research in DRL's application to WebRTC control.


INTRODUCTION
Real-time video streaming stands as a pivotal technology that underpins a plethora of real-world applications, including online meetings, e.g., Zoom [22] and Google Meet [1], video analytics for real-time insights, e.g., FastVA [27] and DDS [11], remote control systems, e.g., remote driving [20], and remote surgery [5].In such applications, video streaming plays a crucial role in encoding (or compressing) the vast amount of raw RGB data into a compact form, enabling the timely delivery of each video frame, even under constrained network conditions.The core challenge involves dynamically adjusting several key aspects: data generation, which is governed by encoding parameters; packetization, managed through forward error correction (FEC) parameters; and transmission, dictated by egress speed limits (more details in Section 2.1 and Fig. 1).These adjustments must be finely tuned to align with the network's fluctuating conditions, such as bandwidth and round trip time (RTT).
In contrast to conventional human-designed streaming control algorithms [8,9], recent studies [15,17,30,31] have highlighted the superior streaming performance brought by machine-learned algorithms, particularly focusing on deep reinforcement learning (DRL).In support of research in DRL-based real-time video streaming control, AlphaRTC [12] has been introduced as the first open-source framework of its kind within the community.AlphaRTC, built on the widely-used WebRTC framework [7], implements DRL-inferred bitrates by modifying WebRTC's bandwidth estimation mechanism.While it serves as a commendable pioneer, AlphaRTC falls short in meeting the requirements of new research.Firstly, it is limited to supporting bitrate only, despite evidence that additional control parameters can significantly enhance performance [15,17].Secondly, its approach to bitrate manipulation is indirect, leading to compensations by other built-in bitrate control algorithms within WebRTC.This indirect control constrains the learning potential of DRL in a multi-parameter context [17].
In this paper, we introduce Pandia, an innovative open-source framework designed for direct manipulation of the WebRTC control parameters using DRL.Pandia facilitates control over key parameters such as bitrate, resolution, frames per second (FPS), FEC redundancy ratio, and pacing rate, ensuring that all parameters steered by DRL are directly implemented in the streaming pipeline.Additionally, Pandia is equipped with enhanced features including hardware codec support and comprehensive logging capabilities for in-depth performance analysis.
To validate the effectiveness of Pandia, we present a demonstrative use case involving the training of a DRL model with Pandia.We encountered challenges in training, even in a simple network environment, as the default random exploration can lead the training towards a local optimum.To overcome this, we introduce a curriculum learning-based approach employing the action cap to utilize domain knowledge for a guided exploration.This method successfully trains a model that outperforms human-crafted algorithms.This training method not only demonstrates the efficacy of Pandia but also paves the way for future research in effectively applying DRL directly to real-world networking systems.
The contribution of this paper is as follow: The rest of this paper is organised as below.We first give the background knowledge of real time video streaming and deep reinforcement learning in Section 2.Then, in Section 3, we present the design, as well as the implementation of Pandia.Afterwards, we demonstrate the effectiveness of Pandia with a training user case in Section 4. Finally, we explore open challenges in Section 5 before concluding in Section 6.

BACKGROUND AND RELATED WORK 2.1 WebRTC
Architecture.Fig. 1 illustrates the WebRTC architecture.On the sending side, the data path encompasses video encoding, packetization, and egress operations.Initially, the video adapter retrieves raw video frames and, depending on the resolution () and FPS settings, may downscale or drop frames.Subsequently, the encoder processes the raw frame, with the compression ratio determined by the quantization parameter (QP) and regulated by the bitrate ( ) and the FPS.The resulting binary data is then packetized into real-time transport protocol (RTP) packets, during which process the FEC generator might add redundant data to mitigate random packet loss, controlled by the redundancy ratio ( ).Ultimately, all packets are dispatched by the pacer, whose egress speed is governed by the pacing rate ( ) and the congestion window ().Upon receiving these packets, the receiver executes a reverse pipeline involving packet assembly and video decoding to reconstruct the original frame.The latency incurred throughout this entire process is referred to as the glass-to-glass (G2G) delay.Receiver feedback.In addition to the sender-to-receiver data path, WebRTC employs real-time control protocol (RTCP) to transmit feedback from the receiver back to the sender of the video stream.This feedback encompasses several key elements: 1) packet acknowledgments for reporting reception and loss, 2) bandwidth estimation to aid in receiver-assisted congestion control, and 3) controls for the video streaming session, such as forcing a key frame to preclude the need to recover previous frames.Streaming Control algorithms.WebRTC inherently supports 2 congestion control algorithms (CCA), Google congestion control (GCC) and performance-oriented congestion control (PCC), for bandwidth estimation and egress control.A bunch of other algorithms are implemented to adjust the other control parameters based on the estimated bandwidth.For example, the FEC redundancy ratio is based on the packet loss rate estimated by the CCA and the bitrate is based on the estimated bandwidth.Despite the architecture clarity in such design, previous work has demonstrated the benefit of jointly considering multiple control parameters [16,21,23].Meanwhile, many more machine learningbased control algorithms have been proposed for the control of a single parameter [30][31][32] and multiple parameters [15,17].

DRL
DRL addresses real-world control challenges by framing them within the Markov Decision Process (MDP) framework.In this context, an agent gathers observations  (such as streaming statistics) from its environment (e.g., WebRTC) and executes actions  (like adjusting bitrate or pacing rate) to influence that environment.The effect of these actions on the environment, after a brief operational period, is quantified as a scalar value reward , which reflects the current quality of video streaming.
Throughout the training process, DRL is tasked with learning a policy   () →  that generates actions aimed at maximizing the discounted rewards: Here,  represents the parameters of the policy.This formula encapsulates the immediate reward as well as future rewards, considering both the short-term and long-term implications of actions.The discount factor , a constant ranging between 0 and 1, determines the extent to which the policy accounts for future rewards, effectively setting the temporal scope of the decision-making process.Actor-critic method.Most contemporary DRL algorithms employ the actor-critic model.In this framework, the actor denotes the policy function responsible for generating the optimal action in a given state to maximize the expectation of the discounted rewards  ( ) = E ∼ [ ()].During the gradient ascent phase, which addresses the problem of maximizing  ( ), the critic comes into play.The critic's role is to model the value of each state  () and is pivotal in reducing the variance of the policy gradient estimated by the actor, thereby enhancing the overall training efficacy.

Application of DRL in Networking Systems
Deep Reinforcement Learning (DRL) has been successfully applied to the control planes of various networking systems, consistently demonstrating superior performance compared to traditional human-crafted algorithms.Key studies such as [3,4,14,19] highlight the implementation of DRL in transport layer congestion control.Additionally, [28] and [15,17,30,31]   Open platforms.Genet [28] is a specialized training framework for DRL in networking systems but it does not include support for real-time video streaming.AlphaRTC [12] emerges as the pioneering framework that enables the integration of DRL with WebRTC.Our research is primarily built upon the foundations of AlphaRTC, with a specific concentration on the streaming endpoint.This approach marks a shift from AlphaRTC's broader global training infrastructure focus.

PANDIA DESIGN AND IMPLEMENTATION
As shown in Fig. 2, Pandia showcases an asynchronous design that connects a C++-implemented video streaming program (WebRTC) with a Python-implemented DRL program (DRL agent).This setup employs shared memory and sockets for efficient inter-process communication (IPC).Events from each frame and packet in We-bRTC are conveyed to the DRL agent, forming the observation .Conversely, the action  from the DRL agent is written to shared memory and subsequently read by WebRTC for video streaming control.This integration necessitates the modifications to WebRTC and the development of specialized tools within the DRL agent to process the video streaming data.

WebRTC Modification
Pandia is developed using WebRTC version M105 as its foundation.In its construction, we have leveraged the test suites provided by WebRTC and drawn implementation insights from AlphaRTC [12].1: Logging of the frame and packet timestamps receiver's accumulation of sufficient packets, the frame's assembly timestamp is recorded as   .Subsequently, the decoding start and finish timestamps are noted as   and   .For packets, the generation timestamp is   , followed by the sender's dispatch timestamp   .The receiver logs the packet's arrival at   , and the timestamp when the sender receives the packet's acknowledgment is   .

Event logging. As detailed in
We augment WebRTC's logging capabilities to trace the relationship of packets to their respective frames.Additionally, we modify RTCP to enable the receiver to relay frame decoding statistics back to the sender.All the aforementioned timestamps are observable from the sender's perspective.The logging data, once serialized within WebRTC, is transmitted to the DRL agent through a dedicated, non-blocking UNIX socket, ensuring a smooth and efficient data flow for processing.Codecs.WebRTC natively supports 4 codecs: H264, VP8, VP9, and AV1.While these codecs are implemented in software, they often encounter significant encoding/decoding delays, particularly at higher streaming qualities.To facilitate experiments with high resolutions and bitrates, we integrate support for NVENC and NVDEC, enabling hardware acceleration on Nvidia GPUs.Application of the Control Parameters.To keep modifications to a minimum, we directly apply control parameters within the data plane.For instance, Pandia retrieves the bitrate from the shared memory and employs the _ function on the codec prior to encoding each frame.Each control parameter is assigned a predefined invalid value to override the agent's output, reverting to WebRTC's original algorithm as needed.This approach is particularly beneficial when allowing the DRL agent to manage only a subset of the parameters.Executable.Pandia offers two modes of running the streaming program to accommodate different needs: • Separated sender and receiver.This mode allows the streaming sender and receiver to run independently on different machines, making it well-suited for real-world testing scenarios.• Integrated sender and receiver.In this mode, both the sender and the receiver get started within a single program, with data transmission occurring via the localhost.This setup is ideal for training within an emulated network environment.
To support large-scale training and evaluation, all Pandia binaries are designed to be operable within Docker containers.In this arrangement, each instance of the Pandia environment spawns its own dedicated container, ensuring isolated and scalable running.

DRL Agent
Building upon the methodologies employed by prior DRL-based networking systems [3], we utilize a monitor block that persistently gauges the streaming statistics over brief intervals, referred to as steps.The observation  is aggregated at the conclusion of each step, whereas the action  is implemented at the onset of the subsequent step.Observation.Pandia offers a broad array of statistics for observations, including 1) delay metrics based on timestamp differences from Table 1, 2) action history, 3) frame encoding information like QP and encoded size, and 4) packet transmission details such as delay intervals [10] and reception rates.To maintain efficiency, the DRL agent should judiciously use a portion of these statistics, thereby avoiding an overly large observation space.Action.Pandia enables the manipulation of 5 control parameters as actions, which are resolution (), bitrate ( ), FPS ( ), pacing rate ( ), and FEC redundancy ratio ( ).Data range and normalization Pandia is capable of supporting network bandwidths ranging from 100 Kbps to 200 Mbps.The lower end of this spectrum, at 100 Kbps, is generally inadequate for low-resolution streaming (e.g., 240p) and is typically regarded as a threshold below which video streaming On the other hand, the upper limit of 200 Mbps is sufficient for streaming in 4K resolution.Furthermore, it's important to note that due to limitations in the software implementation, the maximum egress rate attainable by WebRTC in our hardware setup is approximately 200 Mbps.
To address the challenge posed by the extensive value range, which may lead the agent to overlook cases with smaller values, Pandia use a non-linear normalization function, the cube root, to transform all data into a normalized range between -1 and 1. DRL algorithm.Pandia adopts the gym1 API design for its video streaming environment implementation, ensuring compatibility and ease of use.We offer practical use cases compatible with two prominent DRL libraries: RLlib [18] and SB3 [25].This integration inherently supports state-of-the-art DRL algorithms such as Proximal Policy Optimization (PPO) [26] and Soft Actor-Critic (SAC) [13].Additionally, Pandia is also inherently compatible with advanced hyper-parameter tuning tools like SB3-zoo 2 .Playback and illustration.Leveraging its asynchronous DRL-WebRTC communication design, Pandia effectively uses saved streaming logs for off-policy training, playback, and visualization.It features over 10 detailed figures for in-depth analysis and illustration

Training
Networking Emulation.Aligning with prior DRL networking applications, Pandia supports synthetic [14] network emulation.The network parameters like bandwidth, RTT, packet loss rate, and buffer size, are determined by a sampling policy.This approach is well-suited for Ethernet simulations and provides a reasonable approximation for cellular network simulations.Training methods.Directly training a DRL model in complex environments is often difficult.For example, [28] notes that agents trained over a wider bandwidth range performs worse than those trained in a narrower range.To address this, curriculum learning is employed, as seen in previous works [17,28].This method sequentially introduces environments from simple to complex, allowing the agent to gradually develop an effective policy for more challenging scenarios.
Pandia facilitates curriculum learning by enabling the configuration of various network environment settings in its config files.Additionally, it employs the _ parameter to set upper/lower bounds on actions, thus offering action selection guidance tailored to each specific environment.As we will demonstrate in Section 4, constraining actions such as the bitrate to the environment's bandwidth setting can significantly enhance the training efficiency.The application of the _ parameter in Pandia alters actions without modifying the action sampling algorithm of the DRL algorithms.Thus, this is a training methodology adaptation rather than a policy enhancement.

A DEMO USER CASE
In this section, we illustrate the application of Pandia by training a model within a synthetic network environment.During the training of Pandia, a practical networking system, we encounter challenges even with moderate network configurations.The underlying issues are explored in Section 4.2, with our proposed solution detailed in Section 4.3.The solution is anticipated to inspire further research into the deployment of DRL within Pandia and various other realworld networking systems.

Training Setup
Hardware and networking setup.We conducted training using the emulator environment on a desktop computer.This system is outfitted with a 6-core Intel i5 CPU, 32 GB of RAM, and an Nvidia 4070 GPU.It operates on Ubuntu 22.04 LTS.Network conditions, including bandwidth ranging from 1 Mbps to 5 Mbps and a RTT of 0 to 10 ms, are randomly sampled at the beginning of each training episode.We set the network buffer size to   =  × 0.25.Any packets arriving once the buffer is full are dropped.DRL setup.We set the step duration ( ) to 100 ms and the episode length to 10 s (100 steps).Observations are aggregated from the monitoring blocks of the last 5 steps.Each monitoring block includes the G2G delay (  =    −   ), the encoded bitrate of the frames (  =   / ), the packet transmission delay (  =   −   ), the delay interval (  = Δ  ) and the packet loss , which incentivizes higher bitrates while penalizing greater delays.To prevent excessively large policy updates, the reward is capped between -10 and 10.We set the discount factor  to 0.9.This is based on the rationale that the step duration is adequate for the impact of an action to manifest, considering our RTT settings, hence the model does not require excessive forward-looking.For the training algorithm, Proximal Policy Optimization (PPO) [26] is employed.Training methods.We compared two training strategies: 1) direct training and 2) curriculum training with the action cap mechanism.For curriculum training, four environments were used with _ values of 0.8, 0.9, 1, and disabled, respectively.The model is trained sequentially, shifting to the next environment once the training rewards in the current one shows convergence.

Bitrate and Reward Correlation
Prior to training, we evaluate Pandia under various bitrate settings while maintaining a constant network bandwidth.In each episode, a single bitrate setting is consistently applied to all steps.The resulting G2G delay and rewards are compiled in Fig. 3.It is reasonable to expect increased rewards as the bitrate approaches the bandwidth limit.When the bitrate exceeds a certain proportion of the bandwidth, rewards decrease significantly.This is due to the larger encoded size causing extensive packet loss and retransmission, prolonging the G2G delay.However, the apparent correlation between the action and the reward does not hold when a large bitrate is set.Delayed action impact.We conducted the experiment again, this time with a fixed bitrate of 2 Mbps, except for two continuous steps where it was set to 5 Mbps.The outcomes are presented in Fig. 4. We observed a significant drop in the reward when the bitrate was increased.Despite the bitrate reverting to 2 Mbps after only two steps, it required 12 steps for the reward to return to its In summary, the bitrate-reward correlation is well maintained when the network is not congested.Otherwise, the congestion will cause continuous bad rewards even if a good action is taken.The delayed reward is challenging for DRL because, according to Equation 1, rewards in the recent future always contribute most to the discounted rewards .

Result: Direct Training vs Training with Action Cap
The direct training approach resulted in a training episode reward of approximately 0. In this scenario, the trained agent consistently chose the minimal bitrate, specifically 200 Kbps, regardless of the observation values.In contrast, the model trained using the action cap mechanism effectively learned to adjust the bitrate in accordance with the network bandwidth, achieving a training episode reward of around 500.Fig. 5 illustrates the value function loss incurred during direct training.The loss eventually stabilizes near zero, indicative of the model reaching a local optimum by consistently selecting the minimal bitrate, thereby eliminating exploration.Prior to this, the loss hovered around 200.In contrast, the loss recorded in the last environment of curriculum learning is notably lower, at approximately 20.This indicates that the randomly sampled higher bitrates significantly reduces the efficiency of approximating the value function.However, an accurate and well-calibrated value function is essential for minimizing the variance in policy updates, thereby leading to enhanced training efficiency.
The curriculum learning approach addresses the issue by incrementally increasing the action cap.Initially, the first training environment sets the _ to 0.8, thereby strictly limiting the maximum sampled action to avoid network congestion.Consequently, training converges rapidly within 40,000 steps, achieving a training reward of approximately 450.This phase effectively teaches the model exploratory behavior with higher bitrates, as the potential negative impact of large bitrates is mitigated by the action cap.Subsequently, the model undergoes fine-tuning with progressively larger _ values.The inclusion of large actions in the training dataset, however, curtails exploration and prompts the model to be more cautious in selecting higher bitrates due to their side effects, motivating exploitation.Through several iterations with gradually increased _, the model eventually learns to adjust the bitrate in response to the bandwidth without the use of _.

DISCUSSION AND OPEN CHALLENGES
Training efficiency.Training with real-world systems is notoriously time-intensive.For example, the curriculum learning training process outlined in Section 4 spans over 10 hours on our hardware setup.This duration is likely to extend for more complex networking environments.One approach to enhance the efficiency of curriculum learning involves optimizing the selection of environments during training, as suggested in [28].Meta reinforcement learning [6] also presents itself as a potential solution.Alternatively, simulation offers a viable solution.Real-world system steps are limited by the wall clock time, whereas simulators eliminate this constraint by utilizing a fast-paced simulated clock.However, if not properly abstracted, the simulated clock can end up being slower than real time due to computational overhead, as noted in [2].Therefore, the challenge lies in designing a simulator with the right level of abstraction that not only runs significantly faster than real-world systems but also accurately represents the data distribution of these systems.Finally, while much of the existing research employs the on-policy algorithm PPO for training, off-policy algorithms such as SAC may offer increased efficiency through the reuse of historical data.Multiple control parameters.Although our demonstration user case in Section 4 employs only a single control parameter, Pandia is adept at managing multiple control parameters.This can be achieved through either a high-dimensional action space, as utilized by [15], or by employing multiple agents, a method adopted by [17].Training with multiple control parameters typically poses challenges due to the expansive exploration space.However, as illustrated in Section4 and corroborated by [17], curriculum learning is pivotal in navigating such complex tasks.Additionally, it would be intriguing to explore how DRL discerns the coordination among control parameters and whether it can autonomously learn human-devised policies, e.g., [21,23].Dealing with complex real-world networking environments.This paper showcases the application of Pandia within a simplified networking environment.However, real-world networks often span a much broader spectrum.For instance, cutting-edge cellular networks like 5G can offer bandwidths ranging from hundreds of Kbps to several Gbps, contingent on signal quality [29].Additionally, RTT may vary from milliseconds to seconds [24].Developing a DRL model capable of handling such complex network scenarios necessitates specialized attention to both the model architecture and the training framework.Addressing these challenges remains an area for future research.

Figure 4 :
Figure 4: The effect on the reward due to a short period of elevated bitrate settings

Figure 5 :
Figure 5: The value function loss during direct training

Figure 6 :
Figure 6: Comparison of episode reward when evaluating with GCC and DRL

•
We design and build the first open-source framework for directly applying DRL actions to multiple video streaming control parameters.The code can be downloaded from https://github.com/johnson-li/Pandia.• Through an in-depth analysis of Pandia, we demonstrate that random exploration poses significant challenges in training a DRL agent with direct control capabilities.We propose a curriculum learning training framework to address this issue through guided exploration.
have employed DRL

Table 1 ,
we log the timestamps marking each phase in the lifecycle of every frame and packet.A frame's journey begins with its capture timestamp   .Its encoding start and finish timestamps are   and   , respectively.Upon the /    Encoding/encoded time   Reception time   Assembly time   ACK time   /    Decoding/decoded time Table