Offline Reinforcement Learning for Bandwidth Estimation in RTC Using a Fast Actor and Not-So-Furious Critic

The increasing demand for real-time communication (RTC) applications necessitates robust and reliable systems. Seamless media delivery depends on an accurate assessment of the network conditions, with bandwidth estimation (BWE) being crucial for maintaining system reliability and achieving good quality of experience (QoE) for the users. BWE poses a significant challenge due to dynamic network conditions, limited information availability and computational complexity. The Second Bandwidth Estimation Challenge, organized within ACM MMSys 2024, aims to enhance RTC user QoE by developing a deep learning-based bandwidth estimator using offline reinforcement learning. This paper presents our solution, ranked second in the grand challenge. This solution employs an actor-critic approach to achieve accurate real-time BWE by relying solely on observed network statistics. Due to the offline setting of the challenge, the critic network is trained separately from the actor network to estimate the action quality without interacting with the real environment. Furthermore, the quality prediction by the critic is adjusted by a predefined conservation factor to address overshooting the bandwidth values. The solution's source code is publicly available at https://github.com/streaming-university/FARC.


INTRODUCTION
In recent years, the necessity for dependable real-time communication (RTC) systems has increased due to a surge in demand [14].As businesses increasingly adopt remote work models, reliable and efficient RTC systems have become more crucial for seamless collaboration and improved productivity.Immersive technologies further push the boundaries of RTC, requiring low latency and high bandwidth for immersive and responsive experiences.
Seamless delivery of real-time media relies on accurate estimation of the underlying network conditions, as dealing with unknowns poses challenges in running reliable communication systems.The main challenge is to accurately find the bottleneck link's available bandwidth, which is the part of the network with the least bandwidth and controls the data flow speed.Bandwidth estimation (BWE) is vital for the smooth running of RTC systems as they are sensitive to bandwidth and latency changes.
The Bandwidth Estimation Challenge hosted by Microsoft Research's Academic Program [9] aims to improve the quality of experience (QoE) for RTC scenarios.The goal of this challenge is to design an offline reinforcement learning (RL)-based solution for real-time BWE.This paper introduces our solution to this challenge called Fast ActoR Conservative Critic (FARC).
FARC utilizes an actor-critic architecture, trained offline, eliminating the need for environmental interaction during its training phase.Furthermore, FARC is engineered to be exceptionally lightweight, aiming for deployment in RTC scenarios where additional latency in the workflow could be counterproductive.The source code of FARC is available at https://github.com/streaminguniversity/FARC.

RELATED WORK
Numerous approaches have been developed to estimate the bandwidth and enhance bitrate adaptation for RTC scenarios.
In [1], a hybrid approach is proposed for low-latency live streaming scenarios.This method combines both heuristic and learning-based algorithms.It uses a dynamic model selection algorithm, which selects the most suitable prediction model based on the current network conditions.This dynamic selection is critical for ensuring and sustaining high streaming quality.
Petrangeli et al. [12] propose a framework for efficient WebRTCbased remote teaching applications.It addresses the challenge of scaling real-time communication in virtual classrooms using a limited number of encoders.A conference controller dynamically forwards the most suitable stream to receivers based on their bandwidth conditions and recomputes encoding bitrates to follow long-term variations.The proposed approach, evaluated on a testbed, outperforms static bitrate associations, achieving an 11% increase in received video bitrate with three encoders.The dynamic framework proves more efficient, performing similarly to static associations with fewer encoders.
Mei et al. [11].propose an Long Short Term Memory (LSTM) [6] based approach for realtime mobile bandwidth prediction.They focus on increasing QoE in bandwidth-sensitive applications by accurately estimating bandwidth in various mobile networking environments, such as subways and buses.They use an LSTM model capable of predicting bandwidth one second and multiple seconds ahead.A notable aspect of this work is its use of Multi-Scale Entropy (MSE) analysis to understand the predictability of network bandwidth under different scenarios.The MSE analysis provides insights into the regularity patterns of each scenario, enabling the study of the prediction accuracy of LSTM models in various contexts.
Ruan et al. [13] propose MLP-DBA, a machine learningbased algorithm for dynamic bandwidth allocation in human-tomachine (H2M) communications over optical and wireless networks.Utilizing an artificial neural network (ANN) to predict H2M packet bursts at each access point.MLP-DBA classifies access points based on predicted ON/OFF status and estimated bandwidth, enabling adaptive bandwidth allocation decisions.
Furthermore, RL-based solutions have started to play a vital role in bandwidth estimation.Souane et al. [15] propose a deep reinforcement learning method targeted for Dynamic Adaptive Streaming over HTTP (DASH).In DASH, maintaining a continuously perceived high video quality throughout the streaming session is crucial to improve the user experience.Therefore, the proposed method uses a dynamically controlled quality distance factor between consecutive video segments to maximize QoE.
Fange et al. [3] propose an RL-based solution to address bandwidth estimation and congestion control challenges in RTC.They introduce R3Net, an RL-based recurrent network tailored for rapid adaptation to dynamic RTC network conditions.The study emphasizes the distinctive constraints of RTC, including minimal latency and user-uploaded content, which requires rapid responses to bandwidth changes to maintain a high QoE.
Bentaleb et al. [2] propose a system that employs both heuristic algorithms and RL techniques to predict bandwidth requirements accurately.They combine traditional rule-based approaches with machine learning strategies to improve the precision of predictions.Deep reinforcement learning (DRL) methods typically require substantial offline training data, but they operate with limited online data, leading to performance inconsistencies and sub-optimal actions in the real world.To address this issue, they incorporate an adaptive bandwidth prediction selector.Initially, it utilizes a heuristic-based controller, mitigating the cold start problem.As the system accumulates adequate input data during the session, it dynamically transitions to a learning-based controller.
Gottipati et al. [5] propose Merlin, an offline, data-driven approach for bandwidth estimation in RTC systems.Merlin learns to mimic the performance of an expert Unscented Kalman Filter (UKF) model through behavioral cloning from offline, simulated examples, effectively turning policy learning into a supervised learning task.This methodology eliminates the need for network interactions during training, as all necessary data are pre-collected.Additionally, Merlin's training process does not require specialized hardware or testbed environments, making advanced learning-based network control more accessible.
FARC shares a similarity with Merlin in targeting the same objective: both are trained offline to address bandwidth estimation in RTC scenarios.The primary distinction lies in the underlying network architecture of FARC and its unique training approach.

PROPOSED METHOD
To address the challenge of offline RL for BWE in real-time communications, we introduce Fast ActoR Conservative Critic (FARC).

Neural Network Architecture
FARC employs an actor-critic architecture trained in an asymmetric manner.It takes the observation vector and the bandwidth estimate generated by the actor as inputs and produces predictions for video and audio quality.Since we operate in an offline setting, we cannot observe the direct impact of the actor's actions on the environment.Therefore, we train the critic network to estimate the performance of the actor network in real-world scenarios.
The critic network of FARC is responsible for predicting the resulting QoE based on the given state and action.It takes the observation vector and processes it by applying a feature mask and an action scale.The feature mask is explained in detail in Section 3.2.The action scale component converts the bandwidth rate values from Bps to Mbps for better normalization throughout the features.
The processed observation vector is passed through three fully connected layers, in which an ReLU [4] activation function is applied consecutively.These layers act as a quality predictor using only the observed network values.Once these values are predicted, the bandwidth value estimated by the actor is concatenated, and the critic predicts the corresponding quality if the predicted action is taken.It should be noted that, during the training process, the estimated bandwidth value is the action taken by the reference policy model; thus, the critic implicitly learns the quality of actions taken by another actor model.The critic network architecture is illustrated in Figure 1.The actor network (depicted in Figure 2) follows a similar approach to the critic network and filters the observation vector to remove unused features.The vector is then split into two parts: long-term and shortterm features.Since these features convey different characteristics of the underlying network conditions, they are initially processed separately.Each path acts as a bandwidth predictor, utilizing the available information.
Subsequently, these predictions are concatenated and passed through another set of fully connected layers to predict the final bandwidth value.These layers implicitly learn the importance of short-term and long-term features and adjust their weights accordingly.The multi-path structure enables the actor to quickly adapt to changing network conditions by shifting the importance weight from long-term to short-term feature values.Finally, the final bandwidth value is scaled back to Bps per the challenge requirements.

Feature Selection
The dataset provided for the challenge contains 15 network features, as described on the challenge website 1 .
In our experiments, we initially used the entire observation vector, but found the network struggled to generalize in these cases.To improve model performance, we applied a feature mask to the observation vector, selecting relevant network statistics and omitting the unused ones.We utilized the following features while ignoring the remaining ones: (1) Receiving rate (2) Number of received packets (3) Queuing delay (4) Packet loss ratio (5) Average number of lost packets These features were selected after experimenting with various subsets.Through our experiments, we discovered that features directly related to packet loss and bandwidth values had the highest correlation with video quality.Consequently, we proceeded with the aforementioned feature set in our method.
Additionally, we applied an action scaler to the receiving rate feature in the observation vector, converting it from bits per second (bps) to megabits per second (Mbps) for improved normalization.The same scaler was also applied to the bandwidth estimate provided by the actor.

Training Process
The critic network was independently trained using the testbed traces in the provided dataset 2 .These traces contained bandwidth predictions made by a sample policy model.During the training, these predictions were assumed to be actions taken by the actor, and the critic network was trained to predict the resulting video and audio quality.Since the traces provided the actual video and audio quality values as observed outcomes, these values were utilized as the target values for training.It is important to note that there were NaN values for video quality, particularly at the beginning of the calls.To address these missing qualities, we applied backward fill, which involved replacing NaN values with the first subsequent non-NaN value.
After finalizing the training of the critic, we commenced the training process for the actor network.The training of the actor network followed an asymmetric RL approach.The actor functions as the bandwidth predictor in our setup and was trained using the emulated dataset provided by the challenge organizers.This dataset included the true capacity information for each trace, serving as the optimal bandwidth value.We utilized the true capacity information during training and ignored it during the inference, resulting in an asymmetric RL scheme.
The training process for the actor unfolds as follows: (1) The actor takes the current observation vector and predicts the bandwidth value.(2) The critic takes the bandwidth value predicted by the actor and predicts the expected quality (actor reward).
(3) The critic takes the true capacity, applies a discount using the conservation factor, and predicts the optimal quality (max reward).(4) The loss is calculated using the actor reward and max reward and is back-propagated to the actor network.
Since the critic network was trained with an action taken by a reference model as the bandwidth estimates, we observed it tends to push the action made by the actor to the network limits.However, this could lead to overshooting (i.e., predicting a bandwidth value exceeding the available value), thus significantly affecting the QoE.The conservation factor served the purpose of preventing the actor from overshooting by reducing the effect of the critic's decision.
We experimented with various values for the conservation factor to determine a good value.The final conservation factor value used in the proposed method was 0.96.Finally, during the inference phase, steps 2 and 3 were omitted and the actor predicted the bandwidth value solely relying on the current observation vector.

EXPERIMENTAL RESULTS
This section details the results we obtained in the challenge.

Training Setup
The critic network was trained on the entire testbed dataset.For the actor network, we used the emulated dataset.We randomly selected 150 call traces and used them as the test data while using the remaining call traces for training.Adam [10] optimizer was used for both actor and critic networks.The learning rate was set as 0.0001 initially and multiplied by 0.1 every 10 epochs.
The mean squared error (MSE) loss was used as the loss function.For the critic training, MSE compared the quality prediction made by the critic with the actual observed quality.For the actor training, it evaluated the discrepancy between the quality values determined by the critic for the bandwidth predicted by the actor and the true capacity.
We used a batch size of 256 for both training sessions, meaning 256 traces were taken in each timestep for the given call.We only used full batches and cut the traces early.All training sessions were conducted on a MacBook Air equipped with an Apple M2 processor.

Evaluation Setup
The evaluation of participant models was based on the following scoring function, which calculates the expected sum of objective audio and video quality scores averaged over time for each call and across all calls: where   and   represent the objective video and audio quality values, respectively, which are predicted by advanced ML models.It is worth noting that these ML models have demonstrated a remarkable correlation with the subjective video and audio quality scores, as determined by ITU-T's P.808 [7] and P.910 [8], respectively.
The dataset provided for this challenge is a collection of realworld trajectories extracted from audio/video calls.Further details about the dataset can be found on the challenge's website3 .

Preliminary Results
Since the exact dataset for the final evaluation was not publicly available, we evaluated the performance of the proposed method on a self-constructed test set using the provided emulated dataset, as it included the true bandwidth value.We randomly selected 150 call traces from the emulated dataset and used them to evaluate the performance of the proposed method.
Figure 3 shows that FARC can follow the true capacity of the network quite accurately in most scenarios, while still overshooting occasionally.Compared to the baseline bandwidth estimator, FARC acts more aggressively in terms of requesting a higher bitrate.
One can notice that FARC faces challenges in accurately estimating the available bandwidth, often overshoots at the beginning of the calls, which is notably illustrated in Figure 3a.This issue can be attributed to FARC's inadequate handling of the cold start problem since it relies solely on the observation vector without incorporating any heuristic solutions to tackle this problem.Moreover, since execution time is critical for solutions in RTC systems, we measured the latency of FARC using only a CPU.FARC proved to be extremely efficient, as the onnx model inference averaged only about 2  on an Apple M2 chip.Overall, the performance of FARC was promising in our preliminary evaluation.

Challenge Results
As mentioned before, the exact dataset for the final evaluation is unknown.However, the evaluation criteria remained the same as given in (1).The proposed solution FARC was ranked second in the final evaluation stage.The results are given in Table 1.

CONCLUSIONS
In this paper, we introduced FARC, an actor-critic architecture trained with offline RL.Our approach innovatively tackles the challenge of predicting bandwidth in real-time communication scenarios, with a particular focus on enhancing the QoE.FARC was ranked second in the grand challenge, indicating its promising performance in accurately predicting the bandwidth capacity of the network in a wide range of scenarios.The actor-critic architecture of FARC is highly effective in managing the changing and unpredictable conditions of network environments, making it well-suited for real-time applications where bandwidth changes often occur.This flexibility is essential for ensuring a high-quality user experience, as it guarantees smooth and continuous delivery of real-time media.
We acknowledge potential limitations in our current methodology, particularly regarding the overshooting of the model in extremely volatile or extremely stable network conditions.Moreover, the lack of a heuristic approach to tackle the cold start problem results in unstable predictions during the beginning of the call.Future research could explore the application of FARC in such environments, potentially enhancing its robustness and adaptability.

Figure 3 :
Figure 3: Performance comparison of FARC in different call traces.