Large-scale Video Analytics with Cloud–Edge Collaborative Continuous Learning

Deep learning–based video analytics demands high network bandwidth to ferry the large volume of data when deployed on the cloud. When incorporated at the edge side, only lightweight deep neural network (DNN) models are affordable due to computational constraint. In this article, a cloud–edge collaborative architecture is proposed combining edge-based inference with cloud-assisted continuous learning. Lightweight DNN models are maintained at the edge servers and continuously retrained with a more comprehensive model on the cloud to achieve high video analytics performance while reducing the amount of data transmitted between edge servers and the cloud. The proposed design faces the challenge of constraints of both computation resources at the edge servers and network bandwidth of the edge–cloud links. An accuracy gradient-based resource allocation algorithm is proposed to allocate the limited computation and network resources across different video streams to achieve the maximum overall performance. A prototype system is implemented and experiment results demonstrate the effectiveness of our system with up to 28.6% absolute mean average precision gain compared with alternative designs.


INTRODUCTION
Empowered by the advances in deep learning technologies, video analytics based on deep neural network (DNN) models are applied across a wide range of tasks including image classiication, object detection, semantic segmentation, etc., which are essential for growing applications such as intelligent video surveillance [46,50,65], smart transportation [32,49,60] and autonomous vehicles [33,56].With the increasing presence of networked cameras, the volume of video data to be processed is growing rapidly, which in turn challenges the scalability of existing design of video analytics systems.
Deep learning-based video analytics demands high network bandwidth if the large volume of data is streamed through network, and requires rich computation resources for processing at inference servers for reasonable performance.Generally, video analytics systems are deployed on the cloud, e.g., Microsoft Azure [8].Cloud-based DNN inference enables the usage of complex but resource-consuming DNN models, e.g., Faster-RCNN [40] for object detection.All video sources however have to stream their data to the centralized cloud server, which demands excessive data transmissions via wide-area networks (WAN) due to the geographically large scale.When the network bandwidth is limited, data compression techniques such as frame iltering [25] and resolution downscaling [12] are applied which heavily degrade the performance of DNN models.
In contrast, edge-based DNN inference executes inference tasks on regional edge servers geo-colocated with nearby video sources and more richly interconnected with high bandwidth local area networks (LAN).While the network condition allows high quality low latency video data to be streamed to the edge servers, the generally throttled computation resources at edge servers limit the scale and complexity of DNN models that can be adopted.In such a case, only lightweight models can be used, e.g., SSD [30].As a result, edge-based inference is less resilient to concept drift [52] and may sufer from low inference accuracy during runtime.
To leverage the advantages of both cloud and edge-based inference, this paper proposes a cloud-edge collaborative architecture.Fig. 1 depicts a typical setting of the proposed design.The video sources, edge servers and the cloud form a three-layer architecture.Instead of deploying standalone DNN models at edge servers, the architecture adopts a continuous learning approach -lightweight DNN models are deployed at edge servers for the inference task of each video stream, and are periodically retrained on the cloud and sent back to edge servers.In such a design, frames uploaded to the cloud are used as training samples instead of for cloud inference.With periodical model retraining, the cloud-edge collaborative architecture helps the lightweight models learn the up-to-date data distributions with real-time ingested video content and alleviates the performance degradation caused by concept drift.
A major challenge in adopting the proposed cloud-edge collaborative architecture stays with the fact that both the computation resources at edge servers and the network bandwidth between the cloud and edge servers are constrained, and it is non-trivial to judiciously allocate the limited resources across video streams to yield a higher inference accuracy.Due to constrained computation and network resources, edge servers may not be able to process video streams with full frame rates or send all video frames to the cloud for retraining in real time.Meanwhile, some video contents are more sensitive to resource dynamics while others are more resilient, and such diversity varies both over time and across streams.Consequently, appropriate inference and retraining conigurations are desired to decide the amount of video data for inference and retraining, which corresponds to a comprehensive resource allocation problem to allocate available computation and network resources across video streams.
A general solution to such a resource allocation problem is of high complexity mainly due to two points.First, because of the fact that video contents vary dynamically, the expected accuracy of diferent retraining and inference conigurations needs to be probed online regularly and over each video stream, which involves computationally intensive operations including model training and inference.Second, the size of the search space for the optimal combination of conigurations which yields the maximum overall accuracy grows with the number of available conigurations as well as video streams.In this paper, we propose a heuristic named accuracy gradient to quantify the sensitivity of diferent video contents toward resource variation.Leveraging accuracy gradient, we devise a depth-irst search (DFS) algorithm that greatly reduces the probing overhead by pruning the search space and reducing probing count.Besides, the proposed scheme embeds a non-trivial system solution to probe the inference accuracy of each stream with regard to the amount of allocated computation and network resources, which reduces the amortized cost of a single probing operation.Combining the two designs, the computational overhead caused by resource allocation can be well constrained without detrimenting the accuracy gain from continuous learning.
Model retraining, while essential in our system, imposes computational demands on cloud resources and introduces delays due to the training time, which subsequently afects the timeliness of model updates.Though the cloud is often perceived to possess abundant computational capacities for such retraining, it is impractical to anticipate ininite scalability, especially as the system expands to accommodate an increasing number of video streams.Furthermore, prolonged training periods can make retrained models less up-to-date, undermining the very purpose of retraining.To solve the problem, we devise an aggregated model training technique to curtailing the training overhead on the cloud.Instead of retraining the full model of each stream independently, models are divided into backbone and head.Video streams are grouped and a common model backbone is shared within each group.The shared backbone is retrained at a lower frequency than head using data aggregated over time improve model update timeliness.The proposed system is implemented and fully evaluated with real-world video traces including Bellevue Traic [3] and UA-DETRAC dataset [51].Mean average precision (mAP) is used as the evaluation metric, and the experiment results suggest up to 28.6% absolute mAP gain over alternative solutions on object detection tasks, proving the efectiveness of the proposed cloud-edge collaborative continuous learning approach on real-world scenarios.A comprehensive ablation study is conducted to showcase the performance improvement achieved by each design component separately.
The rest of the paper is organized as follows.Sec. 2 provides preliminary experiment results that motivate the study.Sec. 3 details the system design and implementation.Sec. 4 presents the evaluation results.Sec. 5 presents related works and Sec.7 concludes the paper.

MOTIVATION 2.1 Benefit of Cloud-Edge Collaborative Continuous Learning
As Fig. 1 illustrates, the three-layer cloud-edge collaborative architecture contains a cloud center connected with many edge servers, each connected to multiple video sources.Each edge server and its corresponding video sources are geo-colocated and connected with bandwidth-suicient LAN, e.g., Gigabit Ethernet, while between edge servers and the cloud are WAN links with limited bandwidth, e.g., 4G or 5G cellular network.During inference, an edge server maintains one lightweight DNN model for each connected video source, and at the same time uploads video frames as training data for retraining its DNN models on the cloud.The retraining process for each video stream takes place periodically in retraining windows on the cloud, where the frames collected from the previous retraining window are labeled by a golden model maintained at the cloud and used to retrain the corresponding lightweight model.Each retrained model is thereafter downloaded to the edge server and used for future inference tasks.With the computation and network resource constraints, video frames are downsampled, i.e., iltered, from the stream both for supplying to inference models at edge servers as well as uploading to the cloud for model retraining.In our design, two frame rates are used as the knobs for frame iltering, referred to as inference frame rate and retraining frame rate, according to which frames are uniformly sampled before inference and uploading.The inference frame rate of one stream corresponds to its computation cost incurred at the edge, and the retraining frame rate of one stream corresponds to the network resources on the cloud-edge link it consumes.
To demonstrate the beneit of combining edge inference and cloud-assisted continuous learning, we compare the approach with two straightforward alternatives, namely edge-based inference and cloud-based inference.For edge-based inference, video frames are processed by a lightweight object detection DNN, SSDLite with MobileNetV2 backbone.For the cloud inference, frames are streamed to the cloud and inferred by a complex DNN, Faster-RCNN with ResNet101 backbone.Both models are pre-trained on COCO dataset [26].For the continuous learning approach, video frames are processed at edge servers using SSDLite models, which are retrained every 20 seconds using pseudo-labels generated by the Faster-RCNN model on the cloud.To conduct the measurement, 20 video streams are selected from UA-DETRAC dataset and are allocated across two servers.The same inference and retraining frame rates are assigned to all streams, hence both computation and network resources are equally shared.The inference accuracy of each stream is represented by mAP and the average mAP across all streams is used to represent the overall performance of the system.Fig. 2 compares the achieved mAP of the cloud-edge collaborative approach with the edge-based as well as cloud-based approaches respectively.Fig. 2a gives the average mAP with solely edge-based inference and that of cloud-edge collaboration with an average bandwidth of 5 Mbps per stream.So 20 video streams consume 100 Mbps of uplink bandwidth in total.Computation resources on each edge server are denoted as the total inference frame rate ranging from 50 to 250 FPS, meaning how many frames the edge can process in each second with the lightweight model (SSLite in this case).Experiment results show that the continuous learning approach achieves a signiicant mAP gain even under relatively constrained bandwidth.With the total inference frame rate capped at 150 FPS on edge and a typical 100 Mbps 4G LTE cellular link [9], cloud-edge collaborative continuous learning can achieve up to 5.7% mAP gain compared with edge-based inference.The results indicate that periodical retraining and update can well alleviate the model degradation caused by concept drift compared with using a ixed model.Fig. 2b illustrates the average mAP with solely cloud-based inference and edge inference with cloud-assisted continuous learning with the total inference frame rate of each edge at 150 FPS.Network resources are denoted as the total bandwidth on the cloud-edge links ranging from 0 to 200 Mbps, which is typical bandwidth of nowadays 5G uplinks [37].With a low bandwidth, e.g., 100 Mbps, few frames could be uploaded for cloud-based inference, resulting in mAP as low as 25.7%.However, the cloud-edge collaborative design achieves a much better accuracy ACM Trans.Sensor Netw.with 18.6% mAP gain.The rationale behind is that network resources have a higher utility when uploaded frames are used to enhance edge models instead of for inference when the network bandwidth is limited.
Compared with purely edge-based solution, one concern about cloud-edge collaboration is the potential privacy violation, since it requires training data to be uploaded to cloud.However, some existing solutions [5,31,54] have proposed privacy-preserving video streaming and analytics techniques which adapt well to our cloud-edge collaborative architecture.Therefore, privacy issues are not considered in this paper but the system is compatible with such privacy-preserving designs.

Content Diversity
In the measurement above, the computation and network resources are equally allocated to all video streams.However, video streams with varied contents may have diferent sensitivity of inference accuracy to resource variation, i.e., some video contents may gain high accuracy when more resources are allocated for the inference or model retraining while others may not, which we call content diversity.For example, if a video stream only contains static scenes or objects of slow motion, then the variation of inference frame rate may not greatly afect its inference accuracy.In other words, such a stream has a low sensitivity towards computation resource variation.Oppositely, a video stream containing high motions may be more sensitive to the inference frame rate, and thus it has high sensitivity towards computation resource variation.Such diversity also applies to network resources since some streams experience larger concept drift and are more likely to achieve accuracy gain from retraining online than others.
We perform trace-driven emulation with the same UA-DETRAC dataset.Fig. 3a and 3b demonstrate the spatial content diversity by selecting 3 diferent video streams and applying diferent inference and retraining frame rates respectively on them.Fig. 3c and 3d demonstrate the temporal content diversity by selecting 3 diferent segments from the same stream and applying diferent inference and retraining frame rates.Inconsistent slopes of polylines corresponding to diferent video streams or segments in the igures indicate diferent sensitivity towards resource variation originated from content diversity, which exists not only spatially across video streams but also temporally over time on the same stream.The observations motivate us to explore such diversity to best allocate the limited computation and network resources to video steams.We later use a term to quantitatively express the content diversity and use the quantiied diversity to guide resource allocation (Sec.3.2).

SYSTEM DESIGN 3.1 Overview
In the proposed cloud-edge collaborative architecture shown in Fig. 4, all edge servers connected with the cloud form a set E. Each edge server e ∈ E is connected with a set of video sources and processes their video streams S e .All video frames captured at end devices can be fully streamed to the edge.
Key components on each edge server include a frame scheduler, an inference engine, and an inference model pool.Each edge server e maintains an inference model pool, containing lightweight models paired with every connected video stream s ∈ S e .Besides, the edge server maintains a frame scheduler which schedules all video frames streamed from each source either to feed into the inference engine for inference or upload to the cloud for retraining the edge model according to its coniguration.As stated previously in Sec.2.1, the coniguration of a stream consists of two knobs, namely inference frame rate and retraining frame rate.Frames are uniformly sampled, i.e., iltered, for inference and uploading according to inference and retraining frame rate respectively.The frame scheduler uses weighted round robin (WRR) algorithm [6] to schedule the frames of each stream to control network and GPU utilization so that the resource consumption of each stream follows the allocated amount (Sec.3.4).Computation resource consumption by a stream with inference frame rate ϕ A s i (ϕ, ω) Accuracy of stream s in window i with frame rates ϕ and ω Table 1.Mathematical notations.
Besides, an aggregated model training approach is proposed to reduce the retraining overhead (Sec.3.3).Finally, retrained edge models and conigurations are sent to the edge server.Retrained edge models are stored in the inference model pool and conigured frame rates are updated for the frame scheduler.

Resource Allocation
The proposed system aims at allocating the available computation and network resources across multiple video streams.A set of discrete frame rates is used as a bridge between the inference accuracy and resource consumption -each video stream is assigned a proper inference frame rate and retraining frame rate which correspond to the GPU resources consumed at the edge server and the network bandwidth used for sending data to the cloud respectively.The aim is to achieve a maximum average accuracy across all video streams under the constraint of the total computation resources at the edge and network bandwidth on the cloud-edge links.Problem formulation.In each retraining window i, the resource allocator needs to decide on a combination of retraining frame rate ϕ and inference frame rate ω for each stream s ∈ S as an allocation plan.The allocation target is to maximize the overall inference accuracy with the resource usage not exceeding the computational constraint C i e and network constraint U i , which can be formulated as a discrete optimization problem deined in Eq. (1).A complete set of relevant notations is presented in Table 1.
ACM Trans.Sensor Netw.Note that in retraining window i, edge models are retrained with training samples collected in retraining window i − 1.So x s,i−1 ϕ is used instead of x s,i ϕ when deriving the inference accuracy in the i-th retraining window in Eq. ( 1).
An optimal solution for such a resource allocation problem can be obtained if we have perfect knowledge about how each possible coniguration (ϕ, ω) corresponds to the inference accuracy, i.e., function A s i for every stream s and retraining window i.Then we can apply optimization techniques for solving the multi-dimensional multi-choice knapsack problem such as dynamic programming (DP).In practice, however, we may only obtain an estimation function Âs i based on probing the validation results in the previous retraining window online.Besides, compared with solving traditional optimization problems where the computational complexity measures the overall overhead, in our problem context the dominating overhead comes from the number of times to trigger probing the function Âs i , because each time when probing Âs i a complete DNN inference and result evaluation is invoked which is much more time-consuming than other computational operations involved in solving conventional knapsack problems.The probing cost incurred by DP grows linearly with the number of possible conigurations (ϕ, ω) as well as video streams S. Suppose the number of retraining frame rates ∥Φ∥, inference frame rates ∥Ω∥ and video streams ∥S∥ are m, n, p respectively, then the time dominating complexity of DP-based online resource allocation is O(mnp), which indicates a large allocation overhead when there is a large number of possible conigurations and streams.Such overhead may lead to excessive computation resource consumption and large delay to execute the allocation algorithm, which negatively afects the timeliness of coniguration update.
Accuracy Gradient-based Resource Allocation (AGRA).AGRA is a pruning-based DFS algorithm.Each unique path from the root to a leaf node in the search tree represents a possible allocation plan.Diferent from applying general DP-based method, we explore a practical observation, i.e., the resource-accuracy curves at both computation and network dimensions are concave, as shown previously in Fig. 2. We deine the accuracy gradient as the amount of accuracy variation with per unit of computation or network resource allocated to a video stream.The accuracy gradient of computation (network) resources can be viewed as the irst derivative of accuracy function A s i on the inference (retraining) dimension.DFS is then performed to search for an optimal ACM Trans.Sensor Netw.
allocation plan.Leveraging the concave observation, at each branch (non-leaf) node of the search tree, we may use accuracy gradient and linear programming relaxation [43] to derive an upper bound accuracy for the pending allocation plan corresponding to the node.A pending plan is found not optimal if its upper bound accuracy is lower than that of the best allocation plan identiied so far.With such a guarantee, the allocation plan can be early pruned without reaching any leaf node of the search tree, reducing the total number of pruning.
In addition, we observe that the computation and network resource afections to the inference accuracy are not highly entangled.Fig. 5 demonstrates the inference accuracy after retraining with varied computation and network resources in a 3D format using the same setting as Section 2.1.It shows that the inference accuracy increases monotonically with both computation resources and networks.This observed relationship facilitates the decomposition of the two-dimensional allocation problem into two separate one-dimensional problems without encountering local optima.Such a decomposition avoids jointly allocating network and computation resources and reduces complexity.We irst determine the retraining frame rate for each stream assuming a ixed inference frame rate, and then determine the inference frame rate after the retraining frame rate is ixed.With the above two heuristics, we may reduce the probing overhead to O((m + n)p) with a small constant on average and O(p) at the best case, which greatly reduces online allocation overhead and improves the timeliness for coniguration update.
Next, we irst use examples to illustrate how the estimation function Âs i can be probed online for both retraining and inference frame rate.We then present the complete AGRA algorithm with pseudo codes and a detailed explanation.
Probing Âs i for retraining frame rate ϕ.Fig. 6 illustrates how Âs i is probed with diferent retraining frame rates, which correspond to the network resources allocated across video streams.Suppose a total of m candidate retraining frame rates form a inite set {ϕ 1 , • • • , ϕ m } and are ordered by their network bandwidth consumption, i.e., ∀i, j, i > j ⇒ R n (ϕ i ) > R n ϕ j .At the beginning of retraining window i, for one speciic video stream s, the cloud receives the frames sampled at the retraining frame rate during the previous window.Suppose the previously used retraining frame rate is ϕ k .The training frames are split into a downsampled set and an incremental set, where the downsampled set contains frames sampled at a lower frame rate ϕ k −1 , and the rest of the frames form the incremental set.The downsampled set is irst used to retrain the lightweight model, which is thereafter tested on a validation set and yields a validation accuracy Âs i ( * , ϕ k −1 ), which gives the estimation of test accuracy using retraining frame rate ϕ k−1 .Here the asterisk represents a certain inference frame rate used for the model.The incremental set is then fed to the retrained model to generate a new model retrained using the full training dataset, from which Âs i ( * , ϕ k ) is estimated.Instead of training the model every time from scratch, by recursively splitting the original training set into downsampled sets and incremental sets multiple rounds and training the edge model incrementally, the estimated test accuracy accuracies with diferent retraining frame rates can be probed with amortized overhead.
For validation, an extra set with a ixed number of consecutive frames at the beginning of the retraining window with the original frame rate is conveyed to the cloud as the validation dataset.In practice, we ind that a validation set lasting for ∼2 seconds is suicient to produce accurate validation results and the associated network bandwidth consumption is negligible.
Probing Âs i for inference frame rate ω.Fig. 7 illustrates how Âs i is probed with diferent inference frame rates, which correspond to the computation resources allocated across video streams.A total number of n candidate inference frame rates form a inite set {ω 1 , • • • , ω n } and are ordered by their resource demand.At the beginning of retraining window i, for a certain stream s, after the retraining frame rate ϕ is decided and the retrained model is obtained, the validation dataset is fed to the retrained model to obtain the inference results.We can derive the inference accuracy obtained with the highest inference frame rate Âs i (ω n , ϕ) by comparing inference results with the pseudo-labels.The inference results are then uniformly downsampled to a lower frame rate, i.e., a proportion of inference results are dropped and padded with results from previous frames.The downsampled results are compared with the pseudo-labels to yield an inference accuracy corresponding to Âs i (ω n−1 , ϕ).By gradually downsampling the inference results at multiple levels, the inference accuracies with diferent inference frame rates are estimated.
AGRA algorithm.Algorithm 1 describes the detailed AGRA procedure.The resource monitor on each edge server proactively measures the available GPU and uplink bandwidth resources and periodically synchronizes its measurements with the cloud.On the cloud, the overall uplink network bandwidth U is allocated across multiple video streams (line 2).After the retraining frame rates (i.e., network bandwidth allocation) are decided, for each edge server e, its computation resources C e are allocated across video streams associated with it (line 4).
DFS is adopted to allocate the resources.A path from the root to a leaf node in the DFS tree represents a complete allocation plan where the frame rates of all streams are decided.Meanwhile, a path from the root to a branch node represents a pending allocation plan where the frame rates of part of streams are not decided.Starting from the root, when reaching a node on the i-th layer of the search tree, the frame rate of the i-th stream participating in the allocation is to be decided.So the DFS tree for the resource allocation problem of m streams and n candidate frame rates has m layers, each branch node with n children.ACM Trans.Sensor Netw.
On searching each node (line 6-19), S represents a set of streams whose frame rates are not decided, resCurr represents the amount of remaining resources, accCurr represents the overall accuracy of settled video streams, and accBest records the best overall accuracy achieved so far.When visiting a leaf node (line 8), a complete allocation plan is obtained and the current overall accuracy is directly returned.Otherwise when visiting a branch node and at least one complete allocation plan has been found whose overall accuracy stored in accBest (line 9), an estimation function is used to estimate the upper bound of the overall accuracy of all unsettled streams at the current branch node.The branch can be conidently pruned without visiting its children if the estimated upper bound is still lower than previously recorded accBest accuracy (line 10).Otherwise, one video stream is selected and tried with all candidate frame rates, which generates multiple downsized problems with one less unsettled streams.The downsized problems are then solved recursively (line 11-19), going one layer deeper in the search tree.Note that function to get resource consumption from coniguration (line 13) is R n and R c when taking in retraining and inference frame rates respectively.
The bound function takes in a set of streams and the amount of resource, and produces the upper bound of the overall accuracy of those streams (line 20-30).Accuracy gradient is derived by GRAD function (line 23) using the probing results described above.Mathematically, the gradient of a stream s in windows i for retraining frame rate ϕ k can be estimated as follows: The accuracy gradient of a stream s in windows i for inference frame rate ω k can be estimated as follows: When deriving the accuracy upper bound, all streams start with the lowest frame rate and calculate corresponding accuracy gradients (line 23).The resources are thereafter repeatedly allocated to the stream with the highest gradient with linear relaxation until all resources are allocated (line 24-29).
Theoretically, if the resource-accuracy curve is concave, the accuracy gradient always decreases monotonically with the amount of allocated resources, then the bound function is guaranteed to provide an upper bound of a pending allocation plan.The upper bound can be used for pruning branches to avoid unnecessary probing over the entire search tree, leading to a much smaller constant factor in its complexity.In addition to that, AGRA further avoids joint allocation of network and computation resources by separately determining the retraining and inference frame rates.Such decomposition reduces the search space from O(mnp) to O((m + n)p).

Aggregated Model Training
Though the computation resources on the cloud are assumed ample for retraining all edge models, model retraining can still be time-consuming and computationally demanding when the system scales in practice (i.e., tens of hundreds of edge servers each with many video streams).The retraining cost may impair the timeliness of retrained models to be deployed at the edge.The overall retraining cost on the cloud would grow linearly with the total number of video streams if each model is individually retrained.
To tackle such a problem, we take advantage of the fact that most modern DNN models can be split into a feature extraction backbone and a task-speciic head [30,45], where the former is generally much more complicated than the latter and training it contributes to the major part of training cost.Meanwhile, the feature distribution of a video stream shows less data drift compared with task-speciic contents [4,15,58], suggesting that the backbone tends to remain relatively stable over time.With such an observation, we choose to retrain the head and backbone of an edge model separately to reduce training cost.In such an approach, the model head is retrained and updated in every retraining window, while model backbone is retrained every k windows (k > 1).We use aggregation input : network bandwidth U , computation resources {C e |e ∈ E} a set of streams S, retraining conigurations Φ, inference conigurations Ω output : retraining frame rates for each stream retCgfs, inference frame rates for each stream infCfgs 1 retCgfs, infCfgs ← { }, { } 2 dfs(S, U , 0, 0, retCgfs, Φ) // allocate network bandwidth 3 for e ∈ E do In addition, we leverage the spatial correlation of video content distributions across geographically colocated video streams.In our setting, each edge server is regional and multiple video sources connected to the same edge servers are geo-colocated so their video contents share similar feature distributions, e.g., lighting condition and ACM Trans.Sensor Netw.motion.Such an assumption has been adopted and proven valid in existing works [20].Meanwhile, the diference of data distribution across individual video streams is generally relected by task-speciic heads.Accordingly our design retrains a common backbone shared across multiple geo-colocated streams, while each stream individually retrains its personalized head.The design ensures that only one model backbone is maintained at each edge server, and the overall retraining cost is constrained by the number of edge servers, instead of growing indeinitely with the number of video streams.The aggregated training set is downsampled to a ixed size to ensure uniied resource consumption across diferent edge servers and retraining windows.
Fig. 8 illustrates the detailed process of aggregated model training.Suppose n streams are connected to one edge server.The edge models of each stream share a common backbone and have individual heads.Only one of such models is illustrated in the igure.Generally, in each retraining window, the model backbone is frozen and only the task-speciic head of each edge model is retrained individually using frames sampled from its corresponding stream, shown as 1 ○.Every k retraining windows, aggregated retraining is triggered to update the model backbone of all edge models.When training a shared model backbone, the training datasets are aggregated not only from the past k windows (temporal aggregation) but also from the n streams of the same edge server (spatial aggregation).The aggregated training set is denoted as the dashed rectangle in the igure.Note that such aggregated training, shown as 2 ○, only takes place once across n streams, after which the updated backbone is shared to all streams.Finally, the system resumes general training with the backbone frozen and each stream uses its individual training set from the previous window to retrain its personalized head.

Implementation Details
We provide other implementation details in this section.
Network and GPU utilization.The frame scheduler on an edge server maintains one individual inference queue and upload queue for each stream.When a frame arrives at the edge server, it is duplicated and stored in both queues.Frames are taken from the two queues in sequence and supplied to the edge model for inference or to the network interface for uploading to the cloud respectively.The frame scheduler uses WRR algorithm to iteratively select frames from each stream's queue, where the weight assigned to each upload (inference) queue in WRR is proportional to the retraining (inference) frame rate of the corresponding stream.The WRR scheduler ensures that the computation and network resource consumption of each video stream is exactly the allocated amount so there is no need for explicit low-level resource management, e.g., hardware virtualization.
Pipelined model download.Transmitting multiple models from the cloud to the edge servers incurs high downlink bandwidth usage, which may cause network congestion and also reduce the timeliness of retrained model update.A pipelined model download technique is adopted to determine a proper retraining window duration and arrange model retraining and delivery of individual streams in a pipelined manner, so the downlink bandwidth is best utilized to avoid congestion.When pipelining the model download, it is ensured that only one edge model takes the full downlink bandwidth at a time because sequential transmission provides better timeliness -only completely downloaded model can take efect for later inference.The desired duration of the retraining window T is equal to the product of the number of streams and the delay of model download.The downlink bandwidth can be measured during runtime to adjust T dynamically.
Adaptive strategy.Due to the complexity of online scenarios and the diversity of video contents, neither continuous learning nor model aggregation is guaranteed to achieve better model performance.Inference accuracy after retraining may decrease if data from consecutive retraining windows or geo-colocated streams do not have a strong correlation.We apply an adaptive strategy to avoid possible performance degradation caused by model training.Instead of straightforward replacing models with retrained ones at the edge server, our system constantly monitors the validation accuracy of retrained models during aggregated model training.When the system observes decreasing post-retraining accuracy, it will drop the retrained model instead of updating it to the inference model pool, which ensures the stability and robustness.
Video compression.One option when streaming the video frames to the cloud is to compress frames into video, e.g., with H.264 [41] or VP8 [2].Our observation in practice however suggests that the network traic reduction does not beneit much from temporal compression, mainly due to the fact that the retraining frame rate is already very low.On the other hand, video compression may introduce additional content distortion as well as incur extra computation cost at the edge.After careful consideration, we remain uploading original frames without compression, but leave the system open for such alternatives.
Compatibility with other models, metrics and tasks.Our system allows the use of other models, metrics and tasks in addition to the ones used in motivation and evaluation as long as they follow the input and output of the system, i.e., the DNN model takes frames as input and produces inference results, and the evaluation metric takes in inference results and yields a value representing inference accuracy.For example, we can use more state-of-the-art models such as the ones with deep layer aggregation (DLA) [59] and mean intersection over union (mIoU) as the evaluation metric.We make these components modiiable as user-deined plugins.

EVALUATION
In this section, the overall performance improvement of the proposed system is demonstrated with end-to-end experiments.Breakdown experiments are then performed to examine how design components contribute to such improvement.

Experiment Seting
Testbed.The proposed framework is implemented using Golang.Roles including video sources, edge servers and the cloud are implemented as standalone processes and communicate using the Google Remote Procedure Call (gRPC) protocol.MMDetection [7], an object detection framework built upon PyTorch [38], is used for DNN inference and model retraining.Hyperparameters used for retraining and inference are speciied by default coniguration iles in MMDetection, except that the number of training epochs is ixed to 40.
Edge servers are equipped with Nvidia Tesla T4 GPUs for inference and the cloud uses Nvidia GeForce RTX3090 GPUs for retraining and generating training labels.The network connection between the cloud and edge is based on a 100 Mbps WAN link.Besides, we also implement a software-based resource emulator which can limit the computation resource and network bandwidth to a user-speciied amount, allowing us to test the system performance with iner resource granularity.Task and model selection.Object detection is used as a vehicle DNN task to study the system performance.Faster-RCNN with ResNet101 backbone is used as the golden model on the cloud, and SSDLite with MobileNetV2 backbone is used as lightweight models deployed at edge servers.We use hard labels generated by the golden model with a threshold of 0.5.Both models are pre-trained on the COCO dataset.Average mAP over all streams is used as the accuracy metric.

End-to-end Study
We compare the proposed system with 3 baseline approaches, namely (1) edge-based inference, where video streams are solely processed at edge servers [47], (2) cloud-based inference, where video streams are streamed to and processed on the cloud [12], and (3) hybrid inference, which combines the above two, sending frames which are hard to infer to cloud while keeping the rest processed at edge servers [17].To give hybrid inference the beneit, an oracle obtained oline is used for frame selection, i.e., the system always selects a set of frames to be uploaded to cloud which yields the maximum accuracy improvement, representing the accuracy upper bound for hybrid inference.
Apart from the 3 baselines described above, we also consider Ekya [3], an alternative which also leverages continuous learning but conducts both model retraining and inference solely at the edge.However, we ind that resource contention among inference, retraining and label generation on computation-constrained edge servers lead to excessive training delay which is even longer than the retraining window, making it hard to accommodate such a solution with typical edge server hardwares.
We start with experimenting edge servers each connected with 10 concurrent streams and vary the available computation resource on each edge server as well as network bandwidth on edge-cloud links.Fig. 9 presents the results.The computation resource at edge servers is indicated by the total inference frame rate.Fig. 9a gives the achieved mAP with UA-DETRAC dataset.The edge server's total inference frame rate is varied from 50 to 250 FPS with ixed edge-cloud bandwidth at 50 Mbps (left igure), and then the edge-cloud bandwidth is varied from 20 to 100 Mbps with ixed edge computation resource at 100 FPS (right igure).Fig. 9b shows the same set of results when Bellevue Traic dataset is used.The experimental results show that the performance of the proposed system grows monotonically with computation and network resources, which consistently outperforms all baselines under diferent resource bottlenecks.On UA-DETRAC dataset, the cloud-edge collaborative design achieves up to 12.5%, 25.7% and 4.8% mAP gain compared with edge-based, cloud-based and hybrid inference respectively, while on Bellevue Traic dataset these gains are 26.4%,28.6% and 22.3%.
Such results quantify how the proposed system beneits from cloud-edge collaborative continuous learning.On one hand, when the network bandwidth is constrained, our approach gains higher network utility when using the uploaded frames as training samples rather than directly using them for cloud-side inference.On the other hand, the system can consistently improve the performance of lightweight models at the edge without introducing additional computation overhead.Note that each video sequence lasts 4 minutes in UA-DETRAC dataset and 1 hour in Bellevue Traic.The improved performance of our system on both datasets demonstrates the efectiveness of the continuous learning approach.This approach enhances model accuracy across various time scales, from short intervals that encompass several retraining windows to prolonged video streams.
We test the system scalability by increasing the number of video streams with constrained computation and network resources, i.e., with two edge servers capped with 100 FPS processing power and the network link capped with 50 Mbps bandwidth.Fig. 10a presents the inference accuracy of the proposed approach in comparison with baselines.With the increased number of concurrent streams, we see all approaches experience throttled performance while the proposed system exhibits a consistent advantage over baselines with up to 26.4% mAP gain, which indicates high scalability of our system in servicing large-scale applications.Fig. 10b presents the CDF of achieved mAPs across the 60 individual video streams with the proposed cloud-edge collaborative approach.We see balanced performance over all streams -over 95% of the streams have an mAP within the range of two standard deviations around the average and all streams are within three standard deviations.No particular stream sufers performance degradation due to persistent resource starvation.

Breakdown Study
Resource allocation.We evaluate the efectiveness of our resource allocation approach -AGRA, compared with two alternatives: (1) even, where the system allocates computation and network resource evenly across all streams, and (2) oracle, where the system is assumed having the complete knowledge about how the allocated   resources correspond to inference accuracies and allocates the resources with DP algorithm.The oracle scheme gives the theoretical upper bound gain of resource allocation, and is practically impossible.Fig. 11 summarizes the achieved average mAP over all streams when diferent resource allocation schemes are used.In the experiment, 10 video streams are selected from UA-DETRAC dataset and placed at one edge server with varied total inference frame rate ranging from 50 to 250 FPS.When the network resource is more constrained (e.g., in Fig. 11a where the average bandwidth per stream is 1 Mbps), oracle allocation achieves up to 8.0% mAP gain on average compared with even allocation.Meanwhile, AGRA achieves up to 6.0% mAP gain and consistently outperforms even allocation with varied computation resource bottleneck, which indicates AGRA, though not optimal, can be very close to oracle allocation and achieves non-trivial accuracy gain.When the network resource increases to 1.5 Mbps per stream in Fig. 11b, the performance gain of both oracle and AGRA over even allocation becomes slimmer, i.e., 4.2% mAP gain for oracle and 3.5% for AGRA.The reason is that the marginal gain from model retraining gradually diminishes with retraining frame rate, which makes both oracle and AGRA eventually converge to even allocation.
We also evaluate the cost of our resource allocation scheme.We compare AGRA with two alternatives: (1) DP, the dynamic programming approach on 2D proile matrix without pruning, and (2) DP-split, which also uses dynamic programming but splits the 2D allocation problem to two 1D problems and allocates network and computation resources separately.
Table 2 presents the cost of online allocation measured as the average operation time per video stream per retraining window.The computation resource is ixed to a total inference frame rate of 100 FPS and the network bandwidth is ixed to 100 Mbps.The number of concurrent video streams for resource allocation is varied from 5 to 100.The measured results show that the average online allocation cost of both DP and DP-split is constant with an increasing number of streams, suggesting a linear growth of the total cost when the system scales up with more streams.On the other hand, with the proposed pruning-based DFS approach, the amortized cost of AGRA gradually decreases with up to 5.7× speed-up.This is because when there are more streams, the DFS tree becomes shallower and optimal allocation plan can be found with less number of probing, leading to more branches pruned, which makes AGRA scalable with the growth of the number of streams.Aggregated model training.We perform experiments to demonstrate how aggregated model training can reduce online training cost at a small cost of inference accuracy.We pick 16 video streams from UA-DETRAC dataset and allocate them across 4 edge servers based on camera locations.The size of spatial aggregation is ixed to 4 (1 for each edge server).The temporal aggregation length k is varied from 1 to 4. We compare the achieved average mAP and training cost with common retraining technique that does not use aggregation.
Table 3 summarizes the average accuracy and the cost of model retraining amortized across the streams and retraining windows.The experiment is performed with two network bandwidth settings, i.e., an average uplink bandwidth of 1 Mbps and 2 Mbps per stream, respectively.The results show that aggregated model training signiicantly reduces the online retraining cost.The gain gets larger with temporal aggregation applied across more retraining windows (when k is increased from 1 to 4).When the average uplink bandwidth is set to 1 Mbps, aggregated and non-aggregated training achieve similar mAP.With a larger uplink bandwidth (2 Mbps per stream), the aggregated training approach may lead slight drop of mAP (by 4% at most) but reduce up to 68.8% training cost.In our prototype implementation, we choose k = 2 to achieve a balanced trade-of between accuracy and training cost.

RELATED WORK
Distributed video analytics.Distributed video analytics is a common way to improve system scalability.Existing studies aim at splitting a task either into multiple stages of a processing pipeline [14,17,28,62] or parallel subtasks [21,48,64], and distributing them across machines.Among those studies, EdgeDuet [48] speciically adopts an edge-end collaborative design where hard data samples are oloaded to a powerful edge server while easy ones are kept at the end device for local processing.In such a parallel oloading setting, the cloud or edge servers are treated as simple extension of computation capability.In our design, the cloud plays the role of retraining edge models to ensure the inference accuracy.
Continuous learning.Continuous learning aims at tackling the problem of concept drift.Continuous learning approaches have been adopted in existing video analytics designs to reine DNN models continuously online [10,11,13,27,35,42].Most existing designs however mainly focus on training techniques without considering systematic design issues such as multi-stream resource contention.Ekya [3] proposes a continuous learning framework, which however builds both model inference and retraining at the edge server.Without the cloudedge collaboration, the design is essentially limited by available computation resources at the edge.This paper integrates existing continuous learning rationale into the cloud-edge collaborative architecture with system implementation and addresses speciic challenges involved in resource allocation and management.
Cloud-edge learning systems.Some learning systems adopt a cloud-edge collaborative architecture [16,19].In such conigurations, a learning task is divided into multiple steps, with a portion being oloaded to the edge.Edge servers typically manage steps that are less computationally demanding, such as data retrieval and preprocessing.This design leads to decreased uplink data transmission and enhanced resource eiciency, especially when compared to cloud-only learning systems.Moreover, cloud-edge learning systems can integrate with federated learning [29,55,57].In this scenario, edge servers preprocess data to bolster privacy and anonymity prior to its transmission to the cloud.While much of the existing research on cloud-edge systems explores the use of edge computing for eicient or privacy-focused training task execution, our system focuses on video analysis, where the emphasis is on model inference with training primarily serving to improve the accuracy of lightweight models deployed at edge servers.
Dynamic conigurations.Many video analytics systems support adapting conigurations dynamically in an online manner [1,12,20,22,23,25,47,63].Such adaptation aims at achieving a desired trade-of between inference accuracy and resource consumption by choosing a proper coniguration, which can be either spatial (resolution), temporal (frame rate) or model-related (speciic model settings).Most existing works however only consider computation resources on inference servers and focus on a single-stream scenario, where resource competition and multi-stream diversity are largely neglected.This paper considers dynamic conigurations within a cloud-edge collaborative system, and adapts both the inference and retraining conigurations for resource allocation.

DISCUSSION
Quantization and model pruning.Edge servers face computational resource limitations, making it challenging to run complex DNN models.To address this, two model compression techniques are introduced, i.e., quantization [18,39] and model pruning [24,34].In the quantization method, model parameters are represented using lower precision formats.Meanwhile, model pruning discards parameters that have minimal impact on inference.Both methods aim to decrease the model's size and computation cost, allowing the model to better it edge hardware constraints.In our system, the method for deriving the lightweight model at edge servers is not speciied.In experiments, SSDLite with MobileNetV2 backbone without any quantization or model pruning is used and undergoes conventional training procedures.However, when a lightweight model is derived from a compressed version of a complex model, speciic training techniques designed for such compressed models, e.g., quantization-aware training [44], can be incorporated in the retraining phase to optimize continuous learning.
Usage in wireless scenarios.In our setting, the uplink bandwidth between edge servers and the cloud is relatively consistent over time.However, when edge servers utilize wireless access networks to connect to the cloud, several issues emerge.The primary concern is the simultaneous connection of multiple edge servers, which, combined with the natural interference of wireless signals, can cause signiicant luctuations in wireless bandwidth.This variability can compromise the accuracy of the resource monitor in our system, leading to potential misestimations of available network resources in subsequent retraining windows, thereby negatively afecting the decision of the resource allocator.This issue can be addressed at the physical layer by mitigating Wi-Fi or cellular interference [36,66], or at the transport layer by implementing bandwidth prediction algorithms tailored for wireless networks.[53,61].Further exploration of these solutions is reserved for future research.

CONCLUSION
In this paper, we present a systematic design of a video analytics framework with cloud-edge collaborative continuous learning.The resource allocator of the system manages frame rate selection for multiple streams to utilize limited computation and network resources.The aggregated model training technique saves computation resource consumption for retraining.Evaluation results show the system can achieve up to 28.6% absolute mAP gain on object detection tasks.We conclude the system is efective for real-world applications.

Fig. 2 .
Fig. 2. Performance comparison with (a) the edge-based approach and (b) the cloud-based approach.

Fig. 3 .Fig. 4 .
Fig. 3.The achieved mAP spatially across diferent video streams with varied (a) computation and (b) network resources, and temporally across diferent segments from the same stream with varied (c) computation and (d) network resources.

Fig. 5 .
Fig. 5.The achieved mAP with varied computation and network resources.

Fig. 8 .
Fig. 8. Aggregated model training for a DNN model composed of backbone and head.

Fig. 9 .
Fig. 9. Inference accuracy of diferent approaches with varied computation or network resource.

Fig. 10 .
Fig. 10.(a) Inference accuracy of diferent approaches with a varied number of concurrent streams.(b) The CDF of achieved mAPs across 60 video streams.

Fig. 11 .
Fig. 11.Comparison with even and oracle schemes with diferent computation and network botlenecks.

Table 2 .
The cost of resource allocation when diferent schemes are used.

Table 3 .
Accuracy and training cost with and without aggregated model training.