Scalable Solutions for Efficient Real-Time Distributed Video Analytics with Vehicle Detection on CPU Edge Nodes

Traditional video analytics are typically performed on a single node having limited processing power and vertical scalability, resulting in a lack of real-time performance and potential single point failures. To address these issues, this research pioneers a distributed video processing hosted on Microsoft Azure Cloud, employing Apache Kafka distributed messaging framework, and optimizing the YOLOv8 model for performing real-time video analytics in CPU edge nodes. The study explores two optimization strategies, i.e., post-training optimization with magnitude pruning and pre-training optimization using Open VINO. Through meticulous experimentation, the research delves into the nuanced relationship between model complexity, object detection accuracy, and real-time processing, crucial for resource-constrained environments. After in-depth experimentation, the pre-training optimization not only enhances accuracy but also significantly improves processing speed, rendering it particularly well-suited for real-time scenarios. The Kafka-based infrastructure not only ensures fault tolerance but also exhibits scalability, providing reliable distributed video processing. This research establishes the significance of distributed video analytics for real-time applications, offering insights into optimal configurations in lower-end CPU devices.


INTRODUCTION
Video surveillance has emerged as a prominent use case in various domains, with introduced of high-tech surveillance cameras and IoT devices contributing significantly to the field.In cities and urban areas, thousands of surveillance cameras are deployed at various locations to ensure real-time monitoring and enhance security measures.Both in governmental and private sectors, video surveillance has become a critical concern, emphasizing the need to provide real-time visibility into these environments.Real-time visibility, often supported by advanced analytics, has become an essential component of video surveillance in municipal governments, national defense, traffic management, businesses, and residential security and health services.
However, the increasing demand for video storage and the complexities associated with processing vast amounts of multi-channel video data pose significant challenges for video processing technology.For instance, in traffic monitoring [10,12,13,15,16] applications, efficiently processing large volumes of video data from diverse sources in real-time requires robust technology capable of handling the substantial data load and meeting strict timeliness requirements.The need to process such extensive video data in real-time necessitates powerful processors and sufficient memory to accommodate the workload.Consequently, popular big data tools like Apache Kafka and Apache Spark become indispensable for real-time video analysis by providing distributed storage and processing capabilities.Traditional approaches to video analytics typically rely on GPU-based object detection models, such as YOLO (You Only Look Once) and SSD (Single Shot Multi Box Detector) [7,11].However, these models are inherently heavy in terms of computational requirements and are primarily designed for GPU environments.This poses compatibility challenges when deploying video analytics on distributed edge devices consisting of CPU-based nodes.To overcome these challenges, this research aims to explore scalable solutions for efficient real-time distributed video analytics on CPU edge nodes.The primary objective is to optimize object detection models specifically for CPU-based distributed edge devices.By developing and fine-tuning lightweight object detection models, tailored to the resource constraints of CPU-based nodes, we can enhance realtime video processing performance on distributed edge devices.The research investigated 1 various techniques and approaches to maximize the efficiency and accuracy of object detection models while ensuring compatibility with CPU-based distributed edge nodes.By addressing the limitations of existing GPU-based approaches and leveraging the computational capabilities of CPU edge nodes, this research aims to enable real-time video processing and analysis in distributed environments.The goal is to provide efficient and scalable solutions

MATERIALS AND METHODS
The following Figure 1 represents the overall high-level system block for a real-time video surveillance pipeline.

Video Source
In this research, multiple video files were used to simulate a realtime video feed from the camera but in the practical environment there would be multiple surveillance cameras continuously capturing the video and its feed is collected through the Ethernet or wireless media following different security protocols like RTSP.

Kafka For Distributed video processing
For the implementation of distributed video processing, Apache Kafka were used because of its pivotal role in distributed and faulttolerant messaging system.Its significance lay in its ability to seamlessly handle the streaming of real time video frames, ensuring that data from multiple sources was reliably and concurrently ingested by the processing system.The necessity of Kafka [6] became evident in scenarios demanding low-latency, high-throughput data processing, making it a cornerstone technology for building scalable, real-time video analytics solutions.

Kafka configuration and communication
For Kafka configuration, Microsoft Azure served as the chosen platform, where a comprehensive testing environment was established using a combination of virtual machines (VMs) as shown below in the Figure 2. The configuration involved 5 VMs(Ubuntu 20), with one designated as the Kafka server, and the remaining four partitioned into two consumer groups.All servers were situated within the Azure Virtual Network, facilitating secure and controlled communication.The communication between these servers was established using the TCP/IP protocol, ensuring reliable data exchange.Additionally, each server was configured with its respective security group, specifying inbound and outbound traffic rules required for the communication.

Model Optimization for CPU Nodes
So for the object detection model, a state of art model of the YOLOv8 was used which is has a capability [2][3][4]of enhancing the achievements of its YOLO predecessors with innovative features and enhancements to elevate its performance and versatility.This model was run through the different optimization techniques for running in the CPU edge nodes.Following are the techniques used for optimization.

Post-Training Optimization.
The process involves pruning certain channels or units from the network of pretrained model to reduce computational complexity, followed by fine-tuning or retraining to regain any performance that was compromised during optimization steps.This helps in achieving a good balance between model efficiency (reduced parameters and computations) and model accuracy.
Magnitude pruning is a type of weight pruning technique used to reduce the size of a neural network by removing unimportant weights which plays significant [1] role in running inference in low end devices like cpu.The idea behind magnitude pruning is to identify and eliminate weights with low magnitude (small absolute values) since they are considered less critical for the network's performance.By removing these less important weights, the network becomes sparser, which can lead to reduced memory requirements and faster inference.
There are basically three steps that were followed for the magnitude training process as below in the block diagram shown in Figure 3.
The model goes through different pruning rate/ch_sparsity (0.2,0.3,0.4,0.5,0.6,0.7)followed by Importance Estimation using LP-norm.The Lp-norm of a weight vector w is given by equation 1, where n is the number of elements in the weight vector, and p is a hyperparameter specified during the initialization of the Magnitude Importance class (here L2-norm were used).Finally based on the threshold, pruned networks were subjected to a rigorous fine-tuning regime on the training data to recover any slight loss in accuracy caused by pruning.This helps in achieving a good balance between model efficiency (reduced parameters and computations) and model accuracy.
This stage aimed to rejuvenate the model's predictive prowess on the target problem domain vehicle detection [8].Leveraging an expansive dataset comprising approximately 4500 images spanning six distinct vehicle classes, fine-tuning was done in a Google Colab environment.The process encompassed 120 training epochs, each orchestrated with a batch size of 16.The input images, resized to 640x640 pixels, were meticulously processed to further refine the model's learned feature representations.This comprehensive fine-tuning strategy not only mitigated any degradation in model accuracy caused by pruning but also reinvigorated its ability to accurately detect and classify vehicles across the diverse set of classes.

Pre-Training Optimization
In this optimization technique no pruning is involved.The optimization process mainly focuses on transforming the model into a format suitable for inference optimization with Open VINO, rather than applying sparsity pruning or other model compression techniques.And finally, the post-training quantization using NNCF was used which is a separate step after the model has been converted to the Open VINO IR format.
Open VINO (Open Visual Inference and Neural Network Optimization), [4,9] is intel designed tools for enhancing the performance of deep learning models, particularly for deployment on edge devices.It offers a wide range of tools and optimizations to boost the performance of these models on Intel®hardware, including CPUs, GPUs, VPUs, and FPGAs.Open VINO's optimization process involves model conversion, precision calibration, and hardware-specific enhancements, allowing developers to achieve faster inference without compromising accuracy.
Following are the steps used for optimizing the YOLOv8 model using Open-VINO.
• Load the pre-trained YOLOv8 model.

Data Collection and Annotation
Altogether, 4500 images were collected that are divided into the 6 classes i,e.car, motorcycle, bus, truck, van and tampoo.Each of these classes at least appear in 750 images.The reason of not having accurate number of images for each class because most of the image consist of more than on objects in a single image.Out of 4500 around 3000 datasets were taken from the web which were already annotated where 1500 dataset were manually collected from the street of Kathmandu and some of them were collected via some image API.Once data is collected, those images need to go through the custom annotation for localising each object in a single image and that was a more tedious and time-consuming phase.

Development Tools and Environment
• Google Colab pro (Received 500 compute unit from Google after being requested for thesis)

Optimized Model Performance
So, there are the two method of model optimization used for optimizing the YOLOv8 model for the CPU edge nodes.Below are some of the findings of both strategies.

Post Training Optimization
Result.In the post-training optimization process, firstly the model underwent the magnitude pruning optimization, which involves pruning certain channels or units from the network to reduce computational complexity, followed by fine-tuning, or retraining to regain any performance that was compromised during optimization steps.
So initially the pretrained model was passed to the magnitude pruning at one pruning rate and then after it is fine-tuned with  1 reveals how different pruning rates impact the YOLOv8 model's accuracy and FPS.Lower pruning rates may preserve more information but result in a larger model, while higher pruning rates reduce model size but may lead to information loss.Finding the right balance is crucial for optimizing the model's performance and efficiency.It also highlights how varying pruning rates impact the model's precision, recall, mean average precision at different IoU thresholds, and processing speed, providing valuable insights into the trade-offs involved in model optimization.They shed light on the delicate balance between model complexity, object detection accuracy, and real-time processing capabilities, essential considerations for deploying efficient computer vision systems on resource-constrained edge devices.
Following is some of the metrics plot of above Table 1, The metrics plot clearly depicts that on increase of pruning rate/ch_sparsity accuracy(mAP) decreases whereas FPS increases gradually.So finding out the right balance between the performance (FPS) and accuracy is the cornerstone and for that selection of the proper pruning rate is mandatory.Plotting out all these three metrics together, we can see that the optimum accuracy with average FPS around 12 is obtained if pruning rate of around 0.5 is used.

Pre-Training Optimization
Result.Unlike the Post-training optimization, here the pretrained model was fine-tuned with the custom dataset and then after it was optimized via Open VINO optimization.With the fine-tuned YOLOv8 model providing good results in terms on Five key metrics: precision, recall, F1 score, mAP@0.5, and mAP@0.95.Below is a Table 2 summarizing the results obtained after running the model for 120 epochs, with each epoch having a batch size of 16.After fine tuning, model under the optimization via Open VINO optimization technique which includes the series of steps like Conversion from the model to the Open VINO format by fusing convolution layers, and kernel tuning and finally followed by the NNCF Post training Quantization.As we can see in the Table 3, there is a reduction in certain metrics such as precision, mAP@0.5, and the F1 Score after quantization but there is a substantial improvement in the processing speed (FPS).These trade-offs are common when optimizing models for  deployment on resource-constrained devices.The specific choice of whether to apply quantization should consider the balance between model performance and efficiency based on the requirements of the application.The below Figure 5 highlights the clear visualization of above tables.In the below Figure 5, though we see little drop in some of the accuracy metrics after Quantization but in other side detection FPS massively increases from the 6 FPS to 22 FPS.

Evaluation of Real-time Video Processing Capability in Distributed Mode:
So, for the distributed video processing Apache Kafka were used and whole cluster were setup in the Microsoft Azure cloud.Full scale testing is not done but partial testing is done using 5 VM's.
The main reason for partial testing with few VM's instead of the fullscale testing to figure out the performance of the above optimized  In the context of real-time video processing on distributed nodes, the pre-training optimized model was used after a careful evaluation of multiple metrics, including FPS and F1 score.The pre-training optimization approach demonstrated a favorable balance between accuracy and speed, making it the preferred choice for processing real time video streams in the distributed environment.This decision ensures efficient and accurate analysis, highlighting the significance of the selected optimization technique in enhancing the overall performance of the system for real-time applications.

DISCUSSION
In this research, for optimizing the object detection model, focus is primarily done on optimizing the YOLOv8 model, presenting a tailored solution for real-time video analytics on CPU-based edge nodes.However, it's crucial to recognize certain limitations and opportunities for future exploration.The study's scope is confined to YOLOv8, and the generalizability of its findings to other object detection models remains unexplored.Future investigations could extend their purview to include diverse models like other light weighted SSD model, evaluating their performance on CPU-based edge devices and gaining insights into unique strengths in terms of accuracy and performance.
Potential enhancements encompass a more diverse exploration of object detection models to understand the trade-offs between accuracy, processing speed, and resource efficiency.Tailoring optimization strategies for low-end devices is another avenue, ensuring adaptability to a broader range of hardware configurations.Comparative analyses of various optimized models, considering factors such as FPS, accuracy, and resource utilization, would contribute valuable insights for selecting the most suitable model based on deployment scenarios and hardware constraints.
A potential avenue for future research involves a nuanced exploration of performance and accuracy trade-offs in the context of dynamic frame skipping within continuous video streams.By intentionally skipping certain frames during processing, researchers can analyze the resultant impact on Frames Per Second (FPS) and overall accuracy, considering potential information loss.

CONCLUSION
In conclusion, this research has presented a comprehensive exploration of efficient real time distributed video analytics in cloudbased environments, focusing on the optimization of YOLOv8 models for CPU-based edge devices.The post-training optimization, featuring magnitude pruning and subsequent fine-tuning, demonstrated notable results in terms of model efficiency.Furthermore, the pre-training optimization strategy, leveraging Open VINO, showcased a superior balance between FPS and accuracy (precision and recall, F1 scores) The distributed video processing performance, evaluated using Apache Kafka on Microsoft Azure VMs, revealed the significance of VM configurations in achieving optimal results.The chosen VM types, ranging from Standard DS1 v2 to Standard B4ms, displayed varying capabilities, emphasizing the crucial role of CPU cores and RAM in real-time video analytics.
The evaluation of real-time video processing capability in distributed mode underscored the importance of selecting appropriate VM types for seamless processing of surveillance camera videos at different frame rates.Additionally, the scalability of the system was discussed, highlighting the flexibility in accommodating increased source loads.The use of Kafka for messaging buffer, combined with the ability to dynamically scale VMs, ensures the scalability of the architecture.Whether handling additional video sources or vertically scaling VMs, the system exhibits adaptability to evolving demands.In summary, the research provides valuable insights into optimizing model performance, evaluating distributed video processing, and ensuring scalability in real-world applications.The effective utilization of pre-training optimization with Open VINO emerges as a key contribution, offering a powerful solution for achieving efficient and accurate real-time video analytics in cloudbased distributed environments.

Figure 5 :
Figure 5: FPS Comparison Before and After Quantization

•
Pre-process the model for OpenVINO Conversion (Fuse Convolutional Layers).•Convert the YOLOv8 model to Open VINO IR format with FP32 precision.• Perform inference using the FP32 model to evaluate accuracy on the test dataset.

Table 1 :
Post Training Model Optimization Metrics

Table 2 :
Performance Metrics Summary of Fine-tuned Model

Table 3 :
Metrics Before and After Quantization

Table 4 :
Virtual Machine Specifications with Max FPS CPUs and 8 GB RAM, the optimized model showcased significant improvement, achieving a maximum FPS of 12.The most robust performance was observed in the Standard B4ms VM, boasting 4 CPUs and 16 GB RAM, achieving the highest FPS at 22.These results underscore the crucial role of CPU cores and adequate RAM in optimizing real-time video analytics, providing valuable insights into the ideal configurations for running our optimized model effectively on Azure VMs.