Poster: Profiling Event Vision Processing on Edge Devices

As RGB camera resolutions and frame-rates improve, their increased energy requirements make it challenging to deploy fast, efficient, and low-power applications on edge devices. Newer classes of sensors, such as the biologically inspired neuromorphic event-based camera, capture only changes in light intensity per-pixel to achieve operational superiority in sensing latency (O(μs)), energy consumption (O(mW)), high dynamic range (140dB), and task accuracy such as in object tracking, over traditional RGB camera streams. However, highly dynamic scenes can yield an event rate of up to 12MEvents/second, the processing of which could overwhelm resource-constrained edge devices. Efficient processing of high volumes of event data is crucial for ultra-fast machine vision on edge devices. In this poster, we present a profiler that processes simulated event streams from RGB videos into 6 variants of framed representations for DNN inference on an NVIDIA Jetson Orin AGX, a representative edge device. The profiler evaluates the trade-offs between the volume of events evaluated, the quality of the processed event representation, and processing time to present the design choices available to an edge-scale event camera-based application observing the same RGB scenes. We believe that this analysis opens up the exploration of novel system designs for real-time low-power event vision on edge devices.


INTRODUCTION
Accurate and efficient visual perception on resource constrained edge devices is crucial to the realization of applications such as autonomous navigation on next-generation low-power edge platforms.Recent advancements in camera quality or resolution pose increased energy consumption during operation, creating a bottleneck to low-power efficient sensing on edge devices.In comparison, biologically-inspired neuromorphic event cameras mimic the human retina to provide extremely low power (∼ 10 − 30 ), highly reactive  () sensing capabilities, higher dynamic range up to 140, showing competitive performance in DNN task accuracy against RGB cameras [8].Neuromorphic event vision cameras (hereafter referred to as event cameras) move away from the CMOS sensor and concept of a "frame" to capture delta changes in light intensity (both positive and negative changes/polarity) incident on every pixel asynchronously, illustrated in Figure 1(a).As such, "images" can be synthesized/aggregated from a continuous stream of events that are reported as a tuple of (  ,   , timestamp, polarity).These "images" or framed representations of the event stream can then be used as inputs to an off-the-shelf DNN, leveraging decades of computer vision research on RGB frames.However, highly dynamic scenes (for e.g., if the camera is in motion or observing many fast-moving objects) can yield a high volume of events ( ()), which can quickly overwhelm the CPU on a resource-constrained edge device as it cycles through all the events to create a framed representation for DNN inference.Prior works on event vision disregard the event processing costs and either (i) focus on offline applications which post-process large volumes of event data for analysis, (ii) assume the deployment of a server-class GPU, or (iii) characterise event processing as an offline task, leaving real-time event processing at the edge an open problem.
In this work, we are the first to characterize the cost of event sensor processing on edge devices.We describe the design of our profiler and the trade-offs observed between the volume of events evaluated, pre-processing techniques used, quality of event representation generated, and processing time incurred, to derive the pareto optimal setting for event processing on the NVIDIA Jetson Orin AGX [7], a representative edge device.We present the profiler's operation over the CityFlow AI [2] person detection dataset.

PROFILER OVERVIEW AND RESULTS
The profiler comprises of 4 stages, described below.1. Simulating Events: The profiler first synthesizes event streams from RGB videos using the v2e (video to events) software tool [3], aggregating events at a user-defined temporal resolution (default,  = 10ms or 100 FPS) to evaluate the expected volume of events if an event camera were to replace the RGB camera.The profiler's goal is to identify the best {framed representation, pre-processing technique} combination for the given dataset which provides the best quality of framed representation of the events before the next window of events is aggregated i.e. within  of 10ms.2. Applying Pre-processing Techniques: The profiler then evaluates sampling techniques to reduce the volume of events and the resulting processing time: (i) No sampling (ii) Spatiotemporal filtering (drop isolated events in a 1 pixel radius over 1ms window) (iii) Temporal downsampling (drop every 2nd event per-pixel) (iv) Random sampling (drop events with  = 0.5 probability).3. Creating Framed Representations: Next, the profiler loads the events in each time bin into memory to convert the events into one of the following event representations: (i) Voxel Grid (ii) Binary Histogram (iii) Time Surface [5] (iv) Mixed Density Event Stack (MDES) [6], (v) 12-channel representation [9], and (vi) Time Ordered Recent Events (TORE) [1].4. Calculating Quality of Generated Representation: Lastly, the profiler leverages the Gromov-Wasserstein Discrepancy (GWD) as a metric for comparing the quality of event representations efficiently by measuring the distortion arising from the conversion of raw events to the framed representations.The GWD metric calculates the similarity between event and feature pairs during the construction of an event representation, with a lower GWD score indicating lesser distortion (or equivalent, better representation quality and preservation of events) and better DNN accuracy [9].

Evaluation Results.
In our experiments we utilize the the CityFlow AI dataset [2] (1920 × 1080 resolution videos captured at 30FPS) for a person detection application.In lieu of comparing multiple datasets that generate different event volumes, we utilize the same dataset and vary the sensor resolution settings to control the average number of events accumulated in a single time bin, simulating for both (i) sudden/unexpected bursts of high volume events and (ii) different choices of event camera resolutions.We simulate events with resolution settings (120×90, 346×260 (the resolution of DVS346 [4], a widely deployed event camera), 480 × 270, 512 × 290, 640 × 480) to generate (10000,..., 50000) events on average in a single time bin of  = 10ms to understand the processing latency on the Jetson AGX Orin over different event volumes.Table 1 describes a linear relationship between event volume and processing time, with Voxel Grids and Binary Histograms processing up to 20000 events within the processing deadline  of 10ms.This indicates that in general, if we are able to pre-process high volumes of event data down to ∼20000 events, we can achieve fast non-blocking event processing on a Jetson AGX Orin. Figure 1 describes the trade-off between distortion and processing time with pre-processing techniques applied to an event stream that generates ∼40000 events on average in a 10ms window.Nuanced event representations (e.g.MDES and TORE) achieve a lower distortion rate ≤ 30% but suffer 30 × −50× higher processing times than the required 10ms.Random sampling gives the most amount of control over the volume of events filtered without suffering much distortion compared to the original event stream.Temporal filtering on the other hand suffers the most distortion with no relative gains in processing time.Finally, the profiler determines that for the CityFlow AI dataset for person detection [2], Voxel Grids can achieve both a lower distortion rate of 0.31% and a fast processing time of 7.89ms when random sampling is applied to reduce the volume of events to ∼20000 events.

Figure 1 :
Figure 1: (a) Event stream generated when observing a marker-equipped rotating disk (b) Distortion rate (lower, better) vs processing time (lower, better) of ∼40000 events accumulated in  = 10ms

Table 1 :
Processing time vs event volumes in a 10ms window