Poster: Towards Efficient Spatio-Temporal Video Grounding in Pervasive Mobile Devices

As the use of pervasive devices expands into complex collaborative tasks such as cognitive assistants and interactive AR/VR companions, they are equipped with a myriad of sensors facilitating natural interactions, such as voice commands. Spatio-Temporal Video Grounding (STVG), the task of identifying the target object in the field-of-view referred to in a language instruction, is a key capability needed for such systems. However, current STVG models tend to be resource-intensive, relying on multiple cross-attentional transformers applied to each video frame. This results in runtime complexity that increases linearly with video length. Furthermore, deploying these models on mobile devices while maintaining a low-latency poses additional challenges. Hence, this paper explores the latency and energy requirements for implementing STVG models on a pervasive device.


INTRODUCTION
Natural language interfaces are increasingly integral to interactive or immersive pervasive computing applications.Examples of such applications include (a) Surveillance [3] ("Show me the footage of two men tampering with the ATM"), (b) Memory/cognitive Assistants [1] ("What was my dog doing when I was reading the newspaper?"Or "Did I water the plants properly before I left home?").In such applications, the user's verbal command is interpreted relative to the context of a visual scene captured by a camera.Performing Spatio-Temporal Video Grounding (STVG) using natural language commands is essential for these applications.STVG involves taking a video feed and a natural language query as input to identify (a) the target object/region-of-interest and (b) the start and end of the temporal segment during which this object is visible.
STVG poses a significant challenge due to its requirement for both temporal and spatial reasoning across multiple video frames.Implementing STVG on pervasive devices is particularly challenging due to the high processing overhead.TubeDETR [6] and STCAT [2], which are transformer-based single-stage models, are the current state-of-the-art (SOTA) in this vision-language task.However, these models rely on cross-modal attention mechanisms applied across video frames, making them computationally intensive.Typically, they load entire videos and execute transformer encoders on each frame to retain intermediate feature maps.Subsequently, the decoder utilizes these attention-based features to generate three outputs for each frame: (a) a bounding box, (b) the probability of it being the start frame when the target object appears, and (c) the probability of it being the end frame of the target object.Our evaluation of TubeDETR [6] and STCAT [2] reveals that processing latency increases steadily with video length (refer to Figure 2).Therefore, reducing the computational latency and energy consumption of STVG models is crucial, especially for practical scenarios involving long untrimmed videos (e.g., security footage or instructional videos) or integrating grounding into interactive applications (e.g., service bots responding to human assistance requests).
In this paper, we investigate the possibility of executing STVG on a pervasive device along with some early thoughts on improving the latency and processing overheads for real-time performance to enable the practical use-cases mentioned.For this, we utilize a Jetson Xavier AGX and deploy SOTA STVG models.Figure 2: Latency vs number of video frames for STVG upon receiving the user's "done" command.We then execute either STCAT [2] or TubeDETR [6], which are SOTA STVG models, on the video frames and verbal instructions within the user's "start" and "done" commands.The STVG model then returns the bounding boxes of the target object along with the temporal boundary.

EARLY RESULTS
We evaluate the performance of our system on VidSTG dataset [7], a benchmark dataset for STVG.Our evaluation employs m_vIoU as the accuracy metric, consistent with the methodology outlined in STCAT.We also utilize the Jetson Xavier AGX as our chosen pervasive device, measuring both latency and energy consumption.
First, we evaluate accuracy against FPS and energy per frame, in Table 1.Our findings reveal that SOTA STVG models, STCAT and TubeDETR, achieve only 2.2 and 2.5 FPS, respectively, while maintaining comparable accuracy.Additionally, they exhibit high processing energy consumption, with STCAT and TubeDETR consuming 4.59J and 4.21J, respectively, establishing the high runtime complexity of STVG models when executed on pervasive devices.Furthermore, we analyze the effect of the number of frames on latency as depicted in Figure 2. Our results indicate a linear increase in latency with the number of frames.Due to the transformer-based models' requirement to process the entire video as input, both STVG models experience memory overflow beyond ∼ 240 frames.
These results show that current STVG pipelines are unsuitable for real-time systems and unable to handle long, untrimmed videos.

DISCUSSION
We suggest following as future works to tackle current limitations.
• Static model pruning or Quantization: Static model pruning entails eliminating redundant neurons during training, while quantization reduces computational costs by using lower precision data types to represent weight parameters.These techniques have demonstrated efficacy in decreasing latency and processing overhead across various deep-learning tasks.Our objective is to employ similar strategies in STVG to enhance efficiency.• Dynamic model pruning: Dynamic model pruning [4,5] has proven highly effective, especially in multi-modal grounding tasks.This effectiveness stems from the presence of multiple modalities, wherein one modality can gauge task complexity to selectively skip or choose compute paths in another modality.We anticipate that employing analogous approaches in STVG could lead to faster execution and reduced processing overheads.• STVG with long, untrimmed videos: Even with successful reduction of runtime complexity through static and dynamic optimizations, existing pipelines remain unsuitable for processing long, untrimmed videos, primarily due to the inherent requirement of transformer-based models to ingest the entire video stream at once.To address this limitation, we propose incorporating windowed attention, which will process only a small buffer of video frames at a time rather than the entire video.To preserve longer temporal dependencies, we envision retaining some form of memory feature representation from previous video buffers.

CONCLUSION
This paper explores the execution of STVG models on pervasive devices to facilitate real-time natural language interfaces.Despite achieving a frame rate of 2.2 -2.5 FPS on a Jetson AGX, this falls short of supporting real-time performance.Consequently, the paper introduces preliminary ideas for enhancing runtime speed.To achieve this goal, the paper envisions leveraging static and dynamic model optimizations.Furthermore, it proposes the utilization of windowed attention with memory features to facilitate continuous Human-AI collaborative tasks involving long, untrimmed videos.

Figure 1
Figure 1 depicts our system deployed on a Jetson device.Specifically, we use a Kinect DK camera connected to the Jetson Xavier AGX to capture both video and audio instructions.A real-time speech-to-text engine, running on the AGX device, converts speech into textual commands.The system is initiated upon the user's command, with the phrase "start", and is subsequently terminated

Table 1 :
Accuracy vs FPS vs Energy trade-off for STVG