skip to main content
survey
Open Access

A Survey on Temporal Sentence Grounding in Videos

Published:06 February 2023Publication History

Skip Abstract Section

Abstract

Temporal sentence grounding in videos (TSGV), which aims at localizing one target segment from an untrimmed video with respect to a given sentence query, has drawn increasing attentions in the research community over the past few years. Different from the task of temporal action localization, TSGV is more flexible since it can locate complicated activities via natural languages, without restrictions from predefined action categories. Meanwhile, TSGV is more challenging since it requires both textual and visual understanding for semantic alignment between two modalities (i.e., text and video). In this survey, we give a comprehensive overview for TSGV, which (i) summarizes the taxonomy of existing methods, (ii) provides a detailed description of the evaluation protocols (i.e., datasets and metrics) to be used in TSGV, and (iii) in-depth discusses potential problems of current benchmarking designs and research directions for further investigations. To the best of our knowledge, this is the first systematic survey on temporal sentence grounding. More specifically, we first discuss existing TSGV approaches by grouping them into four categories, i.e., two-stage methods, single-stage methods, reinforcement learning-based methods, and weakly supervised methods. Then we present the benchmark datasets and evaluation metrics to assess current research progress. Finally, we discuss some limitations in TSGV through pointing out potential problems improperly resolved in the current evaluation protocols, which may push forwards more cutting-edge research in TSGV. Besides, we also share our insights on several promising directions, including four typical tasks with new and practical settings based on TSGV.

Skip 1INTRODUCTION Section

1 INTRODUCTION

With the increasing development of multimedia technologies on mobile phones and other terminal devices, people have gained easier access to videos from all around the world. Compared with other mediums for information transmission and exchange like texts and images, videos contain more dynamic activities and are of richer semantics to convey complex while understandable information. Basically, one video is composed of a continuing sequence of frame images possibly accompanied by audio and subtitles. Moreover, the videos from online websites in the wild are also surrounded by multiple forms of natural language texts (e.g., comments written by video viewers, video descriptions uploaded by creators, recommendation reasons edited by website editors). Thus, videos have natural advantages for multimedia intelligence exploration and research. However, the raw videos (e.g., the user-generated video data [8] or surveillance videos [93]) are too redundant and of content sparsity against the user-specific retrieval demands. Furthermore, it is also challenging to maintain and manage these raw videos since they need to occupy a huge number of storage resources [22]. Therefore, the ability to quickly retrieve a specific video segment (i.e., moment) from a long untrimmed video can allow users to locate highlighted moments of their interests conveniently and help information providers to optimize the storage fundamentally, thus being of great importance and interest in the research community.

Given the urgent need in both academia and industry, a vast number of studies attempt to automatically capture the key information within a video, e.g., video summarization [65, 123, 131], video highlight detection [43, 114]. More fundamentally, some works [4, 44, 51, 64, 82, 84, 99, 117] treat the task of detecting a video segment that performs a specific action as a video classification problem, denominating this type of task as action detection or temporal action localization (TAL) [5]. Though TAL is able to extract effective information from the untrimmed videos, it is restricted by predefined action categories. Even though the categorization is becoming more and more complicated, it is still not fully adequate to cover all kinds of interactive activities. Thus, it is natural to utilize natural language to describe those various and complex activities. Temporal Sentence Grounding in Videos (TSGV) is such a task to match a descriptive sentence with one segment (or moment) in an untrimmed video that is of the same semantics. As shown in Figure 1, given the query “A little girl walks by a little boy and continues to blow the leaves” as input, the goal of TSGV is to predict the start and end points (i.e., 7.11 s to 12.7 s) of the target segment within the whole video, and the predicted segment should contain the activities indicated by the input query. Like other vision-and-language tasks (e.g., visual question answering [1, 2, 120], image/video captioning [18, 72, 73, 109, 115, 116], visual grounding [45, 98, 118] and vision-and-language pre-training [19, 24, 61, 90]), TSGV requires both understanding of visual and textual inputs. Moreover, it could also serve as an intermediate task for various downstream vision-and-language tasks such as video question answering [28, 48, 50, 107] and video summarization [23, 66, 81, 123, 140]. For example, related segments can be first grounded through the textual question and then analyzed for discovering the final answer to the input question. Also, by providing concise sentence summaries of videos, semantic coherent video segments can be grounded, retrieved, and composed as the visual summaries of the original videos. Hence, it is worthwhile to go into a deep exploration in TSGV, which connects computer vision and natural language processing communities, as well as further promotes a variety of downstream applications. However, TSGV is much more challenging for the following reasons:

Fig. 1.

Fig. 1. An example of TSGV, i.e., to determine the start and end timestamps of the target video segment corresponding to the given sentence query.

Both videos and sentence queries are in the form of temporal sequences with rich semantic meanings. Therefore, matching the relationships between videos and sentences is quite complicated and needs to be modeled in a fine-grained manner for accurate temporal grounding.

The target segments corresponding to the provided sentence queries are quite flexible in terms of spatial and temporal scales in videos. It will be computationally expensive to fetch candidate video segments of different lengths in different locations via sliding windows, followed by individually matching them with the sentence query. Therefore, obtaining video segments with different temporal granularities to comprehensively cover the target segments efficiently also poses challenges for TSGV.

Activities in a video often do not appear independently, instead they have internal semantic correlations and temporal dependencies on each other. Therefore, modeling the video context information, together with the inner logic relations among different video contents under the semantic guidance from the sentence, becomes an important and challenging step to ensure the accuracy of temporal grounding approaches.

Despite the above challenges, there exist many promising research works which bring continuous improvement in TSGV in the past few years, ranging from early two-stage matching-based methods [29, 32, 38, 57, 103], to single-stage methods [14, 122, 124, 128], RL-based methods [35, 36, 105], to the recent weakly supervised setting that draws people’s attention [26, 67]. Therefore, a systematic review for TSGV which summarizes the current works, analyzes their strengths and weaknesses, as well as promotes the future research directions becomes a necessity for the community. Both Yang et al. [113] and Liu et al. [59] provide a method review on existing TSGV methods with a future direction discussion. Comparing to these previous ones, our survey covers more state-of-the-art (SOTA) models that have been newly published and provides a clearer taxonomy of existing methods. The in-depth analysis of the limitations of current evaluation protocols is an additional advantage. In this survey, we summarize the taxonomy of existing methods, present the evaluation protocols, critically reveal the potential problems based on the current benchmarking designs, and further identify promising research directions to promote the development of this field.

The remainder of this article is organized as follows: Section 2 gives a detailed taxonomy and analysis of the existing approaches. Section 3 reviews benchmark datasets and evaluation metrics, summarizing the current research progress via comprehensive performance comparisons. Section 4 contains a discussion of the hidden risks behind the current evaluation setting and points out promising research directions, followed by Section 5, which concludes the whole article.

Skip 2METHODS OVERVIEW Section

2 METHODS OVERVIEW

We establish the taxonomy of existing approaches based on their characteristics (c.f., Figure 2). Early works adopt a two-stage architecture (c.f., Figure 3(a)), i.e., they first scan the whole video and pre-cut various candidate segments (i.e., proposals or moments) via sliding window strategy or proposal generation network, and then rank the candidates according to the ranking scores produced by the cross-modal matching module. However, such a scan-and-localize pipeline is time-consuming due to too much redundant computation of overlapping candidate segments, and the individual pairwise segment-query matching may also neglect the contextual video information.

Fig. 2.

Fig. 2. The taxonomy of existing approaches, are grouped into early two-stage methods, typical single-stage methods, RL-based methods, and weakly supervised methods. According to the ways of proposal generation, the two-stage methods can be subsequently divided into sliding window-based and proposal-generated ones. Meanwhile, the single-stage methods can be divided into anchor-based and anchor-free ones, which depends on whether or not anchors (candidate moments) are produced for ranking. The weakly supervised methods could also be further grouped into MIL-based and reconstruction-based.

Fig. 3.

Fig. 3. Two-stage Methods vs. Single-stage Methods. (a): At stage 1, the entire video is pre-segmented into multi-scale candidate moments. At stage 2, the matching module takes query-moment pairs as inputs and outputs matching scores for ranking. (b): Single-stage methods can be either anchor-based or anchor-free. Anchor-based methods use different types of anchors (e.g., Conv-styled, RNN-styled) as candidate moments, while anchor-free methods use prediction head to directly obtain moment boundaries (e.g., generate probability/attention weights for each position, or predict the offsets from a certain frame to the start/end groundtruth boundaries).

Considering the above concerns, some researchers start to use single-stage methods to solve TSGV without the process of pre-cutting candidate moments (c.f., Figure 3(b)). Instead, multi-scale candidate moments ended at each time step are maintained by LSTM sequentially or convolutional neural networks hierarchically, and such single-stage methods are named anchor-based methods. Some other single-stage methods predict the probabilities for each video unit (i.e., frame-level or clip-level) being the start and end point of the target segment, or straightforwardly regress the target start and end coordinates based on the multimodal feature of the providing video and sentence query. These methods do not depend on any candidate proposal generation process and are named anchor-free methods.

Besides, it is worth noting that some works resort to deep reinforcement learning techniques to address TSGV, taking the sentence localization problem as a sequential decision process, which are also of anchor-free. To reduce intensive labor for annotating the boundaries of groundtruth moments, weakly supervised methods with only video-level annotated descriptions have also emerged, which can be either MIL-based or reconstruction-based. In the following, we will present all the approaches and perform a deep analysis of the characteristics of each type.

2.1 Two-stage Method

For a two-stage method, the pre-segmenting of proposal candidates is conducted separately with the model computation. It takes the pre-segmented candidates and the sentence query as inputs of a cross-modal matching module for target segment localization. The two-stage methods can be grouped into two categories based on different ways to generate proposals.

2.1.1 Sliding Window-based.

Early methods adopt a multi-scale sliding window sampling strategy for the generation of candidate proposals. There are two pioneering works Moment Context Network (MCN) [38] and Cross-modal Temporal Regression Localizer (CTRL) [29] to define the TSGV task and construct benchmark datasets. Firstly, Hendricks et al. [38] propose MCN, which samples all the candidate moments (i.e., segments) via sliding window mechanism, and then projects the video moment representation and query representation into a common embedding space. The \(\ell _2\) distance between the sentence query and the corresponding target video moment in this space is minimized to supervise the model training (c.f., Figure 4(b)). Specifically, MCN encourages the sentence query to be closer to the target moment than negative moments in a shared embedding space. Since the negative moments either come from other segments within the same video (intra-video) or from different videos (inter-video), MCN devises two similar but different ranking loss functions: (1) \(\begin{equation} \begin{aligned}\mathcal {L}_i^{intra}(\theta) &= \sum _{n \in \Gamma \setminus \tau ^i}\mathcal {L}^R(D_\theta (s^i,\upsilon ^i,\tau ^i), D_\theta (s^i,\upsilon ^i, n)) \,, \\ \mathcal {L}_i^{inter}(\theta) &= \sum _{j\ne i}\mathcal {L}^R(D_\theta (s^i,\upsilon ^i,\tau ^i), D_\theta (s^i,\upsilon ^j,\tau ^i)) \,, \end{aligned} \end{equation}\) where \(\mathcal {L}^R(x, y) = \text{max}(0, x-y+b)\), \(b\) is a margin. As for training sample \(i\), the intra-video ranking loss encourages sentence \(i\) to be closer to the target moment at the location \(\tau ^i\) than the negative moments from other possible locations within the same video, while the inter-video ranking loss encourages sentence \(i\) to be closer to the target one at location \(\tau ^i\) than the negative ones from other videos of the same location \(\tau ^i\). The intra-video ranking loss is able to differentiate between subtle difference within a video while the inter-video ranking loss can differentiate between broad semantic concepts.

Fig. 4.

Fig. 4. The CTRL and MCN frameworks, two pioneer works that firstly present the TSGV task. CTRL uses joint language-segment representations to get the final alignment scores and refines the temporal boundaries by location regressor, while MCN tries to minimize the \(\ell _2\) distance between the language and video representations in a common space, figures from [29, 38].

At the same time, Gao et al. [29] propose CTRL, which is the first one to adapt R-CNN [34] methodology from object detection to the TSGV domain. Particularly, CTRL also leverages a sliding window to obtain candidate segments of various lengths, and as shown in Figure 4(a), it exploits a multi-modal processing module to fuse the candidate segment representation with the sentence representation by three operators (i.e., add, multiply, and fully-connected layer). Then, CTRL feeds the fused representation into another fully-connected layer to predict the alignment score and location offsets between the candidate segment and the target segment. CTRL designs a multi-task loss function to train the model, including visual-semantic alignment loss and location regression loss: (2) \(\begin{align} \mathcal {L}_{aln} &= \frac{1}{N}\sum _{i=0}^N[\alpha _c\log (1+\exp (-cs_{i,i})) + \sum _{j=0,j\ne i}^N\alpha _w \log (1+\exp (cs_{i,j}))], \end{align}\) (3) \(\begin{align} \mathcal {L}_{reg} &= \frac{1}{N}\sum _{i=0}^N[R(t_{x,i}^*-t_{x,i}) +R(t_{y,i}^*-t_{y,i})], \end{align}\) where \(\mathcal {L}_{aln}\) is the visual-semantic alignment loss considering both aligned (video segment, query) pairs and misaligned pairs. \(cs_{i,j}\) measures the alignment score between video segment \(c_i\) and sentence \(s_j\). The location regression loss \(\mathcal {L}_{reg}\) is only accounted for aligned pairs to predict the correct coordinates. \(R\) is a smooth-L1 function.

Compared to the above CTRL that treats the query as a whole, Liu et al. [58] further make some improvements by decomposing the query and adaptively getting the important textual components according to the temporal video context. Meanwhile, TMN [53] dynamically generates a modular neural network layout based on the semantic structure of the query to reason over the video.

Since CTRL overlooks the spatial-temporal information inside the moment and the query, Liu et al. [57] further propose an attentive cross-modal retrieval network (ACRN). With a memory attention network guided by the sentence query, ACRN adaptively assigns weights to the contextual moment representations for memorization to augment the moment representation. SLTA [42] also devises a spatial and language-temporal attention model to adaptively identify the relevant objects and interactions based on the query information. Considering that the inherent spatial-temporal structure of videos can not be fully captured by one-dimensional vectors in CTRL, Song et al. [86] propose to employ voxel- and channel-wise attention over the visual 3D feature maps to improve visual features and cross-modal correlation.

Wu and Han [103] propose a multi-modal circulant fusion (MCF) in contrast to the simple fusion ways employed in CTRL including element-wise product, element-wise sum, and concatenation. MCF extends the visual/textual vector to the circulant matrix, which can fully exploit the interactions of the visual and textual representations. By plugging MCF into CTRL, the grounding accuracy is further improved.

Previous works like CTRL, ACRN, and MCF directly calculate the visual-semantic correlation without explicitly modeling the activity information within two modalities, and the candidate segments fairly sampled by the sliding window may contain various meaningless noisy contents which do not contain any activity. Hence, Ge et al. [32] explicitly mine activity concepts from both visual and textual parts as prior knowledge to provide an actionness score for each candidate segment, reflecting how confident it contains activities, which enhances the localization accuracy. MMRG [126] employs a multi-modal relational graph explicitly considering the interactions among visual and textual objects. It also designs customized pre-training tasks to enhance the visual representations.

Despite the simplicity and effectiveness of such sliding window-based methods, they suffer from inefficient computation since there are too many overlapped areas re-computed due to the densely sampling process with predefined multi-scale sliding windows.

2.1.2 Proposal-generated.

Considering the inevitable drawbacks of sliding window-based methods, some approaches devote to reducing the number of proposal candidates, namely proposal-generated method. Such proposal-generated methods still adopt a two-stage scheme but avoid densely sliding window sampling through different kinds of proposal networks.

Query-guided Segment Proposal Network (QSPN) [108] relieves such a computation burden by proposing temporal segments conditioned on the query so as to reduce the number of candidate segments (c.f., Figure 5). As shown in Figure 5(a), the query-guided SPN first incorporates the query embeddings into the video features to get the attention weight for each temporal location, and further integrates the temporal attention weights into the convolutional process for video encoding to propose query-aware representations of candidate segments. Afterward, the generated proposal visual feature from Figure 5(a) is incorporated into the sentence embedding process at each time step of the second layer of the two-layer LSTM in an early fusion way (c.f., Figure 5(b)). QSPN devises a triplet-based retrieval loss which is similar to MCN: (4) \(\begin{equation} \mathcal {L}_{RET} = \sum _{(S,R,R^\prime)}\text{max}\lbrace 0,\, \eta + \sigma (S,R^\prime) - \sigma (S,R)\rbrace \,, \end{equation}\) where \((S,R)\) is the positive (sentence, segment) pair while \(R^\prime\) is the sampled negative segment. QSPN also devises an auxiliary captioning task, which re-generate the query sentence from the retrieved video segment. The loss for captioning is as follows: (5) \(\begin{equation} \mathcal {L}_{CAP} = -\frac{1}{KT}\sum _{k=1}^K\sum _{t=1}^{T_k}\log P(w_t^k|f(R),h_{t-1}^{(2)},w_1^k,\ldots ,w_{t-1}^k)\,, \end{equation}\) where a standard captioning loss is introduced to maximize the normalized log-likelihood of the words generated at all T unrolled time steps, overall K groundtruth matching query-moment pairs.

Fig. 5.

Fig. 5. The structure of QSPN, including the query-guided segment proposal network (c.f., Figure 5(a)) and a fine-grained early-fused similarity model for retrieval (c.f., Figure 5(b)), figures from [108].

Similarly, SAP proposed by Chen and Jiang [15] integrates the semantic information of sentence queries into the generation process of activity proposals. Specifically, the visual concepts extracted from the query sentence and video frames are used to compute the visual-semantic correlation score for every frame. Activity proposals are generated by grouping frames with high visual-semantic correlation scores.

Summary. Despite the intuitiveness and success of this two-stage matching-based paradigm, it also has some drawbacks. In order to achieve high localization accuracy (i.e., the candidate pool should have at least one proposal that is close to the groundtruth moment), the duration and location distribution of the candidate moments should be diverse, thus inevitably increasing the number of candidates, which leads to inefficient computation of the subsequent matching process.

2.2 Single-stage Method

The single-stage model follows one single-pass pattern. We divide it into two types, i.e., anchor-based and anchor-free, based on whether the method uses anchors (i.e., proposals) to make predictions.

2.2.1 Anchor-based.

Anchor-based methods employ different types of anchors (e.g., Conv-styled, RNN-styled) to yield candidate moments. TGN [11] adopts a typical single-stage deep architecture, which can localize the target moment in one single pass without handling heavily overlapped pre-segmented candidate moments. As shown in Figure 6, TGN dynamically matches the sentence and video units via a sequential LSTM grounder with fine-grained frame-by-word interaction, and at each time step, the grounder would simultaneously score a group of candidate segments with different temporal scales ending at this time step.

Fig. 6.

Fig. 6. The architecture of TGN, adopting a frame-by-word interaction single-stream framework. The grounder would generate multi-scale grounding candidates (anchors) that end at the same time step, figure from [11].

CMIN [136] sequentially scores a set of candidate moments of multi-scale anchors like TGN but with a sequential BiGRU network, and refines the candidate moments with boundary regression. To further enhance the cross-modal matching, it devises a novel cross-modal interaction network, which first leverages a syntactic GCN to model the syntactic structure of queries, and captures long-range temporal dependencies of video context with a multi-head self-attention module.

Likewise, CBP [97] builds a single-stream model with sequential LSTM, which jointly predicts temporal anchors and boundaries at each time step to yield precise localization. To better detect semantic boundaries, CBP devises a self-attention-based module to collect contextual clues instead of simply concatenating the contextual features like [29, 32, 38].

CSMGAN [56] also adopts such a single-pass scheme. It builds a joint graph for modeling the cross-/self-modal relations via iterative message passing, to capture the high-order interactions between two modalities effectively. Each node of the graph aggregates the messages from its neighbor nodes in an edge-weighted manner and updates its state with both aggregated message and current state through ConvGRU. Qu et al. [74] present a fine-grained iterative attention network (FIAN), which devises a content-oriented strategy to generate candidate moments differing from the anchor-based methods with sequential RNNs mentioned above. FIAN employs a refined cross-modal guided attention block to capture the detailed cross-modal interactions and further adopts a symmetrical iterative attention to generate both sentence-aware video and video-aware sentence representations.

TGN establishes the temporal grounding architecture through a sequential LSTM network, while Yuan et al. [122] propose SCDM, which exploits a hierarchical temporal convolutional network to conduct target segment localization, and couples it with a semantics-conditioned dynamic modulation to fully leverage sentence semantics to compose the sentence-related video contents over time. As shown in Figure 7, the multimodal fusion module fuses the entire sentence and each video clip in a fine-grained manner. The fused representation is formulated as: (6) \(\begin{equation} \mathbf {f}_t=\text{ReLU}(\mathbf {W}^f(\mathbf {v}_t||\bar{\mathbf {s}})+\mathbf {b}^f). \end{equation}\)

Fig. 7.

Fig. 7. The architecture of SCDM, which couples semantics-conditioned dynamic modulation into the temporal convolutional network to correlate the sentence-aware video contents over time, figure from [122].

With such fused representations as inputs, the semantic modulated temporal convolution module further correlates sentence-related video contents in a temporal convolution procedure, dynamically modulating the temporal feature maps conditioned on the sentence. Specifically, for each temporal convolutional layer, the feature map is denoted as \({\bf A} = \lbrace {\bf a}_i\rbrace\). The feature unit \({\bf a}_i\) will be modulated based on the modulation vectors \(\gamma _i^c\) and \(\beta _i^c\): (7) \(\begin{equation} \hat{{\bf a}}_i = \gamma _i^c \cdot \frac{{\bf a}_i - \mu ({\bf A})}{\sigma ({\bf A})} + \beta _i^c, \end{equation}\) where the modulation vectors are computed based on the sentence representation \({\bf S}=\lbrace {\bf s}_n\rbrace _{n=1}^N\): (8) \(\begin{equation} \begin{split} \rho _i^n &= \text{softmax}({\bf w}^T\text{tanh}({\bf W}^s {\bf s}_n + {\bf W}^a{\bf a}_i + {\bf b})), {\bf c}_i = \sum _{n=1}^N \rho _i^n{\bf s}_n, \\ \gamma _i^c &= \text{tanh}({\bf W}^{\gamma }{\bf c}_i + {\bf b}^{\gamma }),\; \beta _i^c = \text{tanh}({\bf W}^{\beta }{\bf c}_i + {\bf b}^{\beta }).\\ \end{split} \end{equation}\)

Finally, the position prediction module outputs the location offsets and overlap scores of candidate video segments based on the modulated features. Similar to SCDM, RMN [54] also correlates video contents conditioned on the query semantics via a modulation module, and it further employs a cascade of several rectification-modulation layers for multi-step reasoning.

MAN [128] leverages a temporal convolutional network to address the TSGV task as well, where the sentence query is integrated as dynamic filters into the convolutional process. Specifically, MAN encodes the entire video stream using a hierarchical convolutional network to produce multi-scale candidate moment representations. The textual features are encoded as dynamic filters and convolved with such visual representations. Additionally, MAN exploits the graph-structured moment relation modeling adapted from Graph Convolution Network (GCN) [46] for temporal reasoning to further improve the moment representations. Similar to MAN, Soldan et al. [85] also adopt GCN and present a video-language graph matching network (VLG-Net) for modeling the fine-grained inter-modal interaction.

Both SCDM and MAN only consider 1D temporal feature maps, while the 2D temporal adjacent network (2D-TAN) [134] network models the temporal relations of video segments via a two-dimensional map. As shown in Figure 8, it firstly divides the video into evenly spaced video clips with duration \(\tau\). The (\(i,\) \(j\))-th location on the 2D temporal map represents a candidate moment (or anchor) from the time \(i \tau\) to (\(j\) + 1)\(\tau\). This kind of 2D temporal map covers diverse video moments with different lengths, while representing their adjacent relations. The proposed temporal adjacent network fuses the sentence representation with each of the candidate moment features and then leverages convolutional neural network to embed the video context information, and finally predicts the confidence score of each candidate to be the final target segment.

Fig. 8.

Fig. 8. The architecture of a 2D-TAN, which consists of a text encoder for language representation, a 2D temporal feature map extractor for video representation, and a temporal adjacent network for moment localization, figure from [134].

Some works [9, 30, 96, 133] also adopt the same proposal generation approach as that of 2D-TAN. Wang et al. [96] propose a structured multi-level interaction network (SMIN), which makes further modifications on the 2D temporal feature map as its proposal generation module. SMRN [9] adds a residual connection within the hierarchical convolution network of 2D-TAN and further utilizes the query semantics to modulate short connections within residual blocks. Zhang et al. [133] present a visual-language transformer backbone followed by a multi-stage aggregation module to get discriminative moment representations for more accurate localization. Gao et al. [30] design a fine-grained semantic distillation framework for retrieving desired moments with superiority in both accuracy and efficiency.

It is worth noting that Bao et al. [3] present an anchor-based dense events propagation network (DepNet) for a more challenging task namely dense events grounding, which aims at localizing multiple moments given a paragraph. DepNet aggregates the visual-semantic information of dense events into a compact set and then propagates it to localize every single event, thus fully exploiting the temporal relationships between dense events.

Despite the superior performance anchor-based methods have achieved, the performance is sensitive to the heuristic rules manually designed (i.e., the number and scales of anchors). As a result, such anchor-based methods can not adapt to the situation with variable video length. Meanwhile, although the pre-segmentation like two-stage methods is not required, it still essentially depends on the ranking of proposal candidates, which will also influence its efficiency.

2.2.2 Anchor-free.

Instead of ranking a vast number of proposal candidates, the anchor-free methods start from more fine-grained video units such as frames or clips, and aim at predicting the probability of each frame/clip being the start and end point of the target segment, or directly regress the start and end points from the global view.

Yuan et al. [124] propose Aention Based Location Regression (ABLR), which solves TSGV from a global perspective without generating anchors. Specifically, as shown in Figure 9, to preserve the context information, ABLR first encodes both video and sentence via bidirectional LSTM networks. Then, a multi-modal co-attention mechanism is introduced to generate not only video attention which reflects the global video structure, but also sentences attention which highlights the crucial details for temporal localization. Finally, an attention-based coordinates prediction module is designed to regress the temporal coordinates (i.e., the starting timestamp \(t^s\) and the ending timestamp \(t^e\)) of sentence query from the former output attentions. Meanwhile, there are two different regression strategies (i.e., attention weight-based regression and attended feature-based regression) with the location regression loss \(L_{reg}^{ablr}=\sum _{i=1}^K [R(\tilde{t}^{s}_i-t^{s}_i)+R(\tilde{t}^{e}_i-t^{e}_i)]\), where \(R\) is a smooth L1 function. Besides the location regression loss that aims at minimizing the distance between the temporal coordinates of the predicted and the groundtruth segments, ABLR also designs an attention calibration loss \(\mathcal {L}_{cal}\) to get the video attentions more accurately: (9) \(\begin{equation} \mathcal {L}_{cal}=-\sum _{i=1}^K\frac{\sum _{j=1}^Mm_{i,j}\log (a_j^{V_i})}{\sum _{j=1}^Mm_{i,j}}. \end{equation}\)

Fig. 9.

Fig. 9. The architecture of ABLR model, which regresses the target coordinates with a multi-modal co-attention mechanism, figure from [124].

Here, \(\mathcal {L}_{cal}\) encourages the attention weights of the video clips within the groundtruth segment to be higher.

LGI [68] formulates the TSGV task as the attention-based location regression like ABLR. It further presents a more effective local-global video-text interaction module, which models the multi-level interactions between semantic phrases and video segments. Chen et al. [14] propose pairwise modality interaction (PMI) via a channel-gated modality interaction model to explicitly model the channel-level and sequence-level interactions in a pairwise fashion. Specifically, a light-weight convolutional network is applied as the localization head to process the feature sequence and output the video-text relevance score and boundary prediction. HVTG [16] also computes the frame-level relevance scores and makes boundary predictions based on these scores. To perform the fine-grained interaction among the visual objects and between the visual object and the language query, HVTG devises a hierarchical visual-textual graph to encode the features.

Unlike ABLR that regresses the coordinates of the target moment directly, ExCL [33] borrows the idea from the Reading Comprehension task [10] in the natural language processing area. The process of retrieving a video segment from the video is analogous to extract a text span from the passage. Specifically, as shown in Figure 10, ExCL employs three different variants of start-end frame predictor networks (i.e., MLP, Tied-LSTM, and Conditioned-LSTM) to predict start and end probabilities for each frame.

Fig. 10.

Fig. 10. The architecture of ExCL. It consists of three modules: a sentence encoder (shown in orange squares), a video encoder (shown in blue squares), and three variants of frame predictor (i.e., MLP, Tied-LSTM, and Conditioned-LSTM). The frame predictor outputs the start and end probabilities for each frame, figure from [33].

Likewise, VSLNet [130] employs a standard span-based Question Answering framework. VSLNet further distinguishes the differences between video sequence and text passage for better adaption to the TSGV task. To address the differences, it designs a query-guided highlighting strategy to narrow down the search space to a smaller coarse highlight region. L-Net [12] introduces a boundary model to predict the start and end boundaries, semantically localizing the video segment given the language query. It devises a cross-gated attended recurrent network to emphasize the relevant video parts while the irrelevant ones are gated out, and a cross-modal interactor for fine-grained interactions between two modalities.

TMLGA [76] also predicts start and end probabilities for each video unit. It further models the uncertainty of boundary labels, using two Gaussian distributions as groundtruth probability distributions. CPN [141] devises a cascaded prediction network based on the segment-tree data structure. It performs two sub-tasks (i.e., decision navigation and signal decomposition) on each level from top to down for final boundary prediction. PEARL [132] integrates the subtitles of videos and convolves the query filters into the visual and subtitle branches to locate the boundaries.

Lu et al. [60] propose a dense bottom-up grounding (DEBUG) framework, which localizes the target segment by predicting the distances to bidirectional temporal boundaries for all frames inside the groundtruth segment. In this way, all frames inside the groundtruth segment can be seen as positive samples, alleviating the severe imbalance issue caused by only regarding the groundtruth segment boundaries as positive samples. As shown in Figure 11, a typical dense anchor-free model usually contains a backbone framework for multimodal feature encoding and a head network for frame-level predictions. Specifically, DEBUG adopts QANet as its backbone network which models the interaction between videos and queries, and designs three branches as head networks which aim at separately predicting the classification score, boundary distances, and confidence score for each frame.

Fig. 11.

Fig. 11. The architecture of DEBUG, consists of a backbone framework (QANet) to model the multimodal interaction and a head module with three branches for dense regression, figure from [60].

Similarly, DRN [125] and GDP [13] also adopt such a dense anchor-free framework. For backbone, DRN uses a video-query interaction module to obtain fused hierarchical feature maps. For the head network, DRN densely predicts the distances to boundaries, matching score, and estimated IoU for each frame within the groundtruth segment. Meanwhile, for backbone, GDP leverages a Graph-FPN layer, which conducts graph convolution overall nodes in the scene space to enhance the integrated frame features. For the head network, GDP predicts the distances from its location to the boundaries of the target moment and a confidence score to rank its boundary prediction for each frame.

Another graph-based method DORi [77] utilizes a spatio-temporal graph to model the temporally-changing inter-object interactive relationships based on the language query, which can further improve the activity representations. Instead of adopting well-designed graph-based structures, Yu et al. [119] propose a simple yet effective approach that only conducts bi-directional cross-modal interaction via multi-head attention with multiple training objectives.

Compared with anchor-based methods, the anchor-free methods are obviously computation-efficient and robust to variable video duration. Despite these significant advantages, it is difficult for anchor-free methods to capture segment-level features for multimodal interactions.

Other Single-stage Methods. Different from the aforementioned single-stage methods which either samples from multi-scale anchors or directly regresses the final coordinates, some methods out of these patterns have emerged. The boundary proposal network (BPNet) [106] keeps the advantages of both anchor-based and anchor-free methods and avoids the defects, which generates proposals by anchor-free methods and then matches them with the sentence query in an anchor-based manner. Wang et al. [95] propose a dual path interaction network (DPIN) containing two branches (i.e., a boundary prediction pathway for frame-level features and an alignment pathway for segment-level features) to complementarily localize the target moment. Inspired from the dependency tree parsing task in the natural language processing community, a biaffine-based architecture named context-aware biaffine localizing network (CBLN) [55] has been proposed which can simultaneously score all possible pairs of start and end indices. Ding et al. [25] introduce a support-set cross-supervision (Sscs) module. The Sscs module can be a plug-in branch to enhance multi-modal relation modeling for both anchor-based and anchor-free methods.

2.3 Reinforcement Learning-based Method

As another kind of anchor-free approach, RL-based frameworks view such a task as a sequential decision process. The action space for each step is a set of handcraft-designed temporal transformations (e.g., shifting, scaling).

He et al. [36] first introduce deep reinforcement learning techniques to address the task of TSGV, which formulates TSGV as a sequential decision making problem. As depicted in Figure 12, at each time step, the observation network outputs the current state of the environment for the actor-critic module to generate an action policy (i.e., the probabilistic distribution of all the actions predefined in the action space), based on which the agent will perform an action to adjust the temporal boundaries. This iterative process will be ended when encountering the STOP action or reaching the maximum number of steps (i.e., \(T_{max}\)). Specifically, at each step, the current state vector is computed as \(s^{(t)} = \Phi (E,V_G,V^{(t-1)}_L,L^{(t-1)})\), where \(s^{(t)}\) is generated by one FC layer whose inputs are the concatenated features including the segment-specific features (i.e., the normalized boundary pair \(L^{(t-1)}=[l_s^{(t-1)},l_e^{(t-1)}]\) and local segment C3D feature \(V^{(t-1)}_L\)) and global features (i.e., the sentence embedding \(E\) and entire video C3D feature \(V_G\)). Then the actor-critic module employs GRU to model the sequential decision-making process. At each time step, GRU takes \(s^{(t)}\) as input and the hidden state is used for policy (denoted as \(\pi (a_i^{(t)}|s^{(t)},\theta _{\pi })\)) generation and state-value (denoted as \(v(s^{(t)}|\theta _v)\)) estimation. The reward for each step \(r_t\) is designed to encourage a higher tIoU compared to that of the last step. The accumulated reward function is then defined as (\(\gamma\) is a constant discount factor): (10) \(\begin{equation} R_t = {\left\lbrace \begin{array}{ll} r_t + \gamma * v(s^{(t)}|\theta _v),& t=T_{max} \\ r_t + \gamma * R_{t+1},& t=1,2,\ldots ,T_{max}-1 \end{array}\right.}. \end{equation}\)

Fig. 12.

Fig. 12. The architecture of the R-W-M framework. The observation network takes the environment (i.e., description feature, global video feature, local feature, and location feature) as input to compute the current state. Then one of those seven operators in the action space is determined to adjust the temporal boundaries of the current segment. Two auxiliary supervised tasks including tIoU regression and location regression are also leveraged, figure from [36].

Then they introduce the advantage function as objective which is approximated by the Mente Carlo sampling to get the policy gradient: (11) \(\begin{equation} \mathcal {L}^{\prime }_A(\theta _{\pi }) = - \sum _{t}(\log \pi (a_i^{(t)}|s^{(t)},\theta _{\pi }))(R_t - v(s^{(t)}|\theta _v)). \end{equation}\)

They further leverage two supervised tasks (i.e., tIoU regression and location regression) so the parameters can be updated from both policy gradient and supervised gradient to help the agent obtain more accurate information about the environment.

Wang et al. [100] propose an RNN-based RL model which sequentially observes a selective set of video frames and finally obtains the temporal boundaries given the query. Cao et al. [6] firstly leverage the spatial scene tracking task, which utilizes a spatial-level RL for filtering out the information that is not relevant to the text query. The spatial-level RL can enhance the temporal-level RL for adjusting the temporal boundaries of the video. TripNet [35] uses gated attention to align textual and visual features, leading to improved accuracy. It incorporates a policy network for efficient search, which selects a fixed temporal bounding box moving around without watching the entire video.

TSP-PRL [105] adopts a tree-structured policy that is different from conventional RL-based methods, inspired by a human’s coarse-to-fine decision-making paradigm. The agent receives the state from the environment (video clips) and estimates a primitive action via tree-structured policy, including root policy and leaf policy. The action selection is depicted by a switch over the interface in the tree-structured policy. The alignment network will predict a confidence score to determine when to stop. Meanwhile, AVMR [7] addresses TSGV under the adversarial learning paradigm, which designs an RL-based proposal generator to generate proposal candidates and employs Bayesian Personalized Ranking as a discriminator to rank these generated moment proposals in a pairwise manner.

2.4 Weakly Supervised Method

For the annotation of groundtruth data in TSGV, the annotators should read the query and watch the video first, and then determine the start and end points of the query-indicated segment in the video. Such a human-labored process is very time-consuming. Therefore, due to the labor-intensive groundtruth annotation procedure, some works start to extend TSGV to a weakly supervised scenario where the locations of groundtruth segments (i.e., the start and end timestamps) are unavailable in the training stage. This is formally named as weakly supervised TSGV. In general, weakly supervised methods for TSGV can be grouped into two categories (i.e., MIL-based and reconstruction-based). One representative work will be illustrated in detail for each category, after which we will introduce the remaining.

Some works [20, 31, 67, 91] adopt multi-instance learning (MIL) to address the weakly TSGV task. When temporal annotations are not available, the whole video is treated as a bag of instances with bag-level annotations, and the predictions for instances (video segment proposals) are aggregated as the bag-level prediction.

TGA [67] is a typical MIL-based method that learns the visual-text alignment in the video level by maximizing the matching scores of the videos and their corresponding descriptions while minimizing the matching scores of the videos and the descriptions of others. It presents text-guided attention (TGA) to get text-specific global video representations, learning the joint representation of both the video and the video-level description. As illustrated in Figure 13, TGA first employs a GRU for sentence embedding and a pretrained image encoder for extracting frame-level features. The similarity between \(j\)th sentence and the \(k\)th temporal feature within the \(i\)th video denoted as \(s^i_{kj}\) is computed and a softmax operation is applied to get the text-guided attention weights for each temporal unit denoted as \(a^i_{kj}\): (12) \(\begin{equation} s^i_{kj} = \frac{\mathbf {w}^{i^T}_j\mathbf {v}^i_k}{\left\Vert \mathbf {w}^{i}_j\right\Vert _2 \left\Vert \mathbf {v}^i_k\right\Vert _2} \,, \quad a^i_{kj} = \frac{\exp (s^i_{kj})}{\sum _{m=1}^{nv_i} \exp (s^i_{mj})}. \end{equation}\)

Fig. 13.

Fig. 13. The overall framework of TGA. It learns a joint embedding network to align the text and video features. The global video representation is generated by weighted pooling based on text-guided attentions, figure from [67].

Thus we could get the sentence-wise global video feature \({\bf f}^i_j= \sum _{k=1}^{nv_i} a^i_{kj}\mathbf {v}^i_k\).

WSLLN [31] is another MIL-based end-to-end weakly supervised language localization network conducting clip-sentence alignment and segment selection simultaneously. Huang et al. [41] present a cross-sentence relations mining (CRM) method exploring the cross-sentence relations within paragraph-level scope to improve the per-sentence localization accuracy.

A video-language alignment network (VLANet) proposed by Ma et al. [63] prunes the irrelevant moment candidates with the Surrogate Proposal Module and utilizes multi-directional attention to get a sharper attention map for better multimodal alignment. Wu et al. [104] attempt to apply an RL-based model for weakly TSGV, which proposes a boundary adaptive refinement (BAR) framework for achieving boundary-flexible and content-aware grounding results. Chen et al. [20] propose a novel coarse-to-fine model (WSTG) based on MIL. First, the coarse stage selects a rough segment from a set of predefined sliding windows, which semantically corresponds to the given sentence. Afterward, the fine stage mines the fine-grained matching relationship between each frame in the coarse segment and the sentence. It thereby refines the boundary of the coarse segment by grouping the frames and gets more precise grounding results.

Tan et al. [91] propose a Latent Graph Co-Attention Network (LoGAN), a novel co-attention model that performs fine-grained semantic reasoning over an entire video. LoGAN is also a MIL-based method, which performs a similar frame-by-word interaction with the supervised method TGN [11] and adapts the graph module from another supervised method MAN [128] for iterative frame representation update. Wang et al. [101] present a fine-grained semantic alignment network (FSAN), which enables iterative multi-head attention based cross-modal interaction to capture fine-grained video-language alignment. In order to learn more robust and discriminative moment features, VCA [102] devises a visual co-occurrence alignment NCE loss that maximizes the similarity between video moments from different videos with similar descriptions.

Since MIL-based methods typically learn the visual-text alignment with a triplet loss, these methods heavily depend on the quality of randomly-selected negative samples, which are often easy to distinguish from the positive ones and cannot provide strong supervision signals.

The reconstruction-based methods [17, 26, 52, 87] attempt to reconstruct the given sentence query based on the selected video segments and use the intermediate results for sentence localization. Unlike MIL-based methods, the reconstruction-based methods learn the visual-textual alignment in an indirect way. Lin et al. [52] propose a semantic completion network (SCN) to predict the masked important words within the query according to the visual context of generated and selected video proposals. Specifically, for each proposal \(G^k\), denoted by \(\hat{\mathbf {v}}^k=\lbrace \mathbf {v}_i\rbrace _{i=s_k}^{e_k}\), with the masked query representation \(\hat{q}\), the energy word distribution \(\mathbf {e}_i^k\) at \(i^{th}\) time step can be computed as \(\mathbf {e}_i^k = \mathbf {W}_v\mathbf {f}_i^k + \mathbf {b}_v\,,\) where \(\mathbf {f}^k=\lbrace \mathbf {f}_i^k\rbrace _{i=1}^{n_q}\) are the cross-modal semantic representations, computed by \(\mathbf {f}^k = \mathbf {Dec}_q(\hat{\mathbf {q}},\,\mathbf {Enc}_v(\hat{\mathbf {v}}^k))\), \(\mathbf {Dec}_q\) and \(\mathbf {Enc}_v\) are respectively the textual decoder and visual encoder based on bi-directional Transformer [94]. Afterward, the reconstruction loss can be computed by adding up all negative log-likelihood of masked words: (13) \(\begin{equation} \mathcal {L}_{rec}^k = - \sum _{i=1}^{n_q-1}\log p(\mathbf {w}_{i+1}|\hat{\mathbf {w}}_{1:i},\hat{\mathbf {v}}^k) = - \sum _{i=1}^{n_q-1}\log p(\mathbf {w}_{i+1}|\mathbf {e}_i^k). \end{equation}\)

Song et al. [87] present a Multi-Level Attentional Reconstruction Network (MARN), which leverages the idea of attentional reconstruction. MARN uses proposal-level attentions to rank the segment candidates and refine them with clip-level attention.

Duan et al. [26] formulate and address the problem of weakly supervised dense event captioning in videos (i.e., to detect and describe all events of interest in a video), which is a dual problem of weakly supervised TSGV. It presents a cycle system to train the model which can solve such a pair of dual problems at the same time. In other words, weakly supervised TSGV can be regarded as an intermediate task in such a cycle system. Similar to [26], Chen and Jiang [17] also employ a loop system for dense event captioning. They adopt a concept learner to construct an induced set of concept features to enhance the information passing between the sentence localizer and event captioner.

Besides, instead of proposing a reconstruction-based or MIL-based method, Zhang et al. [138] designed a counterfactual contrastive learning paradigm to improve the vision-and-language grounding tasks. A regularized two-branch proposal network (RTBPN) [137] is also presented to explore sufficient intra-sample confrontment with sharable two-branch proposal module for distinguishing the target moment from plausible negative moments.

Skip 3DATASETS AND EVALUATIONS Section

3 DATASETS AND EVALUATIONS

In this section, we present benchmark datasets and evaluation metrics for TSGV and provide detailed performance comparisons among the above-mentioned approaches.

3.1 Datasets

Several datasets for TSGV from different scenarios with their distinct characteristics have been proposed in the past few years. There is no doubt that the effort of creating these datasets and designing corresponding evaluation metrics do promote the development of TSGV. Tables 1 and 2 provide an overview about the statistics of public datasets. Table 1 gives an overall introduction about the videos and annotated query-moment pairs. As we can see that some datasets (i.e., TACoS, Charades-STA) are constrained in a narrow and specific scene (e.g., kitchen or indoor activity), while others (i.e., DiDeMo, ActivityNet Captions) involve more complicated activities in open domains. Since each query refers to exactly one moment but multiple queries may refer to the same moment (a moment here means a video segment which can be identified by a {video id, start timestamp, end timestamp} triplet), the number of queries would be equal to the number of all samples (query-moment pairs) while the number of moments would be less than that, which is actually the number of unique {video id, start timestamp, end timestamp} triplets. Moreover, the detailed language statistics are reported in Table 2. A Larger vocabulary size and the average number of verbs/adjectives/nouns tokens indicate greater challenges in textual semantic understanding. Obviously, the sentences in ActivityNet Captions are the most difficult and those of Charades-STA are relatively simple with the smallest action (verb) set. We will introduce these four datasets more concretely as follows:

Table 1.
Video StatisticsAnnotation Statistics
# VideosAver. Video Duration(s)DomainVideo Source# Queries# MomentsAver. Moment Duration(s)Aver. Query Length
DiDeMo10,64229.3OpenFlickr41,20628,9256.97.5
TACoS127286.6CookingLab Kitchen18,2277,06927.99.4
Charades-STA9,84830.6Indoor activityActivity16,12411,7678.16.2
ActivityNet Captions14,926117.6OpenActivity71,95771,71837.114.4

Table 1. The Statistics of Videos and Annotations of the Benchmark Datasets

Table 2.
Vocabulary Statistics (Number of Used Unique Tokens)Sentence Statistics (Aver. Number per Query)
VerbAdjectivesNounsVerbsAdjectivesNouns
DiDeMo1.50 K1.40 K4.30 K1.200.582.64
TACoS0.58 K0.42 K0.98 K1.480.232.64
Charades-STA0.25 K0.17 K0.63 K1.260.062.40
ActivityNet Captions2.60 K2.90 K8.90 K2.560.663.73

Table 2. Language Statistics of the Benchmark Datasets

DiDeMo [38]. This dataset is collected from Flickr, and consists of various human activities uploaded by personal users. Hendricks et al. [38] split and label video segments from original untrimmed videos by aggregating five-second clip units, which means the lengths of groundtruth segments are times of five seconds. They claim that this trick is for avoiding ambiguity of labeling and accelerating the validation process. However, such a length-fixed issue makes the retrieval task easier since it compresses the searching space into a set with limited candidates. The data split are also provided by [38], with 33,005/4,180/4,021 video-sentence pairs for training/validation/test, respectively. Besides, a new dataset TEMPO [37] involving more temporally-related events is collected based on the DiDeMo, which is explored by some works as well [88, 135].

TACoS [75]. TACoS is built based on the MPII-Compositive dataset [78]. It contains 127 complex videos featuring cooking activities, and each video has several segments being annotated by sentence descriptions illustrating people’s cooking actions. The average length of videos in TACoS is around 300 s, which is much longer than that of other benchmark datasets. The total amount of query-moment pairs is 18,227 in this dataset, and 50%, 25%, 25% of which are used for training, validation, and test, respectively.

Charades-STA [29]. Charades-STA is built upon Charades [83], which is originally collected for video activity recognition and consists of 9,848 videos depicting human daily indoor activities. Specifically, Charades contains 157 activity categories and 27,847 video-level sentence descriptions. Based on Charades, Gao et al. [29] construct Charades-STA with a semi-automatic pipeline, which parses the activity label out of the video description first and aligns the description with the original label-indicated temporal intervals. As such, the yielded (description, interval) pairs can be seen as the (sentence query, target segment) pairs for TSGV. Since the length of original description in Charades-STA is quite short, Gao et al. [29] further enhance the complexity of the description by combining consecutive descriptions into a more complex sentence for test.

ActivityNet Captions [47]. ActivityNet Captions is originally proposed for dense video captioning upon ActivityNet dataset [5], and the query-moment pairs in this dataset can naturally be utilized for TSGV. ActivityNet Captions contains the largest amount of videos, and it aligns videos with a series of temporally annotated sentence descriptions. On average, each of the 20 k videos contains 3.65 temporally localized sentences, resulting in a total of 100 k sentences. Each sentence has an average length of 13.48 words. The sentence length is also normally distributed. Since the official test set is withheld for competitions, most TSGV works to merge the two available validation subsets “val1” and “val2” as the test set. In summary, there are 10,009 videos and 37,421 query-moment pairs in the training set, and 4,917 videos and 34,536 query-moment pairs in the test set.

3.2 Metrics

There are two types of metrics for TSGV, i.e., R@\(n\),IoU=\(m\) and mIoU, both of which are first introduced for TSGV in [29]. Since Intersection over Union (IoU) is widely used in object detection to measure the similarity between two bounding boxes, similarly for TSGV, as illustrated in Figure 14, many TSGV methods adopt temporal IoU to measure the similarity between the groundtruth moment and the predicted one. The ratio of intersection area over union area ranges from 0 to 1, and it will be equal to 1 when these two moments are totally overlapped.

Fig. 14.

Fig. 14. The illustration of Temporal IoU.

Thereby, one of the metrics is mIoU (i.e., mean IoU), a simple way to evaluate the results through averaging temporal IoUs of all samples. The other commonly-used metric is \(\text{R@}n,\text{IoU=}m\) [40]. As for sample \(i\), it is accounted as positive when there exists one segment out of top \(n\) retrieved segments whose temporal IoU with the groundtruth segment is over \(m\), which can be denoted as \(r(n,m,q_i) = 1\). Otherwise, \(r(n,m,q_i) = 0\). \(\text{R@}n, \text{IoU=}m\) is the percentage of positive samples over all samples (i.e., \(\frac{1}{N_q}\sum _i r(n,m,q_i)\)).

The community is accustomed to setting \(n\in \lbrace 1,5,10\rbrace\) and \(m\in \lbrace 0.3,0.5,0.7\rbrace\). Usually, \(n=1\) when the method adopts a proposal-free manner (i.e., belongs to either anchor-free or RL-based frameworks). Moreover, it is worth noting that MCN [38] adopts a particular metric with the IoU threshold \(m = 1.0\) since the groundtruth segments in DiDeMo are generated by aggregating the clip units of five seconds, and MCN also employs a matching-based method thus the predicted moment has a chance to fully coincide with the target moment, satisfying such a high IoU threshold.

3.3 Performance Comparison

In this section, we give a thorough performance comparison of the aforementioned approaches based on four benchmark datasets. For convenience and fairness, we uniformly adopt \(n = 1\) and \(m \in \lbrace 0.3,0.5,0.7\rbrace\) for the metric of R@\(n\),IoU=\(m\). Though different types of extracted visual features may influence the grounding accuracy, we uniformly report the best results for each method as reported in the literature. Table 3 lists all the experimental results grouped by their categories (i.e., belonging to two-stage, single-stage, RL-based, or weakly supervised methods) which are segmented by double horizontal lines. Table 4 separately reports the experimental results on DiDeMo dataset with metrics of R@\(\lbrace 1,5\rbrace\), IoU=1.0, and mIoU.

Table 3.
TypeMethodDiDeMoTACoSCharades-STAActivityNet Captions
0.30.50.70.30.50.70.30.50.70.30.50.7
SWMCN [38]------13.574.05----
CTRL [29]---18.3213.3--21.427.15---
MCF [103]---18.6412.53-------
ROLE [58]29.415.68----25.2612.12----
ACRN [57]---19.5214.62-------
SLTA [42]-30.9217.1617.0711.92-38.9622.818.25---
VAL [86]---19.7614.74--23.129.16---
ACL-K [32]---24.1720.01--30.4812.2---
MMRG [126]---57.8339.28-71.644.25----
PGQSPN [108]------54.735.615.845.327.713.6
SAP [15]----18.24--27.4213.36---
ABTGN [11]---21.7718.9----45.5128.47-
MAN [128]-------46.5322.72---
CMIN [136]--24.6418.05----63.6143.423.88
SCDM [122]---26.1121.17--54.4433.4354.836.7519.86
CBP [97]---27.3124.7919.1-36.818.8754.335.7617.8
2D-TAN [134]---37.2925.32--39.723.3159.4544.5126.54
FVMR [30]---41.4829.12--55.0133.7460.634526.85
SMRN [9]---42.4932.07--43.5825.22-42.9726.79
RMN [54]---32.2125.61--59.1336.9867.0147.4127.21
FIAN [74]---33.8728.58--58.5537.7264.147.929.81
CSMGAN [56]---33.927.09----68.5249.1129.15
SMIN [96]---48.0135.24--64.0640.75-48.4630.34
Zhang et al. [133]---48.7937.57-----48.0231.78
VLG-Net [85]25.5771.65-45.4634.19-----46.3229.82
AFABLR [124]---18.99.3----55.6736.79-
DEBUG [60]---23.45--54.9537.3917.6955.9139.72-
GDP [13]---24.14--54.5439.4718.4956.1739.27-
PMI [14]-----55.4839.7319.2759.6938.2817.83
ExCL [33]---44.427.814.661.441.221.362.141.623.9
DRN [125]----23.17--45.426.4-42.4922.25
HVTG [16]------61.3747.2723.357.640.1518.27
TMLGA [76]---24.5421.6516.4667.5352.0233.7451.2833.0419.26
LGI [68]------72.9659.4635.4858.5241.5123.07
VSLNet [130]---29.6124.2720.0370.4654.1935.2263.1643.2226.16
CPN [141]---47.6936.3321.5875.5359.7736.6762.8145.128.1
DORi [77]---31.828.6924.9172.7259.6540.5657.8941.3526.41
CI-MHA [119]------69.8754.6835.2761.4943.9725.13
PEARL [132]---42.9432.0718.3771.953.535.4---
OTBPNet [106]---25.9620.9614.0855.4638.2520.5158.9842.0724.69
DPIN [95]---46.7432.92--47.9826.9662.447.2728.31
CBLN [55]---38.9827.65--61.1338.2266.3448.1227.6
RLR-W-M [36]-------36.7--36.9-
SM-RL [100]---20.2515.95--24.3611.17---
TripNet [35]------51.3336.6114.548.4232.1913.93
TSP-PRL [105]-------45.4524.7556.0238.82-
STRONG [6]---72.1449.7318.2978.150.1419.3---
AVMR [7]---72.1649.13-77.7254.59----
WSWSDEC [26]---------41.9823.34-
TGA [67]------32.1419.948.84---
WSLLN [31]---------42.822.7-
EC-SL [17]---------44.2924.16-
SCN [52]------42.9623.589.9747.2329.22-
WSTG [20]------39.827.312.944.323.6-
VLANet [63]------45.2431.8314.17---
FSAN [101]---------55.1129.43
MARN [87]------48.5531.9414.8147.0129.95-
RTBPN [137]------60.0432.3613.2449.7729.63-
BAR [104]------44.9727.0412.2349.0330.73-
CCL [138]-------33.2115.6850.1231.07-
VCA [102]------58.5838.1319.5750.4531-
LoGAN [91]------51.6734.6814.54---
CRM [41]------53.6634.7616.3755.2632.19-

Table 3. The Performance Comparison of All TSGV Methods Grouped by Their Categories (SW:sliding window-based, PG:proposal-generated, AB:anchor-based, AF:anchor-free, OT:other single-stage methods, RL:RL-based, WS:weakly Supervised)

Table 4.
TypeMethod[email protected][email protected]mIoU
Fully supervisedTMN [53]22.9276.0835.17
TGN [11]24.2871.4338.62
VLG-Net [85]25.5771.65-
MCN [38]28.178.2141.08
MAN [128]27.0281.741.16
Weakly supervisedTGA [67]12.1939.7424.92
VLANet [63]19.3265.6825.33
WSLLN [31]19.453.125.4
FSAN [101]19.457.8531.92
RTBPN [137]20.7960.2629.81
LoGAN [91]39.264.0438.28

Table 4. The Evaluation Results on DiDeMo (The IoU Threshold m = 1.0)

Two-stage method. As shown in Table 3, the overall performance of two-stage methods seems poorer than other approaches. The possible reasons lie in three folds: (1) Firstly, most of the two-stage methods combine video and sentence features coarsely, and neglect the fine-grained visual and textual interactions for accurate temporal sentence grounding in videos. (2) Secondly, separating the candidate segment generation and query-moment matching procedures will make the model unable to be globally optimized, which can also influence the overall performance. (3) Thirdly, establishing matching relationships between sentence queries and individual segments will make the local video content separate from the global video context, which may also hurt the temporal grounding accuracy.

Specifically, for the sliding window (SW)-based methods, all of them achieve the lowest grounding accuracy on the TACoS compared to the other three datasets with the same metrics. The reason is that the cooking activities in TACoS take place in the same kitchen scene with only some slightly varied cooking objects (e.g., chopping board, knife, and bread). Thus, it is hard to do temporal location predictions for such fine-grained activities. Meanwhile, the lengths of videos in TACoS are also longer, which will greatly increase the target segment searching space and bring more difficulties. Obviously, MMRG outperforms other SW-based methods with great gains on both TACoS and Charades-STA. Despite using the same moment sampling strategies with CTRL, the multi-modal relational graph MMRG employs can capture subtle differences of candidate moments from the same video and the customized self-supervised pre-training tasks further improve the visual features. Regardless of MMRG, ACL-K also significantly outperforms the remaining SW-based methods on TACoS and Charades-STA, proving the effectiveness of aligning the activity concepts mined from both textual and visual parts. MCN gets the most inferior results on the Charades-STA, which shows that its simple multimodal matching and ranking strategy for candidate segments cannot deal well with the segments of various and flexible locations. However, CTRL, ACRN, ROLE, SLTA, VAL, ACL-K, and MMRG can adjust the candidate segment boundaries based on the model location offsets prediction, which can therefore improve the performances. All of the sliding window-based methods have not conducted experiments on the large-scale ActivityNet Captions dataset, which may be due to the costly computation for multi-scale sliding window sampling.

The proposal-generated (PG) methods achieve even better performance than the SW-based methods though the number of proposal candidates decreases. QSPN with query-guided segment proposal network and auxiliary captioning loss significantly outperforms other two-stage methods (except MMRG) on the Charades-STA, which demonstrates that the presented query-guided proposal network is able to provide more effective candidate moments with finer temporal granularity without dealing with redundant sliding window sampled moments. QSPN also conducts experiments on ActivityNet Captions that are comprised of richer scenes, and it even achieves competitive results with single-stage anchor-based methods, which further proves the effectiveness of captioning supervision and query-guided proposals. Since the videos in Charades-STA are of shorter lengths and contain less diverse activities, it is necessary to focus more on the metrics with higher IoU thresholds. SAP consistently outperforms those SW-based methods on Charades-STA with a higher IoU threshold, which attributes to its discriminative generated proposals and additional refinement process.

Single-stage method. For anchor-based (AB) methods, TGN achieves the lowest performance on TACoS and ActivityNet Captions. CMIN also performs poorly on TACoS. The common inferior accuracy achieved by TGN, CMIN, and CBP may attribute to their single-stream anchor-based localization frameworks. With sequential RNNs, they fail to reason complex cross-modal relations on datasets (i.e., TACoS and ActivityNet Captions) of longer video lengths. Instead of employing RNN-styled anchors, both SCDM and MAN use convolutional neural networks to better capture fine-grained interactions and diverse video contents of different temporal granularities, which can achieve better performance (e.g., SCDM performs better than TGN/CMIN and TGN/CBP on TACoS and ActivityNet Captions, respectively). To make further improvements, 2D-TAN extends it to 2D feature maps to model the adjacent relations of various candidate moments of multi-anchors. SMIN and Zhang et al. [133] which adopt such a similar 2D structure modeling the relationships of candidate moments, also achieve superior results out of AB methods on TACoS, Charades-STA, and ActivityNet Captions. Specifically, the model presented by Zhang et al. [133] performs the best on TACoS while SMIN has surpassed other methods on Charades-STA, which also prove the effectiveness of 2D moment relationship modeling. Furthermore, CSMGAN, SMIN, and Zhang et al. [133] all achieve superior results on ActivityNet Captions. It is noted that although CSMGAN adopts the similar sequential RNN like TGN but it builds a joint graph for modeling the cross-/self-modal relations which can capture the high-order interactions between two modalities effectively.

For anchor-free (AF) methods, the overall performance is slightly behind that of AB methods especially with the challenging ActivityNet Caption dataset. More specifically, reading comprehension-inspired methods (ExCL, VSLNet, TMLGA, CPN, DORi, CI-MHA, and PEARL) outperform other types of anchor-free methods with a significant gap. However, TMLGA achieves the lowest performance with the metrics of [email protected],IoU = \(\lbrace 0.3,0.5\rbrace\) on ActivityNet Captions. One possible reason is that the subjectivity of annotation is the hardest to model for this challenging dataset. The dense AF methods including DRN, GDP, and DEBUG outperform the early sparse regression network ABLR, justifying the importance of increasing the number of positive training samples. However, the additional regression-based methods including PMI, HVTG, and LGI achieve superior performance on ActivityNet Captions, which may result from more effective interaction between visual and textual contents. An obvious observation is that DORi achieves the highest grounding accuracy (with IoU=0.7) among all single-stage methods on TACoS and Charades-STA. Since its spatio-temporal graph is able to model more fine-grained object interactions that change over time. It is noted that L-Net has not been included in the table since the original article [12] did not report the specific experimental values.

Additionally, other single-stage methods (BPNet, DPIN, and CBLN) which can not be grouped into either anchor-based or anchor-free methods achieve comparable results on three datasets (except DiDeMo). Specifically, CBLN achieves superior performance among all single-stage methods on Charades-STA and ActivityNet Captions, which quite highlights the advantages of combining anchor-based and anchor-free schemes and its special biaffine-based architecture.

RL-based method. Although the overall performance of RL-based methods can not reach that of traditional single-stage SOTA methods, they provide brand-new thoughts to address the TSGV task and the sequential decision-making process can also enhance the ability of interpretability. Particularly, TSP-PRL outperforms TripNet and R-W-M on ActivityNet Captions and Charades-STA, which may contribute to its tree-structured policy design inspired by the coarse-to-fine human-decision-making process. STRONG and AVMR achieve the best performance out of the RL-based frameworks on TACoS due to the effectiveness of spatial RL for scene tracking and the employment of adversarial learning, respectively. R-W-M, TripNet, and SM-RL achieve relative inferior performance. Specifically, SM-RL achieves the lowest performance on Charades-STA and TACoS while TripNet keeps the lowest performance on ActivityNet Captions.

Weakly supervised method. Due to the great challenge that temporal annotations of groundtruth moments are not available at the training stage for weakly supervised (WS) methods. The experimental results on Charades-STA and ActivityNet Captions are apparently not as good as above fully-supervised ones. We cannot tell which framework (i.e., MIL-based or reconstruction-based) has absolute advances according to their overall performances. But among all WS methods, CCL, VCA, and CRM achieve superior performance on both Charades-STA and ActivityNet Captions. The results are also competitive compared with those of other fully supervised methods. To investigate the reasons, one finding is that they all design special training objectives that help in better visual-semantic alignment even without the annotated boundary information. Specifically, CCL is able to construct fine-grained supervision signals from counterfactual results for the contrastive training. Meanwhile, VCA re-defines the TSGV problem and designs a new loss for visual co-occurrence alignment learning. Moreover, CRM minimizes the mismatched sentence-moment pairs during training by expanding the scope to paragraph level that can further consider the temporal ordering between sentences.

DiDeMo evaluation results with particular metrics. As aforementioned, MCN [38] reports the results on DiDeMo dataset under the IoU threshold m=1.0. Some works [11, 63, 128] following MCN and also adopt such metrics to assess their models. We supplementally list the evaluation results (i.e., R@\(\lbrace 1,5\rbrace\),IoU=1.0 and mIoU) on DiDeMo shown in Table 4, which are grouped by the supervision manner. Specifically, LoGAN as a WS method achieves the best performance among both fully and weakly supervised methods, which is also due to the effective visual-semantic representation learning via a latent graph co-attention network. Another observation is that the top-1 recall values for all fully supervised methods are constrained into a certain small range (22%–27%). It demonstrates that DiDeMo can not greatly differentiate the performance of methods, which may result from its limitation of taking predefined segments as groundtruth.

Skip 4DISCUSSIONS Section

4 DISCUSSIONS

In this section, we discuss the limitations of current benchmarks and point out several promising research directions for TSGV. Firstly, we comprehensively divide these limitations into three categories, i.e., the temporal annotation biases and ambiguous groundtruth annotations in public datasets, and the problematic evaluation metrics. These limitations may heavily mislead the TSGV approaches since each proposed method should be evaluated with these benchmarks. Meanwhile, we also present a couple of recent efforts to address these issues by proposing new datasets/metrics or proposing new methods. Then, we point out some promising research directions of TSGV including four typical tasks, i.e., large-scale video corpus moment retrieval, spatio-temporal localization, audio-enhanced localization, and video-language pre-training. We hope these research advances can provide more insights for future TSGV explorations, and thus further promote the development in this area.

4.1 Limitations of Current Benchmarks

Despite the promising results which have been made in TSGV, there are also some recent works [70, 121] doubting the quality of current datasets and metrics: (1) The joint distributions of starting and ending timestamps of target video segments are strongly biased and almost identical in the training and test splits of current datasets. Without truly modeling the video and sentence data, and just fitting such distribution biases in the training set, some baselines can still achieve good results and even outperform some well-designed methods. (2) The annotation of groundtruth segment location for TSGV is ambiguous and subjective, and may influence the model evaluation. (3) Current evaluation metrics are easily deceived by the above annotation biases in current datasets, and cannot measure the model performance effectively. Since TSGV is heavily driven by these datasets and evaluation metrics, such problematic benchmarks will influence the research progress of TSGV, and further mislead this research direction. In the following, we will detail the limitations on existing datasets and evaluation metrics, and present some recent solutions to address these issues.

Annotation distribution biases in datasets. Some recent studies [70, 121] attempt to visualize the temporal location distribution of groundtruth segments, finding joint distributions of starting and ending timestamps of groundtruth segments identical in training and test sets with obvious distribution biases. They design some simple model-free methods, for example, a bias-based method [121], which samples locations from the observed training distribution and takes them as predicted locations of target segments at the inference stage. This bias-based method can achieve good performance even surpassing some well-designed deep models, without any valid visual and textual inputs. Furthermore, Yuan et al. [121] re-organize two benchmark datasets for out-of-distribution test. They create two different test sets: one test set follows the identical temporal location distribution with the training set, namely test-iid, and the other test set has quite different distribution with the training set, namely test-ood. After comparing the experimental results of various baseline methods on these two test sets, they find that for almost all methods, the performance on test-ood drops significantly (c.f., Figure 15), which indicates that existing methods are heavily influenced by temporal annotation biases and do not truly model the semantic matching relationship between videos and texts. Thus, it is crucial for future works to construct de-biased datasets and build robust models unaffected by biases. Recently, there have been some attempts to address this issue. For example, Yang et al. [112] design a causal-inspired framework based on CTRL and 2D-TAN, which attempts to eliminate the spurious correlation between the input and prediction caused by hidden confounders (i.e., the temporal location of moments).

Fig. 15.

Fig. 15. Performance of SOTA TSGV methods on re-organized data splits, figure adapted from [121].

Moreover, it is worth noting that there are some de-biased works [69, 142] that concentrate on other kinds of biases in TSGV instead of the moment annotation distribution biases. Zhou et al. [142] are devoted to dealing with the biases caused by single-style of annotations. The proposed DeNet with a debiasing mechanism can produce diverse yet plausible predictions. Nan et al. [69] propose an approach to approximate the latent confounder set distribution based on the theory of causal inference to deconfound selection biases introduced by datasets (e.g., in datasets, it appears more often that a person is holding a vacuum cleaner than a person is repairing a vacuum cleaner).

Ambiguity of groundtruth annotation. One recent study [70] also mentions the ambiguous and inconsistent annotations among current TSGV datasets. Annotating the target segment location of the provided sentence query is a quite subjective task. In some cases, one query can be matched with multiple segments in videos, or different annotators will make different decisions on the grounded location of the sentence query. Therefore, only using one single groundtruth to evaluate the temporal grounding results is problematic. Otani et al. [70] suggest to re-annotate the benchmark datasets with multiple groundtruth moments for one given sentence query if exists, as shown in Figure 16, they ask five annotators to re-annotate a video from ActivityNet Captions given the query “a woman is doing somersaults and big jumps alone”. These five re-annotated segments corresponding to the query are totally different and do not overlap with the groundtruth segment, justifying the ambiguity and subjectivity of groundtruth annotations. They further present two alternative evaluation metrics that take multiple annotated groundtruth moments into consideration.

Fig. 16.

Fig. 16. The re-annotation example for ActivityNet Captions. The annotators annotate five different positive segments (shown as blue bars), all of which match the given query. While the original groundtruth segment is represented as a grey bar, figure from [70].

Limitation of evaluation metrics. Besides the temporal annotation biases in current dataset, Yuan et al. [121] also find that some characteristics of the datasets may have negative effects on model evaluation. Most of previous TSGV methods [11, 58, 108, 124, 134] report their scores on some small IoU thresholds like \(m\in \lbrace 0.1,0.3,0.5\rbrace\). However, for ActivityNet Captions, a substantial proportion of groundtruth moments are of quite long lengths. Statistically, 40%, 20%, and 10% of sentence queries refer to a moment occupying over 30%, 50%, and 70% of the length of the whole video, respectively. Such annotation biases can obviously increase the chance of correct prediction under small IoU thresholds. Taking an extreme case as an example, if the groundtruth moment is the whole video, any predictions with a duration longer than 0.3 can achieve [email protected],IoU=0.3=1. Thus, the metric \(\text{[email protected]},\text{IoU=}m\) with small \(m\) is unreliable for current biased annotated datasets. Therefore, to alleviate the above effects, they present a new metric namely \(\text{discounted-R@}n,\text{IoU=}m\). This new metric considers that the hit score (i.e., \(r(n,m,q_i)\)) for each positive sample \(i\) should not be limited to \(\lbrace 0,1\rbrace\). It can be a real number \(\in [0,1]\) depending on the relative distances between the predicted and groundtruth boundaries. The formal definition for each sample \(i\) is as follows: (14) \(\begin{equation} r(n,m,q_i) = (1-\text{nDis}(g^s_i,p^s_i)) \times (1-\text{nDis}(g^e_i,p^e_i)), \end{equation}\) where the nDis operation calculates the distance between the groundtruth and predicted boundaries normalized to \([0,1]\) by the video length. \((g_i^s, g_i^e)\)/\((p_i^s, p_i^e)\) indicates the (start,end) timestamps of the groundtruth/predicted segment for sample \(i\).

4.2 Promising Research Directions

We point out some promising research directions, including four TSGV-related tasks based on TSGV.

4.2.1 Large-scale Video Corpus Moment Retrieval.

Large-scale video corpus moment retrieval (VCMR) is a research direction extended from TSGV that has been explored over the past few years [27, 49, 80, 127, 129]. It has more application value since it can retrieve the target segment semantically corresponding to a given text query from a large-scale video corpus (i.e., a collection of untrimmed and unsegmented videos) rather than from a single video. As compared with TSGV, VCMR has higher efficiency requirements since it not only needs to retrieve a specific segment from one single video but also locates the target video from a video corpus.

Escorcia et al. [27] first extend TSGV to VCMR, introducing a model named Clip Alignment with Language (CAL) to align the query feature with a sequence of uniformly partitioned clips for moment composing. Lei et al. [49] introduce a new dataset for VCMR called TVR, which is comprised of videos and their associated subtitle texts. A Cross-modal Moment Localization (XML) network with a novel convolutional start-end detector module is also proposed to produce moment predictions in a late fusion manner. Zhang et al. [127] present a hierarchical multi-modal encoder (HAMMER) to capture both coarse- and fine-grained semantic information from the videos and train the model with three sub-tasks (i.e., video retrieval, segment temporal localization, and masked language modeling). Zhang et al. [129] introduce contrastive learning for VCMR, designing a retrieval and localization network with contrastive learning (ReLoCLNet).

4.2.2 Spatio-temporal Localization.

Spatial-temporal sentence grounding in videos is another extension of TSGV which mainly localizes the referring object/instance as a continuing Spatial-temporal tube (i.e., a sequence of bounding boxes) extracted from an untrimmed video via a natural language description. Since fine-grained labeling process of localizing a tube (i.e., annotate a spatial region for each frame in videos) for STSGV is labor-intensive and complicated, Chen et al. [21] propose to solve this task in a weakly-supervised manner which only needs video-level descriptions, with a newly-constructed VID-sentence dataset. Besides, VOGNet [79] commits to addressing the task of video object grounding, which grounds objects in videos referred to the natural language descriptions and constructs a new dataset called ActivityNet-SRL. Zhang et al. [139] propose a spatio-temporal graph reasoning network (STGRN) for grounding multi-form sentences that depict an object and construct a new dataset VidSTG. Tang et al. [92] employ a visual transformer to solve a similar task, which aims at localizing a spatio-temporal tube of the target person from an untrimmed video based on a given textural description with a newly-constructed HC-STVG dataset. Su et al. [89] further present a new STVGBert framework based on a visual-linguistic transformer to perform object tube predictions without any pre-trained object detectors.

4.2.3 Audio-enhanced Localization.

The current inputs for TSGV only contain the given sentence along with the untrimmed video. However, the audio signals are not effectively exploited, which may provide extra guidance for video localization, e.g., the loud noise while using electronics in the kitchen or cheers from the audience when the football player kicks a goal. Such various forms of sounds do offer auxiliary but essential clues for more precise localization of the target moments, which has not been explored yet. Moreover, what people speak in videos can be converted into text with the Automated Speech Recognition (ASR) technique. The converted text also provides relevant information for the cross-modal alignment between the video and the text query. Nowadays, there have been many works [39, 111] in a vision-and-language area with audio-enhanced auxiliary proving its effectiveness for performance improvements. Thus, it is a promising future direction to embed the audio information for the TSGV task.

4.2.4 Video-language Pre-training.

Video-language pre-training [62, 71] has proven to improve many downstream text-based video understanding tasks, e.g., video captioning, video question answering, and video retrieval. Therefore, some pioneer works attempt to leverage video-language pre-training to benefit the TSGV task. Xu et al. [110] design boundary-aware proxy tasks to get boundary-sensitive video features for downstream localization, which can benefit many temporal localization tasks including TAL, TSGV, and step localization. Zeng et al. [126] introduce the graph pre-training upon their multi-modal relational graph to enhance the visual features with explicit relations. They design two node-level and graph-level self-supervised pre-training tasks (i.e., attribute masking and cross-modal context prediction).

Skip 5CONCLUSION Section

5 CONCLUSION

TSGV is a fundamental and challenging task connecting computer vision and natural language processing communities. It is also worth exploring since it can be seen as an intermediate task for some downstream video understanding applications such as video question answering, video summarization and video content retrieval.

In this survey, we take a systematic and insightful overview of the current research progress of the TSGV task, by categorizing existing approaches, benchmark datasets and evaluation metrics. The identified limitations of current benchmarks as well as our careful thoughts on promising research directions are also provided to researchers, aiming at further promoting the development for TSGV. For future works, we suggest that (i) more efforts should be made on proposing unbiased datasets and reliable metrics to better evaluate new methods for TSGV, and (ii) models that are more robust and able to generalize well in dynamic scenarios should be paid with more attentions.

REFERENCES

  1. [1] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C. Lawrence, and Parikh Devi. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 24252433.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Bao Peijun, Zheng Qian, and Mu Yadong. 2021. Dense events grounding in video. In Proceedings of the AAAI Conference on Artificial Intelligence. 920928.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Buch Shyamal, Escorcia Victor, Ghanem Bernard, Fei-Fei Li, and Niebles Juan Carlos. 2017. End-to-end, single-stream temporal action detection in untrimmed videos. In Proceedings of the British Machine Vision Conference 2017.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Heilbron Fabian Caba, Escorcia Victor, Ghanem Bernard, and Niebles Juan Carlos. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961970.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Cao Da, Zeng Yawen, Liu Meng, He Xiangnan, Wang Meng, and Qin Zheng. 2020. STRONG: Spatio-temporal reinforcement learning for cross-modal video moment localization. In Proceedings of the 28th ACM International Conference on Multimedia.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Cao Da, Zeng Yawen, Wei Xiaochi, Nie Liqiang, Hong Richang, and Qin Zheng. 2020. Adversarial video moment retrieval by jointly modeling ranking and localization. In Proceedings of the 28th ACM International Conference on Multimedia.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Cha Meeyoung, Kwak Haewoon, Rodriguez Pablo, Ahn Yong-Yeol, and Moon Sue. 2007. I tube, you tube, everybody tubes: Analyzing the world’s largest user generated content video system. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement. 114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Chen Cheng and Gu Xiaodong. 2020. Semantic modulation based residual network for temporal language queries grounding in video. In Proceedings of the International Symposium on Neural Networks. Springer, 119129.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Chen Danqi, Fisch Adam, Weston Jason, and Bordes Antoine. 2017. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 18701879.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Chen Jingyuan, Chen Xinpeng, Ma Lin, Jie Zequn, and Chua Tat-Seng. 2018. Temporally grounding natural sentence in video. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 162171.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Chen Jingyuan, Ma Lin, Chen Xinpeng, Jie Zequn, and Luo Jiebo. 2019. Localizing natural language in videos. In Proceedings of the AAAI Conference on Artificial Intelligence. 81758182.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Chen Long, Lu Chujie, Tang Siliang, Xiao Jun, Zhang Dong, Tan Chilie, and Li Xiaolin. 2020. Rethinking the bottom-up framework for query-based video localization. In Proceedings of the AAAI Conference on Artificial Intelligence. 1055110558.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Chen Shaoxiang, Jiang Wenhao, Liu Wei, and Jiang Yu-Gang. 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In Proceedings of the European Conference on Computer Vision. Springer, 333351.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Chen Shaoxiang and Jiang Yu-Gang. 2019. Semantic proposal for activity localization in videos via sentence query. In Proceedings of the AAAI Conference on Artificial Intelligence. 81998206.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Chen Shaoxiang and Jiang Yu-Gang. 2020. Hierarchical visual-textual graph for temporal activity localization via language. In Proceedings of the European Conference on Computer Vision. Springer, 601618.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Chen Shaoxiang and Jiang Yu-Gang. 2021. Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 84258435.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Chen Xinlei, Fang Hao, Lin Tsung-Yi, Vedantam Ramakrishna, Gupta Saurabh, Dollár Piotr, and Zitnick C Lawrence. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325. Retrieved from https://arxiv.org/abs/1504.00325.Google ScholarGoogle Scholar
  19. [19] Chen Yen-Chun, Li Linjie, Yu Licheng, Kholy Ahmed El, Ahmed Faisal, Gan Zhe, Cheng Yu, and Liu Jingjing. 2020. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision. Springer, 104120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Chen Zhenfang, Ma Lin, Luo Wenhan, Tang Peng, and Wong Kwan-Yee K.. 2020. Look closer to ground better: Weakly-supervised temporal grounding of sentence in video. arXiv:2001.09308. Retrieved from https://arxiv.org/abs/2001.09308.Google ScholarGoogle Scholar
  21. [21] Chen Zhenfang, Ma Lin, Luo Wenhan, and Wong Kwan-Yee Kenneth. 2019. Weakly-supervised spatio-temporally grounding natural sentence in video. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 18841894.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Cheng Howard and Li Xiaobo. 2000. Partial encryption of compressed images and videos. IEEE Transactions on Signal Processing 48, 8 (2000), 24392451.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Chu Wen-Sheng, Song Yale, and Jaimes Alejandro. 2015. Video co-summarization: Video summarization by visual co-occurrence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 35843592.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Cui Yuhao, Yu Zhou, Wang Chunqi, Zhao Zhongzhou, Zhang Ji, Wang Meng, and Yu Jun. 2021. ROSITA: Enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration. In Proceedings of the 29th ACM International Conference on Multimedia. 797806.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Ding Xinpeng, Wang Nannan, Zhang Shiwei, Cheng De, Li Xiaomeng, Huang Ziyuan, Tang Mingqian, and Gao Xinbo. 2021. Support-set based cross-supervision for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1157311582.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Duan Xuguang, Huang Wen-bing, Gan Chuang, Wang Jingdong, Zhu Wenwu, and Huang Junzhou. 2018. Weakly supervised dense event captioning in videos. In Proceedings of the Advances in Neural Information Processing Systems. 30633073.Google ScholarGoogle Scholar
  27. [27] Escorcia Victor, Soldan Mattia, Sivic Josef, Ghanem Bernard, and Russell Bryan. 2019. Temporal localization of moments in video collections with natural language. arXiv:1907.12763. Retrieved from https://arxiv.org/abs/1907.12763.Google ScholarGoogle Scholar
  28. [28] Fan Chenyou, Zhang Xiaofan, Zhang Shu, Wang Wensheng, Zhang Chi, and Huang Heng. 2019. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19992007.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Gao Jiyang, Sun Chen, Yang Zhenheng, and Nevatia Ram. 2017. TALL: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 52775285.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Gao Junyu and Xu Changsheng. 2021. Fast video moment retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15231532.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Gao Mingfei, Davis Larry, Socher Richard, and Xiong Caiming. 2019. WSLLN:Weakly supervised natural language localization networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 14811487.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Ge Runzhou, Gao Jiyang, Chen Kan, and Nevatia Ram. 2019. Mac: Mining activity concepts for language-based temporal localization. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision. IEEE, 245253.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Ghosh Soham, Agarwal Anuva, Parekh Zarana, and Hauptmann Alexander. 2019. ExCL: Extractive clip localization using natural language descriptions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 19841990.Google ScholarGoogle Scholar
  34. [34] Girshick Ross B., Donahue Jeff, Darrell Trevor, and Malik Jitendra. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. 580587.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Hahn Meera, Kadav Asim, Rehg James M., and Graf Hans Peter. 2020. Tripping through time: Efficient localization of activities in videos. In Proceedings of the 31st British Machine Vision Conference.Google ScholarGoogle Scholar
  36. [36] He Dongliang, Zhao Xiang, Huang Jizhou, Li Fu, Liu Xiao, and Wen Shilei. 2019. Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos. In Proceedings of the AAAI Conference on Artificial Intelligence. 83938400.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Hendricks Lisa Anne, Wang Oliver, Shechtman Eli, Sivic Josef, Darrell Trevor, and Russell Bryan. 2018. Localizing moments in video with temporal language. In EMNLP. 1380–1390.Google ScholarGoogle Scholar
  38. [38] Hendricks Lisa Anne, Wang Oliver, Shechtman Eli, Sivic Josef, Darrell Trevor, and Russell Bryan C.. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 58045813.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Hori Chiori, AlAmri Huda, Wang Jue, Wichern Gordon, Hori Takaaki, Cherian Anoop, Marks Tim K., Cartillier Vincent, Lopes Raphael Gontijo, Das Abhishek, Essa Irfan, Batra Dhruv, and Parikh Devi. 2019. End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 23522356.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Hu Ronghang, Xu Huazhe, Rohrbach Marcus, Feng Jiashi, Saenko Kate, and Darrell Trevor. 2016. Natural language object retrieval. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 45554564.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Huang Jiabo, Liu Yang, Gong Shaogang, and Jin Hailin. 2021. Cross-sentence temporal and semantic relations in video activity localisation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 71997208.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Jiang Bin, Huang Xin, Yang Chao, and Yuan Junsong. 2019. Cross-modal video moment retrieval with spatial and language-temporal attention. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. 217225.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Jiao Yifan, Li Zhetao, Huang Shucheng, Yang Xiaoshan, Liu Bin, and Zhang Tianzhu. 2018. Three-dimensional attention-based deep ranking model for video highlight detection. IEEE Transactions on Multimedia 20, 10 (2018), 26932705.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Karaman Svebor, Seidenari Lorenzo, and Bimbo Alberto Del. 2014. Fast saliency based pooling of fisher encoded dense trajectories. In Proceedings of the ECCV THUMOS Workshop.Google ScholarGoogle Scholar
  45. [45] Kazemzadeh Sahar, Ordonez Vicente, Matten Mark, and Berg Tamara. 2014. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 787798.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Kipf Thomas N. and Welling Max. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations.Google ScholarGoogle Scholar
  47. [47] Krishna Ranjay, Hata Kenji, Ren Frederic, Fei-Fei Li, and Niebles Juan Carlos. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706715.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Lei Jie, Yu Licheng, Bansal Mohit, and Berg Tamara L.. 2018. TVQA: Localized, compositional video question answering. In EMNLP.Google ScholarGoogle Scholar
  49. [49] Lei Jie, Yu Licheng, Berg Tamara L, and Bansal Mohit. 2020. Tvr: A large-scale dataset for video-subtitle moment retrieval. In Proceedings of the 16th European Conference on Computer Vision. Springer, 447463.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Li Xiangpeng, Song Jingkuan, Gao Lianli, Liu Xianglong, Huang Wenbing, He Xiangnan, and Gan Chuang. 2019. Beyond rnns: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence. 86588665.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Lin Tianwei, Zhao Xu, and Shou Zheng. 2017. Single shot temporal action detection. In Proceedings of the 2017 ACM on Multimedia Conference. 988996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Lin Zhijie, Zhao Zhou, Zhang Zhu, Wang Qi, and Liu Huasheng. 2020. Weakly-supervised video moment retrieval via semantic completion network. In Proceedings of the AAAI Conference on Artificial Intelligence. 1153911546.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Liu Bingbin, Yeung Serena, Chou Edward, Huang De-An, Fei-Fei Li, and Niebles Juan Carlos. 2018. Temporal modular networks for retrieving complex compositional activities in videos. In Proceedings of the European Conference on Computer Vision. 552568.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Liu Daizong, Qu Xiaoye, Dong Jianfeng, and Zhou Pan. 2020. Reasoning step-by-step: Temporal sentence localization in videos via deep rectification-modulation network. In Proceedings of the 28th International Conference on Computational Linguistics. 18411851.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Liu Daizong, Qu Xiaoye, Dong Jianfeng, Zhou Pan, Cheng Yu, Wei Wei, Xu Zichuan, and Xie Yulai. 2021. Context-aware biaffine localizing network for temporal sentence grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1123511244.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Liu Daizong, Qu Xiaoye, Liu Xiao-Yang, Dong Jianfeng, Zhou Pan, and Xu Zichuan. 2020. Jointly cross- and self-modal graph attention network for query-based moment localization. In Proceedings of the28th ACM International Conference on Multimedia. 40704078.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Liu Meng, Wang Xiang, Nie Liqiang, He Xiangnan, Chen Baoquan, and Chua Tat-Seng. 2018. Attentive moment retrieval in videos. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1524.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Liu Meng, Wang Xiang, Nie Liqiang, Tian Qi, Chen Baoquan, and Chua Tat-Seng. 2018. Cross-modal moment localization in videos. In Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference. 843851.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Liu Xinfang, Nie Xiushan, Tan Zhifang, Guo Jie, and Yin Yilong. 2021. A survey on natural language video localization. arXiv:2104.00234. Retrieved from https://arxiv.org/abs/2104.00234.Google ScholarGoogle Scholar
  60. [60] Lu Chujie, Chen Long, Tan Chilie, Li Xiaolin, and Xiao Jun. 2019. DEBUG: A dense bottom-up grounding approach for natural language video localization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 51445153.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Lu Jiasen, Batra Dhruv, Parikh Devi, and Lee Stefan. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems 32 (2019), 13–23.Google ScholarGoogle Scholar
  62. [62] Luo Jianjie, Li Yehao, Pan Yingwei, Yao Ting, Chao Hongyang, and Mei Tao. 2021. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising. In Proceedings of the 29th ACM International Conference on Multimedia. 56005608.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Ma Minuk, Yoon Sunjae, Kim Junyeong, Lee Youngjoon, Kang Sunghun, and Yoo Chang D.. 2020. Vlanet: Video-language alignment network for weakly-supervised video moment retrieval. In Proceedings of the European Conference on Computer Vision. Springer, 156171.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Ma Shugao, Sigal Leonid, and Sclaroff Stan. 2016. Learning activity progression in LSTMs for activity detection and early detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 19421950.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Ma Yu-Fei, Lu Lie, Zhang Hong-Jiang, and Li Mingjing. 2002. A user attention model for video summarization. In Proceedings of the 10th ACM International Conference on Multimedia. 533542.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. [66] Mahasseni Behrooz, Lam Michael, and Todorovic Sinisa. 2017. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 202211.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Mithun Niluthpol Chowdhury, Paul Sujoy, and Roy-Chowdhury Amit K.. 2019. Weakly supervised video moment retrieval from text queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1159211601.Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Mun Jonghwan, Cho Minsu, and Han Bohyung. 2020. Local-global video-text interactions for temporal grounding. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1080710816.Google ScholarGoogle ScholarCross RefCross Ref
  69. [69] Nan Guoshun, Qiao Rui, Xiao Yao, Liu Jun, Leng Sicong, Zhang Hao, and Lu Wei. 2021. Interventional video grounding with dual contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 27652775.Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Otani Mayu, Nakashima Yuta, Rahtu Esa, and Heikkilä Janne. 2020. Uncovering hidden challenges in query-based video moment retrieval. In Proceedings of the 31st British Machine Vision Conference.Google ScholarGoogle Scholar
  71. [71] Pan Yingwei, Li Yehao, Luo Jianjie, Xu Jun, Yao Ting, and Mei Tao. 2020. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. arXiv:2007.02375. Retrieved from https://arxiv.org/abs/2007.02375.Google ScholarGoogle Scholar
  72. [72] Pan Yingwei, Mei Tao, Yao Ting, Li Houqiang, and Rui Yong. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 45944602.Google ScholarGoogle ScholarCross RefCross Ref
  73. [73] Pan Yingwei, Yao Ting, Li Houqiang, and Mei Tao. 2017. Video captioning with transferred semantic attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 65046512.Google ScholarGoogle ScholarCross RefCross Ref
  74. [74] Qu Xiaoye, Tang Pengwei, Zou Zhikang, Cheng Yu, Dong Jianfeng, Zhou Pan, and Xu Zichuan. 2020. Fine-grained iterative attention network for temporal language localization in videos. In Proceedings of the 28th ACM International Conference on Multimedia. 42804288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. [75] Regneri Michaela, Rohrbach Marcus, Wetzel Dominikus, Thater Stefan, Schiele Bernt, and Pinkal Manfred. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics 1 (2013), 2536.Google ScholarGoogle Scholar
  76. [76] Rodriguez Cristian, Marrese-Taylor Edison, Saleh Fatemeh Sadat, Li Hongdong, and Gould Stephen. 2020. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 24642473.Google ScholarGoogle ScholarCross RefCross Ref
  77. [77] Rodriguez-Opazo Cristian, Marrese-Taylor Edison, Fernando Basura, Li Hongdong, and Gould Stephen. 2021. DORi: Discovering object relationships for moment localization of a natural language query in a video. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 10791088.Google ScholarGoogle ScholarCross RefCross Ref
  78. [78] Rohrbach Marcus, Regneri Michaela, Andriluka Mykhaylo, Amin Sikandar, Pinkal Manfred, and Schiele Bernt. 2012. Script data for attribute-based recognition of composite activities. In Proceedings of the European Conference on Computer Vision. Springer, 144157.Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. [79] Sadhu Arka, Chen Kan, and Nevatia Ram. 2020. Video object grounding using semantic roles in language description. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1041410424.Google ScholarGoogle ScholarCross RefCross Ref
  80. [80] Shao Dian, Xiong Yu, Zhao Yue, Huang Qingqiu, Qiao Yu, and Lin Dahua. 2018. Find and focus: Retrieve and localize video events with natural language queries. In Proceedings of the European Conference on Computer Vision. 200216.Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. [81] Sharghi Aidean, Laurel Jacob S., and Gong Boqing. 2017. Query-focused video summarization: Dataset, evaluation, and a memory network based approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 47884797.Google ScholarGoogle ScholarCross RefCross Ref
  82. [82] Shou Zheng, Wang Dongang, and Chang Shih-Fu. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 10491058.Google ScholarGoogle ScholarCross RefCross Ref
  83. [83] Sigurdsson Gunnar A., Varol Gül, Wang Xiaolong, Farhadi Ali, Laptev Ivan, and Gupta Abhinav. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision. Springer, 510526.Google ScholarGoogle ScholarCross RefCross Ref
  84. [84] Singh Bharat, Marks Tim K., Jones Michael J., Tuzel Oncel, and Shao Ming. 2016. A multi-stream bi-directional recurrent neural network for fine-grained action detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 19611970.Google ScholarGoogle ScholarCross RefCross Ref
  85. [85] Soldan Mattia, Xu Mengmeng, Qu Sisi, Tegner Jesper, and Ghanem Bernard. 2021. VLG-Net: Video-language graph matching network for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 32243234.Google ScholarGoogle ScholarCross RefCross Ref
  86. [86] Song Xiaomeng and Han Yahong. 2018. Val: Visual-attention action localizer. In Proceedings of the Pacific Rim Conference on Multimedia. Springer, 340350.Google ScholarGoogle ScholarCross RefCross Ref
  87. [87] Song Yijun, Wang Jingwen, Ma Lin, Yu Zhou, and Yu Jun. 2020. Weakly-supervised multi-level attentional reconstruction network for grounding textual queries in videos. arXiv:2003.07048. Retrieved from https://arxiv.org/abs/2003.07048.Google ScholarGoogle Scholar
  88. [88] Stroud Jonathan C., McCaffrey Ryan, Mihalcea Rada, Deng Jia, and Russakovsky Olga. 2019. Compositional temporal visual grounding of natural language event descriptions. arXiv:1912.02256. Retrieved from https://arxiv.org/abs/1912.02256.Google ScholarGoogle Scholar
  89. [89] Su Rui, Yu Qian, and Xu Dong. 2021. Stvgbert: A visual-linguistic transformer based framework for spatio-temporal video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15331542.Google ScholarGoogle ScholarCross RefCross Ref
  90. [90] Tan Hao and Bansal Mohit. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.Google ScholarGoogle ScholarCross RefCross Ref
  91. [91] Tan Reuben, Xu Huijuan, Saenko Kate, and Plummer Bryan A.. 2021. Logan: Latent graph co-attention network for weakly-supervised video moment retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 20832092.Google ScholarGoogle ScholarCross RefCross Ref
  92. [92] Tang Zongheng, Liao Yue, Liu Si, Li Guanbin, Jin Xiaojie, Jiang Hongxu, Yu Qian, and Xu Dong. 2021. Human-centric spatio-temporal video grounding with visual transformers. IEEE Transactions on Circuits and Systems for Video Technology (2021).Google ScholarGoogle Scholar
  93. [93] Tellex Stefanie and Roy Deb. 2009. Towards surveillance video search by natural language query. In Proceedings of the ACM International Conference on Image and Video Retrieval. 18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. [94] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. 59986008.Google ScholarGoogle Scholar
  95. [95] Wang Hao, Zha Zheng-Jun, Chen Xuejin, Xiong Zhiwei, and Luo Jiebo. 2020. Dual path interaction network for video moment localization. In Proceedings of the 28th ACM International Conference on Multimedia. 41164124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. [96] Wang Hao, Zha Zheng-Jun, Li Liang, Liu Dong, and Luo Jiebo. 2021. Structured multi-level interaction network for video moment localization via language query. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 70267035.Google ScholarGoogle ScholarCross RefCross Ref
  97. [97] Wang Jingwen, Ma Lin, and Jiang Wenhao. 2020. Temporally grounding language queries in videos by contextual boundary-aware prediction. In Proceedings of the AAAI Conference on Artificial Intelligence. 1216812175.Google ScholarGoogle ScholarCross RefCross Ref
  98. [98] Wang Liwei, Huang Jing, Li Yin, Xu Kun, Yang Zhengyuan, and Yu Dong. 2021. Improving weakly supervised visual grounding by contrastive knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1409014100.Google ScholarGoogle ScholarCross RefCross Ref
  99. [99] Wang Limin, Qiao Yu, and Tang Xiaoou. 2014. Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recognition Challenge 1, 2 (2014), 2.Google ScholarGoogle Scholar
  100. [100] Wang Weining, Huang Yan, and Wang Liang. 2019. Language-driven temporal activity localization: A semantic matching reinforcement learning model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 334343.Google ScholarGoogle ScholarCross RefCross Ref
  101. [101] Wang Yuechen, Zhou Wengang, and Li Houqiang. 2021. Fine-grained semantic alignment network for weakly supervised temporal language grounding. In Proceedings of the Findings of the Association for Computational Linguistics. 8999.Google ScholarGoogle ScholarCross RefCross Ref
  102. [102] Wang Zheng, Chen Jingjing, and Jiang Yu-Gang. 2021. Visual co-occurrence alignment learning for weakly-supervised video moment retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 14591468.Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. [103] Wu Aming and Han Yahong. 2018. Multi-modal circulant fusion for video-to-language and backward. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 10291035.Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. [104] Wu Jie, Li Guanbin, Han Xiaoguang, and Lin Liang. 2020. Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos. In Proceedings of the 28th ACM International Conference on Multimedia. 12831291.Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. [105] Wu Jie, Li Guanbin, Liu Si, and Lin Liang. 2020. Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. In Proceedings of the AAAI Conference on Artificial Intelligence. 1238612393.Google ScholarGoogle ScholarCross RefCross Ref
  106. [106] Xiao Shaoning, Chen Long, Zhang Songyang, Ji Wei, Shao Jian, Ye Lu, and Xiao Jun. 2021. Boundary proposal network for two-stage natural language video localization. In Proceedings of the AAAI Conference on Artificial Intelligence. 29862994.Google ScholarGoogle ScholarCross RefCross Ref
  107. [107] Xu Dejing, Zhao Zhou, Xiao Jun, Wu Fei, Zhang Hanwang, He Xiangnan, and Zhuang Yueting. 2017. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM International Conference on Multimedia. 16451653.Google ScholarGoogle ScholarDigital LibraryDigital Library
  108. [108] Xu Huijuan, He Kun, Plummer Bryan A., Sigal Leonid, Sclaroff Stan, and Saenko Kate. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence. 90629069.Google ScholarGoogle ScholarDigital LibraryDigital Library
  109. [109] Xu Jun, Mei Tao, Yao Ting, and Rui Yong. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 52885296.Google ScholarGoogle ScholarCross RefCross Ref
  110. [110] Xu Mengmeng, Pérez-Rúa Juan-Manuel, Escorcia Victor, Martinez Brais, Zhu Xiatian, Zhang Li, Ghanem Bernard, and Xiang Tao. 2021. Boundary-sensitive pre-training for temporal localization in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 72207230.Google ScholarGoogle ScholarCross RefCross Ref
  111. [111] Xu Yuecong, Yang Jianfei, and Mao Kezhi. 2019. Semantic-filtered soft-split-aware video captioning with audio-augmented feature. Neurocomputing 357, 12 (2019), 2435. Google ScholarGoogle ScholarDigital LibraryDigital Library
  112. [112] Yang Xun, Feng Fuli, Ji Wei, Wang Meng, and Chua Tat-Seng. 2021. Deconfounded video moment retrieval with causal intervention. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. [113] Yang Yulan, Li Zhaohui, and Zeng Gangyan. 2020. A survey of temporal activity localization via language in untrimmed videos. In Proceedings of the 2020 International Conference on Culture-oriented Science & Technology. IEEE, 596601.Google ScholarGoogle ScholarCross RefCross Ref
  114. [114] Yao Ting, Mei Tao, and Rui Yong. 2016. Highlight detection with pairwise deep ranking for first-person video summarization. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 982990.Google ScholarGoogle ScholarCross RefCross Ref
  115. [115] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision. 684699.Google ScholarGoogle ScholarDigital LibraryDigital Library
  116. [116] Yao Ting, Pan Yingwei, Li Yehao, Qiu Zhaofan, and Mei Tao. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 48944902.Google ScholarGoogle ScholarCross RefCross Ref
  117. [117] Yeung Serena, Russakovsky Olga, Mori Greg, and Fei-Fei Li. 2016. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 26782687.Google ScholarGoogle ScholarCross RefCross Ref
  118. [118] Yu Licheng, Lin Zhe, Shen Xiaohui, Yang Jimei, Lu Xin, Bansal Mohit, and Berg Tamara L.. 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 13071315.Google ScholarGoogle ScholarCross RefCross Ref
  119. [119] Yu Xinli, Malmir Mohsen, He Xin, Chen Jiangning, Wang Tong, Wu Yue, Liu Yue, and Liu Yang. 2021. Cross interaction network for natural language guided video moment retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 18601864.Google ScholarGoogle ScholarDigital LibraryDigital Library
  120. [120] Yu Zhou, Yu Jun, Cui Yuhao, Tao Dacheng, and Tian Qi. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 62816290.Google ScholarGoogle ScholarCross RefCross Ref
  121. [121] Yuan Yitian, Lan Xiaohan, Wang Xin, Chen Long, Wang Zhi, and Zhu Wenwu. 2021. A closer look at temporal sentence grounding in videos: Dataset and metric. In Proceedings of the 2nd International Workshop on Human-centric Multimedia Analysis. 1321.Google ScholarGoogle ScholarDigital LibraryDigital Library
  122. [122] Yuan Yitian, Ma Lin, Wang Jingwen, Liu Wei, and Zhu Wenwu. 2019. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In Proceedings of the Advances in Neural Information Processing Systems. 534544.Google ScholarGoogle Scholar
  123. [123] Yuan Yitian, Mei Tao, Cui Peng, and Zhu Wenwu. 2017. Video summarization by learning deep side semantic embedding. IEEE Transactions on Circuits and Systems for Video Technology 29, 1 (2017), 226237.Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. [124] Yuan Yitian, Mei Tao, and Zhu Wenwu. 2019. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence. 91599166.Google ScholarGoogle ScholarDigital LibraryDigital Library
  125. [125] Zeng Runhao, Xu Haoming, Huang Wenbing, Chen Peihao, Tan Mingkui, and Gan Chuang. 2020. Dense regression network for video grounding. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1028410293.Google ScholarGoogle ScholarCross RefCross Ref
  126. [126] Zeng Yawen, Cao Da, Wei Xiaochi, Liu Meng, Zhao Zhou, and Qin Zheng. 2021. Multi-modal relational graph for cross-modal video moment retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22152224.Google ScholarGoogle ScholarCross RefCross Ref
  127. [127] Zhang Bowen, Hu Hexiang, Lee Joonseok, Zhao Ming, Chammas Sheide, Jain Vihan, Ie Eugene, and Sha Fei. 2020. A hierarchical multi-modal encoder for moment localization in video corpus. abs/2011.09046 arXiv:2011.09046. Retrieved from https://arxiv.org/abs/2011.09046.Google ScholarGoogle Scholar
  128. [128] Zhang Da, Dai Xiyang, Wang Xin, Wang Yuan-Fang, and Davis Larry S.. 2019. MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12471257.Google ScholarGoogle ScholarCross RefCross Ref
  129. [129] Zhang Hao, Sun Aixin, Jing Wei, Nan Guoshun, Zhen Liangli, Zhou Joey Tianyi, and Goh Rick Siow Mong. 2021. Video corpus moment retrieval with contrastive learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarGoogle ScholarDigital LibraryDigital Library
  130. [130] Zhang Hao, Sun Aixin, Jing Wei, and Zhou Joey Tianyi. 2020. Span-based localizing network for natural language video localization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 65436554.Google ScholarGoogle ScholarCross RefCross Ref
  131. [131] Zhang Ke, Grauman Kristen, and Sha Fei. 2018. Retrospective encoders for video summarization. In Proceedings of the European Conference on Computer Vision. 383399.Google ScholarGoogle ScholarDigital LibraryDigital Library
  132. [132] Zhang Lingyu and Radke Richard J.. 2022. Natural language video moment localization through query-controlled temporal convolution. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 682690.Google ScholarGoogle ScholarCross RefCross Ref
  133. [133] Zhang Mingxing, Yang Yang, Chen Xinghan, Ji Yanli, Xu Xing, Li Jingjing, and Shen Heng Tao. 2021. Multi-stage aggregated transformer network for temporal language localization in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1266912678.Google ScholarGoogle ScholarCross RefCross Ref
  134. [134] Zhang Songyang, Peng Houwen, Fu Jianlong, and Luo Jiebo. 2020. Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence. 1287012877.Google ScholarGoogle ScholarCross RefCross Ref
  135. [135] Zhang Songyang, Su Jinsong, and Luo Jiebo. 2019. Exploiting temporal relationships in video moment localization with natural language. In Proceedings of the 27th ACM International Conference on Multimedia. 12301238.Google ScholarGoogle ScholarDigital LibraryDigital Library
  136. [136] Zhang Zhu, Lin Zhijie, Zhao Zhou, and Xiao Zhenxin. 2019. Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 655664.Google ScholarGoogle ScholarDigital LibraryDigital Library
  137. [137] Zhang Zhu, Lin Zhijie, Zhao Zhou, Zhu Jieming, and He Xiuqiang. 2020. Regularized two-branch proposal networks for weakly-supervised moment retrieval in videos. In Proceedings of the 28th ACM International Conference on Multimedia. 40984106.Google ScholarGoogle ScholarDigital LibraryDigital Library
  138. [138] Zhang Zhu, Zhao Zhou, Lin Zhijie, He Xiuqiang, and Jieming Zhu. 2020. Counterfactual contrastive learning for weakly-supervised vision-language grounding. Advances in Neural Information Processing Systems 33 (2020), 1812318134.Google ScholarGoogle Scholar
  139. [139] Zhang Zhu, Zhao Zhou, Zhao Yang, Wang Qi, Liu Huasheng, and Gao Lianli. 2020. Where does it exist: Spatio-temporal video grounding for multi-form sentences. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1066510674.Google ScholarGoogle ScholarCross RefCross Ref
  140. [140] Zhao Bin, Li Xuelong, and Lu Xiaoqiang. 2018. Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 74057414.Google ScholarGoogle ScholarCross RefCross Ref
  141. [141] Zhao Yang, Zhao Zhou, Zhang Zhu, and Lin Zhijie. 2021. Cascaded prediction network via segment tree for temporal video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 41974206.Google ScholarGoogle ScholarCross RefCross Ref
  142. [142] Zhou Hao, Zhang Chongyang, Luo Yan, Chen Yanjun, and Hu Chuanping. 2021. Embracing uncertainty: Decoupling and de-bias for robust temporal grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 84458454.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Survey on Temporal Sentence Grounding in Videos

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 2
          March 2023
          540 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3572860
          • Editor:
          • Abdulmotaleb El Saddik
          Issue’s Table of Contents

          Copyright © 2023 Copyright held by the owner/author(s).

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 6 February 2023
          • Online AM: 20 May 2022
          • Accepted: 17 April 2022
          • Revised: 7 March 2022
          • Received: 16 September 2021
          Published in tomm Volume 19, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • survey
          • Refereed
        • Article Metrics

          • Downloads (Last 12 months)1,547
          • Downloads (Last 6 weeks)226

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!