Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement

Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query. The relevance between the video and query is partial, mainly evident in two aspects:~(1)~Scope: The untrimmed video contains many frames, but not all are relevant to the query. Strong relevance is typically observed only within the relevant moment.~(2)~Modality: The relevance of the query varies with different modalities. Action descriptions align more with visual elements, while character conversations are more related to textual information.Existing methods often treat all video contents equally, leading to sub-optimal moment retrieval. We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task. To this end, we propose a Partial Relevance Enhanced Model~(PREM) to improve VCMR. VCMR involves two sub-tasks: video retrieval and moment localization. To align with their distinct objectives, we implement specialized partial relevance enhancement strategies. For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities by modality-specific pooling, ensuring a more effective match. For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content. We also introduce relevant content-enhanced training methods for both retriever and localizer to enhance the ability of model to capture relevant content. Experimental results on TVR and DiDeMo datasets show that the proposed model outperforms the baselines, achieving a new state-of-the-art of VCMR. The code is available at \url{https://github.com/hdy007007/PREM}.


INTRODUCTION
The rise of video-sharing applications has led to a dramatic increase in videos on the Internet.With a large corpus of videos, it is vital to help users find the information they need efficiently.Many video search engines can receive text queries from users to retrieve videos, but the retrieved videos are usually the original full videos in the video corpus.If users are looking for fine-grained information, such as a classic shot from a movie or a key step in a cooking video, these search engines can hardly meet the demand.We have the opportunity to address this challenge thanks to the recently proposed Video Corpus Moment Retrieval (VCMR) [13,23] task that requires retrieving a video moment via a natural language query from a collection of untrimmed videos, where the moment is a temporal segment of a video.The task can be decomposed into two sub-tasks: Video Retrieval (VR), which aims to retrieve the relevant video that potentially contains the target moment using a text query, and Single Video Moment Retrieval (SVMR), which requires localizing the target moment semantically related to the query in the retrieved video.
The uniqueness of the VCMR task, compared to the typical textvideo retrieval [15], stems from the information richness inherent in video.Unlike typical text-video retrieval where the video is precisely trimmed to align entirely with the text query, VCMR Query: House shows a picture of the patient to his team and they have concluded that maybe the two are not related by blood.00:47:1200:49 :42 House : or that these two have green eyes?00:51:3100:55:14 Foreman : You're not saying... They're not brother and sister.involves untrimmed video that contains a wealth of information, with only a small fraction being directly relevant to the query.The partial relevance between the query and video is evident in two key aspects: (1) Scope of the relevant content: Not all video content is related to the query, and explicitly relevant content is often limited to the context of the target moment, as illustrated in Figure 1.In the TVR [23] dataset, the average duration of the relevant moment for a query constitutes only 11% of the average total video duration.Therefore, it is essential to enhance the capability of retrieval model of capturing these limited yet significant relevant content in the video.(2) Discrepancy in relevance across modalities: The video contains multi-modal information, such as visual information and textual information (subtitle), each modality exhibiting distinct query relevance.For instance, "House shows his team a picture of the patient" of the query in Figure 1 predominantly pertains to visual information, depicting an action.Conversely, "they have concluded that maybe the two are not related by blood" leans more towards the textual modality, involving a conversation between characters.As Wang et al. [40] emphasize, recognizing the varying query relevance across different modalities is crucial for better multi-modal alignment in multi-modal retrieval.For these reasons, we argue that capturing the content of both the scope and modality in the video relevant to the query can help moment retrieval.However, the existing methods for VCMR treat content across all positions and modalities in the video equally.This treatment yields a model incapable of prioritizing relevant content, resulting in sub-optimal moment retrieval.To address this, we propose a Partial Relevance Enhanced Model (PREM) for the VCMR task.We employ distinct partial relevance enhancement strategies for the two sub-tasks to align with their respective objectives.Specially, for VR, we introduce a multi-modal collaborative video retriever that contains a modality-specific pooling component.This component generates different query representations for each modality, ensuring a comprehensive match between the query and the content of the two modalities in the video.For SVMR, we propose a focus-then-fuse moment localizer.In this localizer, features from the two modalities are fed to modality-specific gates to capture query-relevant elements.Following this, a fine-grained multi-modal fusion is employed for accurate moment localization.To enhance the emphasis of the model on content in the relevant scope to query, we introduce relevant content-enhanced contrastive learning and adversarial training for the two modules respectively.
We evaluate the proposed model on two benchmarks, TVR [23] and DiDeMo [1].The results show the effectiveness of PREM, achieving new state-of-the-art results.
Our contributions are as follows: • We propose a partial relevance enhanced model for the VCMR task, which encourages the model to capture queryrelated content within the untrimmed video.• We introduce a multi-modal collaborative video retriever for VR and a focus-then-fuse moment localizer for SVMR, ensuring a thorough match between the query and the two modalities of video.And two relevant content-enhanced training objectives are employed to emphasize content in query-relevant scope of video.• The experimental results on TVR [23] and DiDeMo [1] show that the proposed model outperforms other baselines, achieving new state-of-the-art results.

RELATED WORK
We first provide a brief overview of works related to the two subtasks of VCMR, followed by a detailed introduction to recent works for VCMR.
Text-video retrieval is a typical cross-modal retrieval task which aims to retrieve releated videos from a corpus based on a text query.While similar to video retrieval in the VCMR task, the distinction lies in the fact that in text-video retrieval, most of the video content is relevant to the query, whereas, in VCMR, only a small segment is relevant to the query.Works for text-video retrieval generally fall into two categories based on the interaction mode between the query and video, namely late fusion and early fusion.Latefusion methods [12,15,32,34,45] employ separate encoders for text and video to map them into a shared semantic space.These models exhibit high efficiency when video representations are precomputed and indexed offline, as only query representation and the similarity between query and videos need to be computed during inference.On the other hand, early-fusion methods [9,17,20,33,41,42] adopt fine-grained cross-modal interactions using attention mechanism [2,38], enhancing retrieval accuracy.However, these methods face a trade-off between retrieval efficiency and accuracy as the efficiency is constrained by the necessity to perform the entire online computations of query-video relevance.
Temporal language grounding is a task similar to SVMR, which requires localizing a moment relevant to a given text query from a video.Temporal language grounding can be seen as a special case of VCMR, with only one video in the corpus for each query.According to the way of predicting moment, the existing works for temporal language grounding can be divided into proposal-based and proposal-free.The proposal-based method [5,8,26,27,43,44,51,54] begins by generating several candidate proposals.These candidates are then scored based on their alignment with the query, with the proposal exhibiting the strongest alignment being chosen as the final answer.The proposal-free method [6,7,16,24,48,49,53], predicts the start and end times of the moment directly, bypassing the need to pre-extract proposals as candidates.Recently, several studies [3,22,28,31] introduce DETR [4]-based grounding models, which streamline the complex post-processing steps of earlier methods.It is impractical to apply a temporal language grounding model to predict moments across all videos for VCMR, given the immense computations.A video retrieval module is needed to narrow down the videos to a very small set.
Video corpus moment retrieval is proposed by Escorcia et al. [13], and then Lei et al. [23] introduce a new dataset TVR [23] specifically for VCMR.Existing methods for VCMR fall into two categories based on how they address the learning of the two sub-tasks, namely one-stage [23,25,46,50,52] and two-stage methods [10,19,55].The one-stage method treats VCMR as a multi-task learning problem, employing a shared backbone with two distinct heads to simultaneously learn VR and SVMR tasks.One-stage methods can be further categorized into late-fusion methods and early-fusion methods, similar to the taxonomy in text-video retrieval task.XML [23], ReLoCLNet [52] and HERO [25] are late-fusion models.XML is the first model proposed for the VCMR task, and ReLoCLNet enhances the performance of the late-fusion model by contrastive learning.HERO is a video-language pre-trained model, which significantly improves retrieval accuracy.HAMMER [50] and SQuiDNet [46] belong to the early-fusion method.HAMMER employs a hierarchical attention mechanism to make deep interactions between query and video.SQuiDNet leverages causal reasoning to reduce bad retrieval bias for video retrieval.On the other hand, the two-stage approach leverages specialized modules for each sub-task, combining the benefits of both late-fusion and early-fusion methods.It employs the late-fusion model as the video retrieval module for fast video retrieval and utilizes the early-fusion model as the moment localization module for accurate moment localization.CONQUER [19], DF-MAT [55] and CKCN [10] are two-stage models.The three models use the video retrieval head of the trained HERO as video retriever and propose early-fusion moment localizers.CONQUER proposes a moment localizer based on context-query attention (CQA) [47].DF-MAT introduces a moment localizer with a multi-scale deformable attention module for multi-granularity feature fusion, while CKCN proposes a calibration network to improve cross-modal interaction.Our proposed PREM falls into the two-stage method.Unlike other works for VCMR that treat all locations and modalities in the untrimmed video equally, our model prioritizes content related to the query by modality-specific modules and relevant contentenhanced training.

METHOD
In this section, we introduce our proposed model, beginning with an explanation of the VCMR task formulation.We then describe feature extraction.Subsequently, we provide insights into the video retrieval module and moment localization module.Finally, we detail the training and inference processes of model.

Task Formulation
Given a corpus of videos } is the -th video which contains |  | non-overlapping clips, the goal of VCMR task is to retrieve the most relevant moment  * from V using a text query  = { 1 ,  2 , ...,  | | }: where the moment is a temporal segment [  ,   ] in video  * .VCMR can be decomposed to two sub-tasks, VR and SVMR.The goal of VR is to retrieve the video that contains the target moment: And SVMR aims to localize moment from the retrieved video: The prediction of the target moment depends on the probabilities of start and end positions: We use a video retriever to model  ( |, V) and a moment localizer to model  (| * , ) shown in Figure 2 and Figure 4 respectively.

Feature Extraction
We leverage the pre-trained networks to extract the initial features of input for our model.For the query feature, we utilize a pretrained RoBERTa [29] to extract the token feature   of each word in a query.For video features, we extract clip features of images and subtitles.In particular, a clip feature of the image is taken by maxpooling the image features in a short duration (1.5 seconds), which is extracted by pre-trained SlowFast [14] and ResNet [18] (concatenate the two features as image feature) in a clip, and a clip feature of subtitle is taken by max-pooling the token features extracted by RoBERTa in a clip.Thus, a clip consists of an image feature and a subtitle feature   = {  ,   }, if the clip contains no subtitle,   is set to a vector of zeros.The features are mapped to a semantic space R  by fully-connect layers.In the following paper, we use bold notations to denote vectors.

Multi-modal Collaborative Video Retriever
The goal of the video retriever is to select top- videos V * from the corpus V given the query  based on the relevance, where  is much smaller than |V |.The retriever in our proposed model is a late-fusion architecture comprising two encoders: one for query and another for video.To capture the relevance between the query and the different modalities within the video, we introduce modalityspecific pooling, which produces distinct query representations for each modality.Furthermore, we utilize relevant content-enhanced contrastive learning to improve capability of retriever to capture content within the query-related scope of the video.Query Encoder The token features of a query with positional embeddings are fed to a one-layer Transformer to output w  .Because the words semantically match to different modalities may be different shown in Figure 1, we employ modal-specific pooling to generate two modality-specific query representations for query-image relevance and query-subtitle relevance, denoted as   and   .Specifically, we first calculate the likelihood of each word belonging to a modality, then determine the weights based on the likelihood, and finally conduct a weighted summation of the word  representations as the modality-specific representation: where   ∈ R  ×1 is a fully-connect layer which outputs a scalar   to represent the likelihood,  ∈ {, } for visual modality and textual modality.  is softmax normalized weight of -th word.And   is a modal-specific representation for vision or text.
Video Encoder For the -th video, we feed subtitle features and image features with positional embeddings and modality embeddings to a one-layer multi-modal Transformer that simultaneously captures intra-modal and inter-modal dependencies to output each contextual image representation Ī   and subtitle representation s   .Since only part of the content in the video is related to the query, the similarity score    between the query and -th video is the average of max-pooling of query-image similarities and maxpooling of query-subtitle similarities.We use the cosine similarity as the similarity score sim(): Query: Ted turns around and throws away a poster into a trash can.
strong relevant image weak relevant image positive moment negative moment negative image Relevant Content-enhanced Contrastive Learning Contrastive learning [36] is widely leveraged to train late-fusion retrieval models, such as [21].However, there are few studies on retrieval that query partially related to the target.To this end, we propose a relevant content-enhanced contrastive learning method for training our video retriever to put more focus on query-relevant content.The essence of our approach lies in selecting content relevant to the query as a positive sample and increasing the similarity between the query representation and the query-related content representation computed by the retriever.As depicted in Figure 1, the content explicitly related to the query is always within the relevant moment.Therefore, we select the image and subtitle with the highest similarity to the query within the moment shown in Figure 3 as strong relevant samples, the overall similarity denoted as   ++ .For negative samples, we choose the images and subtitles with the highest similarity to the query in each negative video, the similarity is denoted as   − .We employ InfoNCE [37] loss to train retriever: where  is temperature set to 0.01, and  is the number of negatives.We adopt in-batch negative sampling strategy that all videos in the batch except positive video can be used for negative sampling.
Beyond considering relevant content within the moment, we also explore relevant content in the video that falls outside the moment, as it may still exhibit relevance to the query, such as adjacent images or subtitles to the moment, which we call weak relevant content.The weak relevant content is the image and subtitle with the highest similarity to the query but not within the moment shown in Figure 3, and the similarity is denoted as   + .The loss for weak relevant content L  + is the same as that in Equation (7).The query-to-video loss is: where  is a hyper-parameter set to 0.5.We also incorporate video-to-query loss L  (only using the strong positive sample), as in most works on cross-modal retrieval.The sum of L  and L  is the overall loss for the video retriever.

Focus-then-fuse Moment Localizer
Moment localizer shown in Figure 4 uses the query to localize the target moment  * in the top- retrieved videos V * .The proposed localizer is based on early-fusion architecture to explore deeper interactions between query and video for accurate moment localization.Because the retrieved videos are narrowed down to a small range, the amount of computations is acceptable.The localizer comprises encoders, modality-specific gates focusing on significant content from different modalities, and a multi-modal Transformer for query and video fusion.In addition to the regular loss for training the localizer (optimizing the prediction for moment boundaries), we propose an additional relevant content-enhanced adversarial loss to enhance the partial relevance.Encoders Similar to the video retriever, we obtain multi-modal contextual representations of images Ī and subtitles s in the video using a multi-modal Transformer and specific pooling to generate two representations   and   for different modalities.We aslo retain word representations for token-to-token interactions between the query and video.Modality-specific Gates Because the query-related parts of modalities are different shown in Figure 1, we refer to [39] to design modality-specific gates to capture query-related content of images and subtitles: where Ŵ ∈ R  × is a fully-connect layer, and d  is the -th gated image representation or subtitle representation.  is modalityspecific query representation for image or subtitle.norm is L2normalization.⊙ is element-wise product.
We then fuse the gated representations of two modalities in a clip by a fully-connect layer: ĉ  = FC([ Î  ; ŝ  ]) , where [; ] is concatenation and ĉ  is gated representation of the -th clip.The gated video representations is v = {ĉ 1  , ...,

Multi-modal Fusion
We leverage multi-modal Transformer to fuse token representations in query and clip representations for deep cross-modal interactions.Specifically, we input the gated clip representations to a two-layer Transformer.In each layer, alongside the self-attention layer for clip interactions, there is an additional cross-attention layer capturing fine-grained cross-modal dependencies between the video and the query.
To predict the start and end times of the target moment, we employ two 1D-convolution networks to capture dependencies between adjacent clips and output boundary scores    and    , indicating whether the -th clip serves as a boundary for the moment.The training objective is to make the score of the clip that is the correct boundary of the moment higher than the other clips.To ensure comparability of predicted moments across all retrieved videos, we adopt shared normalization (Shared-Norm) [11], a common practice in Open-domain QA systems.In addition to the clip scores predicted by the query and relevant video, we input the query and irrelevant videos to the localizer to predict clip scores, which serve as additional negative samples for training.The training is based on cross entropy loss: where   + is start score of the ground truth moment, and   , is start score of the -th clip in video   .Relevant Content-enhanced Adversarial Training Like the content-enhanced contrastive learning for the video retriever, we use an adversarial learning method which is classification task to improve the localizer's focus on query-relevant content.The positive samples for classification are obtained by sampling all segments in the video with an IoU greater than 0.7 with the ground truth moment, as shown in Figure 3.The negatives consist of the * : HERO fine-tuned using our relevant content-enhanced contrastive learning.
Dataset Model R@1 R@5 R@10 R@100 where  is a hyper-parameter set to 0.8.

Training and Inference
We implement a stage-wise training strategy, initially training the video retriever using text-video pairs.Subsequently, the trained video retriever is employed to sample negative videos as hard negatives for Shared-Norm to train the moment localizer.
In the inference phase, we begin by retrieving the top-10 videos using the video retriever for a given query.Subsequently, we employ the moment localizer to localize the moment within the selected 10 videos.Notably, moment classification in adversarial training does not contribute to moment prediction.The score of the predicted moment relies on both the video retrieval score and boundary scores: where  is the temperature in contrastive learning.We use  to rank predicted moments in the retrieved videos.

EXPERIMENTS
In this section, we begin by introducing datasets and metrics, followed by a discussion of implementation details.Subsequently, we present experimental results and compare them with baselines.Next, we conduct ablation studies to analyze the components of model.Finally, we provide visualizations and case studies.SumR is sum of R@K (K = 1, 5, 10 , 100).
Table 3: VR results on TVR validation set using different pooling methods.

Datasets and Evaluation Metrics
We conduct experiments on two datasets: TVR [23] and DiDeMo [1].TVR is built on TV Shows whose video consists of images and subtitles, and contains 17435, 2179, and 1089 videos on the training, validation, and testing sets.The average length of the videos is 76.2 seconds, while the average length of the moments is 9.1 secs.There are 5 queries whose average length is 13.4 for each video.DiDeMo is built on YFCC100M [35], a video dataset whose videos are from the real world, with only images and no subtitles in the video.DiDeMo contains 8395, 1065, and 1004 videos on training, validation, and testing, respectively.Most of the videos are about 30 seconds, and the average duration of moments is 6.5 seconds.Each video contains 4 queries whose average length is 8.
We follow the metrics in [23] as evaluation metrics.For the VCMR task, the evaluation metric is R@, IoU= that represents the percentage of at least one predicted moment whose Intersection over Union(IoU) with the ground truth exceeds  in the top- retrieved moments.The two sub-tasks are also evaluated.The metric of the SVMR task is the same as that of the VR task, but the evaluation is conducted in only ground truth video for each query.As for the VR task, the metric is R@, which denotes the percentage of predictions that the correct video in the top- ranked videos.

Implementation Details
Feature Extraction For query, we use 768D RoBERTa feature provided by [23].For video, the duration of a clip is 1.5 seconds, and the FPS for sampling images is 3.We use 4352D SlowFast+ResNet feature provided by [25] as the video feature and use 768D RoBERTa feature provided by [23] as the subtitle feature.
Architecture For Transformer, we use the architecture of vanilla Transformer [38] with the hidden size 384 and the intermediate size 1536.The number of heads for multi-head attention is 4.
Training We train video retriever for 100 epochs with the batch size 256.As for moment localizer, we sample 4 and 2 negative videos for each query from top-100 ranked videos on TVR and DiDeMo respectively, and train it for 15 epochs with the batch size 32.Both the retriever and localizer are trained by AdamW with the learning rate 0.0001 and the weight decay 0.01 in a single NVIDIA Tesla V100 GPU.More details are shown in our released code.
VR Reported in Table 1, our proposed PREM outperforms other methods in both benchmarks, with the exception of HERO.HERO benefits from pre-training on two large text-video datasets, TVR [23] and HowTo100M [30], to acquire additional knowledge.In contrast, our model is exclusively trained on the TVR training set.HERO without pre-training achieves sub-optimal results.HERO fine-tuned using our proposed relevant content-enhanced contrastive learning substantially improves the retrieval accuracy, showing the effectiveness of the partial relevance enhancement in VR task.ReLoCLNet also employs contrastive learning to train the retrieval model; however, it underperforms our model on both datasets, because it neglects to consider query-relevant content when sampling positive samples.Instead, our sampling incorporates both strong and weak relevant clips to encourage the model to prioritize query-relevant content in the video.SVMR and VCMR As shown in Table 4 and Table 5, our proposed model outperforms other baseline methods on two tasks and two datasets, underscoring the effectiveness of our focus-thenfuse moment localizer.The heightened performance of early-fusion and two-stage models compared to late-fusion models can be attributed to the integration of cross-attention in their localizers, facilitating comprehensive interaction of fine-grained information across modalities.Noteworthy is the significant improvement in moment retrieval accuracy on TVR when utilizing HERO finetuned by our proposed relevant content-contrastive learning as the video retriever.

Ablation Study
Video Retriever As reported in Table 2, each component of our video retriever contributes to the performance.Both image and subtitle in the video play important roles in video retrieval.When sampling positive examples for contrastive learning, it is effective to extract strong relevant samples from the query-related moment, as the content within that scope is explicitly related to the query.Furthermore, selecting weak relevant samples from outside the moment is also effective.However, Retrieval accuracy decreases when we exclude relevant content sampling, where we employ the sampling method from the previous methods [52], i.e., selecting the clip with the highest similarity to the query from all content of correct video as positive sample.This approach can make it challenging to learn the relevance between the query and the queryrelated moment.
We further investigate the effect of the weak relevant loss weight  in Equation ( 8) on VR.As shown in Figure 5, the retrieval accuracy is highest when the weight is set to 0.5.However, the accuracy decreases with the weight greater than 0.5, because weak relevant samples are not always explicitly related to the query, and a smaller  weight can mitigate the negative impact of their uncertain relevance.
In Table 3, we report the results of various pooling methods in the retriever.Modality-specific pooling outperforms the two other commonly used pooling methods, confirming the discrepancy in relevance across modalities to the query, as motivated in our approach.A visualization is provided in Figure 6a, showing the weights of individual words that compose the query representation.Words with higher weights are indicated by a darker shade of red.For instance, "Penny enters the apartment holding a mug" in the query has higher weight for the image query, reflecting its visual nature, whereas "saying that she needs coffee" has higher weight for the subtitle, suggesting a stronger association with textual information.Moment Localizer Table 6 reports the contributions of key components in the moment localizer for the performance on SVMR and VCMR.Unlike the video retriever, the moment localizer places a greater emphasis on visual information than textual information due to the nature of the moment localization task, which requires accurate matching of an action to a query.Both Modality-specific gates and adversarial training contribute to the performance of moment localization, confirming the effectiveness of our partial relevance enhancement strategies.As illustrated in Figure 6b, the localizer utilizes key information from both modalities when predicting the boundaries.The start boundary prediction primarily relies on the image, while the end boundary prediction depends on the subtitle, as the subtitle 'Need coffee' is related to 'saying that she needs coffee' in the query.Additionally, Shared-Norm plays a crucial role for VCMR by enabling the localizer to localize the target moment across multiple videos.

Case Study
We present examples of VR and SVMR in Figure 7.In VR, our model ranks the correct video first, and the most similar images to the query are relevant to the query.However, the query and the most similar image in the video retrieved by ReLoCLNet lack semantic alignment; the character is not House, and the action in the video  involves putting on a breathing mask rather than taking it off.In SVMR, the moment predicted by the proposed model is closer to the ground truth because it captures the images related to 'they walk into the room' to help localize the moment.These two cases validate the effectiveness of our partial relevance enhancement for both VR and SVMR tasks.

CONCLUSION
In this paper, we propose a Partial Relevance Enhanced Model (PREM) to improve the VCMR task.For the two sub-tasks of VCMR, we introduce two modules with different partial relevance enhancement strategies: a multi-modal collaborative video retriever for VR and a focus-then-fuse moment localizer for SVMR.To further encourage the model to capture the partial relevance between the query and the video, we propose relevant content-based contrastive learning and adversarial training for the training of the two modules.Extensive experiments on two datasets, TVR and DiDeMo, demonstrate that our proposed model achieves new state-of-the-art results on the VCMR task.The ablation studies and visualizations confirm the effectiveness of the partial relevance enhancement in our proposed model.In the future, we plan to explore additional modalities within the video, such as speech, to enhance the ability of retrieval model to capture partial relevance.

00: 43 :Figure 1 :
Figure 1: In VCMR task, only a small part of the untrimmed video is explicitly related to the query, i.e., the content within the target moment.And the relevance of query to information from different modalities in the video varies.

Figure 2 :
Figure 2: The video retriever consists of two encoders, a video encoder and a query encoder.'ME' and 'PE' represent modality and positional embedding, respectively.

Figure 3 :
Figure 3: Relevant and negative images sampling for contrastive learning of video retriever, and positive and negative moments sampling for adversarial learning of moment localizer.The segment in the video with a red border is the query-related moment.

Figure 4 :
Figure 4: Moment localizer contains two key components, modality-specific gates and a multi-modal fusion network to align query and multi-modal content in the video.We use the square to represent a vector.
pe nn y en te rs th e ap ar tm en t ho ld in g a mu g sa yi ng th at sh e ne ed s co ff ee image subtitle (a) Heatmap of weights for words to compose modality-specific queries.00:38:0300:39:51 Sheldon : And now you finally have an answer.00:41:4100:44:29 -Out of coffee.Need coffee.-Uh... 00:44:3900:47:54 Steph : -Hello.-Hi.Stephanie, right?probabilities of boundries start end (b) Confidence scores of start and end boundaries for moment prediction.

Figure 6 :
Figure 6: Visualizations of modality-specific pooling in video retriever and moment prediction of moment localizer.The query is "Penny enters the apartment holding a mug saying that she needs coffee."

Query:
House takes the breathing mask off of the patient.Retrieved video of PREM: Retrieved video of ReLoCLNet: (a) Video retrieval.00:35:0100:37:39 Bandleader : Mr. and Mrs. Chandler Bing.00:30:5100:34:51 Bandleader : Ladies and gentlemen, it gives me great pleasure to introduce to you : The bandleader announces Chandler and Monica and they walk into the room.Single video moment retrieval.

Figure 7 :
Figure 7: Case study.(a) VR: the images with yellow borders in the video retrieved by PREM are the most similar images to the query within and outside the correct moment, while the image with a blue border in the video retrieved by ReLoCLNet is the most similar to query.(b) SVMR: query-related subtitles or images are distinguished by different colors.

Table 1 :
VR results on the TVR validation set and DiDeMo testing set.†: HERO without pre-training on a large dataset.

Table 2 :
Ablation of video retriever on TVR validation set.'Text': subtitles provided by TVR dataset.'Vision': images in video.'SR': strong relevant positive sample.'WR': weak relevant positive sample.

Table 6 :
Ablation of moment localizer on TVR validation set.