FTAN: Exploring Frame-Text Attention for lightweight Video Captioning

Traditional video captioning approaches employ LSTM as a lightweight decoder. However, these methods focus on fully extracting visual features, but pay less attention to textual information, resulting in relatively low-quality performance. Recent transformer-based methods achieve more accurate results, but at the cost of excessive computing resources. In this paper, we propose a lightweight model for video captioning named Frame-Text Attention Network (FTAN), aiming to make full use of both visual and textual features to obtain more accurate captions. We develop a novel text attention module in FTAN, which uses the hidden state of LSTM as query to generate attentive text features. Then the attentive text features are merged with visual features, which are used as input for LSTM to generate more accurate captions. To the best of our knowledge, we are the first to introduce attention mechanism to extract more textual information hidden in LSTM architecture in video captioning. Extensive experiments demonstrate the effectiveness of FTAN. FTAN outperforms the state-of-the-art LSTM-based method on MSVD dataset by 0.8 in CIDEr-D and is about one-fourth of the transformer-based methods in terms of parameters.


INTRODUCTION
As a task bridging computer vision and natural language processing, video captioning aims to generate descriptions of the visual content for given videos in the style of natural language.Due to difficulty of understanding video content, which, unlike static images, contains motion information across frames, video captioning has been one of the most challenging tasks in deep learning.Recent studies [15,20] adopt the encoder-decoder framework, which has been successfully utilized in neural machine translation [13] and image captioning [16], to extract visual and textual features.Mainstream video captioning methods [10,11,17] use CNN as an encoder to extract visual features and LSTM as a decoder to generate captions.More recently, with the development of the transformer in natural language processing [14] as well as computer vision [8], a number of methods [7,21] have utilized transformer in the encoder-decoder framework and achieved better results.However, these methods are not suitable for many application environments, such as on CPUs or limited GPU resources, because they require sufficient computational resources.Compared to transformer-based methods, our method has significant advantages in terms of training speed and consumption of computational resources, and strikes a good balance between efficiency and accuracy.
Considerable existing works concentrated on how to extract visual features effectively although the types of the methods vary.SA-LSTM [20] introduces a temporal attention mechanism and 3D CNN to acquire motion features across frames.PickNet [4] collects critical frames to reduce the redundancy.OA-BTG [23] and ORG-TRL [24] add detectors to extract the object-level information and achieve satisfactory performance at the expense of time consumption.Lately, several works have attempted to optimize the extraction of textual information, RecNet [17] devises a LSTM layer called reconstructer to get text-to-video information.SGN [11] aggregates words as phrases and then combine phrases and frames as the input of LSTM.However, SGN [11] needs to extract the negative videos whose content are quite different from input videos before training and aggregating words as phrases, which, in our view, will discard some textual information.
Our model is based on the following observations.Firstly, existing methods focus on how to optimize the extraction of video features, while there is not much research on obtaining more textual information.Also, we consider video captioning as a cross-modal task involving both video and text as a whole.Optimization on extracting textual information can boost the performance.Attention mechanism has achieved satisfactory performance in extracting visual contents in many methods like SA-LSTM [20], RecNet [17] and SGN [11].Inspired by that, we devise a novel text attention module utilizing hidden state of LSTM as query.Previous works simply concatenate word embeddings with frame features, which only utilizes the information contained in the last generated word, and the information contained in the previously generated words is ignored.Text attention module, however, is able to combine information of all generated words.
In this paper, we propose a lightweight LSTM-based model FTAN for video captioning.Contributions of our work are summarized as follows: (1) We devise a novel text attention module which utilizes information of all generated words.To the best of our knowledge, we are the first to introduce attention mechanism to extract more textual information hidden in LSTM architecture in video captioning.(2) Our model utilizes LSTM as an encoder, which is lightweight when compared to transformer-based methods and has potential in application.We employ the frame attention module which merges visual features extracted from frames and the text attention module to optimize the understanding of video content.(3) Extensive experiments demonstrate our method achieves state-of-the art results in MSVD dataset in all LSTM-based methods and outperforms the runner-up method by 0.8 in CIDEr-D.

PROPOSED APPROACH
We first use 2D and 3D CNN as an encoder to extract visual features.Then word embeddings and visual features are sent to the frame attention module and text attention module, respectively.At last, we concatenate the output of frame attention module and text attention module as the input of LSTM.

Visual Encoder
For an input video  , we uniformly sample  frames {  }  =1 and clips {  }  =1 .Each clip   includes consecutive frames around frame   .We use CNN as the encoder to extract visual information because of its success on diverse computer vision tasks ranging from video object detection [3] to video classification [5] where   =  (  ) , [; ] denotes concatenation.In this way, encoder can extract the visual features of the video more comprehensively.

Frame Attention Module
Previous methods focus on the attention mechanism to better obtain visual features, SA-LSTM [20] firstly use hidden state of query to calculate the attention scores.Inspired by this, we also utilize where   ,   ,    ,   are learnable parameters,  is the activation function and we choose hyperbolic tangent function because the derivative of hyperbolic tangent function ranges from 0 to 1, which alleviates the problem of vanishing gradients.
The weight scores are normalized using    fucntion, then the attentive visual representations    is obtained as the weighted sum of visual features   :

Text Attention Module
Although attention mechanism has been utilized in extracting visual information across frames, there is still little research on how to exploit attention mechanism for textual information.We implement an attention mechanism in generated captions called text attention module.For generated captions before step , {  }  −1 =1 , we first use learnable embedding matrix  to get embedding of word, { [  ]}  −1 =1 , then we also utilize hidden state of previous step ℎ   −1 as query to calculate the weight scores  , of each word embeddings at time step . , is calculated as: where   ,   ,    ,   ,  are learnable parameters. is the activation function and we choose hyperbolic tangent function.The weight scores are normalized using    function, then the attentive word embeddings    is obtained as:

Decoder
LSTM has been successfully used in modeling the sequence information, therefore, we adopt one-layer bidirectional LSTM as the decoder.At time step t, we firstly concatenate attentive visual representations    and attentive word embeddings    as: Where [; ] denotes concatenation.
Then   is sent to LSTM, following a fully-connected layer and a    layer to predict the probability distribution of word at step ,    is used to normalize the probability distribution: (  | ,  1 , ....,   −1 ) =    ( ℎ ℎ  +  ℎ ) (8) where  is the probability distribution of word at step .

Training
During training, for a video  and its ground-truth caption  = [ 1 ; ....;   ] from a training dataset , we implement cross-entropy loss which is defined as negative log-likelihood.The loss function ℓ  is:

EXPERIMENTS 3.1 Experiments Setup
We conduct various experiments on two benchmark datasets to show the effectiveness of our method.MSVD: Microsoft Video Description (MSVD) dataset [2] includes 1,970 videos with length from 10 seconds to 15 seconds.Each video has 40 English captions collected by Amazon Mechanical Turks.
MSVD is divided into 1,200 videos for training, 100 videos for validation, and 670 videos for testing.For a fair comparison, we follow the same division setting.
MSR-VTT: MSR Video to Text dataset [19] is a relatively large dataset which includes 10,000 videos.The length of videos ranges from 10 seconds to 30 seconds.Each video contains 20 English captions and a category tag.The dataset is divided into a training set of 6,513 videos, a validation set of 497 videos, and a test set of 2,990 videos.We follow the same division setting.
Implementation details: We sample each video with 30 frames and the size of word embedding matrix, which is trained jointly with the model, is set to 468.Before the first word ( 1 ) is decoded, we use <SOS>as the first input word (.. 0 ).We also use <UNK>, which stands for unknown words for the low frequency words.We set the maximum caption length to 30 words and captions longer than 30 words will be clipped.During testing, we adopt beam search strategy with beam width of 5 to get more accurate captions.BLEU@4, CIDEr-D, METEOR and ROUGE_L are used as the evaluation index.We use the official codes from Microsoft COCO evaluation server to calculate scores.

Quantitative Results
We compare our model with previous methods on MSVD dataset and MSR-VTT dataset.SA-LSTM [20] is a commonly-used baseline method utilizing soft attention mechanism on temporal direction.HRNE [9], BAE [1] and Picknet [4] make ways to exploit visual features.M3 [18] and MARN [10] introduces memory structure to optimize long-term dependency.h-RNN [22] and MAM [6] employ both temporal and spatial attention.For fair comparison, we mainly focus on comparing with LSTM-based methods without object detectors and video categories as auxiliary data.method by 0.8 in CIDEr-D.On MSR-VTT, our method is slightly lower than SGN [11].However, SGN [11] includes preprocessing phase when negative videos whose content are quite different from input videos are extracted while there is no preprocessing in our method.Table 3 is the comparison of FTAN and transformer-based methods.Training time and parameters of FTAN is only 1% and 26 % of SWINBERT [7] when the performance gap is acceptable.Our method strikes a good balance between efficiency and accuracy.

Qualitative Results
Figure 2 shows some examples of captions our method generate.In the first and second examples, we predicted the accurate object, "spongebob" and "squidward" in the video and in the third and fourth examples, we predicted the correct actions in the video.These examples demonstrate our method has relatively accurate predictions on details and actions in the video.[7] 41.9 53.8 29.9 HMN [21] 43.5 51.5 29.0

CONCLUSION
In this paper, we propose a lightweight model named FTAN that consists of the frame attention module and the text attention module, which attempt to make full use of both visual and textual information respectively.Frame attention module aggregates all visual features in frames and text attention module can exploit textual information in all generated words ignored by previous methods.To the best of our knowledge, we are the first to introduce attention mechanism to extract more textual information hidden in LSTM architecture in video captioning.We conduct experiments on popular benchmark datasets and results show the effectiveness and priority of our method.

Figure 1 :
Figure 1: Arichitecture of FTAN.We use 2D and 3D CNN as backbone network to extract static and motion visual features.Then textual features and visual features are sent to frame attention module and text attention module separately.At last, outputs of frame attention module and text attention module are concatenated as the input of LSTM.
8 and 71.8 in CIDEr-D on MSVD and MSR-VTT.Only with frame attention module, we achieve 90.0 and 45.2 in CIDEr-D on MSVD and MSR-VTT.Only with text attention module, we achieve 43.3 and 80.3 in CIDEr-D on MSVD and MSR-VTT.With both modules, we achieve 93.0 and 47.1 in CIDEr-D on MSVD and MSR-VTT.Ablation study demonstrates the effectiveness of both modules.With both modules, our method achieves state-ofthe-art performance on MSVD and outperforms the runner-up

Figure 2 :
Figure 2: Examples of captions our method predicted.All sample videos are in the test set of MSR-VTT dataset with ID at the top.GT means ground-truth captions and Predicted means captions our model predict.

Table 1 :
Comparison of our method and previous LSTM-based methods on MSVD dataset and MSR-VTT dataset.B@4, C, M and R are the abbreviation of BLEU@4, CIDEr-D, METEOR and ROUGE_L.respectively.(*)results are obtained after running open-access code the authors provide, slightly lower than the results reported by the authors.attention mechanism.For visual representations {  }  =1 of an input video  , we use the hidden state of previous step ℎ   −1 as query to calculate the weight scores  , for each visual representation at time step . , is calculated as:

Table 2 :
Ablation Study.FAM, TAM are the abbreviation of frame attention module and text attention module respectively.Our method achieves state-of-the-art performance on MSVD and competitive results on MSR-VTT.Table1compares our method with previous LSTM-based methods on MSVD and MSR-VTT.Table2gives ablation study of our method.Without both modules, we achieve 35.

Table 3 :
[7]parison of FTAN and transformer-based methods on MSR-VTT dataset.Training time and parameters of FTAN is only 1% and 26 % of SWINBERT[7], when the performance gap is acceptable.( †) recording the training time for 30 epochs.(*) measured on one video of 14 seconds(30FPS, 640x480 pixels) on one TITAN XP.