Abstract
We consider the task of temporal human action localization in lifestyle vlogs. We introduce a novel dataset consisting of manual annotations of temporal localization for 13,000 narrated actions in 1,200 video clips. We present an extensive analysis of this data, which allows us to better understand how the language and visual modalities interact throughout the videos. We propose a simple yet effective method to localize the narrated actions based on their expected duration. Through several experiments and analyses, we show that our method brings complementary information with respect to previous methods, and leads to improvements over previous work for the task of temporal action localization.
- [1] . 2016. Youtube-8m: A large-scale video classification benchmark. arXiv:1609.08675. Retrieved from https://arxiv.org/abs/1609.08675.Google Scholar
- [2] . 2016. Unsupervised learning from narrated instruction videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4575–4583.Google Scholar
Cross Ref
- [3] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google Scholar
Cross Ref
- [4] . 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- [5] . 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 5803–5812.Google Scholar
Cross Ref
- [6] . 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961–970.Google Scholar
Cross Ref
- [7] . 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.Google Scholar
Cross Ref
- [8] . 2014. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.740–750.Google Scholar
Cross Ref
- [9] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT.Google Scholar
- [10] . 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625–2634.Google Scholar
Cross Ref
- [11] . 2019. Temporal localization of moments in video collections with natural language. arXiv:1907.12763. Retrieved from https://arxiv.org/abs/1907.12763.Google Scholar
- [12] . 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement 33, 3 (1973), 613–619.Google Scholar
Cross Ref
- [13] . 2018. From lifestyle vlogs to everyday interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4991–5000.Google Scholar
Cross Ref
- [14] . 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 5267–5275.Google Scholar
Cross Ref
- [15] . 2019. MAC: Mining activity concepts for language-based temporal localization. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision. IEEE, 245–253.Google Scholar
Cross Ref
- [16] . 2019. ExCL: Extractive clip localization using natural language descriptions. arXiv:1904.02755. Retrieved from https://arxiv.org/abs/1904.02755.Google Scholar
- [17] . 2019. Video action transformer network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 244–253.Google Scholar
Cross Ref
- [18] . 2016. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition.6325–6334.Google Scholar
- [19] . 2008. Supervised sequence labelling with recurrent neural networks. In Proceedings of the Studies in Computational Intelligence.Google Scholar
- [20] Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. 2018. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.6047–6056.Google Scholar
- [21] . 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition.961–970.Google Scholar
Cross Ref
- [22] . 2018. Localizing moments in video with temporal language. arXiv:1809.01337. Retrieved from https://arxiv.org/abs/1809.01337.Google Scholar
- [23] . 2019. Gqa: A new dataset for compositional question answering over real-world images. arXiv:1902.09506. Retrieved from https://arxiv.org/abs/1902.09506.Google Scholar
- [24] . 2019. Identifying visible actions in lifestyle Vlogs. In Proceedings of the ACL.Google Scholar
Cross Ref
- [25] . 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proceedings of the CIVR’07.Google Scholar
Digital Library
- [26] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The kinetics human action video dataset. arXiv:1705.06950. Retrieved from https://arxiv.org/abs/1705.06950.Google Scholar
- [27] . 1970. Estimating the reliability, systematic error and random error of interval data. Educational and Psychological Measurement 30, 1 (1970), 61–70.Google Scholar
Cross Ref
- [28] . 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706–715.Google Scholar
Cross Ref
- [29] . 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision.201–216.Google Scholar
Digital Library
- [30] . 2019. TVQA+: Spatio-Temporal grounding for video question answering. arXiv:1904.11574. Retrieved from https://arxiv.org/abs/1904.11574.Google Scholar
- [31] . 2014. Dependency-Based word embeddings. In Proceedings of the ACL.Google Scholar
Cross Ref
- [32] . 2017. Single shot temporal action detection. In Proceedings of the 25th ACM International Conference on Multimedia. 988–996.Google Scholar
Digital Library
- [33] . 2018. Attentive moment retrieval in videos. In Proceedings of the SIGIR’18.Google Scholar
Digital Library
- [34] . 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Advances in Neural Information Processing Systems. 13–23.Google Scholar
- [35] . 2019. End-to-End learning of visual representations from uncurated instructional videos. arXiv:1912.06430. Retrieved from https://arxiv.org/abs/1912.06430.Google Scholar
- [36] . 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. arXiv:1906.03327. Retrieved from https://arxiv.org/abs/1906.03327.Google Scholar
- [37] . 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781. Retrieved from https://arxiv.org/abs/1301.3781.Google Scholar
- [38] . 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 19–27.Google Scholar
Digital Library
- [39] Mathew Monfort, Alex Andonian, Bolei Zhou,Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl Vondrick, and Aude Oliva. 2019. Moments in time dataset: One million videos for event understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 4, 2 (2019), 502–508.Google Scholar
- [40] . 2012. Improving video activity recognition using object recognition and text mining. In Proceedings of the ECAI.Google Scholar
- [41] . 2019. Multimodal abstractive summarization for how2 videos. arXiv:1906.07901. Retrieved from https://arxiv.org/abs/1906.07901.Google Scholar
- [42] . 2014. Glove: Global vectors for word representation. In Proceedings of the EMNLP.Google Scholar
Cross Ref
- [43] . 2017. Enhancing video summarization via vision-language embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5781–5789.Google Scholar
Cross Ref
- [44] . 2015. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition.779–788.Google Scholar
- [45] . 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics 1 (2013), 25–36.
DOI :Google ScholarCross Ref
- [46] . 2012. A database for fine grained activity detection of cooking activities. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1194–1201.Google Scholar
Cross Ref
- [47] . 2019. Dense procedure captioning in narrated instructional videos. In Proceedings of the 57th Conference of the Association for Computational Linguistics. 6382–6391.Google Scholar
Cross Ref
- [48] . 2018. Charades-ego: A large-scale dataset of paired third and first person videos. arXiv:1804.09626. Retrieved from https://arxiv.org/abs/1804.09626.Google Scholar
- [49] . 2017. What actions are needed for understanding human actions in videos? In Proceedings of the IEEE International Conference on Computer Vision. 2137–2146.Google Scholar
Cross Ref
- [50] . 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision. Springer, 510–526.Google Scholar
Cross Ref
- [51] . 2018. A corpus for reasoning about natural language grounded in photographs. arXiv:1811.00491. Retrieved from https://arxiv.org/abs/1811.00491.Google Scholar
- [52] . 2019. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE International Conference on Computer Vision. 7464–7473.Google Scholar
Cross Ref
- [53] . 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv:1908.07490. Retrieved from https://arxiv.org/abs/1908.07490.Google Scholar
- [54] . 2019. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1207–1216.Google Scholar
Cross Ref
- [55] . 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. arXiv:1412.0767. Retrieved from https://arxiv.org/abs/1412.0767.Google Scholar
- [56] . 2014. C3D: Generic features for video analysis. arXiv:1412.0767. Retrieved from https://arxiv.org/abs/1412.0767.Google Scholar
- [57] . 2017. Attention is all you need. In Proceedings of the NIPS.Google Scholar
Digital Library
- [58] . 2011. Action recognition by dense trajectories. CVPR (2011), 3169–3176.Google Scholar
- [59] . 2016. Structured matching for phrase localization. In Proceedings of the European Conference on Computer Vision. Springer, 696–711.Google Scholar
Cross Ref
- [60] . 2015. Stacked attention networks for image question answering. 2016 IEEE Conference on Computer Vision and Pattern Recognition.21–29.Google Scholar
- [61] . 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.Google Scholar
Cross Ref
- [62] . 2018. To find where you talk: Temporal sentence localization in video with attention based location regression. arXiv:1804.07014. Retrieved from https://arxiv.org/abs/1804.07014.Google Scholar
- [63] . 2020. Learning 2D temporal adjacent networks formoment localization with natural language. In Proceedings of the AAAI.Google Scholar
- [64] . 2019. Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3537–3545.Google Scholar
Cross Ref
Index Terms
When Did It Happen? Duration-informed Temporal Localization of Narrated Actions in Vlogs
Recommendations
Temporal Localization of Actions with Actoms
We address the problem of localizing actions, such as opening a door, in hours of challenging video data. We propose a model based on a sequence of atomic action units, termed "actoms," that are semantically meaningful and characteristic for the action. ...
Learning to Localize Actions from Moments
Computer Vision – ECCV 2020AbstractWith the knowledge of action moments (i.e., trimmed video clips that each contains an action instance), humans could routinely localize an action temporally in an untrimmed video. Nevertheless, most practical methods still require all training ...
Action Search: Spotting Actions in Videos and Its Application to Temporal Action Localization
Computer Vision – ECCV 2018AbstractState-of-the-art temporal action detectors inefficiently search the entire video for specific actions. Despite the encouraging progress these methods achieve, it is crucial to design automated approaches that only explore parts of the video which ...






Comments