skip to main content
research-article

When Did It Happen? Duration-informed Temporal Localization of Narrated Actions in Vlogs

Published:02 November 2022Publication History
Skip Abstract Section

Abstract

We consider the task of temporal human action localization in lifestyle vlogs. We introduce a novel dataset consisting of manual annotations of temporal localization for 13,000 narrated actions in 1,200 video clips. We present an extensive analysis of this data, which allows us to better understand how the language and visual modalities interact throughout the videos. We propose a simple yet effective method to localize the narrated actions based on their expected duration. Through several experiments and analyses, we show that our method brings complementary information with respect to previous methods, and leads to improvements over previous work for the task of temporal action localization.

REFERENCES

  1. [1] Abu-El-Haija Sami, Kothari Nisarg, Lee Joonseok, Natsev Paul, Toderici George, Varadarajan Balakrishnan, and Vijayanarasimhan Sudheendra. 2016. Youtube-8m: A large-scale video classification benchmark. arXiv:1609.08675. Retrieved from https://arxiv.org/abs/1609.08675.Google ScholarGoogle Scholar
  2. [2] Alayrac Jean-Baptiste, Bojanowski Piotr, Agrawal Nishant, Sivic Josef, Laptev Ivan, and Lacoste-Julien Simon. 2016. Unsupervised learning from narrated instruction videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 45754583.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Hendricks Lisa Anne, Wang Oliver, Shechtman Eli, Sivic Josef, Darrell Trevor, and Russell Bryan. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Hendricks Lisa Anne, Wang Oliver, Shechtman Eli, Sivic Josef, Darrell Trevor, and Russell Bryan. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 58035812.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Heilbron Fabian Caba, Escorcia Victor, Ghanem Bernard, and Niebles Juan Carlos. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961970.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Carreira Joao and Zisserman Andrew. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 62996308.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Chen Danqi and Manning Christopher. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.740750.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT.Google ScholarGoogle Scholar
  10. [10] Donahue Jeffrey, Hendricks Lisa Anne, Guadarrama Sergio, Rohrbach Marcus, Venugopalan Subhashini, Saenko Kate, and Darrell Trevor. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 26252634.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Escorcia Victor, Soldan Mattia, Sivic Josef, Ghanem Bernard, and Russell Bryan. 2019. Temporal localization of moments in video collections with natural language. arXiv:1907.12763. Retrieved from https://arxiv.org/abs/1907.12763.Google ScholarGoogle Scholar
  12. [12] Fleiss Joseph L. and Cohen Jacob. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement 33, 3 (1973), 613619.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Fouhey David F., Kuo Wei-cheng, Efros Alexei A., and Malik Jitendra. 2018. From lifestyle vlogs to everyday interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 49915000.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Gao Jiyang, Sun Chen, Yang Zhenheng, and Nevatia Ram. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 52675275.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Ge Runzhou, Gao Jiyang, Chen Kan, and Nevatia Ram. 2019. MAC: Mining activity concepts for language-based temporal localization. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision. IEEE, 245253.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Ghosh Soham, Agarwal Anuva, Parekh Zarana, and Hauptmann Alexander. 2019. ExCL: Extractive clip localization using natural language descriptions. arXiv:1904.02755. Retrieved from https://arxiv.org/abs/1904.02755.Google ScholarGoogle Scholar
  17. [17] Girdhar Rohit, Carreira Joao, Doersch Carl, and Zisserman Andrew. 2019. Video action transformer network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 244253.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Goyal Yash, Khot Tejas, Summers-Stay Douglas, Batra Dhruv, and Parikh Devi. 2016. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition.63256334.Google ScholarGoogle Scholar
  19. [19] Graves Alex. 2008. Supervised sequence labelling with recurrent neural networks. In Proceedings of the Studies in Computational Intelligence.Google ScholarGoogle Scholar
  20. [20] Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. 2018. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.60476056.Google ScholarGoogle Scholar
  21. [21] Heilbron Fabian Caba, Escorcia Victor, Ghanem Bernard, and Niebles Juan Carlos. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition.961970.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Hendricks Lisa Anne, Wang Oliver, Shechtman Eli, Sivic Josef, Darrell Trevor, and Russell Bryan. 2018. Localizing moments in video with temporal language. arXiv:1809.01337. Retrieved from https://arxiv.org/abs/1809.01337.Google ScholarGoogle Scholar
  23. [23] Hudson Drew A. and Manning Christopher D.. 2019. Gqa: A new dataset for compositional question answering over real-world images. arXiv:1902.09506. Retrieved from https://arxiv.org/abs/1902.09506.Google ScholarGoogle Scholar
  24. [24] Ignat Oana, Burdick Laura, Deng Jia, and Mihalcea Rada. 2019. Identifying visible actions in lifestyle Vlogs. In Proceedings of the ACL.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Jiang Yu-Gang, Ngo Chong-Wah, and Yang Jun. 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proceedings of the CIVR’07.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The kinetics human action video dataset. arXiv:1705.06950. Retrieved from https://arxiv.org/abs/1705.06950.Google ScholarGoogle Scholar
  27. [27] Krippendorff Klaus. 1970. Estimating the reliability, systematic error and random error of interval data. Educational and Psychological Measurement 30, 1 (1970), 6170.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Krishna Ranjay, Hata Kenji, Ren Frederic, Fei-Fei Li, and Niebles Juan Carlos. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706715.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Lee Kuang-Huei, Chen Xi, Hua Gang, Hu Houdong, and He Xiaodong. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision.201216.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Lei Jie, Yu Licheng, Berg Tamara L., and Bansal Mohit. 2019. TVQA+: Spatio-Temporal grounding for video question answering. arXiv:1904.11574. Retrieved from https://arxiv.org/abs/1904.11574.Google ScholarGoogle Scholar
  31. [31] Levy Omer and Goldberg Yoav. 2014. Dependency-Based word embeddings. In Proceedings of the ACL.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Lin Tianwei, Zhao Xu, and Shou Zheng. 2017. Single shot temporal action detection. In Proceedings of the 25th ACM International Conference on Multimedia. 988996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Liu Meng, Wang Xiang, Nie Liqiang, He Xiangnan, Chen Baoquan, and Chua Tat-Seng. 2018. Attentive moment retrieval in videos. In Proceedings of the SIGIR’18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Lu Jiasen, Batra Dhruv, Parikh Devi, and Lee Stefan. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Advances in Neural Information Processing Systems. 1323.Google ScholarGoogle Scholar
  35. [35] Miech Antoine, Alayrac Jean-Baptiste, Smaira Lucas, Laptev Ivan, Sivic Josef, and Zisserman Andrew. 2019. End-to-End learning of visual representations from uncurated instructional videos. arXiv:1912.06430. Retrieved from https://arxiv.org/abs/1912.06430.Google ScholarGoogle Scholar
  36. [36] Miech Antoine, Zhukov Dimitri, Alayrac Jean-Baptiste, Tapaswi Makarand, Laptev Ivan, and Sivic Josef. 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. arXiv:1906.03327. Retrieved from https://arxiv.org/abs/1906.03327.Google ScholarGoogle Scholar
  37. [37] Mikolov Tomas, Chen Kai, Corrado G. S., and Dean J.. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781. Retrieved from https://arxiv.org/abs/1301.3781.Google ScholarGoogle Scholar
  38. [38] Mithun Niluthpol Chowdhury, Li Juncheng, Metze Florian, and Roy-Chowdhury Amit K.. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 1927.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Mathew Monfort, Alex Andonian, Bolei Zhou,Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl Vondrick, and Aude Oliva. 2019. Moments in time dataset: One million videos for event understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 4, 2 (2019), 502–508.Google ScholarGoogle Scholar
  40. [40] Motwani Tanvi S. and Mooney Raymond J.. 2012. Improving video activity recognition using object recognition and text mining. In Proceedings of the ECAI.Google ScholarGoogle Scholar
  41. [41] Palaskar Shruti, Libovickỳ Jindrich, Gella Spandana, and Metze Florian. 2019. Multimodal abstractive summarization for how2 videos. arXiv:1906.07901. Retrieved from https://arxiv.org/abs/1906.07901.Google ScholarGoogle Scholar
  42. [42] Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. Glove: Global vectors for word representation. In Proceedings of the EMNLP.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Plummer Bryan A., Brown Matthew, and Lazebnik Svetlana. 2017. Enhancing video summarization via vision-language embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 57815789.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Redmon Joseph, Divvala Santosh Kumar, Girshick Ross B., and Farhadi Ali. 2015. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition.779788.Google ScholarGoogle Scholar
  45. [45] Regneri Michaela, Rohrbach Marcus, Wetzel Dominikus, Thater Stefan, Schiele Bernt, and Pinkal Manfred. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics 1 (2013), 2536. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Rohrbach Marcus, Amin Sikandar, Andriluka Mykhaylo, and Schiele Bernt. 2012. A database for fine grained activity detection of cooking activities. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 11941201.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Shi Botian, Ji Lei, Liang Yaobo, Duan Nan, Chen Peng, Niu Zhendong, and Zhou Ming. 2019. Dense procedure captioning in narrated instructional videos. In Proceedings of the 57th Conference of the Association for Computational Linguistics. 63826391.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Sigurdsson Gunnar A., Gupta Abhinav, Schmid Cordelia, Farhadi Ali, and Alahari Karteek. 2018. Charades-ego: A large-scale dataset of paired third and first person videos. arXiv:1804.09626. Retrieved from https://arxiv.org/abs/1804.09626.Google ScholarGoogle Scholar
  49. [49] Sigurdsson Gunnar A., Russakovsky Olga, and Gupta Abhinav. 2017. What actions are needed for understanding human actions in videos? In Proceedings of the IEEE International Conference on Computer Vision. 21372146.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Sigurdsson Gunnar A., Varol Gül, Wang Xiaolong, Farhadi Ali, Laptev Ivan, and Gupta Abhinav. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision. Springer, 510526.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Suhr Alane, Zhou Stephanie, Zhang Ally, Zhang Iris, Bai Huajun, and Artzi Yoav. 2018. A corpus for reasoning about natural language grounded in photographs. arXiv:1811.00491. Retrieved from https://arxiv.org/abs/1811.00491.Google ScholarGoogle Scholar
  52. [52] Sun Chen, Myers Austin, Vondrick Carl, Murphy Kevin, and Schmid Cordelia. 2019. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE International Conference on Computer Vision. 74647473.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Tan Hao and Bansal Mohit. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv:1908.07490. Retrieved from https://arxiv.org/abs/1908.07490.Google ScholarGoogle Scholar
  54. [54] Tang Yansong, Ding Dajun, Rao Yongming, Zheng Yu, Zhang Danyang, Zhao Lili, Lu Jiwen, and Zhou Jie. 2019. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12071216.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. arXiv:1412.0767. Retrieved from https://arxiv.org/abs/1412.0767.Google ScholarGoogle Scholar
  56. [56] Tran Du, Bourdev Lubomir D., Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2014. C3D: Generic features for video analysis. arXiv:1412.0767. Retrieved from https://arxiv.org/abs/1412.0767.Google ScholarGoogle Scholar
  57. [57] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the NIPS.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Wang Heng, Kläser Alexander, Schmid Cordelia, and Liu Cheng-Lin. 2011. Action recognition by dense trajectories. CVPR (2011), 31693176.Google ScholarGoogle Scholar
  59. [59] Wang Mingzhe, Azab Mahmoud, Kojima Noriyuki, Mihalcea Rada, and Deng Jia. 2016. Structured matching for phrase localization. In Proceedings of the European Conference on Computer Vision. Springer, 696711.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Yang Zichao, He Xiaodong, Gao Jianfeng, Deng Li, and Smola Alexander J.. 2015. Stacked attention networks for image question answering. 2016 IEEE Conference on Computer Vision and Pattern Recognition.2129.Google ScholarGoogle Scholar
  61. [61] You Quanzeng, Jin Hailin, Wang Zhaowen, Fang Chen, and Luo Jiebo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 46514659.Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Yuan Yitian, Mei Tao, and Zhu Wenwu. 2018. To find where you talk: Temporal sentence localization in video with attention based location regression. arXiv:1804.07014. Retrieved from https://arxiv.org/abs/1804.07014.Google ScholarGoogle Scholar
  63. [63] Zhang Songyang, Peng Houwen, Fu Jianlong, and Luo Jiebo. 2020. Learning 2D temporal adjacent networks formoment localization with natural language. In Proceedings of the AAAI.Google ScholarGoogle Scholar
  64. [64] Zhukov Dimitri, Alayrac Jean-Baptiste, Cinbis Ramazan Gokberk, Fouhey David, Laptev Ivan, and Sivic Josef. 2019. Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 35373545.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. When Did It Happen? Duration-informed Temporal Localization of Narrated Actions in Vlogs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 3s
      October 2022
      381 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3567476
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 2 November 2022
      • Online AM: 18 February 2022
      • Accepted: 1 November 2021
      • Revised: 29 September 2021
      • Received: 15 February 2021
      Published in tomm Volume 18, Issue 3s

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!