skip to main content
research-article

Bidirectional Transformer GAN for Long-term Human Motion Prediction

Published:15 April 2023Publication History
Skip Abstract Section

Abstract

The mainstream motion prediction methods usually focus on short-term prediction, and their predicted long-term motions often fall into an average pose, i.e., the freezing forecasting problem [27]. To mitigate this problem, we propose a novel Bidirectional Transformer-based Generative Adversarial Network (BiTGAN) for long-term human motion prediction. The bidirectional setup leads to consistent and smooth generation in both forward and backward directions. Besides, to make full use of the history motions, we split them into two parts. The first part is fed to the Transformer encoder in our BiTGAN while the second part is used as the decoder input. This strategy can alleviate the exposure problem [37]. Additionally, to better maintain both the local (i.e., frame-level pose) and global (i.e., video-level semantic) similarities between the predicted motion sequence and the real one, the soft dynamic time warping (Soft-DTW) loss is introduced into the generator. Finally, we utilize a dual-discriminator to distinguish the predicted sequence at both frame and sequence levels. Extensive experiments on the public Human3.6M dataset demonstrate that our proposed BiTGAN achieves state-of-the-art performance on long-term (4s) human motion prediction, and reduces the average error of all actions by 4%.

REFERENCES

  1. [1] Arjovsky Martin and Bottou Léon. 2017. Towards principled methods for training generative adversarial networks. In Proceeding of the 5th International Conference on Learning Representations.Google ScholarGoogle Scholar
  2. [2] Bousmalis Konstantinos, Silberman Nathan, Dohan David, Erhan Dumitru, and Krishnan Dilip. 2017. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 37223731.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Butepage Judith, Black Michael J., Kragic Danica, and Kjellstrom Hedvig. 2017. Deep representation learning for human motion prediction and classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 61586166.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Cai Yujun, Huang Lin, Wang Yiwei, Cham Tat-Jen, Cai Jianfei, Yuan Junsong, Liu Jun, Yang Xu, Zhu Yiheng, Shen Xiaohui, Ding Liu, Jing Liu, and Nadia Magnenat-Thalmann. 2020. Learning progressive joint propagation for human motion prediction. In Proceedings of the European Conference on Computer Vision. Springer, 226242.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Cao Zhe, Gao Hang, Mangalam Karttikeya, Cai Qi-Zhi, Vo Minh, and Malik Jitendra. 2020. Long-term human motion prediction with scene context. In Proceedings of the European Conference on Computer Vision. Springer, 387404.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Cuturi Marco and Blondel Mathieu. 2017. Soft-dtw: A differentiable loss function for time-series. In Proceedings of the International Conference on Machine Learning. PMLR, 894903.Google ScholarGoogle Scholar
  7. [7] Dai Zihang, Yang Zhilin, Yang Yiming, Carbonell Jaime, Le Quoc V., and Salakhutdinov Ruslan. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. CoRR abs/1901.02860 (2019).Google ScholarGoogle Scholar
  8. [8] Dang Lingwei, Nie Yongwei, Long Chengjiang, Zhang Qing, and Li Guiqing. 2021. MSR-GCN: Multi-scale residual graph convolution networks for human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1146711476.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186.Google ScholarGoogle Scholar
  10. [10] Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceeding of the 9th International Conference on Learning Representations.Google ScholarGoogle Scholar
  11. [11] Yinglin Duan, Tianyang Shi, Zhengxia Zou, Yenan Lin, Zhehui Qian, Bohan Zhang, and Yi Yuan. 2021. Single-Shot Motion Completion with Transformer. CoRR abs/2103.00776 (2021).Google ScholarGoogle Scholar
  12. [12] Fragkiadaki Katerina, Levine Sergey, Felsen Panna, and Malik Jitendra. 2015. Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision. 43464354.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Gao Kaifeng, Chen Long, Huang Yifeng, and Xiao Jun. 2021. Video relation detection via tracklet based visual transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 48334837.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Gopalakrishnan Anand, Mali Ankur, Kifer Dan, Giles Lee, and Ororbia Alexander G.. 2019. A neural temporal model for human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1211612125.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Gui Liang-Yan, Wang Yu-Xiong, Liang Xiaodan, and Moura José M. F.. 2018. Adversarial geometry-aware human motion prediction. In Proceedings of the European Conference on Computer Vision. 786803.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Ingram James N., Körding Konrad P., Howard Ian S., and Wolpert Daniel M.. 2008. The statistics of natural hand movements. Experimental Brain Research 188, 2 (2008), 223236.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Ionescu Catalin, Papava Dragos, Olaru Vlad, and Sminchisescu Cristian. 2013. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (2013), 13251339.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Jain Ashesh, Zamir Amir R., Savarese Silvio, and Saxena Ashutosh. 2016. Structural-rnn: Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 53085317.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Khan Salman, Naseer Muzammal, Hayat Munawar, Zamir Syed Waqas, Khan Fahad Shahbaz, and Shah Mubarak. 2021. Transformers in vision: A survey. ACM Computing Surveys (CSUR) 54, 10s (2021), 200:1–200:41.Google ScholarGoogle Scholar
  20. [20] Kingma Diederik P. and Ba Jimmy. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https://arxiv.org/abs/1412.6980.Google ScholarGoogle Scholar
  21. [21] Koppula Hema S. and Saxena Ashutosh. 2015. Anticipating human activities using object affordances for reactive robotic response. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 1 (2015), 1429.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Kundu Jogendra Nath, Gor Maharshi, and Babu R. Venkatesh. 2019. Bihmp-gan: Bidirectional 3D human motion prediction gan. In Proceedings of the AAAI Conference on Artificial Intelligence. 85538560.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Li Chen, Zhang Zhen, Lee Wee Sun, and Lee Gim Hee. 2018. Convolutional sequence to sequence model for human dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 52265234.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Li Jiachen, Yang Fan, Ma Hengbo, Malla Srikanth, Tomizuka Masayoshi, and Choi Chiho. 2021. Rain: Reinforced hybrid attention inference network for motion forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1609616106.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Li Jiaman, Yin Yihang, Chu Hang, Zhou Yi, Wang Tingwu, Fidler Sanja, and Li Hao. 2020. Learning to generate diverse dance motions with transformer. CoRR abs/2008.08171 (2020).Google ScholarGoogle Scholar
  26. [26] Li Maosen, Chen Siheng, Zhao Yangheng, Zhang Ya, Wang Yanfeng, and Tian Qi. 2020. Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 214223.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Li Ruilong, Yang Shan, Ross David A., and Kanazawa Angjoo. 2021. Learn to dance with AIST++: Music conditioned 3D dance generation. arXiv:2101.08779. Retrieved from https://arxiv.org/abs/2101.08779.Google ScholarGoogle Scholar
  28. [28] Liu Hongyi and Wang Lihui. 2017. Human motion prediction for human-robot collaboration. Journal of Manufacturing Systems 44 (2017), 287294. https://www.sciencedirect.com/science/article/pii/S0278612517300481.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Liu Yicheng, Zhang Jinghuai, Fang Liangji, Jiang Qinhong, and Zhou Bolei. 2021. Multimodal motion prediction with stacked transformers. CoRR abs/2103.11624 (2021).Google ScholarGoogle Scholar
  30. [30] Liu Zhenguang, Wu Shuang, Jin Shuyuan, Liu Qi, Lu Shijian, Zimmermann Roger, and Cheng Li. 2019. Towards natural and accurate future motion prediction of humans and animals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1000410012.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Lyu Kedi, Liu Zhenguang, Wu Shuang, Chen Haipeng, Zhang Xuhong, and Yin Yuyu. 2021. Learning human motion prediction via stochastic differential equations. In Proceedings of the 29th ACM International Conference on Multimedia. 49764984.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Man Xin, Ouyang Deqiang, Li Xiangpeng, Song Jingkuan, and Shao Jie. 2022. Scenario-aware recurrent transformer for goal-directed video captioning. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 4 (2022), 117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Mao Wei, Liu Miaomiao, and Salzmann Mathieu. 2020. History repeats itself: Human motion prediction via motion attention. In Proceedings of the European Conference on Computer Vision. Springer, 474489.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Mao Wei, Liu Miaomiao, Salzmann Mathieu, and Li Hongdong. 2019. Learning trajectory dependencies for human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 94899497.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Martinez Julieta, Black Michael J., and Romero Javier. 2017. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 28912900.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Messina Nicola, Amato Giuseppe, Esuli Andrea, Falchi Fabrizio, Gennaro Claudio, and Marchand-Maillet Stéphane. 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 4 (2021), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Ranzato Marc’Aurelio, Chopra Sumit, Auli Michael, and Zaremba Wojciech. 2016. Sequence level training with recurrent neural networks. In Proceeding of the 4th International Conference on Learning Representations.Google ScholarGoogle Scholar
  38. [38] Ren Xuanchi, Li Haoran, Huang Zijian, and Chen Qifeng. 2020. Self-supervised dance video synthesis conditioned on music. In Proceedings of the 28th ACM International Conference on Multimedia. 4654.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Sofianos Theodoros, Sampieri Alessio, Franco Luca, and Galasso Fabio. 2021. Space-time-separable graph convolutional network for pose forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1120911218.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Su Pengxiang, Liu Zhenguang, Wu Shuang, Zhu Lei, Yin Yifang, and Shen Xuanjing. 2021. Motion prediction via joint dependency modeling in phase space. In Proceedings of the 29th ACM International Conference on Multimedia. 713721.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Tang Hao, Bai Song, Zhang Li, Torr Philip H. S., and Sebe Nicu. 2020. Xinggan for person image generation. In Proceedings of the European Conference on Computer Vision. Springer, 717734.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Tang Hao, Xu Dan, Wang Wei, Yan Yan, and Sebe Nicu. 2018. Dual generator generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the Asian Conference on Computer Vision. Springer, 321.Google ScholarGoogle Scholar
  43. [43] Tang Yongyi, Ma Lin, Liu Wei, and Zheng Weishi. 2018. Long-term human motion prediction by modeling motion context and enhancing motion dynamic. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 935–941.Google ScholarGoogle Scholar
  44. [44] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceeding of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. 5998–6008.Google ScholarGoogle Scholar
  45. [45] Marcard Timo von, Henschel Roberto, Black Michael J., Rosenhahn Bodo, and Pons-Moll Gerard. 2018. Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European Conference on Computer Vision.601617.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Wang Borui, Adeli Ehsan, Chiu Hsu-kuang, Huang De-An, and Niebles Juan Carlos. 2019. Imitation learning for human pose prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 71247133.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Yang Zhilin, Dai Zihang, Yang Yiming, Carbonell Jaime, Salakhutdinov Ruslan, and Le Quoc V.. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237 (2019).Google ScholarGoogle Scholar
  48. [48] Zhang Bo, Zhang Rui, Bisagno Niccolo, Conci Nicola, Natale Francesco G. B. De, and Liu Hongbo. 2021. Where are they going? Predicting human behaviors in crowded scenes. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 4 (2021), 119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Zhu Jun-Yan, Park Taesung, Isola Phillip, and Efros Alexei A.. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 22232232.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Bidirectional Transformer GAN for Long-term Human Motion Prediction

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 5
      September 2023
      262 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3585398
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 15 April 2023
      • Online AM: 10 January 2023
      • Accepted: 24 December 2022
      • Revised: 20 October 2022
      • Received: 8 April 2022
      Published in tomm Volume 19, Issue 5

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)467
      • Downloads (Last 6 weeks)71

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!