Abstract
The mainstream motion prediction methods usually focus on short-term prediction, and their predicted long-term motions often fall into an average pose, i.e., the freezing forecasting problem [27]. To mitigate this problem, we propose a novel Bidirectional Transformer-based Generative Adversarial Network (BiTGAN) for long-term human motion prediction. The bidirectional setup leads to consistent and smooth generation in both forward and backward directions. Besides, to make full use of the history motions, we split them into two parts. The first part is fed to the Transformer encoder in our BiTGAN while the second part is used as the decoder input. This strategy can alleviate the exposure problem [37]. Additionally, to better maintain both the local (i.e., frame-level pose) and global (i.e., video-level semantic) similarities between the predicted motion sequence and the real one, the soft dynamic time warping (Soft-DTW) loss is introduced into the generator. Finally, we utilize a dual-discriminator to distinguish the predicted sequence at both frame and sequence levels. Extensive experiments on the public Human3.6M dataset demonstrate that our proposed BiTGAN achieves state-of-the-art performance on long-term (4s) human motion prediction, and reduces the average error of all actions by 4%.
- [1] . 2017. Towards principled methods for training generative adversarial networks. In Proceeding of the 5th International Conference on Learning Representations.Google Scholar
- [2] . 2017. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3722–3731.Google Scholar
Cross Ref
- [3] . 2017. Deep representation learning for human motion prediction and classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6158–6166.Google Scholar
Cross Ref
- [4] . 2020. Learning progressive joint propagation for human motion prediction. In Proceedings of the European Conference on Computer Vision. Springer, 226–242.Google Scholar
Digital Library
- [5] . 2020. Long-term human motion prediction with scene context. In Proceedings of the European Conference on Computer Vision. Springer, 387–404.Google Scholar
Digital Library
- [6] . 2017. Soft-dtw: A differentiable loss function for time-series. In Proceedings of the International Conference on Machine Learning. PMLR, 894–903.Google Scholar
- [7] . 2019. Transformer-xl: Attentive language models beyond a fixed-length context. CoRR abs/1901.02860 (2019).Google Scholar
- [8] . 2021. MSR-GCN: Multi-scale residual graph convolution networks for human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11467–11476.Google Scholar
Cross Ref
- [9] . 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186.Google Scholar
- [10] . 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceeding of the 9th International Conference on Learning Representations.Google Scholar
- [11] Yinglin Duan, Tianyang Shi, Zhengxia Zou, Yenan Lin, Zhehui Qian, Bohan Zhang, and Yi Yuan. 2021. Single-Shot Motion Completion with Transformer. CoRR abs/2103.00776 (2021).Google Scholar
- [12] . 2015. Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision. 4346–4354.Google Scholar
Digital Library
- [13] . 2021. Video relation detection via tracklet based visual transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 4833–4837.Google Scholar
Digital Library
- [14] . 2019. A neural temporal model for human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12116–12125.Google Scholar
Cross Ref
- [15] . 2018. Adversarial geometry-aware human motion prediction. In Proceedings of the European Conference on Computer Vision. 786–803.Google Scholar
Digital Library
- [16] . 2008. The statistics of natural hand movements. Experimental Brain Research 188, 2 (2008), 223–236.Google Scholar
Cross Ref
- [17] . 2013. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (2013), 1325–1339.Google Scholar
Digital Library
- [18] . 2016. Structural-rnn: Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5308–5317.Google Scholar
Cross Ref
- [19] . 2021. Transformers in vision: A survey. ACM Computing Surveys (CSUR) 54, 10s (2021), 200:1–200:41.Google Scholar
- [20] . 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https://arxiv.org/abs/1412.6980.Google Scholar
- [21] . 2015. Anticipating human activities using object affordances for reactive robotic response. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 1 (2015), 14–29.Google Scholar
Digital Library
- [22] . 2019. Bihmp-gan: Bidirectional 3D human motion prediction gan. In Proceedings of the AAAI Conference on Artificial Intelligence. 8553–8560.Google Scholar
Digital Library
- [23] . 2018. Convolutional sequence to sequence model for human dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5226–5234.Google Scholar
Cross Ref
- [24] . 2021. Rain: Reinforced hybrid attention inference network for motion forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16096–16106.Google Scholar
Cross Ref
- [25] . 2020. Learning to generate diverse dance motions with transformer. CoRR abs/2008.08171 (2020).Google Scholar
- [26] . 2020. Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 214–223.Google Scholar
Cross Ref
- [27] . 2021. Learn to dance with AIST++: Music conditioned 3D dance generation. arXiv:2101.08779. Retrieved from https://arxiv.org/abs/2101.08779.Google Scholar
- [28] . 2017. Human motion prediction for human-robot collaboration. Journal of Manufacturing Systems 44 (2017), 287–294. https://www.sciencedirect.com/science/article/pii/S0278612517300481.Google Scholar
Cross Ref
- [29] . 2021. Multimodal motion prediction with stacked transformers. CoRR abs/2103.11624 (2021).Google Scholar
- [30] . 2019. Towards natural and accurate future motion prediction of humans and animals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10004–10012.Google Scholar
Cross Ref
- [31] . 2021. Learning human motion prediction via stochastic differential equations. In Proceedings of the 29th ACM International Conference on Multimedia. 4976–4984.Google Scholar
Digital Library
- [32] . 2022. Scenario-aware recurrent transformer for goal-directed video captioning. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 4 (2022), 1–17.Google Scholar
Digital Library
- [33] . 2020. History repeats itself: Human motion prediction via motion attention. In Proceedings of the European Conference on Computer Vision. Springer, 474–489.Google Scholar
Digital Library
- [34] . 2019. Learning trajectory dependencies for human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9489–9497.Google Scholar
Cross Ref
- [35] . 2017. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2891–2900.Google Scholar
Cross Ref
- [36] . 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 4 (2021), 1–23.Google Scholar
Digital Library
- [37] . 2016. Sequence level training with recurrent neural networks. In Proceeding of the 4th International Conference on Learning Representations.Google Scholar
- [38] . 2020. Self-supervised dance video synthesis conditioned on music. In Proceedings of the 28th ACM International Conference on Multimedia. 46–54.Google Scholar
Digital Library
- [39] . 2021. Space-time-separable graph convolutional network for pose forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11209–11218.Google Scholar
Cross Ref
- [40] . 2021. Motion prediction via joint dependency modeling in phase space. In Proceedings of the 29th ACM International Conference on Multimedia. 713–721.Google Scholar
Digital Library
- [41] . 2020. Xinggan for person image generation. In Proceedings of the European Conference on Computer Vision. Springer, 717–734.Google Scholar
Digital Library
- [42] . 2018. Dual generator generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the Asian Conference on Computer Vision. Springer, 3–21.Google Scholar
- [43] . 2018. Long-term human motion prediction by modeling motion context and enhancing motion dynamic. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 935–941.Google Scholar
- [44] . 2017. Attention is all you need. In Proceeding of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. 5998–6008.Google Scholar
- [45] . 2018. Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European Conference on Computer Vision.601–617.Google Scholar
Digital Library
- [46] . 2019. Imitation learning for human pose prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7124–7133.Google Scholar
Cross Ref
- [47] . 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237 (2019).Google Scholar
- [48] . 2021. Where are they going? Predicting human behaviors in crowded scenes. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 4 (2021), 1–19.Google Scholar
Digital Library
- [49] . 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 2223–2232.Google Scholar
Cross Ref
Index Terms
Bidirectional Transformer GAN for Long-term Human Motion Prediction
Recommendations
Long-term human motion prediction by modeling motion context and enhancing motion dynamic
IJCAI'18: Proceedings of the 27th International Joint Conference on Artificial IntelligenceHuman motion prediction aims at generating future frames of human motion based on an observed sequence of skeletons. Recent methods employ the latest hidden states of a recurrent neural network (RNN) to encode the historical skeletons, which can only ...
Motion Estimation Using Long Term Motion Vector Prediction
DCC '99: Proceedings of the Conference on Data CompressionThis paper presents a motion estimation technique for the coding of video sequences that is based on long term temporal prediction. The motion vector of a moving object is tracked from one frame to another using a projection method. The traced motion ...
Spatial–temporal modeling for prediction of stylized human motion
Highlights- Auto-regressive network structure for stylized motion prediction.
- Style feature ...
AbstractHuman motion prediction refers to forecasting human motion in the future given a past motion sequence, which has significant applications in human tracking, automatic motion generation, autonomous driving, human-robotics interaction, ...






Comments