Abstract
Egocentric early action prediction aims to recognize actions from the first-person view by only observing a partial video segment, which is challenging due to the limited context information of the partial video. In this article, to tackle the egocentric early action prediction problem, we propose a novel multi-modal adversarial knowledge distillation framework. In particular, our approach involves a teacher network to learn the enhanced representation of the partial video by considering the future unobserved video segment, and a student network to mimic the teacher network to produce the powerful representation of the partial video and based on that predicting the action label. To promote the knowledge distillation between the teacher and the student network, we seamlessly integrate adversarial learning with latent and discriminative knowledge regularizations encouraging the learned representations of the partial video to be more informative and discriminative toward the action prediction. Finally, we devise a multi-modal fusion module toward comprehensively predicting the action label. Extensive experiments on two public egocentric datasets validate the superiority of our method over the state-of-the-art methods. We have released the codes and involved parameters to benefit other researchers.1
- [1] . 2019. Action knowledge transfer for action prediction with partial videos. In Proceedings of the AAAI Conference on Artificial Intelligence. 8118–8125.Google Scholar
Digital Library
- [2] . 2018. HashGAN: Deep learning to hash with pair conditional Wasserstein GAN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 1287–1296.Google Scholar
Cross Ref
- [3] . 2021. Adversarial-enhanced hybrid graph network for user identity linkage. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 1084–1093.Google Scholar
Digital Library
- [4] . 2018. Attention-GAN for object transfiguration in wild images. In Proceedings of the European Conference on Computer Vision. 164–180.Google Scholar
Digital Library
- [5] . 2021. Semantic-aware knowledge distillation for few-shot class-incremental learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 2534–2543.Google Scholar
Cross Ref
- [6] . 2018. Scaling egocentric vision: The EPIC-KITCHENS dataset. In Proceedings of the European Conference on Computer Vision. 720–736.Google Scholar
- [7] . 2021. Forecasting action through contact representations from first person video. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 1–12.Google Scholar
Cross Ref
- [8] . 2011. Understanding egocentric activities. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 407–414.Google Scholar
Digital Library
- [9] . 2018. Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation. In Proceedings of the European Conference on Computer Vision. 1–17.Google Scholar
- [10] . 2019. What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 6252–6261.Google Scholar
Cross Ref
- [11] . 2021. Rolling-unrolling LSTMs for action anticipation from first-person video. IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (2021), 4021–4036.Google Scholar
Cross Ref
- [12] . 2019. Predicting the future: A jointly learnt model for action anticipation. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 5562–5571.Google Scholar
Cross Ref
- [13] . 2018. Geometry guided convolutional neural networks for self-supervised video representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 5589–5597.Google Scholar
Cross Ref
- [14] . 2016. You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 923–932.Google Scholar
Cross Ref
- [15] . 2018. Modality distillation with multiple stream networks for action recognition. In Proceedings of the European Conference on Computer Vision. 1–16.Google Scholar
Digital Library
- [16] . 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. MIT, Cambridge, MA 2672–2680.Google Scholar
Digital Library
- [17] . 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18 (2005), 602–610.Google Scholar
Digital Library
- [18] . 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).Google Scholar
- [19] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.Google Scholar
Digital Library
- [20] . 2020. Creating something from nothing: Unsupervised knowledge distillation for cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 3123–3132.Google Scholar
Cross Ref
- [21] . 2021. EgoRenderer: Rendering human avatars from egocentric camera images. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 14528–14538.Google Scholar
Cross Ref
- [22] . 2018. Multifeature selection for 3D human action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 14 (2018), 1–18.Google Scholar
Digital Library
- [23] . 2017. Egocentric hand detection via dynamic region growing. ACM Transactions on Multimedia Computing, Communications, and Applications 14 (2017), 1–17.Google Scholar
Digital Library
- [24] . 2019. STGAT: Modeling spatial-temporal interactions for human trajectory prediction. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 6272–6281.Google Scholar
Cross Ref
- [25] . 2020. Knowledge-driven egocentric multimodal activity recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 16 (2020), 1–133.Google Scholar
Digital Library
- [26] . 2021. Egocentric pose estimation from human vision span. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 11006–11014.Google Scholar
Cross Ref
- [27] . 2019. EPIC-Fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 5492–5501.Google Scholar
Cross Ref
- [28] . 2021. Latent transformations via NeuralODEs for GAN-based image editing. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 14428–14437.Google Scholar
Cross Ref
- [29] . 2021. Task-aware variational adversarial active learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 8166–8175.Google Scholar
Cross Ref
- [30] . 2017. Deep sequential context networks for action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 1473–1481.Google Scholar
Cross Ref
- [31] . 2018. Adversarial action prediction networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2018), 539–553.Google Scholar
Cross Ref
- [32] . 2019. Deep dual relation modeling for egocentric interaction recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 7932–7941.Google Scholar
Cross Ref
- [33] . 2018. Early recognition of 3D human actions. ACM Transactions on Multimedia Computing, Communications, and Applications 14 (2018), 1–21.Google Scholar
Digital Library
- [34] . 2018. In the eye of beholder: Joint learning of gaze and actions in first person video. In Proceedings of the European Conference on Computer Vision. 619–635.Google Scholar
Digital Library
- [35] . 2021. In the eye of the beholder: Gaze and actions in first person video. IEEE Transactions on Pattern Analysis and Machine Intelligence 2021 (2021), 1–16.Google Scholar
- [36] . 2015. Delving into egocentric actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 287–295.Google Scholar
Cross Ref
- [37] . 2019. User diverse preference modeling by multimodal attentive metric learning. In Proceedings of the 27th ACM International Conference on Multimedia. ACM, New York, NY, 1526–1534.Google Scholar
Digital Library
- [38] . 2019. Deep attention network for egocentric action recognition. IEEE Transactions on Image Processing 28, 8 (2019), 3703–3713.Google Scholar
Cross Ref
- [39] . 2016. Going deeper into first-person activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 1894–1903.Google Scholar
Cross Ref
- [40] . 2016. Learning activity progression in LSTMs for activity detection and early detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 1942–1950.Google Scholar
Cross Ref
- [41] . 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (2008), 2579–2605.Google Scholar
Digital Library
- [42] . 2019. Grounded human-object interaction hotspots from video. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 8688–8697.Google Scholar
Cross Ref
- [43] . 2020. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 10879–10870.Google Scholar
Cross Ref
- [44] . 2019. Few-shot image recognition with knowledge transfer. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 441–449.Google Scholar
Cross Ref
- [45] . 2018. Egocentric activity recognition on a budget. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 5967–5976.Google Scholar
Cross Ref
- [46] . 2018. Unsupervised person image synthesis in arbitrary poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 8620–8628.Google Scholar
Cross Ref
- [47] . 2010. Figure-ground segmentation improves handled object recognition in egocentric video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 3137–3144.Google Scholar
Cross Ref
- [48] . 2021. Efficient conditional GAN transfer with knowledge propagation across classes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 12167–12176.Google Scholar
Cross Ref
- [49] . 2019. Synthesizing facial photometries and corresponding geometries using generative adversarial networks. ACM Transactions on Multimedia Computing, Communications, and Applications 15 (2019), 1–24.Google Scholar
Digital Library
- [50] . 2018. Egocentric activity prediction via event modulated attention. In Proceedings of the European Conference on Computer Vision. 197–212.Google Scholar
Digital Library
- [51] . 2018. Action anticipation with RBF kernelized feature mapping RNN. In Proceedings of the European Conference on Computer Vision. 301–317.Google Scholar
Digital Library
- [52] . 2018. Neural compatibility modeling with attentive knowledge distillation. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 5–14.Google Scholar
Digital Library
- [53] . 2019. LSTA: Long short-term attention for egocentric action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 9954–9963.Google Scholar
Cross Ref
- [54] . 2018. Attention is all we need: Nailing down object-centric attention for egocentric activity recognition. arXiv preprint arXiv:1807.11794 (2018).Google Scholar
- [55] . 2021. There is more than meets the eye: Self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 11612–11621.Google Scholar
Cross Ref
- [56] . 2019. Distill knowledge from NRSfM for weakly supervised 3D pose learning. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 743–752.Google Scholar
Cross Ref
- [57] . 2017. IRGAN: A minimax game for unifying generative and discriminative information retrieval models. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 515–524.Google Scholar
Digital Library
- [58] . 2019. Generative multi-view human action recognition. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 6212–6221.Google Scholar
Cross Ref
- [59] . 2019. Progressive teacher-student learning for early action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 3556–3565.Google Scholar
Cross Ref
- [60] . 2019. U-Net conditional GANs for photo-realistic and identity-preserving facial expression synthesis. ACM Transactions on Multimedia Computing, Communications, and Applications 15 (2019), 1–23.Google Scholar
Digital Library
- [61] . 2021. Interactive prototype learning for egocentric action recognition. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 8168–8177.Google Scholar
Cross Ref
- [62] . 2019. Reasoning about human-object interactions through dual attention networks. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 3919–3928.Google Scholar
Cross Ref
- [63] . 2020. Generative attribute manipulation scheme for flexible fashion search. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 941–950.Google Scholar
Digital Library
- [64] . 2017. A temporal order modeling approach to human action recognition from multimodal sensor data. ACM Transactions on Multimedia Computing, Communications, and Applications 13 (2017), 1–22.Google Scholar
Digital Library
- [65] . 2019. Personalized fashion design. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 9046–9055.Google Scholar
Cross Ref
- [66] . 2017. Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 1974–1982.Google Scholar
Cross Ref
- [67] . 2021. Conditional LSTM-GAN for melody generation from lyrics. ACM Transactions on Multimedia Computing, Communications, and Applications 17 (2021), 1–20.Google Scholar
Digital Library
- [68] . 2017. Modeling sub-event dynamics in first-person action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 7253–7262.Google Scholar
Cross Ref
- [69] . 2018. Better and faster: Knowledge transfer from multiple self-supervised learning tasks via graph distillation for video classification. In Proceedings of the International Joint Conference on Artificial Intelligence. 1135–1141.Google Scholar
Cross Ref
- [70] . 2019. Generative reconstructive hashing for incomplete video analysis. In Proceedings of the ACM International Conference on Multimedia. ACM, New York, NY, 845–854.Google Scholar
Digital Library
Index Terms
Egocentric Early Action Prediction via Adversarial Knowledge Distillation
Recommendations
Multimodal Global Relation Knowledge Distillation for Egocentric Action Anticipation
MM '21: Proceedings of the 29th ACM International Conference on MultimediaIn this paper, we consider the task of action anticipation on egocentric videos. Previous methods ignore explicit modeling of the global context relation among past and future actions, which is not an easy task due to the vacancy of unobserved videos. ...
Adversarial Variational Knowledge Distillation
Artificial Neural Networks and Machine Learning – ICANN 2021AbstractKnowledge Distillation (KD) is one of the most popular and effective techniques for model compression and knowledge transfer. However, most existing KD approaches are heavily relying on the labeled training data, which is usually unavailable due ...
Late feature supplement network for early action prediction
AbstractEarly action prediction is a new hotspot in the field of computer vision. To improve the accuracy of early action prediction, a new end-to-end late feature supplement-based early action prediction network is proposed in this work. ...
Highlights- A feature transfer strategy is defined to predict late features from early features.






Comments