Abstract
Multimodal sequence analysis aims to draw inferences from visual, language, and acoustic sequences. A majority of existing works focus on the aligned fusion of three modalities to explore inter-modal interactions, which is impractical in real-world scenarios. To overcome this issue, we seek to focus on analyzing unaligned sequences, which is still relatively underexplored and also more challenging. We propose Multimodal Graph, whose novelty mainly lies in transforming the sequential learning problem into graph learning problem. The graph-based structure enables parallel computation in time dimension (as opposed to recurrent neural network) and can effectively learn longer intra- and inter-modal temporal dependency in unaligned sequences. First, we propose multiple ways to construct the adjacency matrix for sequence to perform sequence to graph transformation. To learn intra-modal dynamics, a graph convolution network is employed for each modality based on the defined adjacency matrix. To learn inter-modal dynamics, given that the unimodal sequences are unaligned, the commonly considered word-level fusion does not pertain. To this end, we innovatively devise graph pooling algorithms to automatically explore the associations between various time slices from different modalities and learn high-level graph representation hierarchically. Multimodal Graph outperforms state-of-the-art models on three datasets under the same experimental setting.
Supplemental Material
Available for Download
Supplementary material
- [1] . 2019. Trellis networks for sequence modeling. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [2] . 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv: 1803.01271. Retrieved from https://arxiv.org/abs/1803.01271.Google Scholar
- [3] . 2019. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (
February 2019), 423–443. Google ScholarDigital Library
- [4] . 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 2 (1994), 157–166.Google Scholar
Digital Library
- [5] . 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42, 4 (2008), 335–359.Google Scholar
Cross Ref
- [6] . 2017. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In Proceedings of the 19th ACM International Conference on Multimodal Interaction (ICMI’17). 163–171.Google Scholar
Digital Library
- [7] . 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1724–1734.Google Scholar
Cross Ref
- [8] . 2014. COVAREP: A collaborative voice analysis repository for speech technologies. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 960–964.Google Scholar
Cross Ref
- [9] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171–4186. Google Scholar
Cross Ref
- [10] . 2019. Learning discrete structures for graph neural networks. In Proceedings of the International Conference on Machine Learning.Google Scholar
- [11] . 2021. Dynamic graph learning convolutional networks for semi-supervised classification. ACM Trans. Multimedia Comput. Commun. Appl. 17, 1s, Article
4 (March 2021), 13 pages. Google ScholarDigital Library
- [12] . 2020. Multi-modal graph neural network for joint reasoning on vision and scene text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20), 12743–12753.Google Scholar
Cross Ref
- [13] . 2021. What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis. Inf. Fus. 66 (2021), 184–197.Google Scholar
Cross Ref
- [14] 1994. First-order versus second-order single-layer recurrent neural networks. IEEE Trans. Neural Netw. 5, 3 (1994), 511–513.Google Scholar
Digital Library
- [15] . 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning. 369–376.Google Scholar
Digital Library
- [16] . 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems. 1024–1034.Google Scholar
- [17] . 2020. MISA: Modality-invariant and -specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia.Google Scholar
Digital Library
- [18] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [19] . 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.Google Scholar
Digital Library
- [20] . 2019. Deep multimodal multilinear fusion with high-order polynomial pooling. In Advances in Neural Information Processing Systems. 12113–12122.Google Scholar
- [21] . 2020. Attention-based modality-gated networks for image-text sentiment analysis. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3, Article
79 (July 2020), 19 pages. Google ScholarDigital Library
- [22] . 2019. Cross-modality microblog sentiment prediction via bi-layer multimodal hypergraph learning. IEEE Trans. Multimedia 21, 4 (2019), 1062–1075. Google Scholar
Cross Ref
- [23] . 2018. Investigating audio, visual, and text fusion methods for end-to-end automatic personality prediction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 606–611.Google Scholar
Cross Ref
- [24] . 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google Scholar
- [25] . 2016. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [26] . 2021. Quantum-inspired multimodal fusion for video sentiment analysis. Inf. Fus. 65 (2021), 58–71.Google Scholar
Cross Ref
- [27] . 2019. Learning representations from imperfect time series data via tensor rank regularization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 1569–1576.Google Scholar
Cross Ref
- [28] . 2018. Multimodal language analysis with recurrent multistage fusion. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 150–161.Google Scholar
Cross Ref
- [29] . 2018. Efficient low-rank multimodal fusion with modality-specific factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2247–2256.Google Scholar
Cross Ref
- [30] . 2019. Divide, conquer and combine: Hierarchical feature fusion network with local and global perspectives for multimodal affective computing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 481–492.Google Scholar
Cross Ref
- [31] . 2020. Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 164–172.Google Scholar
Cross Ref
- [32] . 2022. Multi-fusion residual memory network for multimodal human sentiment comprehension. IEEE Trans. Affective Comput. 13, 1 (2022), 320–334.Google Scholar
- [33] . 2020. Locally confined modality fusion network with a global perspective for multimodal human affective computing. IEEE Trans. Multimedia 22, 1 (2020), 122–137.Google Scholar
Digital Library
- [34] . 2021. Analyzing multimodal language via acoustic-and visual-LSTM with channel-aware temporal convolution network. IEEE/ACM Trans. Aud. Speech Lang. Process 29 (2021), 1424–1437.Google Scholar
- [35] . 2022. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Trans. Affect. Comput. (2022).Google Scholar
- [36] . 2021. Communicative message passing for inductive relation reasoning. In Proceedings of the 35th AAAI Conference on Artificial Intelligence. 4294–4302.Google Scholar
- [37] . 2009. Neural network for graphs: A contextual constructive approach. IEEE Trans. Neural Netw. 20, 3 (2009), 498–511.Google Scholar
Digital Library
- [38] . 2020. Multi-modal retrieval using graph neural networks. arXiv: abs/2010.01666. Retrieved from https://arxiv.org/abs/2010.01666.Google Scholar
- [39] . 2016. Deep multimodal fusion for persuasiveness prediction. In Proceedings of the ACM International Conference on Multimodal Interaction. 284–288.Google Scholar
Digital Library
- [40] . 1977. From utterance to text: The bias of language in speech and writing. Harv. Educ. Rev. 47, 3 (1977), 257–281.Google Scholar
Cross Ref
- [41] . 2019. TCNN: Temporal convolutional neural network for real-time speech enhancement in the time domain. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 6875–6879.Google Scholar
Cross Ref
- [42] . 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1532–1543.Google Scholar
Cross Ref
- [43] . 2019. Found in translation: Learning robust joint representations by cyclic translations between modalities. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 6892–6899.Google Scholar
Digital Library
- [44] . 2017. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 873–883.Google Scholar
Cross Ref
- [45] . 2016. Convolutional MKL based multimodal emotion recognition and sentiment analysis. In Proceedings of the IEEE International Conference on Data Mining (ICDM’16). 439–448.Google Scholar
Cross Ref
- [46] . 2009. The graph neural network model. IEEE Trans. Neural Netw. 20, 1 (2009), 61–80.Google Scholar
Digital Library
- [47] . 2020. Host–parasite: Graph LSTM-in-LSTM for group activity recognition. IEEE Trans. Neural Netw. Learn. Syst. 32, 2 (2020), 663–674.Google Scholar
Cross Ref
- [48] . 2019. Social anchor-unit graph regularized tensor completion for large-scale image retagging. IEEE Trans. Pattern Anal. Mach. Intell. 41, 8 (2019), 2027–2034.Google Scholar
Cross Ref
- [49] . 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6558–6569.Google Scholar
Cross Ref
- [50] . 2020. Multimodal routing: Improving local and global interpretability of multimodal language analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1823–1833.Google Scholar
Cross Ref
- [51] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.Google Scholar
- [52] . 2018. Graph attention networks. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [53] . 2017. Sparse multigraph embedding for multimodal feature representation. IEEE Trans. Multimedia 19, 7 (2017), 1454–1466. Google Scholar
Digital Library
- [54] . 2019. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7216–7223.Google Scholar
Digital Library
- [55] . 2013. YouTube movie reviews: Sentiment analysis in an audio-visual context. IEEE Intell. Syst. 28, 3 (2013), 46–53.Google Scholar
Digital Library
- [56] . 2019. How powerful are graph neural networks? In Proceedings of the International Conference on Learning Representations.Google Scholar
- [57] . 2020. MTGAT: Multimodal temporal graph attention networks for unaligned human multimodal language sequences. arXiv:2010.11985. Retrieved from https://arxiv.org/abs/2010.11985.Google Scholar
- [58] . 2020. CM-BERT: Cross-modal BERT for text-audio sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia. 521–528.Google Scholar
Digital Library
- [59] . 2021. Image-text multimodal emotion classification via multi-view attentional network. IEEE Trans. Multimedia 23 (2021), 4014–4026. Google Scholar
Cross Ref
- [60] . 2018. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems. 4800–4810.Google Scholar
- [61] . 2016. Multi-scale context aggregation by dilated convolutions. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [62] . 2020. StructPool: Structured graph pooling via conditional random fields. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [63] . 2008. Speaker identification on the SCOTUS corpus. Acoust. Soc. Am. J. 123 (2008), 3878. Google Scholar
Cross Ref
- [64] . 2017. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1114–1125.Google Scholar
Cross Ref
- [65] . 2018. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence. 5634–5641.Google Scholar
Cross Ref
- [66] . 2018. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2236–2246.Google Scholar
- [67] . 2016. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intell. Syst. 31, 6 (
11 2016), 82–88.Google ScholarDigital Library
- [68] . 2018. Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems. 5165–5175.Google Scholar
- [69] . 2018. An end-to-end deep learning architecture for graph classification. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [70] . 2019. Affective computing for large-scale heterogeneous multimedia data: A survey. ACM Trans. Multimedia Comput. Commun. Appl. 15, 3s, Article
93 (December 2019), 32 pages. Google ScholarDigital Library
Index Terms
Multimodal Graph for Unaligned Multimodal Sequence Analysis via Graph Convolution and Graph Pooling
Recommendations
Graph Capsule Aggregation for Unaligned Multimodal Sequences
ICMI '21: Proceedings of the 2021 International Conference on Multimodal InteractionHumans express their opinions and emotions through multiple modalities which mainly consist of textual, acoustic and visual modalities. Prior works on multimodal sentiment analysis mostly apply Recurrent Neural Network (RNN) to model aligned multimodal ...
Multimodal graph analysis of cyber attacks
ANSS '19: Proceedings of the Annual Simulation SymposiumThe limited information on the cyberattacks available in the unclassified regime, hardens standardizing the analysis. We address the problem of modeling and analyzing cyberattacks using a multimodal graph approach. We formulate the stages, actors, and ...






Comments