Abstract
In recent years, research into 3D shape recognition in the field of multimedia and computer vision has attracted wide attention. With the rapid development of deep learning, various deep models have achieved state-of-the-art performance based on different representations. There are many modalities for representing a 3D model, such as point cloud, multiview, and panorama view. Deep learning models based on these different modalities have different concerns, and all of them have achieved high performance for 3D shape recognition. However, all of these methods ignore the multimodality information in conditions where the same 3D model is represented by different modalities. Thus, we can obtain a better descriptor by guiding the training to consider these multiple representations. In this article, we propose MMFN, a novel multimodal fusion network for 3D shape recognition that employs correlations between the different modalities to generate a fused descriptor, which is more robust. In particular, we design two novel loss functions to help the model learn the correlation information during training. The first is correlation loss, which focuses on the correlations among different descriptors generated from different structures. This approach reduces the training time and improves the robustness of the fused descriptor of the 3D model. The second is instance loss, which preserves the independence of each modality and utilizes feature differentiation to guide model learning during the training process. More specifically, we use the weighted fusion method, which applies statistical methods to obtain robust descriptors that maximize the advantages of the information from the different modalities. We evaluated the proposed method on the ModelNet40 and ShapeNetCore55 datasets for 3D shape classification and retrieval tasks. The experimental results and comparisons with state-of-the-art methods demonstrate the superiority of our approach.
- Song Bai, Xiang Bai, Zhichao Zhou, Zhaoxiang Zhang, Qi Tian, and Longin Jan Latecki. 2017. GIFT: Towards scalable 3D shape retrieval. IEEE Transactions on Multimedia 19, 6 (2017), 1257--1271.Google Scholar
Digital Library
- Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen, and Ming Ouhyoung. 2003. On visual similarity based 3D model retrieval. Computer Graphics Forum 22, 3 (2003), 223--232.Google Scholar
- F. Chen, R. Ji, J. Su, D. Cao, and Y. Gao. 2018. Predicting microblog sentiments via weakly supervised multimodal deep learning. IEEE Transactions on Multimedia 20, 4 (2018), 997--1007.Google Scholar
Digital Library
- Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. 2017. Multi-view 3D object detection network for autonomous driving. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 6526--6534.Google Scholar
Cross Ref
- Guoxian Dai, Jin Xie, Fan Zhu, and Yi Fang. 2018. Deep correlated metric learning for sketch-based 3D shape retrieval. IEEE Transactions on Image Processing 27, 7 (2018), 3374.Google Scholar
Cross Ref
- Nira Dyn, David Levine, and John A. Gregory. 1990. A butterfly subdivision scheme for surface interpolation with tension control. ACM Transactions on Graphics 9, 2 (1990), 160--169.Google Scholar
Digital Library
- Yutong Feng, Yifan Feng, Haoxuan You, Xibin Zhao, and Yue Gao. 2019. MeshNet: Mesh neural network for 3D shape representation. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19). 8279--8286.Google Scholar
Cross Ref
- Yifan Feng, Zizhao Zhang, Xibin Zhao, Rongrong Ji, and Yue Gao. 2018. GVCNN: Group-view convolutional neural networks for 3D shape recognition. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, Los Alamitos, CA, 264--272.Google Scholar
Cross Ref
- Y. Gao, M. Wang, D. Tao, R. Ji, and Q. Dai. 2012. 3-D object retrieval and recognition with hypergraph analysis. IEEE Transactions on Image Processing 21, 9 (2012), 4290--4303.Google Scholar
Digital Library
- Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. 2017. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'17). 652--660.Google Scholar
- Alejandro González, David Vázquez, Antonio M. López, and Jaume Amores. 2017. On-board object detection: Multicue, multimodal, and multiview random forest of local experts. IEEE Transactions on Cybernetics 47, 11 (2017), 3980--3990.Google Scholar
Cross Ref
- H. Guo, J. Wang, Y. Gao, J. Li, and H. Lu. 2016. Multi-view 3D object retrieval with deep embedding network. IEEE Transactions on Image Processing 25, 12 (Dec. 2016), 5526--5537.Google Scholar
Digital Library
- Zhizhong Han, Zhenbao Liu, Junwei Han, Chi-Man Vong, Shuhui Bu, and Chun Lung Philip Chen. 2017. Mesh convolutional restricted Boltzmann machines for unsupervised learning of features with structure preservation on 3-D meshes. IEEE Transactions on Neural Networks and Learning Systems 28, 10 (2017), 2268--2281.Google Scholar
Cross Ref
- Zhizhong Han, Zhenbao Liu, Junwei Han, Chi-Man Vong, Shuhui Bu, and C. L. Philip Chen. 2019. Unsupervised learning of 3-D local features from raw voxels based on a novel permutation voxelization strategy. IEEE Transactions on Cybernetics 49, 2 (2019), 481--494.Google Scholar
Cross Ref
- Zhizhong Han, Zhenbao Liu, Chi-Man Vong, Yu-Shen Liu, Shuhui Bu, Junwei Han, and C. L. Philip Chen. 2018. Deep spatiality: Unsupervised learning of spatially-enhanced global and local 3D features by deep neural network with coupled softmax. IEEE Transactions on Image Processing 27, 6 (2018), 3049--3063.Google Scholar
Cross Ref
- Zhizhong Han, Honglei Lu, Zhenbao Liu, Chi-Man Vong, Yu-Shen Liu, Matthias Zwicker, Junwei Han, and C. L. Philip Chen. 2019. 3D2SeqViews: Aggregating sequential views for 3D global feature learning by CNN with hierarchical attention aggregation. IEEE Transactions on Image Processing 28, 8 (2019), 3986--3999.Google Scholar
Cross Ref
- Zhizhong Han, Mingyang Shang, Yu-Shen Liu, and Matthias Zwicker. 2019. View inter-prediction GAN: Unsupervised representation learning for 3D shapes by learning global shape memories to support local view predictions. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19). 8376--8384.Google Scholar
Cross Ref
- Z. Han, M. Shang, Z. Liu, C. Vong, Y. Liu, M. Zwicker, J. Han, and C. L. P. Chen. 2019. SeqViews2SeqLabels: Learning 3D global features via aggregating sequential views by RNN with attention. IEEE Transactions on Image Processing 28, 2 (Feb. 2019), 658--672.Google Scholar
- Zhizhong Han, Xiyang Wang, Yu-Shen Liu, and Matthias Zwicker. 2019. Multi-angle point cloud-VAE: Unsupervised feature learning for 3D point clouds from multiple angles by joint self-reconstruction and half-to-half prediction. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). IEEE, Los Alamitos, CA, 10441--10450.Google Scholar
Cross Ref
- Xinwei He, Yang Zhou, Zhichao Zhou, Song Bai, and Xiang Bai. 2018. Triplet-center loss for multi-view 3D object retrieval. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 1945--1954.Google Scholar
Cross Ref
- Geoffrey Hinton, Jeff Dean, and Oriol Vinyals. 2014. Distilling the knowledge in a neural network. arXiv:1503.02531.Google Scholar
- Rie Johnson and Tong Zhang. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26: The 27th Annual Conference on Neural Information Processing Systems 2013. 315--323.Google Scholar
- Michael M. Kazhdan, Thomas A. Funkhouser, and Szymon Rusinkiewicz. 2003. Rotation invariant spherical harmonic representation of 3D shape descriptors. In First Eurographics Symposium on Geometry Processing, Aachen, Germany, June 23-25, 2003. ACM International Conference Proceeding Series, Vol. 43. Eurographics Association, 156--164.Google Scholar
- Roman Klokov and Victor S. Lempitsky. 2017. Escape from cells: Deep Kd-networks for the recognition of 3D point cloud models. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17). 863--872.Google Scholar
- Van Der Maaten Laurens and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 2605 (2008), 2579--2605.Google Scholar
- Bo Li and Henry Johan. 2013. 3D model retrieval using hybrid features and class information. Multimedia Tools and Applications 62, 3 (2013), 821--846.Google Scholar
Digital Library
- Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. 2018. PointCNN: Convolution on X-transformed points. In Advances in Neural Information Processing Systems 31 (NIPS’18). 828--838.Google Scholar
- A. Liu, W. Nie, and Y. Su. 2019. 3D object retrieval based on multi-view latent variable model. IEEE Transactions on Circuits and Systems for Video Technology 29, 3 (March 2019), 868--880.Google Scholar
Digital Library
- H. Liu, R. Ji, Y. Wu, F. Huang, and B. Zhang. 2017. Cross-modality binary code learning via fusion similarity hashing. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 6345--6353.Google Scholar
- Xinhai Liu, Zhizhong Han, Yu-Shen Liu, and Matthias Zwicker. 2019. Point2Sequence: Learning the shape representation of 3D point clouds with an attention-based sequence to sequence network. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19). 8778--8785.Google Scholar
Cross Ref
- Daniel Maturana and Sebastian A. Scherer. 2015. VoxNet: A 3D convolutional neural network for real-time object recognition. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’15). IEEE, Los Alamitos, CA, 922--928.Google Scholar
- Weizhi Nie, Qi Liang, An-An Liu, Zhendong Mao, and Yangyang Li. 2019. MMJN: Multi-modal joint networks for 3D shape recognition. In Proceedings of the 27th ACM International Conference on Multimedia. ACM, New York, NY, 908--916.Google Scholar
Digital Library
- R. S. Pahwa, J. Lu, N. Jiang, T. T. Ng, and M. N. Do. 2018. Locating 3D object proposals: A depth-based online approach. IEEE Transactions on Circuits and Systems for Video Technology 28, 3 (March 2018), 626--639.Google Scholar
Cross Ref
- Panagiotis Papadakis, Ioannis Pratikakis, Stavros J. Perantonis, and Theoharis Theoharis. 2007. Efficient 3D shape matching and retrieval using a concrete radialized spherical projection representation. Pattern Recognition 40, 9 (2007), 2437--2452.Google Scholar
Digital Library
- C. Papaioannidis and I. Pitas. 2019. 3D object pose estimation using multi-objective quaternion learning. IEEE Transactions on Circuits and Systems for Video Technology 30, 8 (2019), 2683--2693.Google Scholar
Digital Library
- J. Peng, Y. Zhou, L. Cao, X. Sun, J. Su, and R. Ji. 2019. Towards cross-modality topic modelling via deep topical correlation analysis. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’19). 4115--4119.Google Scholar
- Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. 2017. Multi-level multiple attentions for contextual multimodal sentiment analysis. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM’17). IEEE, Los Alamitos, CA, 1033--1038.Google Scholar
Cross Ref
- Charles Ruizhongtai Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J. Guibas. 2016. Volumetric and multi-view CNNs for object classification on 3D data. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 5648--5656.Google Scholar
- Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. 2017. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of Advances in Neural Information Processing Systems 30 (NIPS’17). 5099--5108.Google Scholar
- Manolis Savva, Fisher Yu, Hao Su, Asako Kanezaki, Takahiko Furuya, Ryutarou Ohbuchi, Zhichao Zhou, et al. 2017. Large-scale 3D shape retrieval from ShapeNet Core55. In Proceedings of the Eurographics Workshop on 3D Object Retrieval (3DOR’17).Google Scholar
- Konstantinos Sfikas, Ioannis Pratikakis, and Theoharis Theoharis. 2018. Ensemble of PANORAMA-based convolutional neural networks for 3D model classification and retrieval. Computers 8 Graphics 71 (2018), 208--218.Google Scholar
- Konstantinos Sfikas, Theoharis Theoharis, and Ioannis Pratikakis. 2017. Exploiting the PANORAMA representation for convolutional neural network classification and retrieval. In Proceedings of the Eurographics Workshop on 3D Object Retrieval (3DOR’17).Google Scholar
- H. Shi, Y. Zhang, Z. Zhang, N. Ma, X. Zhao, Y. Gao, and J. Sun. 2019. Hypergraph-induced convolutional networks for visual classification. IEEE Transactions on Neural Networks and Learning Systems 30, 10 (2019), 2963--2972.Google Scholar
Cross Ref
- Richard Socher, Brody Huval, Bharath Putta Bath, Christopher D. Manning, and Andrew Y. Ng. 2012. Convolutional-recursive deep learning for 3D object classification. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12). 665--673.Google Scholar
- Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz. 2018. SPLATNet: Sparse lattice networks for point cloud processing. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, Los Alamitos, CA, 2530--2539.Google Scholar
Cross Ref
- Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik G. Learned-Miller. 2015. Multi-view convolutional neural networks for 3D shape recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15). IEEE, Los Alamitos, CA, 945--953.Google Scholar
- Atsushi Tatsuma and Masaki Aono. 2009. Multi-Fourier spectra descriptor and augmentation with spectral clustering for 3D shape retrieval. Visual Computer 25, 8 (2009), 785--804.Google Scholar
Digital Library
- Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. 2019. Dynamic graph CNN for learning on point clouds. ACM Transactions on Graphics 38, 5 (2019), Article 146, 12 pages.Google Scholar
Digital Library
- Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. 3D ShapeNets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). IEEE, Los Alamitos, CA, 1912--1920.Google Scholar
- Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. 3D ShapeNets: A deep representation for volumetric shapes. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1912--1920.Google Scholar
- Jin Xie, Guoxian Dai, Fan Zhu, and Yi Fang. 2017. Learning barycentric representations of 3D shapes for sketch-based 3D shape retrieval. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, Los Alamitos, CA, 3615--3623.Google Scholar
Cross Ref
- J. Xie, G. Dai, F. Zhu, E. K. Wong, and Y. Fang. 2017. DeepShape: Deep-learned shape descriptor for 3D shape retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 7 (July 2017), 1335--1345.Google Scholar
Digital Library
- Gang-Joon Yoon and Sang Min Yoon. 2017. Sketch-based 3D object recognition from locally optimized sparse features. Neurocomputing 267 (2017), 556--563.Google Scholar
Digital Library
- Haoxuan You, Yifan Feng, Rongrong Ji, and Yue Gao. 2018. PVNet: A joint convolutional network of point cloud and multi-view for 3D shape recognition. In Proceedings of the ACM Multimedia Conference. 1310--1318.Google Scholar
Digital Library
- Haoxuan You, Yifan Feng, Xibin Zhao, Changqing Zou, Rongrong Ji, and Yue Gao. 2019. PVRNet: Point-view relation neural network for 3D shape recognition. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19). 9119--9126.Google Scholar
Cross Ref
- Z. Zhang, H. Lin, X. Zhao, R. Ji, and Y. Gao. 2018. Inductive multi-hypergraph learning and its application on view-based 3D object classification. IEEE Transactions on Image Processing 27, 12 (Dec. 2018), 5957--5968.Google Scholar
Digital Library
- Z. Zhu, Y. Wang, G. Jiang, and Y. Yang. 2018. Efficient shape coding for object-based 3D video applications. IEEE Transactions on Circuits and Systems for Video Technology 29, 11 (2018), 3317--3325.Google Scholar
Digital Library
Index Terms
MMFN: Multimodal Information Fusion Networks for 3D Model Classification and Retrieval
Recommendations
Deep multi-scale and multi-modal fusion for 3D object detection
Highlights- We propose a multi-scale feature fusion method from different resolution feature maps for 3D object detection.
AbstractThe perception of 3D objects in the scene is the basis of autonomous driving. Most autonomous driving cars are equipped with cameras and Lidar to obtain 3D spatial information. RGB images taken from the camera and point cloud produced ...
MMFN: Emotion recognition by fusing touch gesture and facial expression information
AbstractHuman social interaction is a multimodal process that integrates several communication channels, such as speech, facial expressions, body gestures, and touch. For intelligent robots, the ability to recognize human emotions based on ...
Multi-head attention fusion networks for multi-modal speech emotion recognition
Highlights- Multimodal categories enriched by the inclusion of action data.
- New feature ...
AbstractMulti-modal speech emotion recognition is a study to predict emotion categories by combining speech data with other types of data, such as video, speech text transcription, body action, or facial expression when speaking, which will ...






Comments