skip to main content
research-article

MMFN: Multimodal Information Fusion Networks for 3D Model Classification and Retrieval

Published:17 December 2020Publication History
Skip Abstract Section

Abstract

In recent years, research into 3D shape recognition in the field of multimedia and computer vision has attracted wide attention. With the rapid development of deep learning, various deep models have achieved state-of-the-art performance based on different representations. There are many modalities for representing a 3D model, such as point cloud, multiview, and panorama view. Deep learning models based on these different modalities have different concerns, and all of them have achieved high performance for 3D shape recognition. However, all of these methods ignore the multimodality information in conditions where the same 3D model is represented by different modalities. Thus, we can obtain a better descriptor by guiding the training to consider these multiple representations. In this article, we propose MMFN, a novel multimodal fusion network for 3D shape recognition that employs correlations between the different modalities to generate a fused descriptor, which is more robust. In particular, we design two novel loss functions to help the model learn the correlation information during training. The first is correlation loss, which focuses on the correlations among different descriptors generated from different structures. This approach reduces the training time and improves the robustness of the fused descriptor of the 3D model. The second is instance loss, which preserves the independence of each modality and utilizes feature differentiation to guide model learning during the training process. More specifically, we use the weighted fusion method, which applies statistical methods to obtain robust descriptors that maximize the advantages of the information from the different modalities. We evaluated the proposed method on the ModelNet40 and ShapeNetCore55 datasets for 3D shape classification and retrieval tasks. The experimental results and comparisons with state-of-the-art methods demonstrate the superiority of our approach.

References

  1. Song Bai, Xiang Bai, Zhichao Zhou, Zhaoxiang Zhang, Qi Tian, and Longin Jan Latecki. 2017. GIFT: Towards scalable 3D shape retrieval. IEEE Transactions on Multimedia 19, 6 (2017), 1257--1271.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen, and Ming Ouhyoung. 2003. On visual similarity based 3D model retrieval. Computer Graphics Forum 22, 3 (2003), 223--232.Google ScholarGoogle Scholar
  3. F. Chen, R. Ji, J. Su, D. Cao, and Y. Gao. 2018. Predicting microblog sentiments via weakly supervised multimodal deep learning. IEEE Transactions on Multimedia 20, 4 (2018), 997--1007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. 2017. Multi-view 3D object detection network for autonomous driving. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 6526--6534.Google ScholarGoogle ScholarCross RefCross Ref
  5. Guoxian Dai, Jin Xie, Fan Zhu, and Yi Fang. 2018. Deep correlated metric learning for sketch-based 3D shape retrieval. IEEE Transactions on Image Processing 27, 7 (2018), 3374.Google ScholarGoogle ScholarCross RefCross Ref
  6. Nira Dyn, David Levine, and John A. Gregory. 1990. A butterfly subdivision scheme for surface interpolation with tension control. ACM Transactions on Graphics 9, 2 (1990), 160--169.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Yutong Feng, Yifan Feng, Haoxuan You, Xibin Zhao, and Yue Gao. 2019. MeshNet: Mesh neural network for 3D shape representation. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19). 8279--8286.Google ScholarGoogle ScholarCross RefCross Ref
  8. Yifan Feng, Zizhao Zhang, Xibin Zhao, Rongrong Ji, and Yue Gao. 2018. GVCNN: Group-view convolutional neural networks for 3D shape recognition. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, Los Alamitos, CA, 264--272.Google ScholarGoogle ScholarCross RefCross Ref
  9. Y. Gao, M. Wang, D. Tao, R. Ji, and Q. Dai. 2012. 3-D object retrieval and recognition with hypergraph analysis. IEEE Transactions on Image Processing 21, 9 (2012), 4290--4303.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. 2017. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'17). 652--660.Google ScholarGoogle Scholar
  11. Alejandro González, David Vázquez, Antonio M. López, and Jaume Amores. 2017. On-board object detection: Multicue, multimodal, and multiview random forest of local experts. IEEE Transactions on Cybernetics 47, 11 (2017), 3980--3990.Google ScholarGoogle ScholarCross RefCross Ref
  12. H. Guo, J. Wang, Y. Gao, J. Li, and H. Lu. 2016. Multi-view 3D object retrieval with deep embedding network. IEEE Transactions on Image Processing 25, 12 (Dec. 2016), 5526--5537.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Zhizhong Han, Zhenbao Liu, Junwei Han, Chi-Man Vong, Shuhui Bu, and Chun Lung Philip Chen. 2017. Mesh convolutional restricted Boltzmann machines for unsupervised learning of features with structure preservation on 3-D meshes. IEEE Transactions on Neural Networks and Learning Systems 28, 10 (2017), 2268--2281.Google ScholarGoogle ScholarCross RefCross Ref
  14. Zhizhong Han, Zhenbao Liu, Junwei Han, Chi-Man Vong, Shuhui Bu, and C. L. Philip Chen. 2019. Unsupervised learning of 3-D local features from raw voxels based on a novel permutation voxelization strategy. IEEE Transactions on Cybernetics 49, 2 (2019), 481--494.Google ScholarGoogle ScholarCross RefCross Ref
  15. Zhizhong Han, Zhenbao Liu, Chi-Man Vong, Yu-Shen Liu, Shuhui Bu, Junwei Han, and C. L. Philip Chen. 2018. Deep spatiality: Unsupervised learning of spatially-enhanced global and local 3D features by deep neural network with coupled softmax. IEEE Transactions on Image Processing 27, 6 (2018), 3049--3063.Google ScholarGoogle ScholarCross RefCross Ref
  16. Zhizhong Han, Honglei Lu, Zhenbao Liu, Chi-Man Vong, Yu-Shen Liu, Matthias Zwicker, Junwei Han, and C. L. Philip Chen. 2019. 3D2SeqViews: Aggregating sequential views for 3D global feature learning by CNN with hierarchical attention aggregation. IEEE Transactions on Image Processing 28, 8 (2019), 3986--3999.Google ScholarGoogle ScholarCross RefCross Ref
  17. Zhizhong Han, Mingyang Shang, Yu-Shen Liu, and Matthias Zwicker. 2019. View inter-prediction GAN: Unsupervised representation learning for 3D shapes by learning global shape memories to support local view predictions. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19). 8376--8384.Google ScholarGoogle ScholarCross RefCross Ref
  18. Z. Han, M. Shang, Z. Liu, C. Vong, Y. Liu, M. Zwicker, J. Han, and C. L. P. Chen. 2019. SeqViews2SeqLabels: Learning 3D global features via aggregating sequential views by RNN with attention. IEEE Transactions on Image Processing 28, 2 (Feb. 2019), 658--672.Google ScholarGoogle Scholar
  19. Zhizhong Han, Xiyang Wang, Yu-Shen Liu, and Matthias Zwicker. 2019. Multi-angle point cloud-VAE: Unsupervised feature learning for 3D point clouds from multiple angles by joint self-reconstruction and half-to-half prediction. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). IEEE, Los Alamitos, CA, 10441--10450.Google ScholarGoogle ScholarCross RefCross Ref
  20. Xinwei He, Yang Zhou, Zhichao Zhou, Song Bai, and Xiang Bai. 2018. Triplet-center loss for multi-view 3D object retrieval. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 1945--1954.Google ScholarGoogle ScholarCross RefCross Ref
  21. Geoffrey Hinton, Jeff Dean, and Oriol Vinyals. 2014. Distilling the knowledge in a neural network. arXiv:1503.02531.Google ScholarGoogle Scholar
  22. Rie Johnson and Tong Zhang. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26: The 27th Annual Conference on Neural Information Processing Systems 2013. 315--323.Google ScholarGoogle Scholar
  23. Michael M. Kazhdan, Thomas A. Funkhouser, and Szymon Rusinkiewicz. 2003. Rotation invariant spherical harmonic representation of 3D shape descriptors. In First Eurographics Symposium on Geometry Processing, Aachen, Germany, June 23-25, 2003. ACM International Conference Proceeding Series, Vol. 43. Eurographics Association, 156--164.Google ScholarGoogle Scholar
  24. Roman Klokov and Victor S. Lempitsky. 2017. Escape from cells: Deep Kd-networks for the recognition of 3D point cloud models. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17). 863--872.Google ScholarGoogle Scholar
  25. Van Der Maaten Laurens and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 2605 (2008), 2579--2605.Google ScholarGoogle Scholar
  26. Bo Li and Henry Johan. 2013. 3D model retrieval using hybrid features and class information. Multimedia Tools and Applications 62, 3 (2013), 821--846.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. 2018. PointCNN: Convolution on X-transformed points. In Advances in Neural Information Processing Systems 31 (NIPS’18). 828--838.Google ScholarGoogle Scholar
  28. A. Liu, W. Nie, and Y. Su. 2019. 3D object retrieval based on multi-view latent variable model. IEEE Transactions on Circuits and Systems for Video Technology 29, 3 (March 2019), 868--880.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. H. Liu, R. Ji, Y. Wu, F. Huang, and B. Zhang. 2017. Cross-modality binary code learning via fusion similarity hashing. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 6345--6353.Google ScholarGoogle Scholar
  30. Xinhai Liu, Zhizhong Han, Yu-Shen Liu, and Matthias Zwicker. 2019. Point2Sequence: Learning the shape representation of 3D point clouds with an attention-based sequence to sequence network. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19). 8778--8785.Google ScholarGoogle ScholarCross RefCross Ref
  31. Daniel Maturana and Sebastian A. Scherer. 2015. VoxNet: A 3D convolutional neural network for real-time object recognition. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’15). IEEE, Los Alamitos, CA, 922--928.Google ScholarGoogle Scholar
  32. Weizhi Nie, Qi Liang, An-An Liu, Zhendong Mao, and Yangyang Li. 2019. MMJN: Multi-modal joint networks for 3D shape recognition. In Proceedings of the 27th ACM International Conference on Multimedia. ACM, New York, NY, 908--916.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. S. Pahwa, J. Lu, N. Jiang, T. T. Ng, and M. N. Do. 2018. Locating 3D object proposals: A depth-based online approach. IEEE Transactions on Circuits and Systems for Video Technology 28, 3 (March 2018), 626--639.Google ScholarGoogle ScholarCross RefCross Ref
  34. Panagiotis Papadakis, Ioannis Pratikakis, Stavros J. Perantonis, and Theoharis Theoharis. 2007. Efficient 3D shape matching and retrieval using a concrete radialized spherical projection representation. Pattern Recognition 40, 9 (2007), 2437--2452.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. Papaioannidis and I. Pitas. 2019. 3D object pose estimation using multi-objective quaternion learning. IEEE Transactions on Circuits and Systems for Video Technology 30, 8 (2019), 2683--2693.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Peng, Y. Zhou, L. Cao, X. Sun, J. Su, and R. Ji. 2019. Towards cross-modality topic modelling via deep topical correlation analysis. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’19). 4115--4119.Google ScholarGoogle Scholar
  37. Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. 2017. Multi-level multiple attentions for contextual multimodal sentiment analysis. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM’17). IEEE, Los Alamitos, CA, 1033--1038.Google ScholarGoogle ScholarCross RefCross Ref
  38. Charles Ruizhongtai Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J. Guibas. 2016. Volumetric and multi-view CNNs for object classification on 3D data. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 5648--5656.Google ScholarGoogle Scholar
  39. Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. 2017. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of Advances in Neural Information Processing Systems 30 (NIPS’17). 5099--5108.Google ScholarGoogle Scholar
  40. Manolis Savva, Fisher Yu, Hao Su, Asako Kanezaki, Takahiko Furuya, Ryutarou Ohbuchi, Zhichao Zhou, et al. 2017. Large-scale 3D shape retrieval from ShapeNet Core55. In Proceedings of the Eurographics Workshop on 3D Object Retrieval (3DOR’17).Google ScholarGoogle Scholar
  41. Konstantinos Sfikas, Ioannis Pratikakis, and Theoharis Theoharis. 2018. Ensemble of PANORAMA-based convolutional neural networks for 3D model classification and retrieval. Computers 8 Graphics 71 (2018), 208--218.Google ScholarGoogle Scholar
  42. Konstantinos Sfikas, Theoharis Theoharis, and Ioannis Pratikakis. 2017. Exploiting the PANORAMA representation for convolutional neural network classification and retrieval. In Proceedings of the Eurographics Workshop on 3D Object Retrieval (3DOR’17).Google ScholarGoogle Scholar
  43. H. Shi, Y. Zhang, Z. Zhang, N. Ma, X. Zhao, Y. Gao, and J. Sun. 2019. Hypergraph-induced convolutional networks for visual classification. IEEE Transactions on Neural Networks and Learning Systems 30, 10 (2019), 2963--2972.Google ScholarGoogle ScholarCross RefCross Ref
  44. Richard Socher, Brody Huval, Bharath Putta Bath, Christopher D. Manning, and Andrew Y. Ng. 2012. Convolutional-recursive deep learning for 3D object classification. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12). 665--673.Google ScholarGoogle Scholar
  45. Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz. 2018. SPLATNet: Sparse lattice networks for point cloud processing. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, Los Alamitos, CA, 2530--2539.Google ScholarGoogle ScholarCross RefCross Ref
  46. Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik G. Learned-Miller. 2015. Multi-view convolutional neural networks for 3D shape recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15). IEEE, Los Alamitos, CA, 945--953.Google ScholarGoogle Scholar
  47. Atsushi Tatsuma and Masaki Aono. 2009. Multi-Fourier spectra descriptor and augmentation with spectral clustering for 3D shape retrieval. Visual Computer 25, 8 (2009), 785--804.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. 2019. Dynamic graph CNN for learning on point clouds. ACM Transactions on Graphics 38, 5 (2019), Article 146, 12 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. 3D ShapeNets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). IEEE, Los Alamitos, CA, 1912--1920.Google ScholarGoogle Scholar
  50. Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. 3D ShapeNets: A deep representation for volumetric shapes. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1912--1920.Google ScholarGoogle Scholar
  51. Jin Xie, Guoxian Dai, Fan Zhu, and Yi Fang. 2017. Learning barycentric representations of 3D shapes for sketch-based 3D shape retrieval. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, Los Alamitos, CA, 3615--3623.Google ScholarGoogle ScholarCross RefCross Ref
  52. J. Xie, G. Dai, F. Zhu, E. K. Wong, and Y. Fang. 2017. DeepShape: Deep-learned shape descriptor for 3D shape retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 7 (July 2017), 1335--1345.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Gang-Joon Yoon and Sang Min Yoon. 2017. Sketch-based 3D object recognition from locally optimized sparse features. Neurocomputing 267 (2017), 556--563.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Haoxuan You, Yifan Feng, Rongrong Ji, and Yue Gao. 2018. PVNet: A joint convolutional network of point cloud and multi-view for 3D shape recognition. In Proceedings of the ACM Multimedia Conference. 1310--1318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Haoxuan You, Yifan Feng, Xibin Zhao, Changqing Zou, Rongrong Ji, and Yue Gao. 2019. PVRNet: Point-view relation neural network for 3D shape recognition. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19). 9119--9126.Google ScholarGoogle ScholarCross RefCross Ref
  56. Z. Zhang, H. Lin, X. Zhao, R. Ji, and Y. Gao. 2018. Inductive multi-hypergraph learning and its application on view-based 3D object classification. IEEE Transactions on Image Processing 27, 12 (Dec. 2018), 5957--5968.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Z. Zhu, Y. Wang, G. Jiang, and Y. Yang. 2018. Efficient shape coding for object-based 3D video applications. IEEE Transactions on Circuits and Systems for Video Technology 29, 11 (2018), 3317--3325.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. MMFN: Multimodal Information Fusion Networks for 3D Model Classification and Retrieval

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Multimedia Computing, Communications, and Applications
            ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 4
            November 2020
            372 pages
            ISSN:1551-6857
            EISSN:1551-6865
            DOI:10.1145/3444749
            Issue’s Table of Contents

            Copyright © 2020 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 17 December 2020
            • Revised: 1 July 2020
            • Accepted: 1 July 2020
            • Received: 1 January 2020
            Published in tomm Volume 16, Issue 4

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!