Abstract
Reconstructing three-dimensional (3D) objects from images has attracted increasing attention due to its wide applications in computer vision and robotic tasks. Despite the promising progress of recent deep learning–based approaches, which directly reconstruct the full 3D shape without considering the conceptual knowledge of the object categories, existing models have limited usage and usually create unrealistic shapes. 3D objects have multiple forms of representation, such as 3D volume, conceptual knowledge, and so on. In this work, we show that the conceptual knowledge for a category of objects, which represents objects as prototype volumes and is structured by graph, can enhance the 3D reconstruction pipeline. We propose a novel multimodal framework that explicitly combines graph-based conceptual knowledge with deep neural networks for 3D shape reconstruction from a single RGB image. Our approach represents conceptual knowledge of a specific category as a structure-based knowledge graph. Specifically, conceptual knowledge acts as visual priors and spatial relationships to assist the 3D reconstruction framework to create realistic 3D shapes with enhanced details. Our 3D reconstruction framework takes an image as input. It first predicts the conceptual knowledge of the object in the image, then generates a 3D object based on the input image and the predicted conceptual knowledge. The generated 3D object satisfies the following requirements: (1) it is consistent with the predicted graph in concept, and (2) consistent with the input image in geometry. Extensive experiments on public datasets (i.e., ShapeNet, Pix3D, and Pascal3D+) with 13 object categories show that (1) our method outperforms the state-of-the-art methods, (2) our prototype volume-based conceptual knowledge representation is more effective, and (3) our pipeline-agnostic approach can enhance the reconstruction quality of various 3D shape reconstruction pipelines.
- [1] . 2016. tensorflow: A system for large-scale machine learning. In USENIX Symposium on Operating Systems Design and Implementation. 265–283.Google Scholar
- [2] . 2018. Full 3D reconstruction of non-rigidly deforming objects. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1s (2018), 1–23.Google Scholar
Digital Library
- [3] . 2017. Synthesizing 3D shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu. IEEE. 1511–1519.Google Scholar
Cross Ref
- [4] . 2016. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Transactions on Robotics 32, 6 (2016), 1309–1332.Google Scholar
Digital Library
- [5] . 2015. shapenet: An information-rich 3D model repository. arXiv:1512.03012.Google Scholar
- [6] . 2020. Blendmask: Top-down meets bottom-up for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual. IEEE. 8573–8581.Google Scholar
Cross Ref
- [7] . 2019. Learning implicit fields for generative shape modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach. IEEE. 5939–5948.Google Scholar
Cross Ref
- [8] . 2003. Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, Wisconsin. IEEE. 1–1.Google Scholar
Cross Ref
- [9] . 2019. Point cloud deformation for single image 3d reconstruction. In IEEE International Conference on Image Processing (ICIP’19). Taipei. IEEE. 2379–2383.Google Scholar
Cross Ref
- [10] . 2016. 3D-R2N2: A unified approach for single and multi-view 3D object reconstruction. In Proceedings of the European Conference on Computer Vision(
Lecture Notes in Computer Science), Vol. 9912, Springer. Amsterdam. Springer. 628–644.Google ScholarCross Ref
- [11] . 2017. Shape completion using 3D-encoder-predictor CNNs and shape synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu. IEEE. 5868–5877.Google Scholar
Cross Ref
- [12] . 2006. Simultaneous localization and mapping. IEEE Robotics & Automation Magazine 13, 2 (2006), 99–110.Google Scholar
Cross Ref
- [13] . 2017. A point set generation network for 3D object reconstruction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu. IEEE. 605–613.Google Scholar
Cross Ref
- [14] . 2019. Mesh R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. Long Beach. IEEE. 9785–9795.Google Scholar
Cross Ref
- [15] . 2018. A papier-mâché approach to learning 3D surface generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City. IEEE. 216–224.Google Scholar
Cross Ref
- [16] . 2017. High-resolution shape completion using deep neural networks for global structure and local geometry inference. In Proceedings of the IEEE International Conference on Computer Vision. Honolulu. IEEE. 85–93.Google Scholar
Cross Ref
- [17] . 2019. Image-based 3D object reconstruction: State-of-the-art and trends in the deep learning era. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 5 (2019), 1578–1604.Google Scholar
- [18] . 2017. Hierarchical surface prediction for 3D object reconstruction. In International Conference on 3D Vision. Qingdao. IEEE. 412–420.Google Scholar
Cross Ref
- [19] . 2003. Multiple View Geometry in Computer Vision. Cambridge University Press.Google Scholar
Digital Library
- [20] . 2018. Unsupervised learning of shape and pose with differentiable point clouds. In Advances in Neural Information Processing Systems. Montréal. MIT Press. 2802–2812.Google Scholar
- [21] . 2011. KinectFusion: Real-time 3D reconstruction and interaction using a moving depth camera. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology. Santa Barbara. ACM. 559–568.Google Scholar
Digital Library
- [22] . 2017. Scaling CNNs for high resolution volumetric reconstruction from a single image. In Proceedings of the IEEE International Conference on Computer Vision Workshops. Venice. IEEE. 939–948.Google Scholar
Cross Ref
- [23] . 2018. Learning category-specific mesh reconstruction from image collections. In Proceedings of the European Conference on Computer Vision (Lecture Notes in Computer Science), Vol. 11219, Munich. Springer. 386–402.Google Scholar
Cross Ref
- [24] . 2017. Learning a multi-view stereo machine. In Advances in Neural Information Processing Systems. Long Beach. MIT Press, 365–376.Google Scholar
- [25] . 2015. Category-specific object reconstruction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston. IEEE. 1966–1974.Google Scholar
Cross Ref
- [26] . 2019. Learning view priors for single-view 3D reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach. IEEE. 9778–9787.Google Scholar
Cross Ref
- [27] . 2014. Adam: A method for stochastic optimization. arXiv:1412.6980 (2014).Google Scholar
- [28] . 2018. Deformnet: Free-form deformation network for 3D shape reconstruction from a single image. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. Lake Tahoe. IEEE. 858–866.Google Scholar
Cross Ref
- [29] . 2015. Deep learning. Nature 521, 7553 (2015), 436–444.Google Scholar
Digital Library
- [30] . 2020. Anisotropic convolutional networks for 3D semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual. IEEE. 3351–3359.Google Scholar
Cross Ref
- [31] . 2019. Learning pose-aware 3D reconstruction via 2D-3D self-consistency. In IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton. IEEE. 3857–3861.Google Scholar
Cross Ref
- [32] . 2018. Learning efficient point cloud generation for dense 3D object reconstruction. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32, New Orleans. AAAI, 1–1.Google Scholar
Cross Ref
- [33] . 1980. Octree Encoding: A New Technique for the Representation, Manipulation and Display of Arbitrary 3D Objects by Computer. Electrical and Systems Engineering Department, Rensseiaer Polytechnic.Google Scholar
- [34] . 2019. partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach. IEEE. 909–918.Google Scholar
Cross Ref
- [35] . 2019. capnet: Continuous approximation projection for 3D point cloud reconstruction using 2D supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, Honolulu. AAAI. 8819–8826.Google Scholar
Digital Library
- [36] . 2011. KinectFusion: Real-time dense surface mapping and tracking. In IEEE International Symposium on Mixed and Augmented Reality. Basel. IEEE. 127–136.Google Scholar
- [37] . 2019. Graphx-convolution for point cloud deformation in 2D-to-3D conversion. In Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul. IEEE. 8628–8637.Google Scholar
Cross Ref
- [38] . 2019. Reconstructing 3D face models by incremental aggregation and refinement of depth frames. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), 1–24.Google Scholar
Digital Library
- [39] . 2019. Deep mesh reconstruction from single RGB images via topology modification networks. In Proceedings of the IEEE International Conference on Computer Vision. Seoul. IEEE. 9964–9973.Google Scholar
Cross Ref
- [40] . 2018. Residual meshnet: Learning to deform meshes for single-view 3D reconstruction. In International Conference on 3D Vision. Verona. IEEE. 719–727.Google Scholar
Cross Ref
- [41] . 2019. On visual knowledge. Frontiers of Information Technology & Electronic Engineering 20, 8 (2019), 1021–1025.Google Scholar
- [42] . 2019. Domain-adaptive single-view 3D reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul. IEEE. 7638–7647.Google Scholar
Cross Ref
- [43] . 2015. Discriminative shape from shading in uncalibrated illumination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston. IEEE. 1128–1136.Google Scholar
Cross Ref
- [44] . 2018. Matryoshka networks: Predicting 3D geometry via nested shape layers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City. IEEE. 1936–1944.Google Scholar
- [45] . 2020. Reconstructing part-level 3D models from a single image. In 2020 IEEE International Conference on Multimedia and Expo. Virtual. IEEE. 1–6.Google Scholar
Cross Ref
- [46] . 2019. GEOMetrics: Exploiting geometric structure for graph-encoded objects. In Proceedings of the International Conference on Machine Learning, Vol. 97, Long Beach. ACM. 5866–5876.Google Scholar
- [47] . 2020. Learning 3D shape completion under weak supervision. International Journal of Computer Vision 128, 5 (2020), 1162–1181.Google Scholar
Digital Library
- [48] . 2018. Pix3D: Dataset and methods for single-image 3D shape modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City. IEEE. 2974–2983.Google Scholar
Cross Ref
- [49] . 2019. A skeleton-bridged deep learning approach for generating meshes of complex topologies from single RGB images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach. IEEE. 4541–4550.Google Scholar
Cross Ref
- [50] . 2017. Octree generating networks: Efficient convolutional architectures for high-resolution 3D outputs. In Proceedings of the IEEE International Conference on Computer Vision. Venice. IEEE. 2088–2096.Google Scholar
Cross Ref
- [51] . 2018. Multi-view consistency as supervisory signal for learning shape and pose prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City. IEEE. 2897–2905.Google Scholar
Cross Ref
- [52] . 2017. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2626–2634.Google Scholar
Cross Ref
- [53] . 2019. Deep single-view 3D object reconstruction with visual hull embedding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, Honolulu. AAAI, 8941–8948.Google Scholar
Digital Library
- [54] . 2020. GSIR: Generalizable 3D shape interpretation and reconstruction. In European Conference on Computer Vision. Virtual. Springer. 498–514.Google Scholar
Digital Library
- [55] . 2019. MVPNet: Multi-view point regression networks for 3D object reconstruction from a single image. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, Honolulu. AAAI, 8949–8956.Google Scholar
Digital Library
- [56] . 2017. 3DensiNet: A robust neural network architecture towards 3D volumetric object prediction from 2D image. In Proceedings of the ACM International Conference on Multimedia. Mountain View. ACM. 961–969.Google Scholar
Digital Library
- [57] . 2020. Pixel2Mesh: 3D mesh model generation via image guided deformation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).Google Scholar
- [58] . 2017. Shape inpainting using 3D generative adversarial network and recurrent convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. Venice. IEEE. 2298–2306.Google Scholar
Cross Ref
- [59] . 2012. ‘structure-from-motion’ photogrammetry: A low-cost, effective tool for geoscience applications. Geomorphology 179 (2012), 300–314.Google Scholar
Cross Ref
- [60] . 1981. Recovering surface shape and orientation from texture. Artificial Intelligence 17, 1-3 (1981), 17–45.Google Scholar
Digital Library
- [61] . 2017. MarrNet: 3D shape reconstruction via 2.5D sketches. In Advances in Neural Information Processing Systems, Long Beach. MIT Press, 540–550.Google Scholar
- [62] . 2016. Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In Advances in Neural Information Processing Systems. 82–90.Google Scholar
Digital Library
- [63] . 2018. Learning shape priors for single-view 3D completion and reconstruction. In Proceedings of the European Conference on Computer Vision(
Lecture Notes in Computer Science) , Vol. 11215, Springer. 673–691.Google ScholarCross Ref
- [64] . 2015. 3D ShapeNets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston. IEEE. 1912–1920.Google Scholar
- [65] . 2019. Single-image mesh reconstruction and pose estimation via generative normal map. In Proceedings of the International Conference on Computer Animation and Social Agents. Paris. ACM. 79–84.Google Scholar
Digital Library
- [66] . 2014. Beyond Pascal: A benchmark for 3D object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision. Steamboat Springs. IEEE. 75–82.Google Scholar
Cross Ref
- [67] . 2019. Pix2Vox: Context-aware 3D reconstruction from single and multi-view images. In Proceedings of the IEEE International Conference on Computer Vision. Seoul. IEEE. 2690–2698.Google Scholar
Cross Ref
- [68] . 2020. Pix2Vox++: Multi-scale context-aware 3D object reconstruction from single and multiple images. International Journal of Computer Vision 128, 12 (2020), 2919–2935.Google Scholar
Digital Library
- [69] . 2016. Perspective transformer nets: Learning single-view 3D object reconstruction without 3D supervision. In Advances in Neural Information Processing Systems, Barcelona. MIT Press, 1696–1704.Google Scholar
- [70] . 2018. Dense 3D object reconstruction from a single depth view. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 12 (2018), 2820–2834.Google Scholar
Cross Ref
- [71] . 2018. Learning single-view 3D reconstruction with limited pose supervision. In Proceedings of the European Conference on Computer Vision(
Lecture Notes in Computer Science) , Vol. 11219, Munich. Springer, 90–105.Google ScholarCross Ref
- [72] . 2021. Single-view 3D object reconstruction from shape priors in memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE. 3152–3161.Google Scholar
Digital Library
- [73] . 2021. Discovering 3D parts from image collections. In Proceedings of the IEEE International Conference on Computer Vision. IEEE. 12981–12990.Google Scholar
Cross Ref
- [74] . 2020. start hereFront2Back: Single view 3D shape reconstruction via front to back prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual. IEEE. 531–540.Google Scholar
Cross Ref
- [75] . 2019. CuFusion2: Accurate and denoised volumetric 3D object reconstruction using depth cameras. IEEE Access 7 (2019), 49882–49893.Google Scholar
Cross Ref
- [76] . 2017. CuFusion: Accurate real-time camera tracking and volumetric scene reconstruction with a cuboid. Sensors 17, 10 (2017), 2260.Google Scholar
Cross Ref
- [77] . 2015. Depth camera tracking with contour cues. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston. IEEE. 632–638.Google Scholar
Index Terms
Enhanced 3D Shape Reconstruction With Knowledge Graph of Category Concept
Recommendations
3D surface point and wireframe reconstruction from multiview photographic images
This paper describes a new method for reconstructing 3D surface points and a wireframe on the surface of a freeform object using a small number, e.g. 10, of 2D photographic images. The images are taken at different viewing directions by a perspective ...
3D surface reconstruction from multiview photographic images using 2D edge contours
Most techniques for reconstructing 3D shapes from multi-view 2D photographic images require a large number of images. In this paper, we present a new method for reconstructing 3D surfaces, represented by sets of polygons, using a small number, e.g. 10, ...
Reference consistent reconstruction of 3D cloth surface
Highlights Multiview method for reconstructing a folded cloth surface with color patches. Error in 3D surface reconstruction can be recovered with consistency of the patches. Single-view patches can be extrapolated with regularity of the patches. 3D ...






Comments