Abstract
Embedded three-dimensional (3D) Computer Vision (CV) is considered a technology enabler for future consumer applications, attracting a wide interest in academia and industry. However, 3D CV processing is a computation-intensive task. Its high computational cost is directly related to the processing of 3D point clouds, with the 3D descriptor computation representing one of the main bottlenecks. Understanding the main computational challenges of 3D CV applications, as well as the key characteristics, enabling features, and limitations of current computing platforms, is clearly strategic to identify the directions of evolution for future embedded processing systems targeting 3D CV.
In this work, an innovative and complex 3D descriptor (called SHOT) has been ported on a high-end and an embedded computing platform. The high-end system is composed by a high-performance Intel CPU coupled with a Nvidia GPU. The embedded platform is, instead, composed by an ARM-based processor, coupled with the STHORM accelerator. STHORM is a many-core low-power accelerator developed by ST Microelectronics, featuring up to 64 computational units. The SHOT descriptor has been parallelized using the OpenCL programming model for both platforms.
Finally, we have performed an in-depth performance comparison and analysis between general-purpose processors and accelerators in both high-end and embedded domains, discussing and highlighting the main differences in the Hardware/Software (HW/SW) design methodologies and approaches between high-end and embedded systems targeting 3D CV applications.
- Y. Allusse, P. Horain, A. Agarwal, and C. Saipriyadarshan. 2008. GpuCV: An opensource GPU-accelerated framework for image processing and computer vision. In 16th ACM International Conference on Multimedia (MM’08). ACM, New York, NY, 1089--1092. DOI:http://dx.doi.org/10.1145/1459359.1459578 Google Scholar
Digital Library
- P. Babenko and M. Shah. 2008. MinGPU: A minimum GPU library for computer vision. Journal of Real-Time Image Processing 3, 4 (2008), 255--268. DOI:http://dx.doi.org/10.1007/s11554-008-0085-xGoogle Scholar
Cross Ref
- S. P. Baker and R. W. Sadowski. 2013. GPU assisted processing of point cloud data sets for ground segmentation in autonomous vehicles. In 2013 IEEE International Conference on Technologies for Practical Robot Applications (TePRA). 1--6. DOI:http://dx.doi.org/10.1109/TePRA.2013.6556352Google Scholar
Cross Ref
- L. Benini, E. Flamand, D. Fuin, and D. Melpignano. 2012. P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator. In Design, Automation Test in Europe Conference Exhibition (DATE’12). 983--987. DOI:http://dx.doi.org/10.1109/DATE.2012.6176639 Google Scholar
Digital Library
- G. Bradski. 2000. The OpenCV library. Doctor Dobbs Journal 25, 11 (2000), 120--126.Google Scholar
- B. Brousseau and J. Rose. 2012. An energy-efficient, fast FPGA hardware architecture for OpenCV-Compatible object detection. In 2012 International Conference on Field-Programmable Technology (FPT). 166--173. DOI:http://dx.doi.org/10.1109/FPT.2012.6412130Google Scholar
Cross Ref
- C. Cleverdon. 1997. Readings in information retrieval. Morgan Kaufmann Publishers Inc., San Francisco, CA, 47--59. http://dl.acm.org/citation.cfm?id=275537.275544 Google Scholar
Digital Library
- N. Cornelis and L. Van Gool. 2008. Fast scale invariant feature detection and matching on programmable graphics hardware. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’08). 1--8. DOI:http://dx.doi.org/10.1109/CVPRW.2008.4563087Google Scholar
- B. Drost and S. Ilic. 2012. 3D object detection and localization using multimodal point pair features. In 3DIMPVT. 9--16. http://dblp.uni-trier.de/db/conf/3dim/3dimpvt2012.html Google Scholar
Digital Library
- J. Fang, A. L. Varbanescu, and H. Sips. 2011. A comprehensive performance comparison of CUDA and OpenCL. In 2011 International Conference on Parallel Processing (ICPP). 216--225. DOI:http://dx.doi.org/10.1109/ICPP.2011.45 Google Scholar
Digital Library
- J. Fung and S. Mann. 2005. OpenVIDIA: Parallel GPU computer vision. In Proceedings of the 13th Annual ACM International Conference on Multimedia (MULTIMEDIA’05). ACM, New York, NY, 849--852. DOI:http://dx.doi.org/10.1145/1101149.1101334 Google Scholar
Digital Library
- R. S. Hunter. 1958. Photoelectric color difference meter. Journal of the Optical Society of America 48, 12 (Dec. 1958), 985--993. DOI:http://dx.doi.org/10.1364/JOSA.48.000985Google Scholar
Cross Ref
- IEEE. 2008. IEEE standard for floating-point arithmetic. IEEE Std 754-2008 (2008), 1--70. DOI:http://dx.doi.org/10.1109/IEEESTD.2008.4610935Google Scholar
- Khronos Group. 2014. The OpenCL Specification, version 2.0. (2014). http://khronos.org/registry/cl/specsGoogle Scholar
- Y. Luo and R. Duraiswami. 2008. Canny edge detection on NVIDIA CUDA. In Proceedings of Computer Vision and Pattern Recognition Workshops (CVPRW). DOI:http://dx.doi.org/10.1109/CVPRW.2008.4563088Google Scholar
- H. Mark. 2013. Unified Memory in CUDA 6. (Nov. 2013). http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6 Accessed: 2013-11-18.Google Scholar
- O. Mateo Lozano and K. Otsuka. 2009. Real-time visual tracker by stream processing. Journal of Signal Processing Systems 57 (2009), 285--295. DOI:http://dx.doi.org/10.1007/s11265-008-0250-2 Google Scholar
Digital Library
- A. S. Mian, M. Bennamoun, and R. A. Owens. 2006. A novel representation and feature matching algorithm for automatic pairwise registration of range images. International Journal of Computer Vision 66, 1 (2006), 19--40. DOI:http://dx.doi.org/10.1007/s11263-005-3221-0 Google Scholar
Digital Library
- Y. Mizukami and K. Tadamura. Optical flow computation on compute unified device architecture. In 14th International Conference on Image Analysis and Processing (ICIAP). 179--184. DOI:http://dx.doi.org/10.1109/ICIAP.2007.4362776 Google Scholar
Digital Library
- Nvidia. 2009. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi. Technical Report. Retrieved from http://www. nvidia.com/object/fermi-architecture.html/.Google Scholar
- Nvidia. 2011. Tesla C2075 computing processor board. Retrieved from http://www.nvidia.com/object/tesla- workstations.html.Google Scholar
- Nvidia. 2013. NVIDIA CUDA C Programming Guide. Retrieved from http://docs.nvidia.com/cuda/cuda-c-programming-guide.Google Scholar
- Nvidia. 2014. NVIDIA Tegra K1 A New Era in Mobile Computing. Technical Report. Retrieved from http://www.nvidia.com/object/white-papers.html (White Paper).Google Scholar
- S. Orts-Escolano, V. Morell, J. Garcia-Rodriguez, M. Cazorla, and R. B. Fisher. 2013. Real-time 3D semi-local surface patch extraction using GPGPU. Journal of Real-Time Image Processing (2013), 1--20. DOI:http://dx.doi.org/10.1007/s11554-013-0385-7Google Scholar
- D. Palossi, F. Tombari, S. Salti, M. Ruggiero, L. Di Stefano, and L. Benini. 2013. GPU-SHOT: Parallel optimization for real-time 3D local description. In 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 584--591. DOI:http://dx.doi.org/10.1109/CVPRW.2013.88 Google Scholar
Digital Library
- N. Pinto, D. Doukhan, J. J. DiCarlo, and D. D. Cox. 2009. A high-throughput screening approach to discovering good forms of biologically inspired visual representation. PLoS Computational Biology 5, 11 (2009), e1000579. DOI:http://dx.doi.org/10.1371/journal.pcbi.1000579Google Scholar
Cross Ref
- K. Pulli, A. Baksheev, K. Kornyakov, and V. Eruhimov. 2012. Realtime computer vision with OpenCV. Queue 10, 4, Article 40 (Apr. 2012), 17 pages. DOI:http://dx.doi.org/10.1145/2181796.2206309 Google Scholar
Digital Library
- S. Rajan, S. Wang, R. Inkol, and A. Joyal. 2006. Efficient approximations for the arctangent function. IEEE Signal Processing Magazine 23, 3 (May 2006), 108--111.Google Scholar
Cross Ref
- R. B. Rusu, N. Blodow, and M. Beetz. 2009. Fast point feature histograms (FPFH) for 3D registration. In Proceedings of the International Conference on Robotics and Automation (ICRA). Google Scholar
Digital Library
- R. B. Rusu and S. Cousins. 2011. 3D is here: Point cloud library (PCL). In Proceedings of the International Conference on Robotics and Automation (ICRA). DOI:http://dx.doi.org/10.1109/ICRA.2011.5980567Google Scholar
- S. Safari, A. Fijany, F. Diotalevi, and F. Hosseini. 2012. Highly parallel and fast implementation of stereo vision algorithms on MIMD many-core Tilera architecture. In 2012 IEEE Aerospace Conference. 1--11. DOI:http://dx.doi.org/10.1109/AERO.2012.6187228Google Scholar
Cross Ref
- Y. Sato, T. Sugimura, H. Noda, Y. Okuno, K. Arimoto, and T. Nagasaki. 2009. Integral-image based implementation of U-SURF algorithm for embedded super parallel processor. In International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS’09). 485--488. DOI:http://dx.doi.org/10.1109/ISPACS.2009.5383795Google Scholar
- M. Schaeferling, U. Hornung, and G. Kiefer. 2012. Object recognition and pose estimation on embedded hardware: Surf-based system designs accelerated by FPGA logic. International Journal of Reconfigurable Computing 2012, Article 6 (Jan. 2012), 1 page. DOI:http://dx.doi.org/10.1155/2012/368351 Google Scholar
Digital Library
- C.-L. Su, P.-Y. Chen, C.-C. Lan, L.-S. Huang, and K.-H. Wu. 2012. Overview and comparison of OpenCL and CUDA technology for GPGPU. In 2012 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS). 448--451. DOI:http://dx.doi.org/10.1109/APCCAS.2012.6419068Google Scholar
Cross Ref
- H.-N. Ta and S. Lee. 2011. High-performance computing model for 3D camera system. In 2011 IEEE International Conference on Robotics and Biomimetics (ROBIO). 354--359. DOI:http://dx.doi.org/10.1109/ROBIO.2011.6181311Google Scholar
Cross Ref
- D. C. C. Tam and M. Fiala. 2012. A real time augmented reality system using GPU acceleration. In 2012 9th Conference on Computer and Robot Vision (CRV). 101--108. DOI:http://dx.doi.org/10.1109/CRV.2012.21 Google Scholar
Digital Library
- F. Tombari, S. Salti, and L. Di Stefano. 2010. Unique signatures of histograms for local surface description. In 11th European Conference on Computer Vision Conference on Computer Vision: Part III (ECCV’10). Springer-Verlag, Berlin, 356--369. http://dl.acm.org/citation.cfm?id=1927006.1927035 Google Scholar
Digital Library
- V. Vineet and P. J. Narayanan. 2008. CUDA cuts: Fast graph cuts on the GPU. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’08). 1--8. DOI:http://dx.doi.org/10.1109/CVPRW.2008.4563095Google Scholar
- S. Williams, A. Waterman, and D. Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM 52, 4 (Apr. 2009), 65--76. DOI:http://dx.doi.org/10.1145/1498765.1498785 Google Scholar
Digital Library
- H. Xiao, W. He, K. Yuan, and F. Wen. 2013. Real-time scene recognition on embedded system with SIFT keypoints and a new descriptor. In 2013 IEEE International Conference on Mechatronics and Automation (ICMA). 1317--1324. DOI:http://dx.doi.org/10.1109/ICMA.2013.6618104Google Scholar
- K. Zhang, J. Lu, G. Lafruit, R. Lauwereins, and L. Van Gool. 2009. Real-time accurate stereo with bitwise fast voting on CUDA. In 2009 IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops). 794--800. DOI:http://dx.doi.org/10.1109/ICCVW.2009.5457623Google Scholar
Cross Ref
- Y. Zhong. 2009. Intrinsic shape signatures: A shape descriptor for 3D object recognition. In 2009 IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops). 689--696. DOI:http://dx.doi.org/10.1109/ICCVW.2009.5457637Google Scholar
Cross Ref
Index Terms
3D CV Descriptor on Parallel Heterogeneous Platforms
Recommendations
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Portable mapping of data parallel programs to OpenCL for heterogeneous systems
CGO '13: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)General purpose GPU based systems are highly attractive as they give potentially massive performance at little cost. Re-alizing such potential is challenging due to the complexity of programming. This paper presents a compiler based approach to ...
On the Portability of the OpenCL Dwarfs on Fixed and Reconfigurable Parallel Platforms
ICPADS '13: Proceedings of the 2013 International Conference on Parallel and Distributed SystemsThe proliferation of heterogeneous computing systems presents the parallel computing community with the challenge of porting legacy and emerging applications to multiple processors with diverse programming abstractions. OpenCL is a vendor-agnostic and ...






Comments