skip to main content
research-article

Parallelizing the Chambolle Algorithm for Performance-Optimized Mapping on FPGA Devices

Published:07 March 2016Publication History
Skip Abstract Section

Abstract

The performance and the efficiency of recent computing platforms have been deeply influenced by the widespread adoption of hardware accelerators, such as graphics processing units (GPUs) or field-programmable gate arrays (FPGAs), which are often employed to support the tasks of general-purpose processors (GPPs). One of the main advantages of these accelerators over their sequential counterparts (GPPs) is their ability to perform massive parallel computation. However, to exploit this competitive edge, it is necessary to extract the parallelism from the target algorithm to be executed, which generally is a very challenging task.

This concept is demonstrated, for instance, by the poor performance achieved on relevant multimedia algorithms, such as Chambolle, which is a well-known algorithm employed for the optical flow estimation. The implementations of this algorithm that can be found in the state of the art are generally based on GPUs but barely improve the performance that can be obtained with a powerful GPP. In this article, we propose a novel approach to extract the parallelism from computation-intensive multimedia algorithms, which includes an analysis of their dependency schema and an assessment of their data reuse. We then perform a thorough analysis of the Chambolle algorithm, providing a formal proof of its inner data dependencies and locality properties. Then, we exploit the considerations drawn from this analysis by proposing an architectural template that takes advantage of the fine-grained parallelism of FPGA devices. Moreover, since the proposed template can be instantiated with different parameters, we also propose a design metric, the expansion rate, to help the designer in the estimation of the efficiency and performance of the different instances, making it possible to select the right one before the implementation phase. We finally show, by means of experimental results, how the proposed analysis and parallelization approach leads to the design of efficient and high-performance FPGA-based implementations that are orders of magnitude faster than the state-of-the-art ones.

References

  1. M. M. Abutaleb, A. Hamdy, M. E. Abuelwafa, and E. M. Saad. 2009. A reliable FPGA-based real-time optical-flow estimation. In Proceedings of the National Radio Science Conference (NRSC’09). 1--8.Google ScholarGoogle Scholar
  2. Abdulkadir Akin, Ivan Beretta, Alessandro Antonio Nacci, Vincenzo Rana, Marco Domenico Santambrogio, and David Atienza. 2011. A high-performance parallel implementation of the Chambolle algorithm. In Proceedings of the IEEE/ACM 2011 Design, Automation, and Test in Europe Conference (DATE’11). 7--12.Google ScholarGoogle ScholarCross RefCross Ref
  3. Karim M. A. Ali, Rabie Ben Atitallah, Said Hanafi, and Jean-Luc Dekeyser. 2014. A generic pixel distribution architecture for parallel video processing. In Proceedings of the 2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig’14). 1--8. DOI:http://dx.doi.org/10.1109/ReConFig.2014.7032547Google ScholarGoogle ScholarCross RefCross Ref
  4. G. Aubert, R. Deriche, and P. Kornprobst. 1999. Computing optical flow via variational techniques. SIAM Journal on Applied Mathematics 60, 156--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Simon Baker, Eric P. Bennett, Sing Bing Kang, and Richard Szeliski. 2010. Removing rolling shutter wobble. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 2392--2399.Google ScholarGoogle ScholarCross RefCross Ref
  6. S. Behbahani, S. Asadi, M. Ashtiyani, and K. Maghooli. 2007. Analysing optical flow based methods. In Proceedings of the 2007 IEEE International Symposium on Signal Processing and Information Technology. 133--137. DOI:http://dx.doi.org/10.1109/ISSPIT.2007.4458079Google ScholarGoogle ScholarCross RefCross Ref
  7. M. J. Black and P. Anandan. 1993. A framework for the robust estimation of optical flow. In Proceedings of the 4th International Conference on Computer Vision. 231--236. DOI:http://dx.doi.org/10.1109/ICCV.1993.378214Google ScholarGoogle Scholar
  8. John Bodily, Brent Nelson, Zhaoyi Wei, Dah-Jye Lee, and Jeff Chase. 2010. A comparison study on implementing optical flow and digital communications on FPGAs and GPUs. ACM Transactions on Reconfigurable Technology and Systems 3, 2, Article No. 6. DOI:http://dx.doi.org/10.1145/1754386.1754387 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Antonin Chambolle. 2004. An algorithm for total variation minimization and applications. Journal of Mathematical Imaging and Vision 20, 1--2, 89--97. DOI:http://dx.doi.org/10.1023/B:JMIV.0000011325.36760.1e Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Peng Chen, Donglei Yang, Weihua Zhang, Yi Li, Binyu Zang, and Haibo Chen. 2012. Adaptive pipeline parallelism for image feature extraction algorithms. In Proceedings of the 2012 41st International Conference on Parallel Processing (ICPP’12). 299--308. DOI:http://dx.doi.org/10.1109/ICPP.2012.14 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Cordes, M. Engel, O. Neugebauer, and P. Marwedel. 2013. Automatic extraction of pipeline parallelism for embedded heterogeneous multi-core platforms. In Proceedings of the 2013 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES’13). 1--10. DOI:http://dx.doi.org/10.1109/CASES.2013.6662508 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Ghodhbani, T. Saidani, L. Horrigue, and M. Atri. 2014. Analysis and implementation of parallel causal bit plane coding in JPEG2000 standard. In Proceedings of the 2014 World Congress on Computer Applications and Systems (WCCAIS’14). 1--6. DOI:http://dx.doi.org/10.1109/WCCAIS.2014.6916602Google ScholarGoogle Scholar
  13. Berthold K. P. Horn and Brian G. Schunck. 1981. Determining optical flow. Artificial Intelligence 17, 185--203.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. Jamro and K. Wiatr. 2001. Convolution operation implemented in FPGA structures for real-time image processing. In Proceedings of the 2nd International Symposium on Image and Signal Processing and Analysis (ISPA’01). 417--422. DOI:http://dx.doi.org/10.1109/ISPA.2001.938666Google ScholarGoogle Scholar
  15. Guo-An Jian, Jui-Sheng Lee, Kheng-Joo Tan, Peng-Sheng Chen, and Jiun-In Guo. 2013. A real-time parallel scalable video encoder for multimedia streaming systems. In Proceedings of the 2013 International Symposium on VLSI Design, Automation, and Test (VLSI-DAT). 1--4. DOI:http://dx.doi.org/10.1109/VLDI-DAT.2013.6533845Google ScholarGoogle ScholarCross RefCross Ref
  16. Sungbok Kim, Ilhwa Jeong, and Sanghyup Lee. 2007. Mobile robot velocity estimation using an array of optical flow sensors. In Proceedings of the International Conference on Control, Automation, and Systems (ICCAS’07). 616--621. DOI:http://dx.doi.org/10.1109/ICCAS.2007.4407097Google ScholarGoogle Scholar
  17. Yamin Li and Wanming Chu. 1997. Implementation of single precision floating point square root on FPGAs. In Proceedings of the 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 226--232. DOI:http://dx.doi.org/10.1109/FPGA.1997.624623 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Lin, Y. Q. Shi, and Y.-Q. Zhang. 1997. An optical flow based motion compensation algorithm for very low bit-rate video coding. In Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’97), Vol. 4. 2869--2872. DOI:http://dx.doi.org/10.1109/ICASSP.1997.595388 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. NVIDIA. 2007. NVIDIA CUDA Compute Unified Device Architecture Programming Guide. Available at http://www.nvidia.com.Google ScholarGoogle Scholar
  20. NVIDIA. 2009. NVIDIA Next Generation CUDA Compute Architecture: Fermi. Available at http://www.nvidia.com.Google ScholarGoogle Scholar
  21. Nils Papenberg, Andrés Bruhn, Thomas Brox, Stephan Didas, and Joachim Weickert. 2006. Highly accurate optic flow computation with theoretically justified warping. International Journal of Computer Vision 67, 2, 141--158. DOI:http://dx.doi.org/10.1007/s11263-005-3960-y Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. Pock, M. Urschler, C. Zach, R. Beichel, and H. Bischof. 2007. A duality based algorithm for TV-L1-optical-flow image registration. Med Image Computing and Computer Assisted Intervention 10, 511--518. http://www.ncbi.nlm.nih.gov/pubmed/18044607. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Elisenda Roca, Servando Espejo, Rafael Dominguez-Castro, Gustavo Linan, and Angel Rodriguez-Vazquez. 1999. A programmable imager for very high speed cellular signal processing. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology 23, 2--3, 305--318. DOI:http://dx.doi.org/10.1023/A:1008193018623 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Leonid I. Rudin, Stanley Osher, and Emad Fatemi. 1992. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena 60, 1, 259--268. DOI:http://dx.doi.org/10.1016/0167-2789(92)90242-F Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. I. Sajid, M. M. Ahmed, and S. G. Ziavras. 2010. Pipelined implementation of fixed point square root in FPGA using modified non-restoring algorithm. In Proceedings of the 2010 2nd International Conference on Computer and Automation Engineering (ICCAE’10), Vol. 3. 226--230. DOI:http://dx.doi.org/10.1109/ICCAE.2010.5452039Google ScholarGoogle ScholarCross RefCross Ref
  26. Gerard L. G. Sleijpen and Henk A. Van Der Vorst. 2000. A Jacobi-Davidson iteration method for linear eigenvalue problems. SIAM Journal on Matrix Analysis and Applications 17, 401--425. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Sun, D. Haynor, and Yongmin Kim. 2000. Motion estimation based on optical flow with adaptive gradients. In Proceedings of the 2000 International Conference on Image Processing, Vol. 1. 852--855. DOI:http://dx.doi.org/10.1109/ICIP.2000.901093Google ScholarGoogle Scholar
  28. A. Verri and T. Poggio. 1989. Motion field and optical flow: Qualitative properties. IEEE Transactions on Pattern Analysis and Machine Intelligence 11, 5, 490--498. DOI:http://dx.doi.org/10.1109/34.24781 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Andreas Weishaupt, Luigi Bagnato, Emmanuel D’Angelo, and Pierre Vandergheynst. 2010. Tracking and Structure from Motion. Technical Report. École Polytechnique Fédérale de Lausanne (EPFL). http://infoscience.epfl.ch/record/146572.Google ScholarGoogle Scholar
  30. Xilinx. 2009. Virtex-5 Family Overview, DS100 (v5.0). Available at http://www.xilinx.com.Google ScholarGoogle Scholar
  31. C. Zach, T. Pock, and H. Bischof. 2007. A duality based approach for realtime TV-L1 optical flow. In Proceedings of the 29th DAGM Conference on Pattern Recognition. 214--223. http://dl.acm.org/citation.cfm?id=1771530.1771554 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Parallelizing the Chambolle Algorithm for Performance-Optimized Mapping on FPGA Devices

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!