Abstract
The performance and the efficiency of recent computing platforms have been deeply influenced by the widespread adoption of hardware accelerators, such as graphics processing units (GPUs) or field-programmable gate arrays (FPGAs), which are often employed to support the tasks of general-purpose processors (GPPs). One of the main advantages of these accelerators over their sequential counterparts (GPPs) is their ability to perform massive parallel computation. However, to exploit this competitive edge, it is necessary to extract the parallelism from the target algorithm to be executed, which generally is a very challenging task.
This concept is demonstrated, for instance, by the poor performance achieved on relevant multimedia algorithms, such as Chambolle, which is a well-known algorithm employed for the optical flow estimation. The implementations of this algorithm that can be found in the state of the art are generally based on GPUs but barely improve the performance that can be obtained with a powerful GPP. In this article, we propose a novel approach to extract the parallelism from computation-intensive multimedia algorithms, which includes an analysis of their dependency schema and an assessment of their data reuse. We then perform a thorough analysis of the Chambolle algorithm, providing a formal proof of its inner data dependencies and locality properties. Then, we exploit the considerations drawn from this analysis by proposing an architectural template that takes advantage of the fine-grained parallelism of FPGA devices. Moreover, since the proposed template can be instantiated with different parameters, we also propose a design metric, the expansion rate, to help the designer in the estimation of the efficiency and performance of the different instances, making it possible to select the right one before the implementation phase. We finally show, by means of experimental results, how the proposed analysis and parallelization approach leads to the design of efficient and high-performance FPGA-based implementations that are orders of magnitude faster than the state-of-the-art ones.
- M. M. Abutaleb, A. Hamdy, M. E. Abuelwafa, and E. M. Saad. 2009. A reliable FPGA-based real-time optical-flow estimation. In Proceedings of the National Radio Science Conference (NRSC’09). 1--8.Google Scholar
- Abdulkadir Akin, Ivan Beretta, Alessandro Antonio Nacci, Vincenzo Rana, Marco Domenico Santambrogio, and David Atienza. 2011. A high-performance parallel implementation of the Chambolle algorithm. In Proceedings of the IEEE/ACM 2011 Design, Automation, and Test in Europe Conference (DATE’11). 7--12.Google Scholar
Cross Ref
- Karim M. A. Ali, Rabie Ben Atitallah, Said Hanafi, and Jean-Luc Dekeyser. 2014. A generic pixel distribution architecture for parallel video processing. In Proceedings of the 2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig’14). 1--8. DOI:http://dx.doi.org/10.1109/ReConFig.2014.7032547Google Scholar
Cross Ref
- G. Aubert, R. Deriche, and P. Kornprobst. 1999. Computing optical flow via variational techniques. SIAM Journal on Applied Mathematics 60, 156--182. Google Scholar
Digital Library
- Simon Baker, Eric P. Bennett, Sing Bing Kang, and Richard Szeliski. 2010. Removing rolling shutter wobble. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 2392--2399.Google Scholar
Cross Ref
- S. Behbahani, S. Asadi, M. Ashtiyani, and K. Maghooli. 2007. Analysing optical flow based methods. In Proceedings of the 2007 IEEE International Symposium on Signal Processing and Information Technology. 133--137. DOI:http://dx.doi.org/10.1109/ISSPIT.2007.4458079Google Scholar
Cross Ref
- M. J. Black and P. Anandan. 1993. A framework for the robust estimation of optical flow. In Proceedings of the 4th International Conference on Computer Vision. 231--236. DOI:http://dx.doi.org/10.1109/ICCV.1993.378214Google Scholar
- John Bodily, Brent Nelson, Zhaoyi Wei, Dah-Jye Lee, and Jeff Chase. 2010. A comparison study on implementing optical flow and digital communications on FPGAs and GPUs. ACM Transactions on Reconfigurable Technology and Systems 3, 2, Article No. 6. DOI:http://dx.doi.org/10.1145/1754386.1754387 Google Scholar
Digital Library
- Antonin Chambolle. 2004. An algorithm for total variation minimization and applications. Journal of Mathematical Imaging and Vision 20, 1--2, 89--97. DOI:http://dx.doi.org/10.1023/B:JMIV.0000011325.36760.1e Google Scholar
Digital Library
- Peng Chen, Donglei Yang, Weihua Zhang, Yi Li, Binyu Zang, and Haibo Chen. 2012. Adaptive pipeline parallelism for image feature extraction algorithms. In Proceedings of the 2012 41st International Conference on Parallel Processing (ICPP’12). 299--308. DOI:http://dx.doi.org/10.1109/ICPP.2012.14 Google Scholar
Digital Library
- D. Cordes, M. Engel, O. Neugebauer, and P. Marwedel. 2013. Automatic extraction of pipeline parallelism for embedded heterogeneous multi-core platforms. In Proceedings of the 2013 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES’13). 1--10. DOI:http://dx.doi.org/10.1109/CASES.2013.6662508 Google Scholar
Digital Library
- R. Ghodhbani, T. Saidani, L. Horrigue, and M. Atri. 2014. Analysis and implementation of parallel causal bit plane coding in JPEG2000 standard. In Proceedings of the 2014 World Congress on Computer Applications and Systems (WCCAIS’14). 1--6. DOI:http://dx.doi.org/10.1109/WCCAIS.2014.6916602Google Scholar
- Berthold K. P. Horn and Brian G. Schunck. 1981. Determining optical flow. Artificial Intelligence 17, 185--203.Google Scholar
Digital Library
- E. Jamro and K. Wiatr. 2001. Convolution operation implemented in FPGA structures for real-time image processing. In Proceedings of the 2nd International Symposium on Image and Signal Processing and Analysis (ISPA’01). 417--422. DOI:http://dx.doi.org/10.1109/ISPA.2001.938666Google Scholar
- Guo-An Jian, Jui-Sheng Lee, Kheng-Joo Tan, Peng-Sheng Chen, and Jiun-In Guo. 2013. A real-time parallel scalable video encoder for multimedia streaming systems. In Proceedings of the 2013 International Symposium on VLSI Design, Automation, and Test (VLSI-DAT). 1--4. DOI:http://dx.doi.org/10.1109/VLDI-DAT.2013.6533845Google Scholar
Cross Ref
- Sungbok Kim, Ilhwa Jeong, and Sanghyup Lee. 2007. Mobile robot velocity estimation using an array of optical flow sensors. In Proceedings of the International Conference on Control, Automation, and Systems (ICCAS’07). 616--621. DOI:http://dx.doi.org/10.1109/ICCAS.2007.4407097Google Scholar
- Yamin Li and Wanming Chu. 1997. Implementation of single precision floating point square root on FPGAs. In Proceedings of the 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 226--232. DOI:http://dx.doi.org/10.1109/FPGA.1997.624623 Google Scholar
Digital Library
- S. Lin, Y. Q. Shi, and Y.-Q. Zhang. 1997. An optical flow based motion compensation algorithm for very low bit-rate video coding. In Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’97), Vol. 4. 2869--2872. DOI:http://dx.doi.org/10.1109/ICASSP.1997.595388 Google Scholar
Digital Library
- NVIDIA. 2007. NVIDIA CUDA Compute Unified Device Architecture Programming Guide. Available at http://www.nvidia.com.Google Scholar
- NVIDIA. 2009. NVIDIA Next Generation CUDA Compute Architecture: Fermi. Available at http://www.nvidia.com.Google Scholar
- Nils Papenberg, Andrés Bruhn, Thomas Brox, Stephan Didas, and Joachim Weickert. 2006. Highly accurate optic flow computation with theoretically justified warping. International Journal of Computer Vision 67, 2, 141--158. DOI:http://dx.doi.org/10.1007/s11263-005-3960-y Google Scholar
Digital Library
- T. Pock, M. Urschler, C. Zach, R. Beichel, and H. Bischof. 2007. A duality based algorithm for TV-L1-optical-flow image registration. Med Image Computing and Computer Assisted Intervention 10, 511--518. http://www.ncbi.nlm.nih.gov/pubmed/18044607. Google Scholar
Digital Library
- Elisenda Roca, Servando Espejo, Rafael Dominguez-Castro, Gustavo Linan, and Angel Rodriguez-Vazquez. 1999. A programmable imager for very high speed cellular signal processing. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology 23, 2--3, 305--318. DOI:http://dx.doi.org/10.1023/A:1008193018623 Google Scholar
Digital Library
- Leonid I. Rudin, Stanley Osher, and Emad Fatemi. 1992. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena 60, 1, 259--268. DOI:http://dx.doi.org/10.1016/0167-2789(92)90242-F Google Scholar
Digital Library
- I. Sajid, M. M. Ahmed, and S. G. Ziavras. 2010. Pipelined implementation of fixed point square root in FPGA using modified non-restoring algorithm. In Proceedings of the 2010 2nd International Conference on Computer and Automation Engineering (ICCAE’10), Vol. 3. 226--230. DOI:http://dx.doi.org/10.1109/ICCAE.2010.5452039Google Scholar
Cross Ref
- Gerard L. G. Sleijpen and Henk A. Van Der Vorst. 2000. A Jacobi-Davidson iteration method for linear eigenvalue problems. SIAM Journal on Matrix Analysis and Applications 17, 401--425. Google Scholar
Digital Library
- S. Sun, D. Haynor, and Yongmin Kim. 2000. Motion estimation based on optical flow with adaptive gradients. In Proceedings of the 2000 International Conference on Image Processing, Vol. 1. 852--855. DOI:http://dx.doi.org/10.1109/ICIP.2000.901093Google Scholar
- A. Verri and T. Poggio. 1989. Motion field and optical flow: Qualitative properties. IEEE Transactions on Pattern Analysis and Machine Intelligence 11, 5, 490--498. DOI:http://dx.doi.org/10.1109/34.24781 Google Scholar
Digital Library
- Andreas Weishaupt, Luigi Bagnato, Emmanuel D’Angelo, and Pierre Vandergheynst. 2010. Tracking and Structure from Motion. Technical Report. École Polytechnique Fédérale de Lausanne (EPFL). http://infoscience.epfl.ch/record/146572.Google Scholar
- Xilinx. 2009. Virtex-5 Family Overview, DS100 (v5.0). Available at http://www.xilinx.com.Google Scholar
- C. Zach, T. Pock, and H. Bischof. 2007. A duality based approach for realtime TV-L1 optical flow. In Proceedings of the 29th DAGM Conference on Pattern Recognition. 214--223. http://dl.acm.org/citation.cfm?id=1771530.1771554 Google Scholar
Digital Library
Index Terms
Parallelizing the Chambolle Algorithm for Performance-Optimized Mapping on FPGA Devices
Recommendations
Acceleration of an FPGA router
FCCM '97: Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing MachinesThe authors describe their experience and progress in accelerating an FPGA router. Placement and routing is undoubtedly the most time-consuming process in automatic chip design or configuring programmable logic devices as reconfigurable computing ...
Modelling communication overhead for accessing local memories in hardware accelerators
ASAP '13: Proceedings of the 2013 IEEE 24th International Conference on Application-specific Systems, Architectures and Processors (ASAP)Local memories increase the efficiency of hardware accelerators by enabling fast accesses to frequently used data. In addition, the access latencies of local memories are deterministic which allows for more accurate evaluation of the system performance ...
Implementing high-performance, low-power FPGA-based optical flow accelerators in C
ASAP '13: Proceedings of the 2013 IEEE 24th International Conference on Application-specific Systems, Architectures and Processors (ASAP)Recent developments in High-Level Synthesis (HLS) for FPGAs are making it possible to “run” C code on FPGAs thereby making modern programming environments available to FPGA developers. In this paper, C code for a complex optical-flow algorithm is ...






Comments