Abstract
Big data processing on hardware gained immense interest among the hardware research community to take advantage of fast processing and reconfigurability. Though the computation latency can be reduced using hardware, big data processing cost is dominated by data transfers. In this article, we propose a low overhead framework based on compressive sensing (CS) to reduce data transfers up to 67% without affecting signal quality. CS has two important kernels: “sensing” and “reconstruction.” In this article, we focus on CS reconstruction is using orthogonal matching pursuit (OMP) algorithm. We implement the OMP CS reconstruction algorithm on a domain-specific PENC many-core platform and a low-power Jetson TK1 platform consisting of an ARM CPU and a K1 GPU. Detailed performance analysis of OMP algorithm on each platform suggests that the PENC many-core platform has 15× and 18× less energy consumption and 16× and 8× faster reconstruction time as compared to the low-power ARM CPU and K1 GPU, respectively. Furthermore, we implement the proposed CS-based framework on heterogeneous architecture, in which the PENC many-core architecture is used as an “accelerator” and processing is performed on the ARM CPU platform. For demonstration, we integrate the proposed CS-based framework with a hadoop MapReduce platform for a face detection application. The results show that the proposed CS-based framework with the PENC many-core as an accelerator achieves a 26.15% data storage/transfer reduction, with an execution time and energy consumption overhead of 3.7% and 0.002%, respectively, for 5,000 image transfers. Compared to the CS-based framework implementation on the low-power Jetson TK1 ARM CPU+GPU platform, the PENC many-core implementation is 2.3× faster for the image reconstruction part, while achieving 29% higher performance and 34% better energy efficiency for the complete face detection application on the Hadoop MapReduce platform.
- 2016. Apache kernel description. Retrieved from http://www.apache.org.Google Scholar
- 2016. Haar feature-based cascade classifier for object detection. Retrieved from http://docs.opencv.org/.Google Scholar
- 2016. Jetson TK1. Retrieved from http://www.elinux.org/Jetson_TK1.Google Scholar
- M Andrecut. 2008. Fast GPU implementation of sparse signal recovery from random projections. Retrieved from http://www.arxiv.org/PS_cache/arxiv/pdf/0809/0809.1833v1.pdf.Google Scholar
- R. Baraniuk and P. Steeghs. 2007. Compressive radar imaging. In Proceedings of the IEEE 2007 Radar Conference. 128--133.Google Scholar
- P. Blache, H. Rabah, and A. Amira. 2012. High level prototyping and FPGA implementation of the orthogonal matching pursuit algorithm. In Proceedings of the 11th International Conference on Information Science, Signal Processing and Their Applications (ISSPA). 1336--1340.Google Scholar
- E. Candès and M. Wakin. 2010. An introduction to compressive sampling. IEEE Signal Processing Magazine 25, 2 (Mar 2010), 21--30.Google Scholar
- Y. Chen, T. Chen, Z. Xu, N. Sun, and O. Temam. 2016. DianNao family: Energy-efficient hardware accelerators for machine learning. Communications of the ACM 59, 11 (Oct. 2016), 105--112. Google Scholar
Digital Library
- Y. Chen and X. Zhang. 2010. High-speed architecture for image reconstruction based on compressive sensing. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). 1574--1577.Google Scholar
- J. Constantin, A. Dogan, O. Andersson, P. Meinerzhagen, J. N. Rodrigues, D. Atienza, and A. Burg. 2012. TamaRISC-CS: An ultra-low-power application-specific processor for compressed sensing. In Proceedings of the IEEE/IFIP 20th International Conference on VLSI and System-on-Chip (VLSI-SoC). 159--164.Google Scholar
- F. Conti and L. Benini. 2015. A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters. In Proceedings of the 2015 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’15). EDA Consortium, San Jose, CA, 683--688. Google Scholar
Digital Library
- Y. Fang, L. Chen, J. Wu, and B. Huang. 2011. GPU implementation of orthogonal matching pursuit for compressive sensing. In IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS). 1044--1047. Google Scholar
Digital Library
- M. Gautschi, M. Schaffner, F. K. Grkaynak, and L. Benini. 2016. 4.6 A 65nm CMOS 6.4-to-29.2pJ/[email protected] shared logarithmic floating point unit for acceleration of nonlinear function kernels in a tightly coupled processor cluster. In Proceedings of the 2016 IEEE International Solid-State Circuits Conference (ISSCC). 82--83.Google Scholar
- R. Girshick. 2015. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). Google Scholar
Digital Library
- G. Huang and L. Wang. 2012. High-speed signal reconstruction with orthogonal matching pursuit via matrix inversion bypass. In Proceedings of the IEEE Workshop on Signal Processing Systems (SiPS). 191--196. Google Scholar
Digital Library
- G. Huang and L. Wang. 2014. High-speed signal reconstruction for compressive sensing applications. Journal of Signal Processing Systems 81, 3 (2014), 333--344. Google Scholar
Digital Library
- A. Jafari and T. Mohsenin. 2015. A low power seizure detection processor based on direct use of compressively-sensed data and employing a deterministic random matrix. In Proceedings of the IEEE Biomedical Circuits and Systems (Biocas) Conference.Google Scholar
- V. Jain and E. Learned-miller. 2010. FDDB: A Benchmark for Face Detection in Unconstrained Settings. Technical Report.Google Scholar
- A. Korde, D. Bradley, and T. Mohsenin. 2013. Detection performance of radar compressive sensing in noisy environments. In Proceedings of the International SPIE Conference on Defense, Security, and Sensing.Google Scholar
- A. Kulkarni, T. Abtahi, E. Smith, and T. Mohsenin. 2016. Low energy sketching engines on many-core platform for big data acceleration. In Proceedings of the 26th Edition on Great Lakes Symposium on VLSI (GLSVLSI’16). ACM, New York, NY, 57--62. Google Scholar
Digital Library
- A. Kulkarni, H. Homayoun, and T. Mohsenin. 2014. A parallel and reconfigurable architecture for efficient omp compressive sensing reconstruction. In Proceedings of the 24th Edition of the Great Lakes Symposium on VLSI (GLSVLSI’14). ACM, New York, 299--304. Google Scholar
Digital Library
- A. Kulkarni, A. Jafari, C. Sagedy, and T. Mohsenin. 2016a. Sketching-based high-performance biomedical big data processing accelerator. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS). 1138--1141.Google Scholar
- A. Kulkarni, A. Jafari, C. Shea, and T. Mohsenin. 2016b. CS-based secured big data processing on FPGA. In Proceedings of the IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 201--201.Google Scholar
- A. Kulkarni and T. Mohsenin. 2015. Accelerating compressive sensing reconstruction OMP algorithm with CPU, GPU, FPGA and domain specific many-core. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS). 970--973.Google Scholar
- A. Kulkarni and T. Mohsenin. 2017. Low overhead architectures for OMP compressive sensing reconstruction algorithm. IEEE Transactions on Circuits and Systems I: Regular Papers 99 (2017), 1--13.Google Scholar
- A. Kulkarni, Y. Pino, M. French, and T. Mohsenin. 2016c. Real-time anomaly detection framework for many-core router through machine-learning techniques. Journal on Emerging Technologies in Computing (JETC) 13, 1, Article 10 (June 2016), 22 pages. Google Scholar
Digital Library
- A. Kulkarni, C. Shea, H. Homayoun, and T. Mohsenin. 2017. LESS: Big data sketching and encryption on low power platform. In Proceedings of the 2017 Design, Automation Test in Europe Conference Exhibition (DATE). Google Scholar
Digital Library
- A. Kulkarni, J. L. V. M. Stanislaus, and T. Mohsenin. 2014. Parallel heterogeneous architectures for efficient OMP compressive sensing reconstruction. Proc. SPIE 9109 (2014), 7.Google Scholar
- A. Kulkarni, T. Abtahi, C. Shea, A. Kulkarni, and T. Mohsenin. 2017. PACENet: Energy efficient acceleration for convolutional network on embedded platform. IEEE International Symposium on Circuits and Systems (ISCAS'17). 1--4.Google Scholar
- A. Kulkarni, A. Page, N. Attaran, A. Jafari, M. Malik, H. Homayoun, and T. Mohsenin. 2017. An energy-efficient programmable manycore accelerator for personalized biomedical applications. IEEE Transactions on Very Large Scale Integration (VLSI) Systems PP, 99 (2017), 1--14.Google Scholar
- Feng L., S. Ghosh, N. P. Johnson, and D. I. August. 2014. CGPA: Coarse-grained pipelined accelerators. In Proceedings of the 51st ACM/EDAC/IEEE Design Automation Conference (DAC). 1--6. Google Scholar
Digital Library
- R. Lienhart, A. Kuranov, and V. Pisarevsky. 2003. Empirical analysis of detection cascades of boosted classifiers for rapid object detection. In Proceedings of the Pattern Recognition: 25th DAGM Symposium. Springer, Berlin. 297--304.Google Scholar
- B. Liu and B. M. Baas. 2013. Parallel AES encryption engines for many-core processor arrays. IEEE Transactions on Computers 62, 3 (March 2013), 536--547. Google Scholar
Digital Library
- X. Liu, Y. Zhu, L. Kong, C. Liu, Y. Gu, A. Vasilakos, and M. Wu. 2015. CDC: Compressive data collection for wireless sensor networks. IEEE Transactions on Parallel and Distributed Systems 26, 8 (Aug 2015), 2188--2197.Google Scholar
- P. Maechler, C. Studer, D. E. Bellasi, A. Maleki, A. Burg, N. Felber, H. Kaeslin, and R. G. Baraniuk. 2012. VLSI design of approximate message passing for signal restoration and compressive sensing. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 2, 3 (2012), 579--590.Google Scholar
Cross Ref
- M. Malik, S. Rafatirah, A. Sasan, and H. Homayoun. 2015. System and architecture level characterization of big data applications on big and little core server architectures. In IEEE International Conference on Big Data (Big Data). 85--94. Google Scholar
Digital Library
- A. Martinez and R. Benavente. 1998. The AR face database. In CVC Technical Report 24).Google Scholar
- O. Maslennikow, P. Ratuszniak, and A. Sergyienko. 2007. Implementation of Cholesky LLT-decomposition algorithm in FPGA-based rational fraction parallel processor. In Proceedings of the 14th International Conference on Mixed Design of Integrated Circuits and Systems (MIXDES’07). 287--292.Google Scholar
- P. Meher, B. K. Mohanty, and T. Srikanthan. 2014. Area-delay efficient architecture for MP algorithm using reconfigurable inner-product circuits. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS). 2628--2631.Google Scholar
- D. Needell and R. Vershynin. 2010. Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuit. IEEE Journal of Selected Topics in Signal Processing 4, 2 (April 2010), 310--316.Google Scholar
Cross Ref
- K. Neshatpour, M. Malik, A. Ghodrat, Mohammad, A. Sasan, and H. Homayoun. 2015. Energy-efficient acceleration of big data analytics applications using FPGAs. In Proceedings of the IEEE International Conference on Big Data. 115--123. Google Scholar
Digital Library
- A. Page, N. Attaran, C. Shea, H. Homayoun, and T. Mohsenin. 2016. Low-power manycore accelerator for personalized biomedical applications. In Proceedings of the 26th Edition on Great Lakes Symposium on VLSI (GLSVLSI’16). ACM, New York, 63--68. Google Scholar
Digital Library
- A. Page, A. Jafari, C. Shea, and T. Mohsenin. 2017. SPARCNet: A hardware accelerator for efficient deployment of sparse convolutional networks. Journal on Emerging Technologies in Computing (JETC), Article 10 (Jan. 2017), 22 pages. Google Scholar
Digital Library
- H. Rabah, A. Amira, B. K. Mohanty, S. Almaadeed, and P. K. Meher. 2014. FPGA implementation of orthogonal matching pursuit for compressive sensing reconstruction. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 99 (2014), 1--1.Google Scholar
- B. Rouhani, E. Songhori, A. Mirhoseini, and F. Koushanfar. 2015. SSketch: An automated framework for streaming sketch-based analysis of big data on FPGA. In Proceedings of the IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines. 187--194. Google Scholar
Digital Library
- A. Septimus and R. Steinberg. 2010. Compressive sampling hardware reconstruction. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS). 3316--3319.Google Scholar
- P. Sermwuthisarn, S. Auethavekiat, and V. Patanavijit. 2009. A fast image recovery using compressive sensing technique with block based orthogonal matching pursuit. In International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS 2009). 212--215.Google Scholar
Cross Ref
- Y. Shan, B. Wang, J. Yan, Y. Wang, N. Xu, and H. Yang. 2010. FPMR: MapReduce framework on FPGA. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’10). ACM, New York, 93--102. Google Scholar
Digital Library
- P. Sinha, B. Balas, Y. Ostrovsky, and R. Russell. 2006. Face recognition by humans: Nineteen results all computer vision researchers should know about. Proceedings of the IEEE 94, 11 (Nov 2006), 1948--1962.Google Scholar
Cross Ref
- A. Stillmaker, L. Stillmaker, and B. Baas. 2012. Fine-grained energy-efficient sorting on a many-core processor array. In Proceedings of the IEEE 18th Internatonal Confereonce on Parallel and Distributed Systems (ICPADS). 652--659. Google Scholar
Digital Library
- K. Stokke, H. Stensland, C. Griwodz, and P. Halvorsen. 2015. Energy efficient video encoding using the tegra K1 mobile processor. In Proceedings of the 6th ACM Multimedia Systems Conference (MMSys’15). ACM, New York, 81--84. Google Scholar
Digital Library
- P. B. Swamy, S. K. Ambat, S. Chatterjee, and K. V. S. Hari. 2014. Reduced look ahead orthogonal matching pursuit. In 20th National Conference on Communications (NCC). 1--6.Google Scholar
Cross Ref
- M. Tavana, D. Pathak, M. Hajkazemi, M. Malik, I. Savidis, and H. Homayoun. 2015. Realizing complexity-effective on-chip power delivery for many-core platforms by exploiting optimized mapping. In Proceedings of the IEEE 33rd International Conference on Computer Design (ICCD). 581--588. Google Scholar
Digital Library
- J. Tropp and A. Gilbert. 2007. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory 53, 12 (Dec. 2007), 4655--4666. Google Scholar
Digital Library
- D. Truong, W. Cheng, T. Mohsenin, Y. Zhiyi, A. Jacobson, G. Landge, M. Meeuwsen, C. Watnik, A. Tran, X. Zhibin, E. Work, J. Webb, P. Mejia, and B. Baas. 2009. A 167-processor computational platform in 65 nm CMOS. IEEE Journal of Solid-State Circuits 44, 4 (Apr. 2009), 1130--1144.Google Scholar
Cross Ref
- P. Viola and M. Jones. 2001. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), Vol. 1. I--511--I--518.Google Scholar
- C. Wang, X. Li, and X. Zhou. 2015. SODA: Software-defined FPGA-based accelerators for big data. In Proceedings of the 2015 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’15). EDA Consortium, San Jose, CA, 884--887. Google Scholar
Digital Library
- Y. Yan, J. Zhang, B. Huang, X. Sun, J. Mu, Z. Zhang, and T. Moscibroda. 2015. Distributed outlier detection using compressive sensing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD’15). ACM, New York, 3--16. Google Scholar
Digital Library
- J. Zhang, Y. Yan, L. J. Chen, M. Wang, T. Moscibroda, and Z. Zhang. 2014. Impression store: Compressive sensing-based storage for big data analytics. In Proceedings of the 6th USENIX Conference on Hot Topics in Cloud Computing (HotCloud’14). USENIX Association, Berkeley, CA, 1. Google Scholar
Digital Library
Recommendations
Low Energy Sketching Engines on Many-Core Platform for Big Data Acceleration
GLSVLSI '16: Proceedings of the 26th edition on Great Lakes Symposium on VLSIAlmost 90% of the data available today was created within the last couple of years, thus Big Data set processing is of utmost importance. Many solutions have been investigated to increase processing speed and memory capacity, however I/O bottleneck is ...
Heterogeneous parallel_for Template for CPU---GPU Chips
Heterogeneous processors, comprising CPU cores and a GPU, are the de facto standard in desktop and mobile platforms. In many cases it is worthwhile to exploit both the CPU and GPU simultaneously. However, the workload distribution poses a challenge when ...
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs
ICS '12: Proceedings of the 26th ACM international conference on SupercomputingIn heterogeneous systems that include CPUs and GPUs, the data transfers between these components play a critical role in determining the performance of applications. Software pipelining is a common approach to mitigate the overheads of those transfers. ...






Comments