Abstract
Application requirements, such as real-time response, are pushing wearable devices to leverage more powerful processors inside the SoC (system on chip). However, existing wearable devices are not well suited for such challenging applications due to poor performance, and the conventional powerful many-core architectures are not appropriate either due to the stringent power budget in this domain. We propose LOCUS—a low-power, customizable, many-core processor for next-generation wearable devices. LOCUS combines customizable processor cores with a customizable network on a message-passing architecture to deliver very competitive performance/watt—an average 3.1× compared to quad-core ARM processors used in state-of-the-art wearable devices. A combination of full system simulation with representative applications from the wearable domain and RTL synthesis of the architecture show that 16-core LOCUS achieves an average 1.52× performance/watt improvement over a conventional 16-core shared memory many-core architecture. A dynamic power management mechanism is proposed to further decrease the power consumption in both computation and communication, which improves the performance/watt of LOCUS by 1.17×.
- Kanak Agarwal, Kevin Nowka, Harmander Deogun, and Dennis Sylvester. 2006. Power gating with multiple sleep modes. In Proceedings of the 7th International Symposium on Quality Electronic Design (ISQED’06). 633--637. Google Scholar
Digital Library
- Niket Agarwal, Tushar Krishna, Li-Shiuan Peh, and Niraj K. Jha. 2009. GARNET: A detailed on-chip network model inside a full-system simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09). IEEE, Los Alamitos, CA, 33--42. Google Scholar
Cross Ref
- Shane Bell, Bruce Edwards, John Amann, Rich Conlin, Kevin Joyce, Vince Leung, John MacKay, et al. 2008. TILE64-processor: A 64-core SoC with mesh interconnect. In Proceedings of the 2008 IEEE International Solid-State Circuits Conference (ISSCC’08) Digest of Technical Papers. IEEE, Los Alamitos, CA, 88--598.Google Scholar
Cross Ref
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, et al. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News 39, 2, 1--7. Google Scholar
Digital Library
- Lucien M. Censier and Paul Feautrier. 1978. A new solution to coherence problems in multicache systems. IEEE Transactions on Computers 100, 12, 1112--1118. Google Scholar
Digital Library
- Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramanian, Anantha P. Chandrakasan, and Li-Shiuan Peh. 2013. SMART: A single-cycle reconfigurable NoC for SoC applications. In Proceedings of the Conference on Design, Automation, and Test in Europe. 338--343. Google Scholar
Cross Ref
- Liang Chen, Joseph Tarango, Tulika Mitra, and Philip Brisk. 2013. A just-in-time customizable processor. In Proceedings of the 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD’13). IEEE, Los Alamitos, CA, 524--531. Google Scholar
Cross Ref
- Sergey Chernenko. 2015. ECG Processing—R-Peaks Detection. Retrieved October 18, 2017, from http://goo.gl/oYbn8C.Google Scholar
- Nathan Clark, Jason Blome, Michael Chu, Scott Mahlke, Stuart Biles, and Krisztian Flautner. 2005. An architecture framework for transparent instruction set customization in embedded processors. ACM SIGARCH Computer Architecture News 33, 272--283. Google Scholar
Digital Library
- Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, and Krisztian Flautner. 2004. Application-specific processing on a general-purpose core via transparent instruction set customization. In Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37). IEEE, Los Alamitos, CA, 30--40. Google Scholar
Digital Library
- Amber ARM Compatible Core. 2009. Home Page. Retrieved October 18, 2017, from http://goo.gl/Jshd3q.Google Scholar
- Francesco Conti, Davide Rossi, Antonio Pullini, Igor Loi, and Luca Benini. 2015. PULP: A ultra-low power parallel accelerator for energy-efficient and flexible embedded vision. Journal of Signal Processing Systems 84, 3, 339--354. Google Scholar
Digital Library
- Andrea Corradini. 2001. Dynamic time warping for off-line recognition of a small gesture vocabulary. In Proceedings of the IEEE ICCV Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems. IEEE, Los Alamitos, CA, 82--89. Google Scholar
Cross Ref
- Z. Cvetanovic and C. Nofsinger. 1990. Parallel Astar search on message-passing architectures. In Proceedings of the 23rd Annual Hawaii International Conference on System Sciences, Vol. 1. IEEE, Los Alamitos, CA, 82--90. Google Scholar
Cross Ref
- Ahmed Yasir Dogan, Jeremy Constantin, Martino Ruggiero, Andreas Burg, and David Atienza. 2012. Multi-core architecture design for ultra-low-power wearable health monitoring systems. In Proceedings of the Conference on Design, Automation, and Test in Europe. 988--993. Google Scholar
Cross Ref
- David Duarte, Yuh-Fang Tsai, Narayanan Vijaykrishnan, and Mary Jane Irwin. 2002. Evaluating run-time techniques for leakage power reduction. In Proceedings of the 2002 Asia and South Pacific Design Automation Conference (ASP-DAC’02). 31. Google Scholar
Cross Ref
- Andrew Duller, Gajinder Panesar, and Daniel Towner. 2003. Parallel processing—the picoChip way. Communicating Processing Architectures 2003, 125--138.Google Scholar
- Ashraf Eassa. 2015. How Much Does a Qualcomm Inc. Snapdragon 400 Chip Cost? Retrieved October 18, 2017, from http://goo.gl/YAIqzJ.Google Scholar
- Alon Efrat, Quanfu Fan, and Suresh Venkatasubramanian. 2007. Curve matching, time warping, and light fields: New algorithms for computing similarity between curves. Journal of Mathematical Imaging and Vision 27, 3, 203--216. Google Scholar
Digital Library
- Google’s Fused Location API. 2013. Google I/O 2013—Beyond the Blue Dot: New Features in Android Location (Video). Retrieved October 18, 2017, from https://goo.gl/fAckD8.Google Scholar
- Gartner. 2014. Gartner Says 4.9 Billion Connected “Things” Will Be in Use in 2015. Retrieved October 18, 2017, from http://goo.gl/TVinZF.Google Scholar
- Samsung Gear S. 2013. Home Page. Retrieved October 18, 2017, from http://goo.gl/aE6ApL.Google Scholar
- Samsung Gear SDK. 2013. Home Page. Retrieved October 18, 2017, from http://goo.gl/cT4qXJ.Google Scholar
- Google Glass. 2013. Home Page. Retrieved October 18, 2017, from https://goo.gl/2VDMyO.Google Scholar
- Google Glass SDK. 2013. Home Page. Retrieved October 18, 2017, from https://goo.gl/jWeUh5.Google Scholar
- Glasses AR SDK. 2015. Home Page. Retrieved October 18, 2017, from http://goo.gl/o9Y5YM.Google Scholar
- Michael Gschwind, H. Peter Hofstee, Brian Flachs, Martin Hopkins, Yukio Watanabe, and Takeshi Yamazaki. 2006. Synergistic processing in cell’s multicore architecture. IEEE Micro 26, 2, 10--24. Google Scholar
Digital Library
- Linley Gwennap. 2011. Adapteva: More flops, less watts. Microprocessor Report 6, 13, 11--02.Google Scholar
- HERE. 2014. HERE for Gear: Apps Inbound for Samsung Tizen. Retrieved October 18, 2017, from http://goo.gl/lVPqux.Google Scholar
- Jason Howard, Saurabh Dighe, Yatin Hoskote, Sriram Vangal, David Finan, Gregory Ruhl, David Jenkins, et al. 2010. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proceedings of the 2010 IEEE International Solid-State Circuits Conference (ISSCC’10). IEEE, Los Alamitos, CA, 108--109.Google Scholar
Cross Ref
- Libo Huang, Zhiying Wang, and Nong Xiao. 2012. Accelerating NoC-based MPI primitives via communication architecture customization. In Proceedings of the 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures, and Processors. IEEE, Los Alamitos, CA, 141--148. Google Scholar
Digital Library
- Natalie Enright Jerger and Li-Shiuan Peh. 2009. On-chip networks. Synthesis Lectures on Computer Architecture 4, 1, 1--141. Google Scholar
Cross Ref
- Tushar Krishna, Chia-Hsin Owen Chen, Woo Cheol Kwon, and Li-Shiuan Peh. 2013. Breaking the on-chip latency barrier using SMART. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). IEEE, Los Alamitos, CA, 378--389. Google Scholar
Digital Library
- Bo Li, Hung-Ching Chang, Shuaiwen Song, Chun-Yi Su, Timmy Meyer, John Mooring, and Kirk W. Cameron. 2014. The power-performance tradeoffs of the Intel Xeon Phi on HPC applications. In Proceedings of the 2014 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’14). IEEE, Los Alamitos, CA, 1448--1456. Google Scholar
Digital Library
- Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, New York, NY, 469--480. Google Scholar
Digital Library
- Larry McMurchie and Carl Ebeling. 1995. PathFinder: A negotiation-based performance-driven router for FPGAs. In Proceedings of the 1995 ACM 3rd International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 111--117. Google Scholar
Digital Library
- Moto 360. 2015. Moto 360 (2nd Generation). Retrieved October 18, 2017, from http://goo.gl/N1jquY.Google Scholar
- MPICH. 1999. Home Page. Retrieved October 18, 2017, from https://www.mpich.org/.Google Scholar
- Meinard Müller. 2007. Dynamic time warping. In Information Retrieval for Music and Motion. Springer, 69--84.Google Scholar
Digital Library
- Offline Navigation. 2016. Routing/Offline Routers. Retrieved October 18, 2017, from http://goo.gl/Bmeljs.Google Scholar
- Odroid-XU3. 2014. Home Page. Retrieved October 18, 2017, from http://goo.gl/vhPocF.Google Scholar
- Moriyoshi Ohara, Hiroshi Inoue, Yukihiko Sohda, Hideaki Komatsu, and Toshio Nakatani. 2006. MPI microtask for programming the cell broadband engine processor. IBM Systems Journal 45, 1, 85--102. Google Scholar
Digital Library
- Optinvent. 2015. Home Page. Retrieved October 18, 2017, from http://optinvent.com/.Google Scholar
- James Psota and Anant Agarwal. 2008. rMPI: Message passing on multicore processors with on-chip interconnect. In Proceedings of the 2008 International Conference on High-Performance Embedded Architectures and Compilers. 22--37. Google Scholar
Cross Ref
- Peng Rong and Massoud Pedram. 2006. Power-aware scheduling and dynamic voltage setting for tasks running on a hard real-time system. In Proceedings of the 2006 Asia and South Pacific Conference on Design Automation. IEEE, Los Alamitos, CA, 6. Google Scholar
Digital Library
- Hiroaki Sakoe and Seibi Chiba. 1978. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 26, 1, 43--49. Google Scholar
Cross Ref
- Kartik Sankaran, Minhui Zhu, Xiang Fa Guo, Akkihebbal L. Ananda, Mun Choon Chan, and Li-Shiuan Peh. 2014. Using mobile phone barometer for low-power transportation context detection. In Proceedings of the 12th ACM Conference on Embedded Network Sensor Systems. ACM, New York, NY, 191--205. Google Scholar
Digital Library
- Sony SmartWatch 3. 2014. SmartWatch 3 SWR50. Retrieved October 18, 2017, from http://goo.gl/qrV8ux.Google Scholar
- Qualcomm Snapdragon 400. 2012. Snapdragon 400 Processor. Retrieved October 18, 2017, from https://goo.gl/aja771.Google Scholar
- Chen Sun, Chia-Hsin Owen Chen, George Kurian, Lan Wei, Jason Miller, Anant Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. 2012. DSENT—a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In Proceedings of the 2012 6th IEEE/ACM International Networks on Chip Symposium (NoCS’12). IEEE, Los Alamitos, CA, 201--210.Google Scholar
Digital Library
- Cheng Tan, Aditi Kulkarni, Vanchinathan Venkataramani, Manupa Karunaratne, Tulika Mitra, and Li-Shiuan Peh. 2016. LOCUS: Low-power customizable many-core architecture for wearables. In Proceedings of the 2016 International Conference on Compilers, Architectures, and Synthesis for Embedded Systems. ACM, New York, NY, 11. Google Scholar
Digital Library
- Charles C. Tappert, Ching Y. Suen, and Toru Wakahara. 1990. The state of the art in online handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 8. 787--808.Google Scholar
Digital Library
- Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffman, et al. 2002. The raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE Micro 22, 2, 25--35. Google Scholar
Digital Library
- Sergio V. Tota, Mario R. Casu, Massimo Ruo Roch, Luca Rostagno, and Maurizio Zamboni. 2010. MEDEA: A hybrid shared-memory/message-passing multiprocessor NoC-based architecture. In Proceedings of the 2010 Design, Automation, and Test in Europe Conference and Exhibition (DATE’10). IEEE, Los Alamitos, CA, 45--50.Google Scholar
Cross Ref
- LG Watch Urbane W150. 2015. LG Watch Urbane in Silver: W150. Retrieved October 18, 2017, from http://goo.gl/qg76vg.Google Scholar
- Intel Xeon Phi. 2012. Intel Xeon Phi Coprocessor 5110P. Retrieved October 18, 2017, from http://goo.gl/8jXTzR.Google Scholar
- Pan Yu and Tulika Mitra. 2004. Characterizing embedded applications for instruction-set extensible processors. In Proceedings of the 41st Annual Design Automation Conference. ACM, New York, NY, 723--728. Google Scholar
Digital Library
- Pan Yu and Tulika Mitra. 2004. Scalable custom instructions identification for instruction-set extensible processors. In Proceedings of the 2004 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems. ACM, New York, NY, 69--78. Google Scholar
Digital Library
- Jason Zebchuk, Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and Andreas Moshovos. 2009. A tagless coherence directory. In Proceedings of the 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42). IEEE, Los Alamitos, CA, 423--434. Google Scholar
Digital Library
Index Terms
LOCUS: Low-Power Customizable Many-Core Architecture for Wearables
Recommendations
On the efficiency of the accelerated processing unit for scientific computing
HPC '16: Proceedings of the 24th High Performance Computing SymposiumThe AMD APU (Accelerated Processing Unit) architecture, which combines CPU and GPU cores on the same die at a low power budget, promises a significant advent in GPU computing, in particular to applications which performance is bottlenecked by the low ...
Parallelism via Multithreaded and Multicore CPUs
Multicore and multithreaded CPUs have become the new approach to obtaining increases in CPU performance. Numeric applications mostly benefit from a large number of computationally powerful cores. Servers typically benefit more if chip circuitry is used ...
Sparse matrix-vector multiplication on the Single-Chip Cloud Computer many-core processor
The microprocessor industry has responded to memory, power and ILP walls by turning to many-core processors, increasing parallelism as the primary method to improve processor performance. These processors are expected to consist of tens or even hundreds ...






Comments