Abstract
Cloud services are becoming increasingly globalized and data-center workloads are expanding exponentially. GPU and FPGA-based clouds have illustrated improvements in power and performance by accelerating compute-intensive workloads. ASIC-based clouds are a promising way to optimize the Total Cost of Ownership (TCO) of a given datacenter computation (e.g. YouTube transcoding) by reducing both energy consumption and marginal computation cost.
The feasibility of an ASIC Cloud for a particular application is directly gated by the ability to manage the Non-Recurring Engineering (NRE) costs of designing and fabricating the ASIC, so that it is significantly lower (e.g. 2X) than the TCO of the best available alternative.
In this paper, we show that technology node selection is a major tool for managing ASIC Cloud NRE, and allows the designer to trade off an accelerator's excess energy efficiency and cost performance for lower total cost.
We explore NRE and cross-technology optimization of ASIC Clouds for four different applications: Bitcoin mining, YouTube-style video transcoding, Litecoin, and Deep Learning. We address these challenges and show large reductions in the NRE, potentially enabling ASIC Clouds to address a wider variety of datacenter workloads. Our results suggest that advanced nodes like 16nm will lead to sub-optimal TCO for many workloads, and that use of older nodes like 65nm can enable a greater diversity of ASIC Clouds.
- M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: a system for large-scale machine learning.In OSDI, 2016.Google Scholar
Digital Library
- M. Abdelfattah, A. Hagiescu, and D. Singh.Gzip on a chip: High performance lossless data compression on FPGAs using opencl.In International Workshop on OpenCL (IWOC, 2014.Google Scholar
Digital Library
- J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi.A Scalable Processing-in-memory Accelerator for Parallel Graph Processing.In ISCA, 2015. Google Scholar
Digital Library
- J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. Jerger, and A. Moshovos.Cnvlutin: ineffectual-neuron-free deep neural network computing.In ISCA, 2016. Google Scholar
Digital Library
- K. Asanovic, R. Avizienis, J. Bachrach, S. Beamer, D. Biancolin, C. Celio, H. Cook, D. Dabbelt, J. Hauser, A. Izraelevitz, S. Karandikar, B. Keller, D. Kim, and J. Koenig.The Rocket Chip Generator.Technical Report No. UCB/EECS-2016--17, 2016.Google Scholar
- J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avizienis, J. Wawrzynek, and K. Asanovic.Chisel: Constructing hardware in a Scala embedded language.In DAC, 2012.Google Scholar
Digital Library
- J. Balkind, M. McKeown, Y. Fu, T. Nguyen, Y. Zhou, A. Lavrov, M. Shahrad, A. Fuchs, S. Payne, X. Liang, M. Matl, and D. Wentzlaff.OpenPiton: An Open Source Manycore Research Framework.In ASPLOS, 2016.Google Scholar
Digital Library
- L. Barroso, J. Clidaras, and U. Holzle.\ The Datacenter As a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition. SynthesisLectures on Computer Architecture, 2013.Google Scholar
- J. Beetem, M. Denneau, and D. Weingarten.The GF11 Supercomputer.In ISCA, 1985. Google Scholar
Digital Library
- M. Bojnordi, and E. Ipek.Memristive Boltzmann Machine: A Hardware Accelerator for Combinatorial Optimization and Deep Learning.In HPCA, 2016. Google Scholar
Cross Ref
- I. Bolsens.2.5 D ICs: Just a Stepping Stone or a Long Term Alternative to 3D?. Keynote Talk at 3-D Architectures for Semiconductor Integration and Packaging Conference, 2011.Google Scholar
- A. Caulfield, E. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman, S. Heil, M. Humphrey, P. Kaur, J. Kim, D. Lo, T. Massengill, K. Ovtcharov, M. Papamichael, L. Woods, S. Lanka, D. Chiou, and D. BurgerA Cloud-Scale Acceleration Architecture.In MICRO, 2016.Google Scholar
Cross Ref
- Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam.DaDianNao: A Machine-Learning Supercomputer.In MICRO, 2014. Google Scholar
Digital Library
- Y. Chen, J. Emer, and V. Sze.Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks.In ISCA, 2016. Google Scholar
Digital Library
- Q. Chen, H. Yang, J. Mars, and L. Tang.Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers.In ASPLOS, 2016.Google Scholar
Digital Library
- P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie.PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory.In ISCA, 2016. Google Scholar
Digital Library
- H. Esmaeilzadeh, E. Blem, R. Amant, K. Sankaralingam, and D. Burger.Dark Silicon and the End of Multicore Scaling.In ISCA, 2011. Google Scholar
Digital Library
- V. Gangadhar, R. Balasubramanian, M. Drumond, Z. Guo, J. Menon, C. Joseph, R. Prakash, S. Prasad, P. Vallathol, and K. Sankaralingam.MIAOW: An open source GPGPU.In IEEE Hot Chips 27 Symposium, 2015.Google Scholar
- Glassdoor.Glassdoor salaries, 2016.https://www.glassdoor.comGoogle Scholar
- V. Gogte, A. Kolli, M. Cafarella, L. D'Antoni, and T. Wenisch.HARE: Hardware accelerator for regular expressions.In MICRO, 2016.Google Scholar
Cross Ref
- N. Goulding, J. Sampson, G. Venkatesh, S. Garcia, J. Auricchio, J. Babb, M. Taylor, and S. Swanson.GreenDroid: A mobile application processor for a future of dark silicon.In IEEE Hot Chips 22 Symposium, 2010. Google Scholar
Cross Ref
- N. Goulding-Hotta, J. Sampson, G. Venkatesh, S. Garcia, J. Auricchio, P. Huang, M. Arora, S. Nath, V. Bhatt, J. Babb, S. Swanson, and M. Taylor.The GreenDroid Mobile Application Processor: An Architecture for Silicon's Dark Future.In IEEE MICRO, 2011.Google Scholar
Digital Library
- B. Gu, A. Yoon, D. Bae, I. Jo, J. Lee, J. Yoon, J. Kang, M. Kwon, C. Yoon, S. Cho, J. Jeong, and D. Chang.Biscuit: a framework for near-data processing of big data workloads.In ISCA, 2016. Google Scholar
Digital Library
- A. Gutierrez, M. Cieslak, B. Giridhar, R. G. Dreslinski, L. Ceze, and T. Mudge.Integrated 3D-stacked Server Designs for Increasing Physical Density of Key-value Stores.In ASPLOS, 2014.Google Scholar
Digital Library
- T. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi.Graphicionado: A High-Performance and Energy-Efficient Accelerator for Graph Analytics.In MICRO, 2016. Google Scholar
Cross Ref
- R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz.Understanding sources of inefficiency in general-purpose chips.In ISCA, 2012.Google Scholar
- S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and W. Dally. EIE: Efficient Inference Engine on Compressed Deep Neural Network.In ISCA, 2016.Google Scholar
Digital Library
- J. Hauswald, M. Laurenzano, Y. Zhang, C. Li, A. Rovinski, A. Khurana, R. Dreslinski, T. Mudge, V. Petrucci, L. Tang, and J. Mars.Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers.In ASPLOS, 2015.Google Scholar
Digital Library
- Y. Ji, Y. Zhang, S. Li, P. Chi, C. Jiang, P. Qu, Y. Xie, and W. ChenNEUTRAMS: Neural Network Transformation and Co-design under Neuromorphic Hardware Constraints.In MICRO, 2016.Google Scholar
Cross Ref
- H. Jones.Strategies in Optimizing Market Positions for Semiconductor Vendors Based on IP Leverage.IBS White Paper, 2014.Google Scholar
- C. Ju, T. Liu, K. Lee, Y. Chang, H. Chou, C. Wang, T. Wu, H. Lin, Y. Huang, C. Cheng, T. Lin, C. Chen, Y. Lin, M. Chiu, W. Li, S. Wang, Y. Lai, P. Chao, C. Chien, M. Hu, P. Wang, Y. Huang, S. Chuang, L. Chen, H. Lin, M. Wu, and C. Chen.A 0.5 nJ/Pixel 4 K H.265/HEVC Codec LSI for Multi-Format Smartphone Applications.In JSSC, 2016.Google Scholar
- S. Jun, M. Liu, S. Lee, Hicks, Ankcorn, King, Myron, S. Xu, and Arvind.BlueDBM: An Appliance for Big Data Analytics.In ISCA, 2015.Google Scholar
Digital Library
- A. Kannan, N. Jerger, and G. Loh.Enabling Interposer-based Disintegration of Multi-core Processors.In MICRO, 2015. Google Scholar
Digital Library
- M. Kim, M. Mehrara, M. Oskin, and T. Austin.Architectural Implications of Brick and Mortar Silicon Manufacturing.In ISCA, 2007. Google Scholar
Digital Library
- D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay.Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory.In ISCA, 2016.Google Scholar
- O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ranganathan.Meet the Walkers: Accelerating Index Traversals for In-memory Databases.In MICRO, 2013.Google Scholar
Digital Library
- K. Lim, D. Meisner, A. Saidi, P. Ranganathan, and T. Wenisch.Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached.In ISCA, 2013.Google Scholar
- S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen.Cambricon: An Instruction Set Architecture for Neural Networks.In ISCA, 2016.Google Scholar
- I. Magaki, M. Khazraee, L. Vega, M. B. Taylor.ASIC Clouds: Specializing the Datacenter.In ISCA, 2016.Google Scholar
- M. Ozdal, S. Yesil, T. Kim, A. Ayupov, J. Greth, S. Burns, and O. Ozturk.Energy efficient architecture for graph analytics accelerators.In ISCA, 2016. Google Scholar
Digital Library
- A. Putnam, A. Caulfield, E. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Xiao, and D. Burger.A Reconfigurable Fabric for Accelerating Large-scale Datacenter Services.In ISCA, 2014. Google Scholar
Cross Ref
- W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. Horowitz.Convolution engine: balancing efficiency and flexibility in specialized computing.In ISCA, 2013. Google Scholar
Digital Library
- B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. Lee, J. Hernández-Lobato, G. Wei, and D. Brooks.Minerva: enabling low-power, highly-accurate deep neural network accelerators.In ISCA, 2016. Google Scholar
Digital Library
- J. Sampson, G. Venkatesh, N. Goulding-Hotta, S. Garcia, S. Swanson and M. Taylor.Efficient Complex Operators for Irregular Codes.In HPCA, 2011. Google Scholar
Cross Ref
- R. Sampson, M. Yang, S. Wei, C. Chakrabarti, and T. Wenisch.Sonic Millip3De: A Massively Parallel 3D-Stacked Accelerator for 3D Ultrasound.In HPCA, 2013.Google Scholar
Digital Library
- A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. Strachan, M. Hu, R. Williams, and V. Srikumar.ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars.In ISCA, 2016. Google Scholar
Digital Library
- Y. Shao, B. Reagen, G. Wei, and D. Brooks.Aladdin: a Pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures.In ISCA, 2014. Google Scholar
Digital Library
- D. Shaw, M. Deneroff, R. Dror, J. Kuskin, R. Larson, J. Salmon, C. Young, B. Batson, K. Bowers, J. Chao, M. Eastwood, J. Gagliardo, J. Grossman, C. Ho, D. Ierardi, I. Kolossváry, J. Klepeis, T. Layman, C. McLeavey, M. Moraes, R. Mueller, E. Priest, Y. Shan, J. Spengler, M. Theobald, B. Towles, and S. Wang.Anton, a Special-purpose Machine for Molecular Dynamics Simulation.In ISCA, 2007. Google Scholar
Digital Library
- A. Solomatnikov, A. Firoozshahian, W. Qadeer, O. Shacham, K. Kelley, Z. Asgar, M. Wachs, R. Hameed, and M. Horowitz.Chip Multi-processor Generator.In DAC, 2007. Google Scholar
Digital Library
- A. Pedram, S. Richardson, S. Galal, S. Kvatinsky, and M. Horowitz.Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era.In IEEE Design Test, 2016.Google Scholar
- P. Tandon, J. Chang, R. Dreslinski, V. Qazvinian, P. Ranganathan, and T. Wenisch.Hardware Acceleration for Similarity Measurement in Natural Language Processing.In ISLPED, 2013. Google Scholar
Cross Ref
- M. Taylor.A Landscape of the New Dark Silicon Design Regime.In IEEE Micro, 2013. Google Scholar
Cross Ref
- M. Taylor.Is Dark Silicon Useful? Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse.In DAC, 2012.Google Scholar
Digital Library
- M. Taylor.Bitcoin and the Age of Bespoke Silicon.In CASES, 2013.Google Scholar
- G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. Taylor.Conservation cores: reducing the energy of mature computationsIn ASPLOS, 2010.Google Scholar
- G. Venkatesh, J. Sampson, N. Goulding-Hotta, S. Kota Venkata, M. Taylor, and S. Swanson.QsCores: Configurable Co-processors to Trade Dark Silicon for Energy Efficiency in a Scalable Manner.In MICRO, 2011.Google Scholar
- M. Wachs, O. Shacham, Z. Asgar, A. Firoozshahian, S. Richardson and M. Horowitz.Bringing up a chip on the cheap.\ IEEE Design Test of Computers, 2012. Google Scholar
Digital Library
- J. Wong, F. Kourshanfar and M. Potkonjak.Flexible ASIC: shared masking for multiple media processors.In DAC, 2005. Google Scholar
Digital Library
- K. Wu, and Y. Tsai.Structured ASIC, Evolution or Revolution?.In Proceedings of the International Symposium on Physical Design (ISPD), 2004. Google Scholar
Digital Library
- L. Wu, A. Lottarini, T. Paine, M. Kim, and K. Ross.Q100: The Architecture and Design of a Database Processing Unit.In ASPLOS, 2014.Google Scholar
Digital Library
- N. Xu, X. Cai, R. Gao, L. Zhang, and F. Hsu.FPGA Acceleration of RankBoost in Web Search Engines.In ACM Transactions on Reconfigurable Technology and Systems (TRETS), 2009. Google Scholar
Digital Library
- R. Yazdani, A. Segura, J. Arnau, and A. Gonzalez.An ultra low-power hardware accelerator for automatic speech recognition.In MICRO, 2016. Google Scholar
Cross Ref
- B. Zahiri.Structured ASICs: opportunities and challenges.In Proceedings of the 21st International Conference on Computer Design (ICCD), 2003. Google Scholar
Cross Ref
- S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen.Cambricon-X: An accelerator for sparse neural networks.In MICRO, 2016.Google Scholar
Digital Library
Index Terms
Moonwalk: NRE Optimization in ASIC Clouds
Recommendations
Moonwalk: NRE Optimization in ASIC Clouds
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsCloud services are becoming increasingly globalized and data-center workloads are expanding exponentially. GPU and FPGA-based clouds have illustrated improvements in power and performance by accelerating compute-intensive workloads. ASIC-based clouds ...
Moonwalk: NRE Optimization in ASIC Clouds
Asplos'17Cloud services are becoming increasingly globalized and data-center workloads are expanding exponentially. GPU and FPGA-based clouds have illustrated improvements in power and performance by accelerating compute-intensive workloads. ASIC-based clouds ...
Extreme Datacenter Specialization for Planet-Scale Computing: ASIC Clouds
Special TopicsPlanet-scale applications are driving the exponential growth of the cloud, and datacenter specialization is the key enabler of this trend, providing order of magnitudes improvements in cost-effectiveness and energy-efficiency. While exascale computing ...







Comments