Abstract
Specialized hardware accelerators have performance and energy-efficiency advantages over general-purpose processors. To fully realize these benefits and aid programmability, accelerators may share a physical and virtual address space and full cache coherence with the host system. However, allowing accelerators -- particularly those designed by third parties -- to directly communicate with host coherence protocols poses several problems. Host coherence protocols are complex, vary between companies, and may be proprietary, increasing burden on accelerator designers. Bugs in the accelerator implementation may cause crashes and other serious consequences to the host system.
We propose Crossing Guard, a coherence interface between the host coherence system and accelerators. The Crossing Guard interface provides the accelerator designer with a standardized set of coherence messages that are simple enough to aid in design of bug-free coherent caches. At the same time, they are sufficiently complex to allow customized and optimized accelerator caches with performance comparable to using the host protocol. The Crossing Guard hardware is implemented as part of the trusted host, and provides complete safety to the host coherence system, even in the presence of a pathologically buggy accelerator cache.
- Cache Coherent Interconnect for Accelerators (CCIX). URL http://www.ccixconsortium.com/.Google Scholar
- D. Abts, D. J. Lilja, and S. Scott. So many states, so little time: Verifying memory coherence in the cray x1. In Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS), Apr. 2003. Google Scholar
Cross Ref
- N. Agarwal, D. Nellans, E. Ebrahimi, T. F. Wenisch, J. Danskin, and S. W. Keckler. Selective GPU caches to eliminate CPU-GPU HW cache coherence. In Proc. of the 22nd IEEE Symp. on High-Performance Computer Architecture, Mar. 2016. Google Scholar
Cross Ref
- K. Atasu, R. Polig, C. Hagleitner, and F. R. Reiss. Hardware-accelerated regular expression matching for high-throughput text analytics. In Field Programmable Logic and Applications (FPL), 2013 23rd International Conference on, pages 1--7, Sept. 2013. 10.1109/FPL.2013.6645534. URL http://dx.doi.org/10.1109/FPL.2013.6645534. Google Scholar
Cross Ref
- N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 simulator. Computer Architecture News (CAN), 2011. URL http://gem5.org.Google Scholar
Digital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the International Symposium on Workload Characterization, pages 44--54, October 2009. 10.1109/IISWC.2009.5306797. URL http://dx.doi.org/10.1109/IISWC.2009.5306797. Google Scholar
Digital Library
- E. M. Clarke, O. Grumberg, H. Hiraishi, S. Jha, D. E. Long, K. L. McMillan, and L. A. Ness. Verification of the FuturebusGoogle Scholar
- cache coherence protocol. In CHDL, volume 93, pages 15--30. Citeseer, 1993.Google Scholar
- A. DeOrio, A. Bauserman, and V. Bertacco. Post-silicon verification for cache coherence. In Computer Design, 2008. ICCD 2008. IEEE International Conference on, pages 348--355, Oct 2008. 10.1109/ICCD.2008.4751884. Google Scholar
Cross Ref
- D. L. Dill. The mur φ verification system. In Computer Aided Verification, pages 390--393. Springer, 1996. Google Scholar
Cross Ref
- H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural acceleration for general-purpose approximate programs. In Proc. of the 45th Annual IEEE/ACM International Symp. on Microarchitecture, pages 449--460, Dec. 2012. 10.1109/MICRO.2012.48. URL http://dx.doi.org/10.1109/MICRO.2012.48. Google Scholar
Digital Library
- R. Fernandez-Pascual, J. M. Garcia, M. E. Acacio, and J. Duato. A low overhead fault tolerant coherence protocol for cmp architectures. In Proc. of the 13th IEEE Symp. on High-Performance Computer Architecture, Feb. 2007. Google Scholar
Digital Library
- J. Goodacre. The evolution of the ARM architecture towards big data and the data-centre. http://virtical.upv.es/pub/sc13.pdf, Nov. 2013. URL http://virtical.upv.es/pub/sc13.pdf.Google Scholar
- E. E. Hagersten, M. D. Hill, and D. A. Wood. Methods and apparatus for a coherence transformer for connecting computer system coherence domains, Jan. 12 1999. US Patent 5,860,109.Google Scholar
- Coherent Accelerator Processor Interface User's Manual. IBM, 2014.Google Scholar
- O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ranganathan. Meet the walkers: Accelerating index traversals for in-memory databases. In Proc. of the 46th Annual IEEE/ACM International Symp. on Microarchitecture, pages 468--479, Dec. 2013. 10.1145/2540708.2540748. URL http://doi.acm.org/10.1145/2540708.2540748. Google Scholar
Digital Library
- S. Kumar, A. Shriraman, and N. Vedula. Fusion : Design tradeoffs in coherent cache hierarchies for accelerators. In Proc. of the 42nd Annual Intnl. Symp. on Computer Architecture, June 2015. Google Scholar
Digital Library
- J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH multiprocessor. In Proc. of the 21st Annual Intnl. Symp. on Computer Architecture, pages 302--313, Apr. 1994. 10.1109/ISCA.1994.288140. URL http://dx.doi.org/10.1109/ISCA.1994.288140. Google Scholar
Cross Ref
- J. V. Lunteren, T. Engbersen, J. Bostian, B. Carey, and C. Larsson. XML accelerator engine. In The First International Workshop on High Performance XML Processing. ACM, 2004.Google Scholar
- Y. A. Manerkar, D. Lustig, M. Pellauer, and M. Martonosi. Ccicheck: using μhb graphs to verify the coherence-consistency interface. In Proceedings of the 48th International Symposium on Microarchitecture, pages 26--37. ACM, 2015. Google Scholar
Digital Library
- M. M. K. Martin, M. D. Hill, and D. A. Wood. Token coherence: Decoupling performance and correctness. In Proc. of the 30th Annual Intnl. Symp. on Computer Architecture, pages 182--193, June 2003. Google Scholar
Digital Library
- B. P. Miller, L. Fredriksen, and B. So. An empirical study of the reliability of UNIX utilities. Communications of the ACM, 33 (12): 32--44, Dec. 1990. 10.1145/96267.96279. URL http://doi.acm.org/10.1145/96267.96279. Google Scholar
Digital Library
- D. Moloney, B. Barry, R. Richmond, F. Connor, C. Brick, D. Donohoe, A. Lupas, S. Mitchell, D. Nicholls, and V. Toma. Myriad 2: Eye of the computational vision storm. In Hot Chips 26, 2014.Google Scholar
- L. E. Olson, J. Power, M. D. Hill, and D. A. Wood. Border control: Sandboxing accelerators. In Proc. of the 48th Annual IEEE/ACM International Symp. on Microarchitecture, pages 470--481, Dec. 2015. 10.1145/2830772.2830819. URL http://doi.acm.org/10.1145/2830772.2830819. Google Scholar
Digital Library
- S. Park and D. L. Dill. Verification of FLASH cache coherence protocol by aggregation of distributed transactions. In Proc. of the 8th ACM Symp. on Parallel Algorithms and Architectures, pages 288--296, June 1996. Google Scholar
Digital Library
- W.-C. Park, H.-J. Shin, B. Lee, H. Yoon, and T.-D. Han. RayChip: Real-time ray-tracing chip for embedded applications. In Hot Chips 26, 2014.Google Scholar
- S. Phillips. M7: Next generation SPARC. In Hot Chips 26, 2014.Google Scholar
- J. Power, J. Hestness, M. S. Orr, M. D. Hill, and D. A. Wood. gem5-gpu: A heterogeneous cpu-gpu simulator. Computer Architecture Letters, 13 (1). 10.1109/LCA.2014.2299539. URL http://dx.doi.org/10.1109/LCA.2014.2299539. Google Scholar
Cross Ref
- J. Power, A. Basu, J. Gu, S. Puthoor, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood. Heterogeneous system coherence for integrated cpu-gpu systems. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 457--467, New York, NY, USA, 2013. ACM. ISBN 978--1--4503--2638--4. 10.1145/2540708.2540747. URL http://doi.acm.org/10.1145/2540708.2540747.Google Scholar
Digital Library
- V. Rajagopalan. All programmable devices: Not just an FPGA anymore. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, 2013. Keynote presentation.Google Scholar
- D. J. Sorin, M. M. Martin, M. D. Hill, and D. A. Wood. SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proc. of the 29th Annual Intnl. Symp. on Computer Architecture, pages 123--134. IEEE, May 2002. Google Scholar
Digital Library
- D. J. Sorin, M. D. Hill, and D. A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures in Computer Architecture, 2011.Google Scholar
Digital Library
- J. Stuecheli, B. Blaner, C. R. Johns, and M. S. Siegel. CAPI: A coherent accelerator processor interface. IBM Journal of Research and Development, 59 (1): 7:1--7:7, Jan. 2015. ISSN 0018--8646. 10.1147/JRD.2014.2380198.Google Scholar
Digital Library
- D. A. Wood, G. A. Gibson, and R. H. Katz. Verifying a multiprocessor cache controller using random test generation. IEEE Design and Test of Computers, pages 13--25, Aug. 1990. Google Scholar
Digital Library
Index Terms
Crossing Guard: Mediating Host-Accelerator Coherence Interactions
Recommendations
Crossing Guard: Mediating Host-Accelerator Coherence Interactions
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsSpecialized hardware accelerators have performance and energy-efficiency advantages over general-purpose processors. To fully realize these benefits and aid programmability, accelerators may share a physical and virtual address space and full cache ...
Crossing Guard: Mediating Host-Accelerator Coherence Interactions
Asplos'17Specialized hardware accelerators have performance and energy-efficiency advantages over general-purpose processors. To fully realize these benefits and aid programmability, accelerators may share a physical and virtual address space and full cache ...
A Runtime Programmable Accelerator for Convolutional and Multilayer Perceptron Neural Networks on FPGA
Applied Reconfigurable Computing. Architectures, Tools, and ApplicationsAbstractDeep neural networks (DNNs) are prevalent for many applications related to classification, prediction and regression. To perform different applications with better performance and accuracy, an optimized network architecture is required, which can ...







Comments