skip to main content
research-article
Public Access

Crossing Guard: Mediating Host-Accelerator Coherence Interactions

Published:04 April 2017Publication History
Skip Abstract Section

Abstract

Specialized hardware accelerators have performance and energy-efficiency advantages over general-purpose processors. To fully realize these benefits and aid programmability, accelerators may share a physical and virtual address space and full cache coherence with the host system. However, allowing accelerators -- particularly those designed by third parties -- to directly communicate with host coherence protocols poses several problems. Host coherence protocols are complex, vary between companies, and may be proprietary, increasing burden on accelerator designers. Bugs in the accelerator implementation may cause crashes and other serious consequences to the host system.

We propose Crossing Guard, a coherence interface between the host coherence system and accelerators. The Crossing Guard interface provides the accelerator designer with a standardized set of coherence messages that are simple enough to aid in design of bug-free coherent caches. At the same time, they are sufficiently complex to allow customized and optimized accelerator caches with performance comparable to using the host protocol. The Crossing Guard hardware is implemented as part of the trusted host, and provides complete safety to the host coherence system, even in the presence of a pathologically buggy accelerator cache.

References

  1. Cache Coherent Interconnect for Accelerators (CCIX). URL http://www.ccixconsortium.com/.Google ScholarGoogle Scholar
  2. D. Abts, D. J. Lilja, and S. Scott. So many states, so little time: Verifying memory coherence in the cray x1. In Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS), Apr. 2003. Google ScholarGoogle ScholarCross RefCross Ref
  3. N. Agarwal, D. Nellans, E. Ebrahimi, T. F. Wenisch, J. Danskin, and S. W. Keckler. Selective GPU caches to eliminate CPU-GPU HW cache coherence. In Proc. of the 22nd IEEE Symp. on High-Performance Computer Architecture, Mar. 2016. Google ScholarGoogle ScholarCross RefCross Ref
  4. K. Atasu, R. Polig, C. Hagleitner, and F. R. Reiss. Hardware-accelerated regular expression matching for high-throughput text analytics. In Field Programmable Logic and Applications (FPL), 2013 23rd International Conference on, pages 1--7, Sept. 2013. 10.1109/FPL.2013.6645534. URL http://dx.doi.org/10.1109/FPL.2013.6645534. Google ScholarGoogle ScholarCross RefCross Ref
  5. N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 simulator. Computer Architecture News (CAN), 2011. URL http://gem5.org.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the International Symposium on Workload Characterization, pages 44--54, October 2009. 10.1109/IISWC.2009.5306797. URL http://dx.doi.org/10.1109/IISWC.2009.5306797. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. M. Clarke, O. Grumberg, H. Hiraishi, S. Jha, D. E. Long, K. L. McMillan, and L. A. Ness. Verification of the FuturebusGoogle ScholarGoogle Scholar
  8. cache coherence protocol. In CHDL, volume 93, pages 15--30. Citeseer, 1993.Google ScholarGoogle Scholar
  9. A. DeOrio, A. Bauserman, and V. Bertacco. Post-silicon verification for cache coherence. In Computer Design, 2008. ICCD 2008. IEEE International Conference on, pages 348--355, Oct 2008. 10.1109/ICCD.2008.4751884. Google ScholarGoogle ScholarCross RefCross Ref
  10. D. L. Dill. The mur φ verification system. In Computer Aided Verification, pages 390--393. Springer, 1996. Google ScholarGoogle ScholarCross RefCross Ref
  11. H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural acceleration for general-purpose approximate programs. In Proc. of the 45th Annual IEEE/ACM International Symp. on Microarchitecture, pages 449--460, Dec. 2012. 10.1109/MICRO.2012.48. URL http://dx.doi.org/10.1109/MICRO.2012.48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Fernandez-Pascual, J. M. Garcia, M. E. Acacio, and J. Duato. A low overhead fault tolerant coherence protocol for cmp architectures. In Proc. of the 13th IEEE Symp. on High-Performance Computer Architecture, Feb. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Goodacre. The evolution of the ARM architecture towards big data and the data-centre. http://virtical.upv.es/pub/sc13.pdf, Nov. 2013. URL http://virtical.upv.es/pub/sc13.pdf.Google ScholarGoogle Scholar
  14. E. E. Hagersten, M. D. Hill, and D. A. Wood. Methods and apparatus for a coherence transformer for connecting computer system coherence domains, Jan. 12 1999. US Patent 5,860,109.Google ScholarGoogle Scholar
  15. Coherent Accelerator Processor Interface User's Manual. IBM, 2014.Google ScholarGoogle Scholar
  16. O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ranganathan. Meet the walkers: Accelerating index traversals for in-memory databases. In Proc. of the 46th Annual IEEE/ACM International Symp. on Microarchitecture, pages 468--479, Dec. 2013. 10.1145/2540708.2540748. URL http://doi.acm.org/10.1145/2540708.2540748. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Kumar, A. Shriraman, and N. Vedula. Fusion : Design tradeoffs in coherent cache hierarchies for accelerators. In Proc. of the 42nd Annual Intnl. Symp. on Computer Architecture, June 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH multiprocessor. In Proc. of the 21st Annual Intnl. Symp. on Computer Architecture, pages 302--313, Apr. 1994. 10.1109/ISCA.1994.288140. URL http://dx.doi.org/10.1109/ISCA.1994.288140. Google ScholarGoogle ScholarCross RefCross Ref
  19. J. V. Lunteren, T. Engbersen, J. Bostian, B. Carey, and C. Larsson. XML accelerator engine. In The First International Workshop on High Performance XML Processing. ACM, 2004.Google ScholarGoogle Scholar
  20. Y. A. Manerkar, D. Lustig, M. Pellauer, and M. Martonosi. Ccicheck: using μhb graphs to verify the coherence-consistency interface. In Proceedings of the 48th International Symposium on Microarchitecture, pages 26--37. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. M. K. Martin, M. D. Hill, and D. A. Wood. Token coherence: Decoupling performance and correctness. In Proc. of the 30th Annual Intnl. Symp. on Computer Architecture, pages 182--193, June 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. P. Miller, L. Fredriksen, and B. So. An empirical study of the reliability of UNIX utilities. Communications of the ACM, 33 (12): 32--44, Dec. 1990. 10.1145/96267.96279. URL http://doi.acm.org/10.1145/96267.96279. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Moloney, B. Barry, R. Richmond, F. Connor, C. Brick, D. Donohoe, A. Lupas, S. Mitchell, D. Nicholls, and V. Toma. Myriad 2: Eye of the computational vision storm. In Hot Chips 26, 2014.Google ScholarGoogle Scholar
  24. L. E. Olson, J. Power, M. D. Hill, and D. A. Wood. Border control: Sandboxing accelerators. In Proc. of the 48th Annual IEEE/ACM International Symp. on Microarchitecture, pages 470--481, Dec. 2015. 10.1145/2830772.2830819. URL http://doi.acm.org/10.1145/2830772.2830819. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Park and D. L. Dill. Verification of FLASH cache coherence protocol by aggregation of distributed transactions. In Proc. of the 8th ACM Symp. on Parallel Algorithms and Architectures, pages 288--296, June 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. W.-C. Park, H.-J. Shin, B. Lee, H. Yoon, and T.-D. Han. RayChip: Real-time ray-tracing chip for embedded applications. In Hot Chips 26, 2014.Google ScholarGoogle Scholar
  27. S. Phillips. M7: Next generation SPARC. In Hot Chips 26, 2014.Google ScholarGoogle Scholar
  28. J. Power, J. Hestness, M. S. Orr, M. D. Hill, and D. A. Wood. gem5-gpu: A heterogeneous cpu-gpu simulator. Computer Architecture Letters, 13 (1). 10.1109/LCA.2014.2299539. URL http://dx.doi.org/10.1109/LCA.2014.2299539. Google ScholarGoogle ScholarCross RefCross Ref
  29. J. Power, A. Basu, J. Gu, S. Puthoor, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood. Heterogeneous system coherence for integrated cpu-gpu systems. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 457--467, New York, NY, USA, 2013. ACM. ISBN 978--1--4503--2638--4. 10.1145/2540708.2540747. URL http://doi.acm.org/10.1145/2540708.2540747.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. V. Rajagopalan. All programmable devices: Not just an FPGA anymore. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, 2013. Keynote presentation.Google ScholarGoogle Scholar
  31. D. J. Sorin, M. M. Martin, M. D. Hill, and D. A. Wood. SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proc. of the 29th Annual Intnl. Symp. on Computer Architecture, pages 123--134. IEEE, May 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. J. Sorin, M. D. Hill, and D. A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures in Computer Architecture, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Stuecheli, B. Blaner, C. R. Johns, and M. S. Siegel. CAPI: A coherent accelerator processor interface. IBM Journal of Research and Development, 59 (1): 7:1--7:7, Jan. 2015. ISSN 0018--8646. 10.1147/JRD.2014.2380198.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. D. A. Wood, G. A. Gibson, and R. H. Katz. Verifying a multiprocessor cache controller using random test generation. IEEE Design and Test of Computers, pages 13--25, Aug. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Crossing Guard: Mediating Host-Accelerator Coherence Interactions

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 52, Issue 4
          ASPLOS '17
          April 2017
          811 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/3093336
          Issue’s Table of Contents
          • cover image ACM Conferences
            ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
            April 2017
            856 pages
            ISBN:9781450344654
            DOI:10.1145/3037697

          Copyright © 2017 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 4 April 2017

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!