skip to main content
research-article
Open Access

Understanding I/O Direct Cache Access Performance for End Host Networking

Published:28 February 2022Publication History
Skip Abstract Section

Abstract

Direct Cache Access (DCA) enables a network interface card (NIC) to load and store data directly on the processor cache, as conventional Direct Memory Access (DMA) is no longer suitable as the bridge between NIC and CPU in the era of 100 Gigabit Ethernet. As numerous I/O devices and cores compete for scarce cache resources, making the most of DCA for networking applications with varied objectives and constraints is a challenge, especially given the increasing complexity of modern cache hardware and I/O stacks. In this paper, we reverse engineer details of one commercial implementation of DCA, Intel's Data Direct I/O (DDIO), to explicate the importance of hardware-level investigation into DCA. Based on the learned knowledge of DCA and network I/O stacks, we (1) develop an analytical framework to predict the effectiveness of DCA (i.e., its hit rate) under certain hardware specifications, system configurations, and application properties; (2) measure penalties of the ineffective use of DCA (i.e., its miss penalty) to characterize its benefits; and (3) show that our reverse engineering, measurement, and model contribute to a deeper understanding of DCA, which in turn helps diagnose, optimize, and design end-host networking.

References

  1. Andreas Abel and Jan Reineke. 2014. Reverse engineering of cache replacement policies in Intel microprocessors and their evaluation. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014, Monterey, CA, USA, March 23--25, 2014. IEEE, Monterey, CA, USA, 141--142.Google ScholarGoogle ScholarCross RefCross Ref
  2. Andreas Abel and Jan Reineke. 2019. uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019, Providence, RI, USA, April 13--17, 2019 . ACM, Providence, RI, USA, 673--686.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Anant Agarwal, Mark Horowitz, and John L. Hennessy. 1989. An Analytical Cache Model. ACM Trans. Comput. Syst. , Vol. 7, 2 (1989), 184--215.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Mohammad Alian, Yifan Yuan, Jie Zhang, Ren Wang, Myoungsoo Jung, and Nam Sung Kim. 2020. Data Direct I/O Characterization for Future I/O System Exploration. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2020, Boston, MA, USA, August 23--25, 2020. IEEE , Boston, MA, USA, 160--169.Google ScholarGoogle ScholarCross RefCross Ref
  5. ARM. 2021. Arm DynamIQ Shared Unit Technical Reference Manual. https://developer.arm.com/documentation/100453/0401 [Online; accessed 15-January-2022].Google ScholarGoogle Scholar
  6. AWS. 2021. Announcing AWS Graviton2-based instances for Amazon Neptune. https://web.archive.org/web/20211201211023/https://aws.amazon.com/about-aws/whats-new/2021/11/aws-graviton2-based-instances-amazon-neptune/ [Online; archived 1-December-2021; accessed 15-January-2022].Google ScholarGoogle Scholar
  7. Luna Backes and Daniel A. Jimé nez. 2019. The impact of cache inclusion policies on cache management techniques. In Proceedings of the International Symposium on Memory Systems, MEMSYS 2019, Washington, DC, USA, September 30 - October 03, 2019. ACM , Washington, DC, USA, 428--438.Google ScholarGoogle Scholar
  8. Luiz André Barroso, Mike Marty, David A. Patterson, and Parthasarathy Ranganathan. 2017. Attack of the killer microseconds. Communication of ACM , Vol. 60, 4 (2017), 48--54.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Nathan Beckmann and Daniel Sá nchez. 2016. Modeling cache performance beyond LRU. In 2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 12--16, 2016. IEEE, Barcelona, Spain, 225--236.Google ScholarGoogle ScholarCross RefCross Ref
  10. Shekhar Borkar and Andrew A. Chien. 2011. The future of microprocessors. Commun. ACM , Vol. 54, 5 (2011), 67--77.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Qizhe Cai, Shubham Chaudhary, Midhul Vuppalapati, Jaehyun Hwang, and Rachit Agarwal. 2021. Understanding host network stack overheads. In ACM SIGCOMM 2021 Conference, Virtual Event, USA, August 23--27, 2021. ACM , Virtual Event, USA, 65--77.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. 2016. A cloud-scale acceleration architecture. In 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 15--19, 2016 . IEEE, Taipei, Taiwan, 7:1--7:13.Google ScholarGoogle ScholarCross RefCross Ref
  13. William J. Dally, Yatish Turakhia, and Song Han. 2020. Domain-specific hardware accelerators. Commun. ACM , Vol. 63, 7 (2020), 48--57.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chen Ding and Yutao Zhong. 2003. Predicting whole-program locality through reuse distance analysis. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation 2003, San Diego, California, USA, June 9--11, 2003 . ACM, San Diego, CA, USA, 245--257.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. DPDK. 2020. How to get best performance with NICs on Intel platforms. https://web.archive.org/web/20201031183151/https://doc.dpdk.org/guides/linux_gsg/nic_perf_intel_platform.html [Online; archived 3-Octeber-2020; accessed 5-September-2021].Google ScholarGoogle Scholar
  16. DPDK. 2021 a. 34. MLX5 poll mode driver - Data Plane Development Kit. https://web.archive.org/web/20210507064300/https://doc.dpdk.org/guides/nics/mlx5.html [Online; archived 3-May-2021; accessed 5-January-2022].Google ScholarGoogle Scholar
  17. DPDK. 2021 b. MLX5 poll mode driver - Data Plane Development Kit documentation. https://doc.dpdk.org/guides/nics/mlx5.html [Online; accessed 1-October-2021].Google ScholarGoogle Scholar
  18. Daniel E. Eisenbud, Cheng Yi, Carlo Contavalli, Cody Smith, Roman Kononov, Eric Mann-Hielscher, Ardas Cilingiroglu, Bin Cheyney, Wentao Shang, and Jinnah Dylan Hosein. 2016. Maglev: A Fast and Reliable Software Network Load Balancer. In 13th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2016, Santa Clara, CA, USA, March 16--18, 2016. USENIX, Santa Clara, CA, USA, 523--535.Google ScholarGoogle Scholar
  19. David Eklov and Erik Hagersten. 2010. StatStack: Efficient modeling of LRU caches. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2010, 28--30 March 2010, White Plains, NY, USA. IEEE, White Plains, NY, USA, 55--65.Google ScholarGoogle ScholarCross RefCross Ref
  20. Alireza Farshin, Tom Barbette, Amir Roozbeh, Gerald Q. Maguire Jr., and Dejan Kostic. 2021. PacketMill: toward per-Core 100-Gbps networking. In ASPLOS '21: 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Virtual Event, USA, April 19--23, 2021. ACM , Virtual Event, USA, 1--17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Alireza Farshin, Amir Roozbeh, Gerald Q. Maguire Jr., and Dejan Kostic. 2020. Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks. In 2020 USENIX Annual Technical Conference, USENIX ATC 2020, July 15--17, 2020. USENIX , Virtual Event, 673--689.Google ScholarGoogle Scholar
  22. Daniel Firestone, Andrew Putnam, Sambrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian M. Caulfield, Eric S. Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth Srivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, Doug Burger, Kushagra Vaid, David A. Maltz, and Albert G. Greenberg. 2018. Azure Accelerated Networking: SmartNICs in the Public Cloud. In 15th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2018, Renton, WA, USA, April 9--11, 2018. USENIX, Renton, WA, USA, 51--66.Google ScholarGoogle Scholar
  23. Romain Fontugne, Patrice Abry, Kensuke Fukuda, Darryl Veitch, Kenjiro Cho, Pierre Borgnat, and Herwig Wendt. 2017. Scaling in Internet Traffic: A 14 Year and 3 Day Longitudinal Study, With Multiscale Analyses and Random Projections. IEEE/ACM Trans. Netw. , Vol. 25, 4 (2017), 2152--2165.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Linux Foundation. 2015. Data Plane Development Kit (DPDK). http://www.dpdk.orgGoogle ScholarGoogle Scholar
  25. Hossein Golestani, Amirhossein Mirhosseini, and Thomas F. Wenisch. 2019. Software Data Planes: You Can't Always Spin to Win. In Proceedings of the ACM Symposium on Cloud Computing, SoCC 2019, Santa Cruz, CA, USA, November 20--23, 2019 . ACM, Santa Cruz, CA, USA, 337--350.Google ScholarGoogle Scholar
  26. Samuel Greengard. 2016. GPUs reshape computing. Commun. ACM , Vol. 59, 9 (2016), 14--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. David B. Gustavson. 1992. The Scalable Coherent Interface and related standards projects. IEEE Micro , Vol. 12, 1 (1992), 10--22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. John L. Hennessy and David A. Patterson. 2019. Computer Architecture - A Quantitative Approach, 6th Edition .Morgan Kaufmann, Waltham, MA, USA.Google ScholarGoogle Scholar
  29. Ram Huggahalli, Ravi R. Iyer, and Scott Tetrick. 2005. Direct Cache Access for High Bandwidth Network I/O. In 32st International Symposium on Computer Architecture (ISCA 2005), 4--8 June 2005, Madison, Wisconsin, USA. IEEE, Madison, Wisconsin, USA , 50--59.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Intel. 2009. An Introduction to the Intel® QuickPath Interconnect. https://web.archive.org/web/20210409183234/https://www.intel.com/content/www/us/en/io/quickpath-technology/quick-path-interconnect-introduction-paper.html [Online; archived 18-April-2021; accessed-6-January-2022].Google ScholarGoogle Scholar
  31. Intel. 2012a. Intel® Data Direct I/O Technology (Intel® DDIO): A Primer. https://web.archive.org/web/20210225132434/https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/data-direct-i-o-technology-brief.pdf [Online; archived 25-February-2021; accessed 6-September-2021].Google ScholarGoogle Scholar
  32. Intel. 2012b. Process and Interrupt Affinity on Intel® Xeon® Processor E5 Servers with Intel® DDIO Technology. https://web.archive.org/web/20150419082907/http://www.intel.com/content/dam/www/public/us/en/documents/application-notes/xeon-e5-ddio-appl-notes.pdf [Online; archived 19-April-2015; accessed 6-September-2021].Google ScholarGoogle Scholar
  33. Intel. 2014. Disclosure of Hardware Prefetcher Control on Some Intel® Processors. https://software.intel.com/content/www/us/en/develop/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors.html [Online; accessed 19-September-2021].Google ScholarGoogle Scholar
  34. Intel. 2016a. Intel® Xeon® Processor E5 and E7 v4 Product Families Uncore Performance Monitoring Reference Manual. https://www.intel.com/content/www/us/en/products/docs/processors/xeon/xeon-e5-e7-v4-uncore-performance-monitoring.html [Online; accessed 8-September-2021].Google ScholarGoogle Scholar
  35. Intel. 2016b. Introduction to Cache Allocation Technology in the Intel® Xeon® Processor E5 v4 Family. https://web.archive.org/web/20210527071318/https://software.intel.com/content/www/us/en/develop/articles/introduction-to-cache-allocation-technology.html [Online; archived 27-May-2021; accessed 17-September-2021].Google ScholarGoogle Scholar
  36. Intel. 2017. Intel® Xeon® Processor Scalable Memory Family Uncore Performance Monitoring. https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-uncore-performance-monitoring-manual.html [Online; accessed 8-September-2021].Google ScholarGoogle Scholar
  37. Intel. 2021 a. Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3B: System Programming Guide, Part 2. https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-sdm-volume-3b-system-programming-guide-part-2.html [Online; accessed 8-September-2021].Google ScholarGoogle Scholar
  38. Intel. 2021 b. Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 4: Model-Specific Registers. https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-4-model-specific-registers.html [Online; accessed 19-September-2021].Google ScholarGoogle Scholar
  39. Gorka Irazoqui, Thomas Eisenbarth, and Berk Sunar. 2015. Systematic Reverse Engineering of Cache Slice Selection in Intel Processors. In 2015 Euromicro Conference on Digital System Design, DSD 2015, Madeira, Portugal, August 26--28, 2015. IEEE, Madeira, Portugal, 629--636.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr., and Joel S. Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). In 37th International Symposium on Computer Architecture (ISCA 2010), June 19--23, 2010, Saint-Malo, France. ACM , Saint-Malo, France, 60--71.Google ScholarGoogle Scholar
  41. Natalie D. Enright Jerger, Tushar Krishna, and Li-Shiuan Peh. 2017. On-Chip Networks .Morgan & Claypool Publishers, Waltham, MA, USA.Google ScholarGoogle Scholar
  42. Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24--28, 2017. ACM , Toronto, ON, Canada, 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S. Kent. 2005. IP Encapsulating Security Payload (ESP) . RFC 4303. RFC Editor.Google ScholarGoogle Scholar
  44. J. F. C. Kingman. 1961. The single server queue in heavy traffic. Mathematical Proceedings of the Cambridge Philosophical Society , Vol. 57, 4 (1961), 902--904.Google ScholarGoogle ScholarCross RefCross Ref
  45. Amit Kumar and Ram Huggahalli. 2007. Impact of Cache Coherence Protocols on the Processing of Network Traffic. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40 2007), 1--5 December 2007, Chicago, Illinois, USA. IEEE , Chicago, IL, USA, 161--171.Google ScholarGoogle Scholar
  46. Amit Kumar, Ram Huggahalli, and Srihari Makineni. 2009. Characterization of Direct Cache Access on multi-core systems and 10GbE. In 15th International Conference on High-Performance Computer Architecture (HPCA-15 2009), 14--18 February 2009, Raleigh, North Carolina, USA. IEEE, Raleigh, NC, USA, 341--352.Google ScholarGoogle ScholarCross RefCross Ref
  47. Mahesh K. Kumashikar, Shridhar G. Bendi, Srikanth Nimmagadda, Anup Jyoti Deka, and Anil Agarwal. 2017. 14nm Broadwell Xeon® processor family: Design methodologies and optimizations. In IEEE Asian Solid-State Circuits Conference, A-SSCC 2017, Seoul, Korea (South), November 6--8, 2017 . IEEE, Seoul, Korea (South), 17--20.Google ScholarGoogle ScholarCross RefCross Ref
  48. Michael Kurth, Ben Gras, Dennis Andriesse, Cristiano Giuffrida, Herbert Bos, and Kaveh Razavi. 2020. NetCAT: Practical Cache Attacks from the Network. In 2020 IEEE Symposium on Security and Privacy, SP 2020, San Francisco, CA, USA, May 18--21, 2020 . IEEE, San Francisco, CA, USA, 20--38.Google ScholarGoogle ScholarCross RefCross Ref
  49. Edgar A. Leó n, Kurt B. Ferreira, and Arthur B. Maccabe. 2007. Reducing the Impact of the MemoryWall for I/O Using Cache Injection. In 15th Annual IEEE Symposium on High-Performance Interconnects, HOTI 2007, Stanford, CA, USA, August 22--24, 2007. IEEE , Stanford, CA, USA, 143--150.Google ScholarGoogle Scholar
  50. Mathworks. 2021. Subset of eigenvalues and eigenvectors - MATLAB eigs - MathWorks. https://ww2.mathworks.cn/help/matlab/ref/eigs.html [Online; accessed 29-September-2021].Google ScholarGoogle Scholar
  51. Richard L. Mattson, Jan Gecsei, Donald R. Slutz, and Irving L. Traiger. 1970. Evaluation Techniques for Storage Hierarchies. IBM Syst. J. , Vol. 9, 2 (1970), 78--117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Clé mentine Maurice, Nicolas Le Scouarnec, Christoph Neumann, Olivier Heen, and Auré lien Francillon. 2015. Reverse Engineering Intel Last-Level Cache Complex Addressing Using Performance Counters. In Research in Attacks, Intrusions, and Defenses - 18th International Symposium, RAID 2015, Kyoto, Japan, November 2--4, 2015, Proceedings , Vol. 9404. Springer, Kyoto, Japan, 48--65.Google ScholarGoogle Scholar
  53. Daniel Molka, Daniel Hackenberg, Robert Schö ne, and Wolfgang E. Nagel. 2015. Cache Coherence Protocol and Memory Performance of the Intel Haswell-EP Architecture. In 44th International Conference on Parallel Processing, ICPP 2015, Beijing, China, September 1--4, 2015. IEEE , Beijing, China, 739--748.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio Ló pez-Buedo, and Andrew W. Moore. 2018. Understanding PCIe performance for end host networking. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIGCOMM 2018, Budapest, Hungary, August 20--25, 2018. ACM , Budapest, Hungary, 327--341.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri E. Bal. 2014. A detailed GPU cache model based on reuse distance theory. In 20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA, February 15--19, 2014. IEEE , Orlando, FL, USA, 37--48.Google ScholarGoogle Scholar
  56. numpy. 2021. numpy.linalg.eig - NumPy v1.21 Manual. https://numpy.org/doc/stable/reference/generated/numpy.linalg.eig.html [Online; accessed 29-September-2021].Google ScholarGoogle Scholar
  57. University of Oregon. 2021. Route Views Archive Project. http://archive.routeviews.org/ [Online; accessed 24-December-2021].Google ScholarGoogle Scholar
  58. opcm. 2021. opcm/pcm: Processor Counter Monitor. https://github.com/opcm/pcmGoogle ScholarGoogle Scholar
  59. Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan. 2019. Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads. In 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019, Boston, MA, USA, February 26--28, 2019. USENIX , Boston, MA, USA, 361--378.Google ScholarGoogle Scholar
  60. John K. Ousterhout. 2018. Always measure one level deeper. Commun. ACM , Vol. 61, 7 (2018), 74--83.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Solal Pirelli and George Candea. 2020. A Simpler and Faster NIC Driver Model for Network Functions. In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4--6, 2020. USENIX, Virtual Event, 225--241.Google ScholarGoogle Scholar
  62. George Prekas, Marios Kogias, and Edouard Bugnion. 2017. ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, October 28--31, 2017. ACM , Shanghai, China, 325--341.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely Jr., and Joel S. Emer. 2007. Adaptive insertion policies for high performance caching. In 34th International Symposium on Computer Architecture (ISCA 2007), June 9--13, 2007, San Diego, California, USA. ACM, San Diego, California, USA, 381--391.Google ScholarGoogle Scholar
  64. Rathijit Sen and David A. Wood. 2013. Reuse-based online models for caches. In ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '13, Pittsburgh, PA, USA, June 17--21, 2013. ACM , Pittsburgh, PA, USA, 279--292.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Igor Smolyar, Alex Markuze, Boris Pismenny, Haggai Eran, Gerd Zellweger, Austin Bolen, Liran Liss, Adam Morrison, and Dan Tsafrir. 2020. IOctopus: Outsmarting Nonuniform DMA. In ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16--20, 2020 . ACM, Lausanne, Switzerland, 101--115.Google ScholarGoogle Scholar
  66. Akshitha Sriraman and Thomas F. Wenisch. 2018. (mathrmμ)Tune: Auto-Tuned Threading for OLDI Microservices. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8--10, 2018. USENIX , Carlsbad, CA, USA, 177--194.Google ScholarGoogle Scholar
  67. Ron Stein. 2021. Key drivers of 100Gbps network adoption. https://web.archive.org/web/20210416000644/https://www.datacenterdynamics.com/en/opinions/key-drivers-100gbps-network-adoption/ Online; archived 16-April-2021; accessed 29-September-2021.Google ScholarGoogle Scholar
  68. Per Stenströ m, Truman Joe, and Anoop Gupta. 1992. Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures. In Proceedings of the 19th Annual International Symposium on Computer Architecture. Gold Coast, Australia, May 1992 . ACM, Gold Coast, Australia, 80--91.Google ScholarGoogle Scholar
  69. supermicro. 2017. SUPERSERVER® 1028UX-CR-LL1 1028UX-CR-LL2 USER'S MANUAL. https://web.archive.org/web/20220105080830/https://www.supermicro.com/manuals/superserver/1U/MNL-1702.pdf [Online; archived 5-January-2022; accessed 5-January-2022].Google ScholarGoogle Scholar
  70. Simon M. Tam, Harry Muljono, Min Huang, Sitaraman Iyer, Kalapi Royneogi, Nagmohan Satti, Rizwan Qureshi, Wei Chen, Tom Wang, Hubert Hsieh, Sujal Vora, and Eddie Wang. 2018. SkyLake-SP: A 14nm 28-Core xeon® processor. In 2018 IEEE International Solid-State Circuits Conference, ISSCC 2018, San Francisco, CA, USA, February 11--15, 2018. IEEE, San Francisco, CA, USA, 34--36.Google ScholarGoogle ScholarCross RefCross Ref
  71. Dan Tang, Yungang Bao, Weiwu Hu, and Mingyu Chen. 2010. DMA cache: Using on-chip storage to architecturally separate I/O data from CPU data for improving I/O performance. In 16th International Conference on High-Performance Computer Architecture (HPCA-16 2010), 9--14 January 2010, Bangalore, India. IEEE, Bangalore, India, 1--12.Google ScholarGoogle Scholar
  72. Mohammadkazem Taram, Ashish Venkat, and Dean M. Tullsen. 2020. Packet Chasing: Spying on Network Packets over a Cache Side-Channel. In 47th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2020, Valencia, Spain, May 30 - June 3, 2020. IEEE , Valencia, Spain, 721--734.Google ScholarGoogle Scholar
  73. David E. Taylor and Jonathan S. Turner. 2005. ClassBench: a packet classification benchmark. In INFOCOM 2005. 24th Annual Joint Conference of the IEEE Computer and Communications Societies, 13--17 March 2005, Miami, FL, USA. IEEE, Miami, FL, USA , 2068--2079.Google ScholarGoogle Scholar
  74. Shelby Thomas, Rob McGuinness, Geoffrey M. Voelker, and George Porter. 2018. Dark packets and the end of network scaling. In Proceedings of the 2018 Symposium on Architectures for Networking and Communications Systems, ANCS 2018, Ithaca, NY, USA, July 23--24, 2018. ACM, Ithaca, NY, USA, 1--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Amin Tootoonchian, Aurojit Panda, Chang Lan, Melvin Walls, Katerina J. Argyraki, Sylvia Ratnasamy, and Scott Shenker. 2018. ResQ: Enabling SLOs in Network Function Virtualization. In 15th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2018, Renton, WA, USA, April 9--11, 2018. USENIX, Renton, WA, USA, 283--297.Google ScholarGoogle Scholar
  76. Intel & Intel's Users. 2016. DDIO hit/miss metric? (or PCIeItoM not ticking). https://web.archive.org/web/20201020093626/https://community.intel.com/t5/Software-Tuning-Performance/DDIO-hit-miss-metric-or-PCIeItoM-not-ticking/m-p/1124331 [Online; archived-20-October-2020; accessed-6-January-2022].Google ScholarGoogle Scholar
  77. Intel & Intel's Users. 2017. pcie bandwidth drops on Skylake-SP. https://web.archive.org/web/20220112023100/https://community.intel.com/t5/Software-Tuning-Performance/pcie-bandwidth-drops-on-Skylake-SP/m-p/1167451 [Online; archived-6-January-2022; accessed-6-January-2022].Google ScholarGoogle Scholar
  78. Intel & Intel's Users. 2021. DDIO does not reduce Memory Read Bandwidth Despite 100% PCIRdCur Hit Rate. https://web.archive.org/web/20220112021118/https://community.intel.com/t5/Processors/DDIO-does-not-reduce-Memory-Read-Bandwidth-Despite-100-PCIRdCur/m-p/1325633 [Online; archived-6-January-2022; accessed-6-January-2022].Google ScholarGoogle Scholar
  79. Meng-Ju Wu, Minshu Zhao, and Donald Yeung. 2013. Studying multicore processor scaling via reuse distance analysis. In The 40th Annual International Symposium on Computer Architecture, ISCA'13, Tel-Aviv, Israel, June 23--27, 2013. ACM, Tel-Aviv, Israel, 499--510.Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. William A. Wulf and Sally A. McKee. 1995. Hitting the memory wall: implications of the obvious. SIGARCH Comput. Archit. News , Vol. 23, 1 (1995), 20--24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Mengjia Yan, Read Sprabery, Bhargava Gopireddy, Christopher W. Fletcher, Roy H. Campbell, and Josep Torrellas. 2019. Attack Directories, Not Caches: Side Channel Attacks in a Non-Inclusive World. In 2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19--23, 2019. IEEE, San Francisco, CA, USA, 888--904.Google ScholarGoogle ScholarCross RefCross Ref
  82. Yifan Yuan, Mohammad Alian, Yipeng Wang, Ren Wang, Ilia Kurakin, Charlie Tai, and Nam Sung Kim. 2021. Don't Forget the I/O When Allocating Your LLC . In 48th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2021, Valencia, Spain, June 14--18, 2021 . IEEE, Valencia, Spain, 112--125.Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Zhen Zhang, Chaokun Chang, Haibin Lin, Yida Wang, Raman Arora, and Xin Jin. 2020. Is Network the Bottleneck of Distributed Training?. In Proceedings of the 2020 Workshop on Network Meets AI & ML, [email protected], Virtual Event, USA, August 14, 2020. ACM, Virtual Event, USA, 8--13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Li Zhao, Ravi R. Iyer, Srihari Makineni, Don Newell, and Liqun Cheng. 2010. NCID: a non-inclusive cache, inclusive directory architecture for flexible and efficient cache hierarchies. In Proceedings of the 7th Conference on Computing Frontiers, 2010, Bertinoro, Italy, May 17--19, 2010 . ACM, Bertinoro, Italy, 121--130.Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Minshu Zhao and Donald Yeung. 2015. Studying the impact of multicore processor scaling on directory techniques via reuse distance analysis. In 21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA, February 7--11, 2015. IEEE, Burlingame, CA, USA, 590--602.Google ScholarGoogle ScholarCross RefCross Ref
  86. Ying Zheng, Brian T. Davis, and Matthew Jordan. 2004. Performance evaluation of exclusive cache hierarchies. In 2004 IEEE International Symposium on Performance Analysis of Systems and Software, March 10--12, 2004, Austin, Texas, USA, Proceedings. IEEE , Austin, Texas, USA, 89--96.Google ScholarGoogle ScholarCross RefCross Ref
  87. Dimitrios Ziakas, Allen Baum, Robert A. Maddox, and Robert J. Safranek. 2010. Intel® QuickPath Interconnect Architectural Features Supporting Scalable System Architectures. In IEEE 18th Annual Symposium on High Performance Interconnects, HOTI 2010, Google Campus, Mountain View, California, USA, August 18--20, 2010. IEEE , GMountain View, California, USA, 1--6.Google ScholarGoogle Scholar

Index Terms

  1. Understanding I/O Direct Cache Access Performance for End Host Networking

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!