Abstract
Direct Cache Access (DCA) enables a network interface card (NIC) to load and store data directly on the processor cache, as conventional Direct Memory Access (DMA) is no longer suitable as the bridge between NIC and CPU in the era of 100 Gigabit Ethernet. As numerous I/O devices and cores compete for scarce cache resources, making the most of DCA for networking applications with varied objectives and constraints is a challenge, especially given the increasing complexity of modern cache hardware and I/O stacks. In this paper, we reverse engineer details of one commercial implementation of DCA, Intel's Data Direct I/O (DDIO), to explicate the importance of hardware-level investigation into DCA. Based on the learned knowledge of DCA and network I/O stacks, we (1) develop an analytical framework to predict the effectiveness of DCA (i.e., its hit rate) under certain hardware specifications, system configurations, and application properties; (2) measure penalties of the ineffective use of DCA (i.e., its miss penalty) to characterize its benefits; and (3) show that our reverse engineering, measurement, and model contribute to a deeper understanding of DCA, which in turn helps diagnose, optimize, and design end-host networking.
- Andreas Abel and Jan Reineke. 2014. Reverse engineering of cache replacement policies in Intel microprocessors and their evaluation. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014, Monterey, CA, USA, March 23--25, 2014. IEEE, Monterey, CA, USA, 141--142.Google Scholar
Cross Ref
- Andreas Abel and Jan Reineke. 2019. uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019, Providence, RI, USA, April 13--17, 2019 . ACM, Providence, RI, USA, 673--686.Google Scholar
Digital Library
- Anant Agarwal, Mark Horowitz, and John L. Hennessy. 1989. An Analytical Cache Model. ACM Trans. Comput. Syst. , Vol. 7, 2 (1989), 184--215.Google Scholar
Digital Library
- Mohammad Alian, Yifan Yuan, Jie Zhang, Ren Wang, Myoungsoo Jung, and Nam Sung Kim. 2020. Data Direct I/O Characterization for Future I/O System Exploration. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2020, Boston, MA, USA, August 23--25, 2020. IEEE , Boston, MA, USA, 160--169.Google Scholar
Cross Ref
- ARM. 2021. Arm DynamIQ Shared Unit Technical Reference Manual. https://developer.arm.com/documentation/100453/0401 [Online; accessed 15-January-2022].Google Scholar
- AWS. 2021. Announcing AWS Graviton2-based instances for Amazon Neptune. https://web.archive.org/web/20211201211023/https://aws.amazon.com/about-aws/whats-new/2021/11/aws-graviton2-based-instances-amazon-neptune/ [Online; archived 1-December-2021; accessed 15-January-2022].Google Scholar
- Luna Backes and Daniel A. Jimé nez. 2019. The impact of cache inclusion policies on cache management techniques. In Proceedings of the International Symposium on Memory Systems, MEMSYS 2019, Washington, DC, USA, September 30 - October 03, 2019. ACM , Washington, DC, USA, 428--438.Google Scholar
- Luiz André Barroso, Mike Marty, David A. Patterson, and Parthasarathy Ranganathan. 2017. Attack of the killer microseconds. Communication of ACM , Vol. 60, 4 (2017), 48--54.Google Scholar
Digital Library
- Nathan Beckmann and Daniel Sá nchez. 2016. Modeling cache performance beyond LRU. In 2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 12--16, 2016. IEEE, Barcelona, Spain, 225--236.Google Scholar
Cross Ref
- Shekhar Borkar and Andrew A. Chien. 2011. The future of microprocessors. Commun. ACM , Vol. 54, 5 (2011), 67--77.Google Scholar
Digital Library
- Qizhe Cai, Shubham Chaudhary, Midhul Vuppalapati, Jaehyun Hwang, and Rachit Agarwal. 2021. Understanding host network stack overheads. In ACM SIGCOMM 2021 Conference, Virtual Event, USA, August 23--27, 2021. ACM , Virtual Event, USA, 65--77.Google Scholar
Digital Library
- Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. 2016. A cloud-scale acceleration architecture. In 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 15--19, 2016 . IEEE, Taipei, Taiwan, 7:1--7:13.Google Scholar
Cross Ref
- William J. Dally, Yatish Turakhia, and Song Han. 2020. Domain-specific hardware accelerators. Commun. ACM , Vol. 63, 7 (2020), 48--57.Google Scholar
Digital Library
- Chen Ding and Yutao Zhong. 2003. Predicting whole-program locality through reuse distance analysis. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation 2003, San Diego, California, USA, June 9--11, 2003 . ACM, San Diego, CA, USA, 245--257.Google Scholar
Digital Library
- DPDK. 2020. How to get best performance with NICs on Intel platforms. https://web.archive.org/web/20201031183151/https://doc.dpdk.org/guides/linux_gsg/nic_perf_intel_platform.html [Online; archived 3-Octeber-2020; accessed 5-September-2021].Google Scholar
- DPDK. 2021 a. 34. MLX5 poll mode driver - Data Plane Development Kit. https://web.archive.org/web/20210507064300/https://doc.dpdk.org/guides/nics/mlx5.html [Online; archived 3-May-2021; accessed 5-January-2022].Google Scholar
- DPDK. 2021 b. MLX5 poll mode driver - Data Plane Development Kit documentation. https://doc.dpdk.org/guides/nics/mlx5.html [Online; accessed 1-October-2021].Google Scholar
- Daniel E. Eisenbud, Cheng Yi, Carlo Contavalli, Cody Smith, Roman Kononov, Eric Mann-Hielscher, Ardas Cilingiroglu, Bin Cheyney, Wentao Shang, and Jinnah Dylan Hosein. 2016. Maglev: A Fast and Reliable Software Network Load Balancer. In 13th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2016, Santa Clara, CA, USA, March 16--18, 2016. USENIX, Santa Clara, CA, USA, 523--535.Google Scholar
- David Eklov and Erik Hagersten. 2010. StatStack: Efficient modeling of LRU caches. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2010, 28--30 March 2010, White Plains, NY, USA. IEEE, White Plains, NY, USA, 55--65.Google Scholar
Cross Ref
- Alireza Farshin, Tom Barbette, Amir Roozbeh, Gerald Q. Maguire Jr., and Dejan Kostic. 2021. PacketMill: toward per-Core 100-Gbps networking. In ASPLOS '21: 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Virtual Event, USA, April 19--23, 2021. ACM , Virtual Event, USA, 1--17.Google Scholar
Digital Library
- Alireza Farshin, Amir Roozbeh, Gerald Q. Maguire Jr., and Dejan Kostic. 2020. Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks. In 2020 USENIX Annual Technical Conference, USENIX ATC 2020, July 15--17, 2020. USENIX , Virtual Event, 673--689.Google Scholar
- Daniel Firestone, Andrew Putnam, Sambrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian M. Caulfield, Eric S. Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth Srivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, Doug Burger, Kushagra Vaid, David A. Maltz, and Albert G. Greenberg. 2018. Azure Accelerated Networking: SmartNICs in the Public Cloud. In 15th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2018, Renton, WA, USA, April 9--11, 2018. USENIX, Renton, WA, USA, 51--66.Google Scholar
- Romain Fontugne, Patrice Abry, Kensuke Fukuda, Darryl Veitch, Kenjiro Cho, Pierre Borgnat, and Herwig Wendt. 2017. Scaling in Internet Traffic: A 14 Year and 3 Day Longitudinal Study, With Multiscale Analyses and Random Projections. IEEE/ACM Trans. Netw. , Vol. 25, 4 (2017), 2152--2165.Google Scholar
Digital Library
- Linux Foundation. 2015. Data Plane Development Kit (DPDK). http://www.dpdk.orgGoogle Scholar
- Hossein Golestani, Amirhossein Mirhosseini, and Thomas F. Wenisch. 2019. Software Data Planes: You Can't Always Spin to Win. In Proceedings of the ACM Symposium on Cloud Computing, SoCC 2019, Santa Cruz, CA, USA, November 20--23, 2019 . ACM, Santa Cruz, CA, USA, 337--350.Google Scholar
- Samuel Greengard. 2016. GPUs reshape computing. Commun. ACM , Vol. 59, 9 (2016), 14--16.Google Scholar
Digital Library
- David B. Gustavson. 1992. The Scalable Coherent Interface and related standards projects. IEEE Micro , Vol. 12, 1 (1992), 10--22.Google Scholar
Digital Library
- John L. Hennessy and David A. Patterson. 2019. Computer Architecture - A Quantitative Approach, 6th Edition .Morgan Kaufmann, Waltham, MA, USA.Google Scholar
- Ram Huggahalli, Ravi R. Iyer, and Scott Tetrick. 2005. Direct Cache Access for High Bandwidth Network I/O. In 32st International Symposium on Computer Architecture (ISCA 2005), 4--8 June 2005, Madison, Wisconsin, USA. IEEE, Madison, Wisconsin, USA , 50--59.Google Scholar
Digital Library
- Intel. 2009. An Introduction to the Intel® QuickPath Interconnect. https://web.archive.org/web/20210409183234/https://www.intel.com/content/www/us/en/io/quickpath-technology/quick-path-interconnect-introduction-paper.html [Online; archived 18-April-2021; accessed-6-January-2022].Google Scholar
- Intel. 2012a. Intel® Data Direct I/O Technology (Intel® DDIO): A Primer. https://web.archive.org/web/20210225132434/https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/data-direct-i-o-technology-brief.pdf [Online; archived 25-February-2021; accessed 6-September-2021].Google Scholar
- Intel. 2012b. Process and Interrupt Affinity on Intel® Xeon® Processor E5 Servers with Intel® DDIO Technology. https://web.archive.org/web/20150419082907/http://www.intel.com/content/dam/www/public/us/en/documents/application-notes/xeon-e5-ddio-appl-notes.pdf [Online; archived 19-April-2015; accessed 6-September-2021].Google Scholar
- Intel. 2014. Disclosure of Hardware Prefetcher Control on Some Intel® Processors. https://software.intel.com/content/www/us/en/develop/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors.html [Online; accessed 19-September-2021].Google Scholar
- Intel. 2016a. Intel® Xeon® Processor E5 and E7 v4 Product Families Uncore Performance Monitoring Reference Manual. https://www.intel.com/content/www/us/en/products/docs/processors/xeon/xeon-e5-e7-v4-uncore-performance-monitoring.html [Online; accessed 8-September-2021].Google Scholar
- Intel. 2016b. Introduction to Cache Allocation Technology in the Intel® Xeon® Processor E5 v4 Family. https://web.archive.org/web/20210527071318/https://software.intel.com/content/www/us/en/develop/articles/introduction-to-cache-allocation-technology.html [Online; archived 27-May-2021; accessed 17-September-2021].Google Scholar
- Intel. 2017. Intel® Xeon® Processor Scalable Memory Family Uncore Performance Monitoring. https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-uncore-performance-monitoring-manual.html [Online; accessed 8-September-2021].Google Scholar
- Intel. 2021 a. Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3B: System Programming Guide, Part 2. https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-sdm-volume-3b-system-programming-guide-part-2.html [Online; accessed 8-September-2021].Google Scholar
- Intel. 2021 b. Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 4: Model-Specific Registers. https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-4-model-specific-registers.html [Online; accessed 19-September-2021].Google Scholar
- Gorka Irazoqui, Thomas Eisenbarth, and Berk Sunar. 2015. Systematic Reverse Engineering of Cache Slice Selection in Intel Processors. In 2015 Euromicro Conference on Digital System Design, DSD 2015, Madeira, Portugal, August 26--28, 2015. IEEE, Madeira, Portugal, 629--636.Google Scholar
Digital Library
- Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr., and Joel S. Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). In 37th International Symposium on Computer Architecture (ISCA 2010), June 19--23, 2010, Saint-Malo, France. ACM , Saint-Malo, France, 60--71.Google Scholar
- Natalie D. Enright Jerger, Tushar Krishna, and Li-Shiuan Peh. 2017. On-Chip Networks .Morgan & Claypool Publishers, Waltham, MA, USA.Google Scholar
- Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24--28, 2017. ACM , Toronto, ON, Canada, 1--12.Google Scholar
Digital Library
- S. Kent. 2005. IP Encapsulating Security Payload (ESP) . RFC 4303. RFC Editor.Google Scholar
- J. F. C. Kingman. 1961. The single server queue in heavy traffic. Mathematical Proceedings of the Cambridge Philosophical Society , Vol. 57, 4 (1961), 902--904.Google Scholar
Cross Ref
- Amit Kumar and Ram Huggahalli. 2007. Impact of Cache Coherence Protocols on the Processing of Network Traffic. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40 2007), 1--5 December 2007, Chicago, Illinois, USA. IEEE , Chicago, IL, USA, 161--171.Google Scholar
- Amit Kumar, Ram Huggahalli, and Srihari Makineni. 2009. Characterization of Direct Cache Access on multi-core systems and 10GbE. In 15th International Conference on High-Performance Computer Architecture (HPCA-15 2009), 14--18 February 2009, Raleigh, North Carolina, USA. IEEE, Raleigh, NC, USA, 341--352.Google Scholar
Cross Ref
- Mahesh K. Kumashikar, Shridhar G. Bendi, Srikanth Nimmagadda, Anup Jyoti Deka, and Anil Agarwal. 2017. 14nm Broadwell Xeon® processor family: Design methodologies and optimizations. In IEEE Asian Solid-State Circuits Conference, A-SSCC 2017, Seoul, Korea (South), November 6--8, 2017 . IEEE, Seoul, Korea (South), 17--20.Google Scholar
Cross Ref
- Michael Kurth, Ben Gras, Dennis Andriesse, Cristiano Giuffrida, Herbert Bos, and Kaveh Razavi. 2020. NetCAT: Practical Cache Attacks from the Network. In 2020 IEEE Symposium on Security and Privacy, SP 2020, San Francisco, CA, USA, May 18--21, 2020 . IEEE, San Francisco, CA, USA, 20--38.Google Scholar
Cross Ref
- Edgar A. Leó n, Kurt B. Ferreira, and Arthur B. Maccabe. 2007. Reducing the Impact of the MemoryWall for I/O Using Cache Injection. In 15th Annual IEEE Symposium on High-Performance Interconnects, HOTI 2007, Stanford, CA, USA, August 22--24, 2007. IEEE , Stanford, CA, USA, 143--150.Google Scholar
- Mathworks. 2021. Subset of eigenvalues and eigenvectors - MATLAB eigs - MathWorks. https://ww2.mathworks.cn/help/matlab/ref/eigs.html [Online; accessed 29-September-2021].Google Scholar
- Richard L. Mattson, Jan Gecsei, Donald R. Slutz, and Irving L. Traiger. 1970. Evaluation Techniques for Storage Hierarchies. IBM Syst. J. , Vol. 9, 2 (1970), 78--117.Google Scholar
Digital Library
- Clé mentine Maurice, Nicolas Le Scouarnec, Christoph Neumann, Olivier Heen, and Auré lien Francillon. 2015. Reverse Engineering Intel Last-Level Cache Complex Addressing Using Performance Counters. In Research in Attacks, Intrusions, and Defenses - 18th International Symposium, RAID 2015, Kyoto, Japan, November 2--4, 2015, Proceedings , Vol. 9404. Springer, Kyoto, Japan, 48--65.Google Scholar
- Daniel Molka, Daniel Hackenberg, Robert Schö ne, and Wolfgang E. Nagel. 2015. Cache Coherence Protocol and Memory Performance of the Intel Haswell-EP Architecture. In 44th International Conference on Parallel Processing, ICPP 2015, Beijing, China, September 1--4, 2015. IEEE , Beijing, China, 739--748.Google Scholar
Digital Library
- Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio Ló pez-Buedo, and Andrew W. Moore. 2018. Understanding PCIe performance for end host networking. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIGCOMM 2018, Budapest, Hungary, August 20--25, 2018. ACM , Budapest, Hungary, 327--341.Google Scholar
Digital Library
- Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri E. Bal. 2014. A detailed GPU cache model based on reuse distance theory. In 20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA, February 15--19, 2014. IEEE , Orlando, FL, USA, 37--48.Google Scholar
- numpy. 2021. numpy.linalg.eig - NumPy v1.21 Manual. https://numpy.org/doc/stable/reference/generated/numpy.linalg.eig.html [Online; accessed 29-September-2021].Google Scholar
- University of Oregon. 2021. Route Views Archive Project. http://archive.routeviews.org/ [Online; accessed 24-December-2021].Google Scholar
- opcm. 2021. opcm/pcm: Processor Counter Monitor. https://github.com/opcm/pcmGoogle Scholar
- Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan. 2019. Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads. In 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019, Boston, MA, USA, February 26--28, 2019. USENIX , Boston, MA, USA, 361--378.Google Scholar
- John K. Ousterhout. 2018. Always measure one level deeper. Commun. ACM , Vol. 61, 7 (2018), 74--83.Google Scholar
Digital Library
- Solal Pirelli and George Candea. 2020. A Simpler and Faster NIC Driver Model for Network Functions. In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4--6, 2020. USENIX, Virtual Event, 225--241.Google Scholar
- George Prekas, Marios Kogias, and Edouard Bugnion. 2017. ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, October 28--31, 2017. ACM , Shanghai, China, 325--341.Google Scholar
Digital Library
- Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely Jr., and Joel S. Emer. 2007. Adaptive insertion policies for high performance caching. In 34th International Symposium on Computer Architecture (ISCA 2007), June 9--13, 2007, San Diego, California, USA. ACM, San Diego, California, USA, 381--391.Google Scholar
- Rathijit Sen and David A. Wood. 2013. Reuse-based online models for caches. In ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '13, Pittsburgh, PA, USA, June 17--21, 2013. ACM , Pittsburgh, PA, USA, 279--292.Google Scholar
Digital Library
- Igor Smolyar, Alex Markuze, Boris Pismenny, Haggai Eran, Gerd Zellweger, Austin Bolen, Liran Liss, Adam Morrison, and Dan Tsafrir. 2020. IOctopus: Outsmarting Nonuniform DMA. In ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16--20, 2020 . ACM, Lausanne, Switzerland, 101--115.Google Scholar
- Akshitha Sriraman and Thomas F. Wenisch. 2018. (mathrmμ)Tune: Auto-Tuned Threading for OLDI Microservices. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8--10, 2018. USENIX , Carlsbad, CA, USA, 177--194.Google Scholar
- Ron Stein. 2021. Key drivers of 100Gbps network adoption. https://web.archive.org/web/20210416000644/https://www.datacenterdynamics.com/en/opinions/key-drivers-100gbps-network-adoption/ Online; archived 16-April-2021; accessed 29-September-2021.Google Scholar
- Per Stenströ m, Truman Joe, and Anoop Gupta. 1992. Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures. In Proceedings of the 19th Annual International Symposium on Computer Architecture. Gold Coast, Australia, May 1992 . ACM, Gold Coast, Australia, 80--91.Google Scholar
- supermicro. 2017. SUPERSERVER® 1028UX-CR-LL1 1028UX-CR-LL2 USER'S MANUAL. https://web.archive.org/web/20220105080830/https://www.supermicro.com/manuals/superserver/1U/MNL-1702.pdf [Online; archived 5-January-2022; accessed 5-January-2022].Google Scholar
- Simon M. Tam, Harry Muljono, Min Huang, Sitaraman Iyer, Kalapi Royneogi, Nagmohan Satti, Rizwan Qureshi, Wei Chen, Tom Wang, Hubert Hsieh, Sujal Vora, and Eddie Wang. 2018. SkyLake-SP: A 14nm 28-Core xeon® processor. In 2018 IEEE International Solid-State Circuits Conference, ISSCC 2018, San Francisco, CA, USA, February 11--15, 2018. IEEE, San Francisco, CA, USA, 34--36.Google Scholar
Cross Ref
- Dan Tang, Yungang Bao, Weiwu Hu, and Mingyu Chen. 2010. DMA cache: Using on-chip storage to architecturally separate I/O data from CPU data for improving I/O performance. In 16th International Conference on High-Performance Computer Architecture (HPCA-16 2010), 9--14 January 2010, Bangalore, India. IEEE, Bangalore, India, 1--12.Google Scholar
- Mohammadkazem Taram, Ashish Venkat, and Dean M. Tullsen. 2020. Packet Chasing: Spying on Network Packets over a Cache Side-Channel. In 47th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2020, Valencia, Spain, May 30 - June 3, 2020. IEEE , Valencia, Spain, 721--734.Google Scholar
- David E. Taylor and Jonathan S. Turner. 2005. ClassBench: a packet classification benchmark. In INFOCOM 2005. 24th Annual Joint Conference of the IEEE Computer and Communications Societies, 13--17 March 2005, Miami, FL, USA. IEEE, Miami, FL, USA , 2068--2079.Google Scholar
- Shelby Thomas, Rob McGuinness, Geoffrey M. Voelker, and George Porter. 2018. Dark packets and the end of network scaling. In Proceedings of the 2018 Symposium on Architectures for Networking and Communications Systems, ANCS 2018, Ithaca, NY, USA, July 23--24, 2018. ACM, Ithaca, NY, USA, 1--14.Google Scholar
Digital Library
- Amin Tootoonchian, Aurojit Panda, Chang Lan, Melvin Walls, Katerina J. Argyraki, Sylvia Ratnasamy, and Scott Shenker. 2018. ResQ: Enabling SLOs in Network Function Virtualization. In 15th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2018, Renton, WA, USA, April 9--11, 2018. USENIX, Renton, WA, USA, 283--297.Google Scholar
- Intel & Intel's Users. 2016. DDIO hit/miss metric? (or PCIeItoM not ticking). https://web.archive.org/web/20201020093626/https://community.intel.com/t5/Software-Tuning-Performance/DDIO-hit-miss-metric-or-PCIeItoM-not-ticking/m-p/1124331 [Online; archived-20-October-2020; accessed-6-January-2022].Google Scholar
- Intel & Intel's Users. 2017. pcie bandwidth drops on Skylake-SP. https://web.archive.org/web/20220112023100/https://community.intel.com/t5/Software-Tuning-Performance/pcie-bandwidth-drops-on-Skylake-SP/m-p/1167451 [Online; archived-6-January-2022; accessed-6-January-2022].Google Scholar
- Intel & Intel's Users. 2021. DDIO does not reduce Memory Read Bandwidth Despite 100% PCIRdCur Hit Rate. https://web.archive.org/web/20220112021118/https://community.intel.com/t5/Processors/DDIO-does-not-reduce-Memory-Read-Bandwidth-Despite-100-PCIRdCur/m-p/1325633 [Online; archived-6-January-2022; accessed-6-January-2022].Google Scholar
- Meng-Ju Wu, Minshu Zhao, and Donald Yeung. 2013. Studying multicore processor scaling via reuse distance analysis. In The 40th Annual International Symposium on Computer Architecture, ISCA'13, Tel-Aviv, Israel, June 23--27, 2013. ACM, Tel-Aviv, Israel, 499--510.Google Scholar
Digital Library
- William A. Wulf and Sally A. McKee. 1995. Hitting the memory wall: implications of the obvious. SIGARCH Comput. Archit. News , Vol. 23, 1 (1995), 20--24.Google Scholar
Digital Library
- Mengjia Yan, Read Sprabery, Bhargava Gopireddy, Christopher W. Fletcher, Roy H. Campbell, and Josep Torrellas. 2019. Attack Directories, Not Caches: Side Channel Attacks in a Non-Inclusive World. In 2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19--23, 2019. IEEE, San Francisco, CA, USA, 888--904.Google Scholar
Cross Ref
- Yifan Yuan, Mohammad Alian, Yipeng Wang, Ren Wang, Ilia Kurakin, Charlie Tai, and Nam Sung Kim. 2021. Don't Forget the I/O When Allocating Your LLC . In 48th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2021, Valencia, Spain, June 14--18, 2021 . IEEE, Valencia, Spain, 112--125.Google Scholar
Digital Library
- Zhen Zhang, Chaokun Chang, Haibin Lin, Yida Wang, Raman Arora, and Xin Jin. 2020. Is Network the Bottleneck of Distributed Training?. In Proceedings of the 2020 Workshop on Network Meets AI & ML, [email protected], Virtual Event, USA, August 14, 2020. ACM, Virtual Event, USA, 8--13.Google Scholar
Digital Library
- Li Zhao, Ravi R. Iyer, Srihari Makineni, Don Newell, and Liqun Cheng. 2010. NCID: a non-inclusive cache, inclusive directory architecture for flexible and efficient cache hierarchies. In Proceedings of the 7th Conference on Computing Frontiers, 2010, Bertinoro, Italy, May 17--19, 2010 . ACM, Bertinoro, Italy, 121--130.Google Scholar
Digital Library
- Minshu Zhao and Donald Yeung. 2015. Studying the impact of multicore processor scaling on directory techniques via reuse distance analysis. In 21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA, February 7--11, 2015. IEEE, Burlingame, CA, USA, 590--602.Google Scholar
Cross Ref
- Ying Zheng, Brian T. Davis, and Matthew Jordan. 2004. Performance evaluation of exclusive cache hierarchies. In 2004 IEEE International Symposium on Performance Analysis of Systems and Software, March 10--12, 2004, Austin, Texas, USA, Proceedings. IEEE , Austin, Texas, USA, 89--96.Google Scholar
Cross Ref
- Dimitrios Ziakas, Allen Baum, Robert A. Maddox, and Robert J. Safranek. 2010. Intel® QuickPath Interconnect Architectural Features Supporting Scalable System Architectures. In IEEE 18th Annual Symposium on High Performance Interconnects, HOTI 2010, Google Campus, Mountain View, California, USA, August 18--20, 2010. IEEE , GMountain View, California, USA, 1--6.Google Scholar
Index Terms
Understanding I/O Direct Cache Access Performance for End Host Networking
Recommendations
Understanding I/O Direct Cache Access Performance for End Host Networking
SIGMETRICS '22Direct Cache Access (DCA) enables a network interface card (NIC) to load and store data directly on the processor cache, as conventional Direct Memory Access (DMA) is no longer suitable as the bridge between NIC and CPU in the era of 100 Gigabit ...
Understanding I/O Direct Cache Access Performance for End Host Networking
SIGMETRICS/PERFORMANCE '22: Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer SystemsDirect Cache Access (DCA) enables a network interface card (NIC) to load and store data directly on the processor cache, as conventional Direct Memory Access (DMA) is no longer suitable as the bridge between NIC and CPU in the era of 100 Gigabit ...
Using Direct Cache Access Combined with Integrated NIC Architecture to Accelerate Network Processing
HPCC '12: Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and SystemsAs network speed continues to grow, new challenges of network processing are emerging. In this paper, we first study the overheads and interactions among the different steps of networking from a hardware perspective and point out that I/O and related ...






Comments