Abstract
Applications requiring high-speed TCP/IP processing can easily saturate a modern server. We and others have previously suggested alleviating this problem in multiprocessor environments by dedicating a subset of the processors to perform network packet processing. The remaining processors perform only application computation, thus eliminating contention between these functions for processor resources. Applications interact with packet processing engines (PPEs) using an asynchronous I/O (AIO) programming interface which bypasses the operating system. A key attraction of this overall approach is that it exploits the architectural trend toward greater thread-level parallelism in future systems based on multi-core processors. In this paper, we conduct a detailed experimental performance analysis comparing this approach to a best-practice configured Linux baseline system.We have built a prototype system implementing this architecture, ETA+AIO (Embedded Transport Acceleration with Asynchronous I/O), and ported a high-performance web-server to the AIO interface. Although the prototype uses modern single-core CPUs instead of future multi-core CPUs, an analysis of its performance can reveal important properties of this approach. Our experiments show that the ETA+AIO prototype has a modest advantage over the baseline Linux system in packet processing efficiency, consuming fewer CPU cycles to sustain the same throughput. This efficiency advantage enables the ETA+AIO prototype to achieve higher peak throughput than the baseline system, but only for workloads where the mix of packet processing and application processing approximately matches the allocation of CPUs in the ETA+AIO system thereby enabling high utilization of all the CPUs. Detailed analysis shows that the efficiency advantage of the ETA+AIO prototype, which uses one PPE CPU, comes from avoiding multiprocessing overheads in packet processing, lower overhead of our AIO interface compared to standard sockets, and reduced cache misses due to processor partitioning.
- Apache. URL www.apache.org.Google Scholar
- OProfile. URL oprofile.sourceforge.net/news/.Google Scholar
- RDMA Consortium. URL www.rdmaconsortium.org.Google Scholar
- Sockets API Extensions. URL www.opengroup.org.Google Scholar
- Zeus Technology. URL www.zeus.co.uk.Google Scholar
- Design notes on asynchronous I/O (aio) for Linux, 2002. URL lse.sourceforge.net/io/aionotes.txt.Google Scholar
- The Open Group Base Specifications Issue 6 IEEE Std 1003.1, 2003 Edition.Google Scholar
- V. Anand and B. Hartner. TCP/IP network stack performance in Linux kernel 2.4 and 2.5. In Proceedings of the Linux Symposium, pages 8--30. Ottawa, Ontario, Canada, July 2003Google Scholar
- B. S. Ang. An evaluation of an attempt at offloading TCP/IP processing onto an i960rn-based NIC. Technical Report HPL-2001-8, HP Labs, Palo Alto, CA, Jan 2001.Google Scholar
- G. Banga, J. Mogul, and P. Druschel. A scalable and explicit event delivery mechanism for UNIX. In Proceedings of the 1999 USENIX Annual Technical Conference. Monterey, CA, June 1999. Google Scholar
Digital Library
- A. V. Bhatt. Creating a PCI Express interconnect. URL www.pcisig.com/specifications/pciexpress/technical_library/pciexpress_whitepaper.pdf.Google Scholar
- N. L. Binkert, L. R. Hsu, A. G. Saidi, R. G. Dreslinski, A. L. Schultz, and S. K. Reinhardt. Performance analysis of system overheads in TCP/IP workloads. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. St. Louis, September 2005. Google Scholar
Digital Library
- T. Brecht and M. Ostrowski. Exploring the performance of select-based internet servers. Technical Report HPL-2001-314, HP Labs, November 2001.Google Scholar
- T. Brecht, D. Pariag, and L. Gammo. accept()able strategies for improving web server performance. In Proceedings of the 2004 USENIX Annual Technical Conference. Boston, June 2004. Google Scholar
Digital Library
- D. Clark, V. Jacobson, J. Romkey, and H. Salwen. An analysis of TCP processing overhead. IEEE Communications Magazine, 27(6):23--29, June 1989.Google Scholar
Digital Library
- Z. Ditta, G. Parulkar, and J. Cox Jr. The APIC approach to high performance network interface design: Protected and other techniques. In Proceedings of IEEE INFOCOM '97, volume 2, pages 7--11, April 1997. Google Scholar
Digital Library
- D. Dunning, G. Regnier, G. McAlpine, D. Cameron, B. Shubert, F. Berry, A. M. Merritt, E. Gronke, and C. Dodd. The Virtual Interface Architecture. IEEE Micro, 18(2):66--76, March-April 1998. Google Scholar
Digital Library
- K. Elmeleegy, A. Chanda, A. L. Cox, and W. Zwaenepoel. Lazy asynchronous I/O for event-driven servers. In Proceedings of the 2004 USENIX Annual Technical Conference. Boston, June 2004. Google Scholar
Digital Library
- A. Foong, J. Fung, and D. Newell. An in-depth analysis of the impact of processor affinity on network performance. In IEEE International Conference on Networks, November 2004.Google Scholar
Cross Ref
- A. Foong, T. Huff, H. Hum, J. Patwardhan, and G. Regnier. TCP performance re-visited. In IEEE International Symposium on Performance of Systems and Software, March 2003. Google Scholar
Digital Library
- D. Freimuth, E. Hu, J. LaVoie, R. Mraz, E. Nahum, P. Pradhan, and J. Tracey. Server network scalability and TCP offload. In Proceedings of the 2005 USENIX Annual Technical Conference, pages 209--222. Anaheim, April 2005. Google Scholar
Digital Library
- A. Gallatin, J. Chase, and K. Yocum. Trapeze/IP: TCP/IP at near-gigabit speeds. In Proceedings of 1999 USENIX Technical Conference (Freenix Track), pages 109--120, June 1999. Google Scholar
Digital Library
- J. M. Hart. Win32 System Programming. Addison Wesley, 2nd edition, 2001. Google Scholar
Digital Library
- HP Labs. The userver home page, 2005. URL www.hpl.hp.com/research/linux/userver.Google Scholar
- R. Huggahalli, R. Iyer, and S. Tetrick. Direct cache access for high bandwidth network I/O. In Proceedings of the 32nd International Conference on Computer Architecture (ISCA'05). Madison, WI, June 2005. Google Scholar
Digital Library
- InfiniBandSM Trade Association. InfiniBand#8482; Architecture Specification Volume 1, Release 1.0. October 2000. URL www.infinibandta.org.Google Scholar
- Intel® Corporation. PCI/PCI-X Family of Gigabit Ethernet Controllers Software Developer's Manual, Revision 2.5. July 2005.Google Scholar
- V. Jacobson and B. Felderman. A modest proposal to help speed up and scale up the linux networking stack. In linux.conf.au, January 2006.Google Scholar
- J. Kay and J. Pasquale. The importance of non-data touching processing overheads in TCP/IP. In SIGCOMM, pages 259--268, 1993. Google Scholar
Digital Library
- J. Kay and J. Pasquale. Profiling and reducing processing overheads in TCP/IP. IEEE/ACM Transations on Networking, 4(6):817--828, 1996. Google Scholar
Digital Library
- Y. Khalidi and M. Thadani. An efficient zero-copy I/O framework for UNIX. Technical report, SMLI TR95--39, Sun Microsystems Lab, May 1995. Google Scholar
Digital Library
- D. Libenzi. Improving (network) I/O performance. URL http://www.xmailserver.org/linux-patches/nio-improve.html.Google Scholar
- J. C. Mogul. TCP offload is a dumb idea whose time has come. In 9th Workshop on Hot Topics in Operating Systems (HotOS IX). USENIX, May 2003. Google Scholar
Digital Library
- J. C. Mogul and K. K. Ramakrishnan. Eliminating receive livelock in an interrupt-driven kernel. ACM Transactions on Computer Systems, 15(3):217--252, 1997. Google Scholar
Digital Library
- D. Mosberger and T. Jin. httperf: A tool for measuring web server performance. In First Workshop on Internet Server Performance, pages 59--67. Madison, WI, June 1998.Google Scholar
Digital Library
- S. Muir and J. Smith. AsyMOS - an asymmetric multiprocessor operating system. In IEEE Conf on Open Architectures and Network Programming (OPENARCH), April 1998.Google Scholar
Cross Ref
- S. Muir and J. Smith. Functional divisions in the Piglet multiprocessor operating system. In ACM SIGOPS European Workshop, September 1998. Google Scholar
Digital Library
- S. Nagar, P. Larson, H. Linder, and D. Stevens, epoll scalability web page. URL http://Ise.sourceforge.net/epoll/index.html.Google Scholar
- V. S. Pai, P. Druschel, and W. Zwaenepoel. Flash: An efficient and portable Web server. In Proceedings of the USENIX 1999 Annual Technical Conference, 1999. Google Scholar
Digital Library
- M. Rangarajan, K. Banerjee, J. Yeo, and L. Iftode. MemNet: Efficient offloading of TCP/IP processing using memory-mapped communication. Technical Report DCS-TR-485, Rutgers University Technical Report, 2002.Google Scholar
- M. Rangarajan, A. Bohra, K. Banerjee, E. Carrera, R. Bianchini, L. Iftode, and W. Zwaenepoel. TCP Servers: Offloading TCP processing in Internet servers. Technical Report DCS-TR-481, Rutgers University, Mar 2002.Google Scholar
- G. Regnier, D. Minturn, G. McAlpine, V. Saletore, and A. Foong. ETA: Experience with an Intel® Xeon#8482; processor as a packet processing engine. In Hot Interconnects, August 2003.Google Scholar
- G. J. Regnier, S. Makineni, R. Illikkal, R. R. Iyer, D. B. Minturn, R. Huggahalli, D. Newell, L. S. Cline, and A. Foong. TCP onloading for data center servers. IEEE Computer, 37(11):48--58, 2004. Google Scholar
Digital Library
- V. A. Saletore, P. M. Stillwell, J. A. Wiegert, P. Cayton, J. Gray, and G. J. Regnier. Efficient direct user level sockets for an Intel® Xeon#8482; processor based TCP on-load engine. In The Workshop on Communication Architecture for Clusters. Denver, CO, April 2005. Google Scholar
Digital Library
- J. H. Salim, R. Olsson, and A. Kuznetsov. Beyond Softnet. In 5th Annual Linux Showcase and Conference, pages 165--172, November 2001. Google Scholar
Digital Library
- P. Sarkar, S. Uttamchandani, and K. Voruganti. Storage over IP: when does hardware support help? In 2nd USENIX Conference on File and Storage Technologies (FAST), Mar 2003. Google Scholar
Digital Library
- P. Shivam and J. S. Chase. On the elusive benefits of protocol offload. In ACM SigComm Workshop on Network-IO Convergence (NICELI). Germany, August 2003. Google Scholar
Digital Library
- Standard Performance Evaluation Corporation. SPECweb99 Benchmark, 1999. URL www.spec.org/osg/web99.Google Scholar
- W. Stevens. Unix Network Programming, Volume 1. Addison Wesley, third edition, 2003. Google Scholar
Digital Library
- Y. Turner, T. Brecht, G. Regnier, V. Saletore, G. J. Janakiraman, and B. Lynn. Scalable networking for next-generation computing platforms. In Third Annual Workshop on System Area Networks (SAN-3). Madrid, Spain, February 2004.Google Scholar
- M. Welsh, D. Culler, and E. Brewer. SEDA: an architecture for well-conditioned, scalable Internet services. In 18th Symp. on Operating System Principles (SOSP-18), Oct 2001. Google Scholar
Digital Library
- N. Zeldovich, A. Yip, F. Dabek, R. T. Morris, D. Mazieres, and F. Kaashoek. Multiprocessor support for event-driven programs. In Proceedings of the USENIX 2003 Annual Technical Conference, June 2003.Google Scholar
Index Terms
Evaluating network processing efficiency with processor partitioning and asynchronous I/O
Recommendations
Evaluating network processing efficiency with processor partitioning and asynchronous I/O
EuroSys '06: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006Applications requiring high-speed TCP/IP processing can easily saturate a modern server. We and others have previously suggested alleviating this problem in multiprocessor environments by dedicating a subset of the processors to perform network packet ...
Accelerated high-performance computing through efficient multi-process GPU resource sharing
CF '12: Proceedings of the 9th conference on Computing FrontiersThe HPC field is witnessing a widespread adoption of GPUs as accelerators for traditional homogeneous HPC systems. One of the prevalent parallel programming models is the SPMD paradigm, which has been adapted for GPU-based parallel processing. Since ...
A Case Study of Energy Efficiency on a Heterogeneous Multi-Processor
In this extended abstract, we present a case study of powerefficiency on a heterogeneous multi-core processor, Exynos 5422 based on the ARM big.LITTLE architecture. We show the effect of thermal management on the big (faster) cores and the comparisons ...






Comments