Abstract
Functional full-system simulators are powerful and versatile research tools for accelerating architectural exploration and advanced software development. Their main shortcoming is limited throughput when simulating large multiprocessor systems with hundreds or thousands of processors or when instrumentation is introduced. We propose the ProtoFlex simulation architecture, which uses FPGAs to accelerate full-system multiprocessor simulation and to facilitate high-performance instrumentation. Prior FPGA approaches that prototype a complete system in hardware are either too complex when scaling to large-scale configurations or require significant effort to provide full-system support. In contrast, ProtoFlex virtualizes the execution of many logical processors onto a consolidated number of multiple-context execution engines on the FPGA. Through virtualization, the number of engines can be judiciously scaled, as needed, to deliver on necessary simulation performance at a large savings in complexity. Further, to achieve low-complexity full-system support, a hybrid simulation technique called transplanting allows implementing in the FPGA only the frequently encountered behaviors, while a software simulator preserves the abstraction of a complete system.
We have created a first instance of the ProtoFlex simulation architecture, which is an FPGA-based, full-system functional simulator for a 16-way UltraSPARC III symmetric multiprocessor server, hosted on a single Xilinx Virtex-II XCV2P70 FPGA. On average, the simulator achieves a 38x speedup (and as high as 49×) over comparable software simulation across a suite of applications, including OLTP on a commercial database server. We also demonstrate the advantages of minimal-overhead FPGA-accelerated instrumentation through a CMP cache simulation technique that runs orders-of-magnitude faster than software.
- AMD. 2008. Advanced Micro Devices, SimNow Simulator 4.4.3. User’s manual.Google Scholar
- Barroso, L. A., Gharachorloo, K., McNamara, R., Nowatzyk, A., Qadeer, S., Sano, B., Smith, S., Stets, R., and Verghese, B. 2000. Piranha: A scalable architecture based on single-chip multiprocessing. SIGARCH Comput. Archit. News 28, 2, 282--293. Google Scholar
Digital Library
- Bellard, F. 2005. QEMU, A fast and portable dynamic translator. In Proceedings of the Annual Conference on USENIX Annual Technical Conference (ATEC’05). USENIX Association, 41--41. Google Scholar
Digital Library
- Binkert, N. L., Dreslinski, R. G., Hsu, L. R., Lim, K. T., Saidi, A. G., and Reinhardt, S. K. 2006. The M5 simulator: Modeling networked systems. IEEE Micro 26, 4, 52--60. Google Scholar
Digital Library
- Bohrer, P., Peterson, J., Elnozahy, M., Rajamony, R., Gheith, A., Rockhold, R., Lefurgy, C., Shafi, H., Nakra, T., Simpson, R., Speight, E., Sudeep, K., Hensbergen, E., and Zhang, L. 2004. Mambo: A full system simulator for the PowerPC architecture. ACM SIGMETRICS Perform. Eval. Rev. 31, 4, 8--12. Google Scholar
Digital Library
- Chang, C., Wawrzynek, J., and Brodersen, R. W. 2005. BEE2: A high-end reconfigurable computing system. IEEE Des. Test Comput. 22, 2, 114--125. Google Scholar
Digital Library
- Chen, S., Kozuch, M., Strigkos, T., Falsafi, B., Gibbons, P. B., Mowry, T. C., Ramachandran, V., Ruwase, O., Ryan, M., and Vlachos, E. 2008. Flexible hardware acceleration for instruction-grain program monitoring. In Proceedings of the 35th International Symposium on Computer Architecture (ISCA’08). IEEE Computer Society, 377--388. Google Scholar
Digital Library
- Chidester, M. and George, A. 2002. Parallel simulation of chip-multiprocessor architectures. ACM Trans. Model. Comput. Simul. 12, 3, 176--200. Google Scholar
Digital Library
- Chiou, D., Sunwoo, D., Kim, J., Patil, N. A., Reinhart, W., Johnson, D. E., Keefe, J., and Angepat, H. 2007. FPGA-Accelerated simulation technologies (FAST): Fast, full-system, cycle-accurate simulators. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). IEEE Computer Society, 249--261. Google Scholar
Digital Library
- Chung, E. S., Nurvitadhi, E., Hoe, J. C., Falsafi, B., and Mai, K. 2008. A complexity-effective architecture for accelerating full-system multiprocessor simulations using FPGAs. In Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays (FPGA’08). ACM, New York, 77--86. Google Scholar
Digital Library
- Dalton, M., Kannan, H., and Kozyrakis, C. 2007. Raksha: A flexible information flow architecture for software security. SIGARCH Comput. Archit. News 35, 2, 482--493. Google Scholar
Digital Library
- Emer, J., Ahuja, P., Borch, E., Klauser, A., Luk, C.-K., Manne, S., Mukherjee, S. S., Patil, H., Wallace, S., Binkert, N., Espasa, R., and Juan, T. 2002. Asim: A performance model framework. Comput. 35, 2, 68--76. Google Scholar
Digital Library
- Hankins, R., Diep, T., Annavaram, M., Hirano, B., Eri, H., Nueckel, H., and Shen, J. 2003. Scaling and characterizing database workloads: Bridging the gap between research and practice. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture. 151--162. Google Scholar
Digital Library
- Krasnov, A., Schultz, A., Wawrzynek, J., Gibeling, G., and Droz, P. 2007. RAMP Blue: A message-passing manycore system in FPGAs. In Proceedings of the Conference on Field Programmable Logic and Applications.Google Scholar
- Lantz, R. 2008. Fast functional simulation with parallel Embra. In Proceedings of the 4th Annual Workshop on Modeling, Benchmarking and Simulation.Google Scholar
- Legedza, U. and Weihl, W. E. 1996. Reducing synchronization overhead in parallel simulation. SIGSIM Simul. Digest 26, 1, 86--95. Google Scholar
Digital Library
- Lu, S.-L. L., Yiannacouras, P., Kassa, R., Konow, M., and Suh, T. 2007. An FPGA-based Pentium®in a complete desktop system. In Proceedings of the ACM/SIGDA 15th International Symposium on Field Programmable Gate Arrays (FPGA’07). ACM, New York, 53--59. Google Scholar
Digital Library
- Magnusson, P., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hogberg, J., Larsson, F., Moestedt, A., and Werner, B. 2002. Simics: A full system simulation platform. Comput. 35, 2, 50--58. Google Scholar
Digital Library
- Martin, M. M. K., Sorin, D. J., Beckmann, B. M., Marty, M. R., Xu, M., Alameldeen, A. R., Moore, K. E., Hill, M. D., and Wood, D. A. 2005. Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News 33, 4, 92--99. Google Scholar
Digital Library
- Mukherjee, S., Reinhardt, S., Falsafi, B., Litzkow, M., Hill, M., Wood, D., Huss-Lederman, S., and Larus, J. 2000. Wisconsin Wind Tunnel II: a fast, portable parallel architecture simulator. Concurr. IEEE 8, 4, 12--20. Google Scholar
Digital Library
- Nethercote, N. and Seward, J. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’07). ACM, New York, 89--100. Google Scholar
Digital Library
- Nussbaum, F., Fedorova, A., and Small, C. 2004. An overview of the Sam CMT simulator kit. Tech. rep. TR-2004-133, Sun Microsystems Research Labs. Google Scholar
Digital Library
- Öner, K., Barroso, L. A., Iman, S., Jeong, J., Ramamurthy, K., and Dubois, M. 1995. The design of RPM: An FPGA-based multiprocessor emulator. In Proceedings of the ACM 3rd International Symposium on Field Programmable Gate Arrays (FPGA’95). ACM, New York, 60--66. Google Scholar
Digital Library
- Over, A., Clarke, B., and Strazdins, P. 2007. A comparison of two approaches to parallel simulation of multiprocessors. ispass 0, 12--22.Google Scholar
- Patil, H., Cohn, R., Charney, M., Kapoor, R., Sun, A., and Karunanidhi, A. 2004. Pinpointing representative portions of large Intel®Itanium®programs with dynamic instrumentation. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’04). IEEE Computer Society, 81--92. Google Scholar
Digital Library
- Pellauer, M., Vijayaraghavan, M., Adler, M., and Emer, J. 2008. Quick performance models quickly: Timing-Directed simulation on FPGAs. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. Google Scholar
Digital Library
- Penry, D., Fay, D., Hodgdon, D., Wells, R., Schelle, G., August, D., and Connors, D. 2006. Exploiting parallelism and structure to accelerate the simulation of chip multi-processors. In Proceedings of the 12th International Symposium on High-Performance Computer Architecture, 29--40.Google Scholar
- Reinhardt, S. K., Hill, M. D., Larus, J. R., Lebeck, A. R., Lewis, J. C., and Wood, D. A. 1993. The Wisconsin Wind Tunnel: Virtual prototyping of parallel computers. ACM SIGMETRICS Perform. Eval. Rev. 21, 1, 48--60. Google Scholar
Digital Library
- Rosenblum, M., Herrod, S. A., Witchel, E., and Gupta, A. 1995. Complete computer system simulation: The SimOS approach. IEEE Parallel Distrib. Technol. 3, 4, 34--43. Google Scholar
Digital Library
- Smith, B. 1985. In The Architecture of HEP on Parallel MIMD Computation: HEP Supercomputer and its Applications. Massachusetts Institute of Technology, Cambridge, MA, 41--55. Google Scholar
Digital Library
- Srivastava, A. and Eustace, A. 1994. ATOM: A system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’94). ACM, New York, 196--205. Google Scholar
Digital Library
- Tan, Z., Asanović, K., and Patterson, D. 2008. An FPGA host-multithreaded functional model for SPARC v8. In Proceedings of the 3rd Workshop on Architectural Research Prototyping.Google Scholar
- Thornton, J. E. 1995. Parallel operation in the control data 6600. 5--12.Google Scholar
- Vahia, D. and Hartke, P. 2007. OpenSPARC T1 on Xilinx FPGAs--Updates. June 2007 RAMP Retreat.Google Scholar
- Venkataramani, G., Roemer, B., Solihin, Y., and Prvulovic, M. 2007. MemTracker: Efficient and programmable support for memory access monitoring and debugging. In Proceedings of the IEEE 13th International Symposium on High Performance Computer Architecture (HPCA’07). IEEE Computer Society, 273--284. Google Scholar
Digital Library
- Wang, K., Zhang, Y., Wang, H., and Shen, X. 2008. Parallelization of IBM mambo system simulator in functional modes. SIGOPS Oper. Syst. Rev. 42, 1, 71--76. Google Scholar
Digital Library
- Wawrzynek, J., Patterson, D., Oskin, M., Lu, S.-L., Kozyrakis, C., Hoe, J. C., Chiou, D., and Asanović, K. 2007. RAMP: Research accelerator for multiple processors. IEEE Micro 27, 2, 46--57. Google Scholar
Digital Library
- Wee, S., Casper, J., Njoroge, N., Tesylar, Y., Ge, D., Kozyrakis, C., and Olukotun, K. 2007. A practical FPGA-based framework for novel CMP research. In Proceedings of the ACM/SIGDA 15th International Symposium on Field Programmable Gate Arrays (FPGA’07). ACM, New York, 116--125. Google Scholar
Digital Library
- Wenisch, T. and Wunderlich, R. 2005. SimFlex: Fast, accurate and flexible simulation of computer systems. In Proceedings of the Tutorial in the International Symposium on Microarchitecture (MICRO-38).Google Scholar
- Wenisch, T. F., Wunderlich, R. E., Ferdman, M., Ailamaki, A., Falsafi, B., and Hoe, J. C. 2006. SimFlex: Statistical sampling of computer system simulation. IEEE Micro 26, 4, 18--31. Google Scholar
Digital Library
- Witchel, E. and Rosenblum, M. 1996. Embra: Fast and flexible machine simulation. ACM SIGMETRICS Perform. Eval. Rev. 24, 1, 68--79. Google Scholar
Digital Library
- Yourst, M. 2007. PTLsim: A cycle accurate full system x86-64 microarchitectural simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 23--34.Google Scholar
Cross Ref
Index Terms
ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs
Recommendations
A complexity-effective architecture for accelerating full-system multiprocessor simulations using FPGAs
FPGA '08: Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arraysFunctional full-system simulators are powerful and versatile research tools for accelerating architectural exploration and advanced software development. Their main shortcoming is limited throughput when simulating systems with hundreds of processors or ...
A Desktop Computer with a Reconfigurable Pentium®
Special edition on the 15th international symposium on FPGAsAdvancements in reconfigurable technologies, specifically FPGAs, have yielded faster, more power-efficient reconfigurable devices with enormous capacities. In our work, we provide testament to the impressive capacity of recent FPGAs by hosting a ...
Intel nehalem processor core made FPGA synthesizable
FPGA '10: Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arraysWe present a FPGA-synthesizable version of the Intel Nehalem processor core, synthesized, partitioned and mapped to a multi-FPGA emulation system consisting of Xilinx Virtex-4 and Virtex-5 FPGAs. To our knowledge, this is the first time a modern state-...






Comments