Abstract
Mambo [4] is IBM's full-system simulator which models PowerPC systems, and provides a complete set of simulation tools to help IBM and its partners in pre-hardware development and performance evaluation for future systems. Currently Mambo simulates target systems on a single host thread. When the number of cores increases in a target system, Mambo's simulation performance for each core goes down. As the so-called "multi-core era" approaches, both target and host systems will have more and more cores. It is very important for Mambo to efficiently simulate a multi-core target system on a multi-core host system. Parallelization is a natural method to speed up Mambo under this situation.
Parallel Mambo (P-Mambo) is a multi-threaded implementation of Mambo. Mambo's simulation engine is implemented as a user-level thread-scheduler. We propose a multi-scheduler method to adapt Mambo's simulation engine to multi-threaded execution. Based on this method a core-based module partition is proposed to achieve both high inter-scheduler parallelism and low inter-scheduler dependency. Protection of shared resources is crucial to both correctness and performance of P-Mambo. Since there are two tiers of threads in P-Mambo, protecting shared resources by only OS-level locks possibly introduces deadlocks due to user-level context switch. We propose a new lock mechanism to handle this problem. Since Mambo is an on-going project with many modules currently under development, co-existence with new modules is also important to P-Mambo. We propose a global-lock-based method to guarantee compatibility of P-Mambo with future Mambo modules.
We have implemented the first version of P-Mambo in functional modes. The performance of P-Mambo has been evaluated on the OpenMP implementation of NAS Parallel Benchmark (NPB) 3.2 [12]. Preliminary experimental results show that P-Mambo achieves an average speedup of 3.4 on a 4-core host machine.
- L. R. Bachega, J. R. Brunheroto, L. DeRose, P. Mindlin, and J. E. Moreira. The BlueGene/L Pseudo Cycle-accurate Simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), March 2004. Google Scholar
Digital Library
- F. Bellard. QEMU, a Fast and Portable Dynamic Translator. USENIX 2005 Annual Technical Conference, FREENIX Track, 2005. Google Scholar
Digital Library
- N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt. Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset. Computer Architecture News (CAN), September 2005.Google Scholar
- P. Bohrer, M. Elnozahy, A. Gheith, C. Lefurgy, T. Nakra, J. Peterson, R. Rajamony, R. Rockhold, H. Shafi, R. Simpson, E. Speight, K. Sudeep, E. V. Hensbergen, and L. Zhang. Mambo -- A Full System Simulator for the PowerPC Architecture. ACM SIGMETRICS Performance Evaluation Review, 31(4):8--12, March 2004. Google Scholar
Digital Library
- D. Burger, T. M. Austin, and S. Bennett. Evaluating Future Microprocessors: The SimpleScalar Tool Set. Technical Report CS-TR-1996-1308, 1996.Google Scholar
- L. Ceze, K. Strauss, G. Almasi, P. J. Bohrer, J. R. Brunheroto, C. Cascaval, J. G. Castanos, D. Lieber, X. Martorell, J. E. Moreira, A. Sanomiya, and E. Schenfeld. Full Circle: Simulating Linux Clusters on Linux Clusters. In Proceedings of the Fourth LCI International Conference on Linux Clusters: The HPC Revolution 2003, June 2003.Google Scholar
- D. Chiou, D. Sunwoo, J. Kim, N. Patil, W. Reinhart, E. Johnson, J. Keefe, and H. Angepat. FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007. Google Scholar
Digital Library
- I. Corporation. The PowerPC Architecture: A Specification for a New Family of Processors. Morgan Kaufmann Publishers, Inc., 1994. Google Scholar
Digital Library
- K. Ebcioglu and E. R. Altman. DAISY: Dynamic Compilation for 100% Architectural Compatibility. In Proceedings of 24th Annual International Symposium on Computer Architecture, pages 26--37, 1997. Google Scholar
Digital Library
- P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full System Simulation Platform. IEEE Computer, 35(2):50--58, February 2002. Google Scholar
Digital Library
- N. Njoroge, J. Casper, S. Wee, Y. Teslyar, D. Ge, C. Kozyrakis, and K. Olukotun. ATLAS: A Chip-Multiprocessor with Transactional Memory Support. In Proceedings of the Conference on Design Automation and Test in Europe (DATE), 2007. Google Scholar
Digital Library
- NPB. NAS Parallel Benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html.Google Scholar
- M. Rosenblum, S. A. Herrod, E. Witchel, and A. Gupta. Complete Computer System Simulation: The SimOS Approach. IEEE parallel and distributed technology: systems and applications, 3(4):34--43, Winter 1995. Google Scholar
Digital Library
- H. Shafi, P. J. Bohrer, J. Phelan, C. A. Rusu, and J. L. Peterson. Design and validation of a performance and power simulator for PowerPC systems. IBM Journal of Research and Development, 47(5--6):641--651, September 2003. Google Scholar
Digital Library
- T. B. Team. An Overview of the BlueGene/L Supercomputer. In Proceedings of ACM/IEEE Conference on Supercomputing, November 2002. Google Scholar
Digital Library
- S. Wee, J. Casper, N. Njoroge, Y. Teslyar, D. Ge, C. Kozyrakis, and K. Olukotun. A Practical FPGA-based Framework for Novel CMP Research. Google Scholar
Digital Library
- E. Witchel and M. Rosenblum. Embra: Fast and Flexible Machine Simulation. In Proceedings of ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 1996. Google Scholar
Digital Library
Index Terms
Parallelization of IBM mambo system simulator in functional modes
Recommendations
Parallelized Direct Execution Simulation of Message-Passing Parallel Programs
As massively parallel computers proliferate, there is growing interest in finding ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing compilers, parallel ...
GPU Acceleration for Simulating Massively Parallel Many-Core Platforms
Emerging massively parallel architectures such as a general-purpose processor plus many-core programmable accelerators are creating an increasing demand for novel methods to perform their architectural simulation. Most state-of-the-art simulation ...
Full system simulation of many-core heterogeneous SoCs using GPU and QEMU semihosting
GPGPU-5: Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing UnitsModern system-on-chips are evolving towards complex and heterogeneous platforms with general purpose processors coupled with massively parallel manycore accelerator fabrics (e.g. embedded GPUs). Platform developers are looking for efficient full-system ...






Comments