Abstract
In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different times, we observe extensive code reuse among different processors, causing redundancy (e.g., in our server workloads, 45-65% of all instruction blocks are accessed by all processors). Moreover, largely independent fragments of computation compete for the same private resources causing destructive interference. Together, this redundancy and interference lead to poor utilization of private microarchitecture resources such as caches and branch predictors.We present Computation Spreading (CSP), which employs hardware migration to distribute a thread's dissimilar fragments of computation across the multiple processing cores of a chip multiprocessor (CMP), while grouping similar computation fragments from different threads together. This paper focuses on a specific example of CSP for OS intensive server applications: separating application level (user) computation from the OS calls it makes.When performing CSP, each core becomes temporally specialized to execute certain computation fragments, and the same core is repeatedly used for such fragments. We examine two specific thread assignment policies for CSP, and show that these policies, across four server workloads, are able to reduce instruction misses in private L2 caches by 27-58%, private L2 load misses by 0-19%, and branch mispredictions by 9-25%.
- Advanced Micro Devices. AMD64 Architecture Programmer's Manual Volume 2: System Programming, Dec 2005.Google Scholar
- A. Agarwal, J. Hennessy, and M. Horowitz. Cache performance of operating system and multiprogramming workloads. ACM Trans. Comput. Syst., 6(4):393--431, 1988. Google Scholar
Digital Library
- A. Ailamaki, D.J. DeWitt, M.D. Hill, and D.A. Wood. DBMSs on a modern processor: Where does time go? In Proceedings of the 25th International Conference on Very Large Data Bases, 1999. Google Scholar
Digital Library
- A.R. Alameldeen and D.A. Wood. Variability in architectural simulations of multi-threaded workloads. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture, 2003. Google Scholar
Digital Library
- T.E. Anderson, H.M. Levy, B.N. Bershad, and E.D. Lazowska. The interaction of architecture and operating system design. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, 1991. Google Scholar
Digital Library
- S. Balakrishnan, R. Rajwar, M. Upton, and K. Lai. The impact of performance asymmetry in emerging multicore architectures. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, 2005. Google Scholar
Digital Library
- P. Barford and M. Crovella. Generating representative web workloads for network and server performance evaluation. In Proceedings of the 1998 SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, 1998. Google Scholar
Digital Library
- L.A. Barroso, K. Gharachorloo, and E. Bugnion. Memory system characterization of commercial workloads. In Proceedings of the 25th Annual International Symposium on Computer architecture, 1998. Google Scholar
Digital Library
- B.M. Beckmann and D.A. Wood. Managing wire delay in large chip-multiprocessor caches. In Proceedings of the 37th Annual International Symposium on Microarchitecture, 2004. Google Scholar
Digital Library
- J. Chang and G.S. Sohi. Cooperative caching for chip multiprocessors. In Proceedings of the 33rd International Symposium on Computer Architecture, 2006. Google Scholar
Digital Library
- A.N. Eden and T. Mudge. The YAGS branch prediction scheme. In Proceedings of the 31st Annual International Symposium on Microarchitecture, 1998. Google Scholar
Digital Library
- N. Gloy, C. Young, J.B. Chen, and M.D. Smith. An analysis of dynamic branch prediction schemes on system workloads. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, 1996. Google Scholar
Digital Library
- R. Hankins, T. Diep, M. Annavaram, B. Hirano, H. Eri, H. Nueckel, and J.P. Shen. Scaling and characterizing database workloads: Bridging the gap between research and practice. In Proceedings of the 36th Annual International Symposium on Microarchitecture, 2003. Google Scholar
Digital Library
- S. Harizopoulos and A. Ailamaki. STEPS towards cache-resident transaction processing. In Proceedings of the 30th International Conference on Very Large Databases, 2004. Google Scholar
Digital Library
- R. Kumar, D.M. Tullsen, P. Ranganathan, N.P. Jouppi, and K.I. Farkas. Single-ISA heterogeneous multi-core architectures for multithreaded workload performance. In Proceedings of the 31st Annual International Symposium on Computer Architecture, 2004. Google Scholar
Digital Library
- J.R. Larus and M. Parkes. Using cohort-scheduling to enhance server performance. In Proceedings of the General Track USENIX Annual Technical Conference, 2002. Google Scholar
Digital Library
- H.-H.S. Lee, M. Smelyanskiy, G.S. Tyson, and C.J. Newburn. Stack value file: Custom microarchitecture for the stack. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture, 2001. Google Scholar
Digital Library
- T. Li, L.K. John, A. Sivasubramaniam, N. Vijaykrishnan, and J. Rubio. Understanding and improving operating system effects in control flow prediction. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002. Google Scholar
Digital Library
- D. Lilja, F. Marcovitz, and P.C. Yew. Memory referencing behavior and a cache performance metric in a shared memory multiprocessor. Technical Report CSRD-836, University of Illinois, Urbana-Champaign, Dec 1988.Google Scholar
- J.L. Lo, L.A. Barroso, S.J. Eggers, K. Gharachorloo, H.M. Levy, and S. S. Parekh. An analysis of database workload performance on simultaneous multithreaded processors. In Proceedings of the 25th Annual International Symposium on Computer Architecture, 1998. Google Scholar
Digital Library
- P.Magnusson,M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE Computer, 35(2):50--58, Feb 2002. Google Scholar
Digital Library
- V. S. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel, W. Zwaenepoel, and E. Nahum. Locality-aware request distribution in cluster-based network servers. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, 1998. Google Scholar
Digital Library
- A. Ramirez, L.A. Barroso, K. Gharachorloo, R. Cohn, J. Larriba-Pey, P.G. Lowney, and M. Valero. Code layout optimizations for transaction processing workloads. In Proceedings of the 28th Annual International Symposium on Computer Architecture, 2001. Google Scholar
Digital Library
- J.A. Redstone, S.J. Eggers, and H.M. Levy. An analysis of operating system behavior on a simultaneous multithreaded architecture. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, 2000. Google Scholar
Digital Library
- A.J. Smith. Cache memories. ACM Comput. Surv., 14(3):473--530, 1982. Google Scholar
Digital Library
- E. Speight, H. Shafi, L. Zhang, and R. Rajamony. Adaptive mechanisms and policies for managing cache hierarchies in chip multiprocessors. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, 2005. Google Scholar
Digital Library
- J.E. Thorton. Parallel operation in the control data 6600. In Proceedings of the Fall Joint Computer Conference, 1964.Google Scholar
- J. Torrellas, A. Gupta, and J. Hennessy. Characterizing the caching and synchronization performance of a multiprocessor operating system. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, 1992. Google Scholar
Digital Library
- J. Torrellas, A. Tucker, and A. Gupta. Benefits of cache-affinity scheduling in shared-memory multiprocessors: a summary. In Proceedings of the 1993 SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 1993. Google Scholar
Digital Library
- R. Uhlig, G. Neiger, D. Rodgers, A.L. Santoni, F.C. M. Martins, A.V. Anderson, S.M. Bennett, A. Kagi, F.H. Leung, and L. Smith. Intel virtualization technology. Computer, 38(5), 2005. Google Scholar
Digital Library
- P. Wells, K. Chakraborty, and G. Sohi. Hardware support for spin management in overcommitted virtual machines. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, 2006. Google Scholar
Digital Library
- M. Welsh, D. Culler, and E. Brewer. SEDA: an architecture for wellconditioned, scalable internet services. In Proceedings of the 18th Symposium on Operating Systems Principles, 2001. Google Scholar
Digital Library
- T.F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, 2005. Google Scholar
Digital Library
Index Terms
Computation spreading: employing hardware migration to specialize CMP cores on-the-fly
Recommendations
Computation spreading: employing hardware migration to specialize CMP cores on-the-fly
ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systemsIn canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different ...
Computation spreading: employing hardware migration to specialize CMP cores on-the-fly
Proceedings of the 2006 ASPLOS ConferenceIn canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different ...
Computation spreading: employing hardware migration to specialize CMP cores on-the-fly
Proceedings of the 2006 ASPLOS ConferenceIn canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different ...






Comments