skip to main content
article

Computation spreading: employing hardware migration to specialize CMP cores on-the-fly

Published:20 October 2006Publication History
Skip Abstract Section

Abstract

In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different times, we observe extensive code reuse among different processors, causing redundancy (e.g., in our server workloads, 45-65% of all instruction blocks are accessed by all processors). Moreover, largely independent fragments of computation compete for the same private resources causing destructive interference. Together, this redundancy and interference lead to poor utilization of private microarchitecture resources such as caches and branch predictors.We present Computation Spreading (CSP), which employs hardware migration to distribute a thread's dissimilar fragments of computation across the multiple processing cores of a chip multiprocessor (CMP), while grouping similar computation fragments from different threads together. This paper focuses on a specific example of CSP for OS intensive server applications: separating application level (user) computation from the OS calls it makes.When performing CSP, each core becomes temporally specialized to execute certain computation fragments, and the same core is repeatedly used for such fragments. We examine two specific thread assignment policies for CSP, and show that these policies, across four server workloads, are able to reduce instruction misses in private L2 caches by 27-58%, private L2 load misses by 0-19%, and branch mispredictions by 9-25%.

References

  1. Advanced Micro Devices. AMD64 Architecture Programmer's Manual Volume 2: System Programming, Dec 2005.Google ScholarGoogle Scholar
  2. A. Agarwal, J. Hennessy, and M. Horowitz. Cache performance of operating system and multiprogramming workloads. ACM Trans. Comput. Syst., 6(4):393--431, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Ailamaki, D.J. DeWitt, M.D. Hill, and D.A. Wood. DBMSs on a modern processor: Where does time go? In Proceedings of the 25th International Conference on Very Large Data Bases, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A.R. Alameldeen and D.A. Wood. Variability in architectural simulations of multi-threaded workloads. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T.E. Anderson, H.M. Levy, B.N. Bershad, and E.D. Lazowska. The interaction of architecture and operating system design. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Balakrishnan, R. Rajwar, M. Upton, and K. Lai. The impact of performance asymmetry in emerging multicore architectures. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Barford and M. Crovella. Generating representative web workloads for network and server performance evaluation. In Proceedings of the 1998 SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L.A. Barroso, K. Gharachorloo, and E. Bugnion. Memory system characterization of commercial workloads. In Proceedings of the 25th Annual International Symposium on Computer architecture, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B.M. Beckmann and D.A. Wood. Managing wire delay in large chip-multiprocessor caches. In Proceedings of the 37th Annual International Symposium on Microarchitecture, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Chang and G.S. Sohi. Cooperative caching for chip multiprocessors. In Proceedings of the 33rd International Symposium on Computer Architecture, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A.N. Eden and T. Mudge. The YAGS branch prediction scheme. In Proceedings of the 31st Annual International Symposium on Microarchitecture, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Gloy, C. Young, J.B. Chen, and M.D. Smith. An analysis of dynamic branch prediction schemes on system workloads. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Hankins, T. Diep, M. Annavaram, B. Hirano, H. Eri, H. Nueckel, and J.P. Shen. Scaling and characterizing database workloads: Bridging the gap between research and practice. In Proceedings of the 36th Annual International Symposium on Microarchitecture, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Harizopoulos and A. Ailamaki. STEPS towards cache-resident transaction processing. In Proceedings of the 30th International Conference on Very Large Databases, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Kumar, D.M. Tullsen, P. Ranganathan, N.P. Jouppi, and K.I. Farkas. Single-ISA heterogeneous multi-core architectures for multithreaded workload performance. In Proceedings of the 31st Annual International Symposium on Computer Architecture, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J.R. Larus and M. Parkes. Using cohort-scheduling to enhance server performance. In Proceedings of the General Track USENIX Annual Technical Conference, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H.-H.S. Lee, M. Smelyanskiy, G.S. Tyson, and C.J. Newburn. Stack value file: Custom microarchitecture for the stack. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Li, L.K. John, A. Sivasubramaniam, N. Vijaykrishnan, and J. Rubio. Understanding and improving operating system effects in control flow prediction. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Lilja, F. Marcovitz, and P.C. Yew. Memory referencing behavior and a cache performance metric in a shared memory multiprocessor. Technical Report CSRD-836, University of Illinois, Urbana-Champaign, Dec 1988.Google ScholarGoogle Scholar
  20. J.L. Lo, L.A. Barroso, S.J. Eggers, K. Gharachorloo, H.M. Levy, and S. S. Parekh. An analysis of database workload performance on simultaneous multithreaded processors. In Proceedings of the 25th Annual International Symposium on Computer Architecture, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P.Magnusson,M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE Computer, 35(2):50--58, Feb 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. V. S. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel, W. Zwaenepoel, and E. Nahum. Locality-aware request distribution in cluster-based network servers. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Ramirez, L.A. Barroso, K. Gharachorloo, R. Cohn, J. Larriba-Pey, P.G. Lowney, and M. Valero. Code layout optimizations for transaction processing workloads. In Proceedings of the 28th Annual International Symposium on Computer Architecture, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J.A. Redstone, S.J. Eggers, and H.M. Levy. An analysis of operating system behavior on a simultaneous multithreaded architecture. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A.J. Smith. Cache memories. ACM Comput. Surv., 14(3):473--530, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. E. Speight, H. Shafi, L. Zhang, and R. Rajamony. Adaptive mechanisms and policies for managing cache hierarchies in chip multiprocessors. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J.E. Thorton. Parallel operation in the control data 6600. In Proceedings of the Fall Joint Computer Conference, 1964.Google ScholarGoogle Scholar
  28. J. Torrellas, A. Gupta, and J. Hennessy. Characterizing the caching and synchronization performance of a multiprocessor operating system. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Torrellas, A. Tucker, and A. Gupta. Benefits of cache-affinity scheduling in shared-memory multiprocessors: a summary. In Proceedings of the 1993 SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. Uhlig, G. Neiger, D. Rodgers, A.L. Santoni, F.C. M. Martins, A.V. Anderson, S.M. Bennett, A. Kagi, F.H. Leung, and L. Smith. Intel virtualization technology. Computer, 38(5), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. Wells, K. Chakraborty, and G. Sohi. Hardware support for spin management in overcommitted virtual machines. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Welsh, D. Culler, and E. Brewer. SEDA: an architecture for wellconditioned, scalable internet services. In Proceedings of the 18th Symposium on Operating Systems Principles, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. T.F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Computation spreading: employing hardware migration to specialize CMP cores on-the-fly

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 34, Issue 5
        Proceedings of the 2006 ASPLOS Conference
        December 2006
        425 pages
        ISSN:0163-5964
        DOI:10.1145/1168919
        Issue’s Table of Contents
        • cover image ACM Conferences
          ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
          October 2006
          440 pages
          ISBN:1595934510
          DOI:10.1145/1168857

        Copyright © 2006 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 October 2006

        Check for updates

        Qualifiers

        • article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!