skip to main content
research-article

Composing parallel software efficiently with lithe

Authors Info & Claims
Published:05 June 2010Publication History
Skip Abstract Section

Abstract

Applications composed of multiple parallel libraries perform poorly when those libraries interfere with one another by obliviously using the same physical cores, leading to destructive resource oversubscription. This paper presents the design and implementation of Lithe, a low-level substrate that provides the basic primitives and a standard interface for composing parallel codes efficiently. Lithe can be inserted underneath the runtimes of legacy parallel libraries to provide bolt-on composability without needing to change existing application code. Lithe can also serve as the foundation for building new parallel abstractions and libraries that automatically interoperate with one another.

In this paper, we show versions of Threading Building Blocks (TBB) and OpenMP perform competitively with their original implementations when ported to Lithe. Furthermore, for two applications composed of multiple parallel libraries, we show that leveraging our substrate outperforms their original, even expertly tuned, implementations.

References

  1. Atul Adya et al. Cooperative task management without manual stack management. In USENIX, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Thomas Anderson et al. Scheduler activations: Effective kernel support for the user-level management of parallelism. In SOSP, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Animoto. http://www.animoto.com.Google ScholarGoogle Scholar
  4. Robert Blumofe et al. Cilk: An efficient multithreaded runtime system. In PPOPP, 1995.. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Rohit Chandra et al. Parallel Programming in OpenMP. Morgan Kaufmann, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jike Chong et al. Scalable hmm based inference engine in large vocabulary continuous speech recognition. In ICME, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Timothy Davis. Multifrontal multithreaded rank-revealing sparse QR factorization. Transactions on Mathematical Software, Submitted.Google ScholarGoogle Scholar
  8. K. Dussa et al. Dynamic partitioning in a Transputer environment. In SIGMETRICS, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. EVE Online. http://www.eveonline.com.Google ScholarGoogle Scholar
  10. Kathleen Fisher and John Reppy. Compiler support for lightweight concurrency. Technical report, Bell Labs, 2002.Google ScholarGoogle Scholar
  11. Flickr. http://www.flickr.com.Google ScholarGoogle Scholar
  12. Matthew Fluet et al. A scheduling framework for general-purpose parallel languages. In ICFP, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Bryan Ford and Sai Susarla. CPU inheritance scheduling. In OSDI, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Seth Copen Goldstein et al. Lazy threads: Implementing a fast parallel call. Journal of Parallel and Distributed Computing, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Google Voice. http://voice.google.com.Google ScholarGoogle Scholar
  16. GraphicsMagick. http://www.graphicsmagick.org.Google ScholarGoogle Scholar
  17. Benjamin Hindman. Libprocess. http://www.eecs.berkeley.edu/ benh/libprocess.Google ScholarGoogle Scholar
  18. Parry Husbands and Katherine Yelick. Multithreading and one-sided communication in parallel lu factorization. In Supercomputing, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Intel. Math Kernel Library for the Linux Operating System: User's Guide. 2007.Google ScholarGoogle Scholar
  20. Ravi Iyer. CQoS: A framework for enabling QoS in shared caches of CMP platforms. In ICS, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Haoqiang Ji et al. The OpenMP implementation of NAS parallel benchmarks and its performance. Technical report, NASA Research Center, 1999.Google ScholarGoogle Scholar
  22. Laxmikant V. Kale, Joshua Yelon, and Timothy Knauff. Threads for interoperable parallel programming. Languages and Compilers for Parallel Computing, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jakub Kurzak et al. Scheduling linear algebra operations on multicore processors. Technical report, LAPACK, 2009.Google ScholarGoogle Scholar
  24. C. L. Lawson et al. Basic linear algebra subprograms for FORTRAN usage. Transactions on Mathematical Software, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jae Lee et al. Globally-synchronized frames for guaranteed quality-of-service in on-chip networks. In ISCA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Peng Li et al. Lightweight concurrency primitives. In Haskell, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Rose Liu et al. Tessellation: Space-time partitioning in a manycore client OS. In HotPar, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Brian Marsh et al. First-class user-level threads. OS Review, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Cathy McCann et al.A dynamic processor allocation policy for multiprogrammed shared--memory multiprocessors. Transactions on Computer Systems, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ana Lucia De Moura and Robert Ierusalimschy. Revisiting coroutines. Transactions on Programming Languages and Systems, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Rajesh Nishtala and Kathy Yelick. Optimizing collective communication on multicores. In HotPar, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Simon Peter et al. 30 seconds is not enough! a study of operating system timer usage. In Eurosys, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. John Regehr. Using Hierarchical Scheduling to Support Soft Real-Time Applications in General-Purpose Operating Systems. PhD thesis, University of Virginia, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. James Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O'Reilly, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Composing parallel software efficiently with lithe

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!