skip to main content
research-article

SuperMalloc: a super fast multithreaded malloc for 64-bit machines

Published:14 June 2015Publication History
Skip Abstract Section

Abstract

SuperMalloc is an implementation of malloc(3) originally designed for X86 Hardware Transactional Memory (HTM)@. It turns out that the same design decisions also make it fast even without HTM@. For the malloc-test benchmark, which is one of the most difficult workloads for an allocator, with one thread SuperMalloc is about 2.1 times faster than the best of DLmalloc, JEmalloc, Hoard, and TBBmalloc; with 8 threads and HTM, SuperMalloc is 2.75 times faster; and on 32 threads without HTM SuperMalloc is 3.4 times faster. SuperMalloc generally compares favorably with the other allocators on speed, scalability, speed variance, memory footprint, and code size. SuperMalloc achieves these performance advantages using less than half as much code as the alternatives. SuperMalloc exploits the fact that although physical memory is always precious, virtual address space on a 64-bit machine is relatively cheap. It allocates 2 chunks which contain objects all the same size. To translate chunk numbers to chunk metadata, SuperMalloc uses a simple array (most of which is uncommitted to physical memory). SuperMalloc takes care to avoid associativity conflicts in the cache: most of the size classes are a prime number of cache lines, and nonaligned huge accesses are randomly aligned within a page. Objects are allocated from the fullest non-full page in the appropriate size class. For each size class, SuperMalloc employs a 10-object per-thread cache, a per-CPU cache that holds about a level-2-cache worth of objects per size class, and a global cache that is organized to allow the movement of many objects between a per-CPU cache and the global cache using $O(1)$ instructions. SuperMalloc prefetches everything it can before starting a critical section, which makes the critical sections run fast, and for HTM improves the odds that the transaction will commit.

References

  1. Y. Afek, D. Dice, and A. Morrison. Cache index-aware memory allocation. In Proceedings International Symposium on Memory Managment (ISMM), pages 55–64, San Jose, California, June 2011. doi:10.1145/2076022.1993486. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Alexandrescu and E. Berger. Policy-based memory allocation. Dr. Dobb’s, Dec. 1 2005. http://www.drdobbs.com/184402039. Viewed Apr. 27, 2015.Google ScholarGoogle Scholar
  3. K. Aziz. Pre-emption control for userspace, Mar. 3 2014. http://lkml.iu.edu/hypermail/linux/ kernel/1403.0/00780.html. Viewed Apr. 27, 2015.Google ScholarGoogle Scholar
  4. E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: A scalable memory allocator for multithreaded applications. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 117–128, Cambridge, MA, Nov. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. doi:10.1145/378993.379232.Google ScholarGoogle Scholar
  6. R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. In Proceedings of the IEEE Symposium on Foundations of Computer Science, pages 356–368, November 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. doi:10.1145/324133.324234.Google ScholarGoogle Scholar
  8. W. J. Bolosky and M. L. Scott. False sharing and its effect on shared memory performance. In Proceedings of the USENIX Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV), pages 57–71, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Corbet. Transparent huge pages in 2.6.38, Jan. 11 2011. http://lwn.net/Articles/423584/. Viewed Apr. 27, 2015.Google ScholarGoogle Scholar
  10. D. Detlefs, A. Dosser, and B. Zorn. Memory allocation costs in large c and c++ programs. Software Practice and Experience, 24(6):527–542, June 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. doi:10.1002/spe.4380240602.Google ScholarGoogle Scholar
  12. D. Dice. Inverted schedctl usage in the JVM. David Dice’s Weblog, June 11 2011.Google ScholarGoogle Scholar
  13. https://blogs.oracle.com/dave/entry/ inverted_schedctl_usage_in_the. Viewed Apr. 27, 2015.Google ScholarGoogle Scholar
  14. D. Dice. malloc for Haswell — hardware transactional memory. David Dice’s Weblog, Apr. 24 2014. https: //blogs.oracle.com/dave/entry/malloc_for_ haswell_hardware_transactional. Viewed Apr. 27, 2015.Google ScholarGoogle Scholar
  15. D. Dice. Private Communication, Sept. 14 2014.Google ScholarGoogle Scholar
  16. D. Dice, T. L. Harris, A. Kogan, Y. Lev, and M. Moir. Pitfalls of lazy subscription. In 6th Workshop on the Theory of Transactional Memory (WTTM 2014), Paris, France, July 14 2014. http://www.gsd.inesc-id.pt/˜mcouceiro/ wttm2014/html/abstracts/dice.pdf.Google ScholarGoogle Scholar
  17. C. Eder and H. Schoenemann. Xmalloc, 2012. https://github.com/ederc/xmalloc. Viewed Nov. 13, 2014.Google ScholarGoogle Scholar
  18. J. Evans. A scalable concurrent malloc(3) implementation for FreeBSD. In BSDCan — The Technical BSD Conference, Ottawa, Canada, May 2006. http://people.freebsd.org/˜jasone/ jemalloc/bsdcan2006/jemalloc.pdf.Google ScholarGoogle Scholar
  19. J. Evans. Behavior of madvise(MADV_FREE), Oct. 12 2012. http://lists.freebsd.org/pipermail/ freebsd-arch/2012-October/013287.html. Viewed Apr. 27, 2015.Google ScholarGoogle Scholar
  20. J. Evans. Personal communication, Mar. 2015.Google ScholarGoogle Scholar
  21. Y. Feng and E. D. Berger. A locality-improving dynamic memory allocator. In Proceedings of the 2005 Workshiop on Memory Systems Perforamnce (MSP), pages 68–77, Chicago, IL, June 2005. doi:10.1145/1111583.1111594. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Free Software Foundation. GCC, the GNU compiler collection. http://gcc.gnu.org, 2014. Viewed Nov. 2, 2014.Google ScholarGoogle Scholar
  23. W. Gloger. Wolfram gloger’s malloc homepage, May 2006. www.malloc.de/en/. Viewed Nov. 9, 2014.Google ScholarGoogle Scholar
  24. M. Hertz, Y. Feng, and E. D. Berger. Garbage collection without paging. In ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation (PLDI), pages 143–153, Chicago, IL, June 11–15 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. doi:10.1145/1065010.1065028.Google ScholarGoogle Scholar
  26. R. L. Hudson, B. Saha, A.-R. Adl-Tabatabai, and B. C. Hertzberg. McRT-malloc — a scalable transactional memory allocator. In Proceedings of the 5th International Symposium on Memory Managment (ISMM), pages 74–83, Ottawa, Canada, June 2006. doi:10.1145/1133956.1133967. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Intel. Intel 64 and IA-32 architectures software developer’s manual — combined volumes: 1, 2a, 2b, 2c, 3a, 3b and 3c. https://www-ssl.intel.com/content/dam/ www/public/us/en/documents/manuals/64ia-32-architectures-software-developermanual-325462.pdf, June 2013.Google ScholarGoogle Scholar
  28. ISO/IEC. Information technology – programming languages – c. Standard 9899:2011, Dec. 8 2011.Google ScholarGoogle Scholar
  29. B. W. Kernighan and D. M. Ritchie. The C Programming Language. Prentice Hall, Inc., second edition, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. E. Knuth. The Art of Computer Programming, volume 1. Addison Wesley, Reading, MA, 2nd edition, 1973.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. G. Korn and K.-P. Vo. In search of a better malloc. In Proceedings of the Summer ’85 USENIX Conference, pages 489–506, 1985.Google ScholarGoogle Scholar
  32. A. Kukanov and M. J. Voss. The foundations for scalable multi-core software in Intel Threading Building Blocks. Intel Technology Journal, 11(4):309–322, Q4 2007. http: //citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.71.8289&rep=rep1&type=pdf.Google ScholarGoogle Scholar
  33. P.- ˚ A. Larson and M. Krishnan. Memory allocation for long-running server applications. In Proceedings of the International Symposium on Memory Management (ISMM’98), pages 176–185, Vancouver, Canada, Oct. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. doi:10.1145/286860.286880.Google ScholarGoogle Scholar
  35. D. Lea. A memory allocator, 1996. http://g.oswego.edu/dl/html/malloc. Viewed Nov. 3, 2014.Google ScholarGoogle Scholar
  36. C. Lever and D. Boreham. malloc() performance in a multithreaded Linux environment. In Proceedings of FREENIX Track: USENIX Annual Technical Conference, San Diego, CA, June 2000. https://www.usenix.org/ legacy/events/usenix2000/freenix/full_ papers/lever/lever.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. V. B. Lvin, G. Novark, E. D. Berger, and B. G. Zorn. Archipelago: Trading address space for reliability and security. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XIII), pages 115–124, Seattle, WA, Mar. 1–5 2008. doi:10.1145/1346281.1346296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. D. J. Magenheimer, L. Peters, K. Pettis, and D. Zuras. Integer multiplication and division on the hp precision architecture. In Proceedings of the 2nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-II), pages 90–99, Palo Alto, California, Oct. 5–8 1987. doi:10.1145/36206.36189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. M. Matz, J. Hubiˇcka, A. Jaeger, and M. Mitchell. System V Application Binary Interface AMD64 architecture processor supplement draft version 0.99, May 2009.Google ScholarGoogle Scholar
  40. www.x86-64.org/documentation/abi.pdf.Google ScholarGoogle Scholar
  41. M. M. Michael. Scalable lock-free dynamic memory allocation. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI), pages 35–46, Washington, DC, June 2004. doi:10.1145/996893.996848. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. Oboguev. The deferred set priority facility for Linux, July 2014. https://raw.githubusercontent.com/ oboguev/dprio/master/dprio.txt. Viewed Apr. 27, 2015.Google ScholarGoogle Scholar
  43. S. Oboguev. Sched: deferred set priority (dprio), July 2014. http://lists.openwall.net/linuxkernel/2014/07/28/41. Viewed Apr. 27, 2015.Google ScholarGoogle Scholar
  44. F. Schanda. Bug 206 - malloc does not align memory correctly for sse capable systems. https://sourceware.org/bugzilla/show_bug. cgi?id=206, June 2004. Viewed Apr. 27, 2015.Google ScholarGoogle Scholar
  45. S. Schneider, C. D. Antonopoulos, and D. S. Nikolopoulos. Scalable locality-conscious multithreaded memory allocation. In Proceedings of the 5th International Symposium on Memory Managment (ISMM), pages 84–94, Ottawa, Canada, June 2006. doi:10.1145/1133956.1133968. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Standard Template Library Programmers Guide – Allocators section. SGI, Feb. 1997. Archived as https://web. archive.org/web/20010221202030/http: //www.sgi.com/tech/stl/Allocators.html. Viewed Nov. 10, 2014.Google ScholarGoogle Scholar
  47. D. Vyukov. Possible problem in scalable allocator, Nov. 2008. https://software.intel.com/enus/forums/topic/299796. Viewed Nov. 3, 2014.Google ScholarGoogle Scholar
  48. T. Yang, E. D. Berger, S. F. Kaplan, and J. E. B. Moss. CRAMM: Virtual memory support for garbage-collected applications. In Proceedings of the 7th symposium on Operating systems design and implementation, pages 103–116, Seattle, WA, Nov. 6–8 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. https://people.cs.umass.edu/˜emery/pubs/ 06-25.pdf.Google ScholarGoogle Scholar

Index Terms

  1. SuperMalloc: a super fast multithreaded malloc for 64-bit machines

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 50, Issue 11
          ISMM '15
          November 2015
          156 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/2887746
          Issue’s Table of Contents
          • cover image ACM Conferences
            ISMM '15: Proceedings of the 2015 International Symposium on Memory Management
            June 2015
            156 pages
            ISBN:9781450335898
            DOI:10.1145/2754169

          Copyright © 2015 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 14 June 2015

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!