Abstract
SuperMalloc is an implementation of malloc(3) originally designed for X86 Hardware Transactional Memory (HTM)@. It turns out that the same design decisions also make it fast even without HTM@. For the malloc-test benchmark, which is one of the most difficult workloads for an allocator, with one thread SuperMalloc is about 2.1 times faster than the best of DLmalloc, JEmalloc, Hoard, and TBBmalloc; with 8 threads and HTM, SuperMalloc is 2.75 times faster; and on 32 threads without HTM SuperMalloc is 3.4 times faster. SuperMalloc generally compares favorably with the other allocators on speed, scalability, speed variance, memory footprint, and code size. SuperMalloc achieves these performance advantages using less than half as much code as the alternatives. SuperMalloc exploits the fact that although physical memory is always precious, virtual address space on a 64-bit machine is relatively cheap. It allocates 2 chunks which contain objects all the same size. To translate chunk numbers to chunk metadata, SuperMalloc uses a simple array (most of which is uncommitted to physical memory). SuperMalloc takes care to avoid associativity conflicts in the cache: most of the size classes are a prime number of cache lines, and nonaligned huge accesses are randomly aligned within a page. Objects are allocated from the fullest non-full page in the appropriate size class. For each size class, SuperMalloc employs a 10-object per-thread cache, a per-CPU cache that holds about a level-2-cache worth of objects per size class, and a global cache that is organized to allow the movement of many objects between a per-CPU cache and the global cache using $O(1)$ instructions. SuperMalloc prefetches everything it can before starting a critical section, which makes the critical sections run fast, and for HTM improves the odds that the transaction will commit.
- Y. Afek, D. Dice, and A. Morrison. Cache index-aware memory allocation. In Proceedings International Symposium on Memory Managment (ISMM), pages 55–64, San Jose, California, June 2011. doi:10.1145/2076022.1993486. Google Scholar
Digital Library
- A. Alexandrescu and E. Berger. Policy-based memory allocation. Dr. Dobb’s, Dec. 1 2005. http://www.drdobbs.com/184402039. Viewed Apr. 27, 2015.Google Scholar
- K. Aziz. Pre-emption control for userspace, Mar. 3 2014. http://lkml.iu.edu/hypermail/linux/ kernel/1403.0/00780.html. Viewed Apr. 27, 2015.Google Scholar
- E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: A scalable memory allocator for multithreaded applications. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 117–128, Cambridge, MA, Nov. 2000. Google Scholar
Digital Library
- doi:10.1145/378993.379232.Google Scholar
- R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. In Proceedings of the IEEE Symposium on Foundations of Computer Science, pages 356–368, November 1994. Google Scholar
Digital Library
- doi:10.1145/324133.324234.Google Scholar
- W. J. Bolosky and M. L. Scott. False sharing and its effect on shared memory performance. In Proceedings of the USENIX Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV), pages 57–71, 1993. Google Scholar
Digital Library
- J. Corbet. Transparent huge pages in 2.6.38, Jan. 11 2011. http://lwn.net/Articles/423584/. Viewed Apr. 27, 2015.Google Scholar
- D. Detlefs, A. Dosser, and B. Zorn. Memory allocation costs in large c and c++ programs. Software Practice and Experience, 24(6):527–542, June 1994. Google Scholar
Digital Library
- doi:10.1002/spe.4380240602.Google Scholar
- D. Dice. Inverted schedctl usage in the JVM. David Dice’s Weblog, June 11 2011.Google Scholar
- https://blogs.oracle.com/dave/entry/ inverted_schedctl_usage_in_the. Viewed Apr. 27, 2015.Google Scholar
- D. Dice. malloc for Haswell — hardware transactional memory. David Dice’s Weblog, Apr. 24 2014. https: //blogs.oracle.com/dave/entry/malloc_for_ haswell_hardware_transactional. Viewed Apr. 27, 2015.Google Scholar
- D. Dice. Private Communication, Sept. 14 2014.Google Scholar
- D. Dice, T. L. Harris, A. Kogan, Y. Lev, and M. Moir. Pitfalls of lazy subscription. In 6th Workshop on the Theory of Transactional Memory (WTTM 2014), Paris, France, July 14 2014. http://www.gsd.inesc-id.pt/˜mcouceiro/ wttm2014/html/abstracts/dice.pdf.Google Scholar
- C. Eder and H. Schoenemann. Xmalloc, 2012. https://github.com/ederc/xmalloc. Viewed Nov. 13, 2014.Google Scholar
- J. Evans. A scalable concurrent malloc(3) implementation for FreeBSD. In BSDCan — The Technical BSD Conference, Ottawa, Canada, May 2006. http://people.freebsd.org/˜jasone/ jemalloc/bsdcan2006/jemalloc.pdf.Google Scholar
- J. Evans. Behavior of madvise(MADV_FREE), Oct. 12 2012. http://lists.freebsd.org/pipermail/ freebsd-arch/2012-October/013287.html. Viewed Apr. 27, 2015.Google Scholar
- J. Evans. Personal communication, Mar. 2015.Google Scholar
- Y. Feng and E. D. Berger. A locality-improving dynamic memory allocator. In Proceedings of the 2005 Workshiop on Memory Systems Perforamnce (MSP), pages 68–77, Chicago, IL, June 2005. doi:10.1145/1111583.1111594. Google Scholar
Digital Library
- Free Software Foundation. GCC, the GNU compiler collection. http://gcc.gnu.org, 2014. Viewed Nov. 2, 2014.Google Scholar
- W. Gloger. Wolfram gloger’s malloc homepage, May 2006. www.malloc.de/en/. Viewed Nov. 9, 2014.Google Scholar
- M. Hertz, Y. Feng, and E. D. Berger. Garbage collection without paging. In ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation (PLDI), pages 143–153, Chicago, IL, June 11–15 2005. Google Scholar
Digital Library
- doi:10.1145/1065010.1065028.Google Scholar
- R. L. Hudson, B. Saha, A.-R. Adl-Tabatabai, and B. C. Hertzberg. McRT-malloc — a scalable transactional memory allocator. In Proceedings of the 5th International Symposium on Memory Managment (ISMM), pages 74–83, Ottawa, Canada, June 2006. doi:10.1145/1133956.1133967. Google Scholar
Digital Library
- Intel. Intel 64 and IA-32 architectures software developer’s manual — combined volumes: 1, 2a, 2b, 2c, 3a, 3b and 3c. https://www-ssl.intel.com/content/dam/ www/public/us/en/documents/manuals/64ia-32-architectures-software-developermanual-325462.pdf, June 2013.Google Scholar
- ISO/IEC. Information technology – programming languages – c. Standard 9899:2011, Dec. 8 2011.Google Scholar
- B. W. Kernighan and D. M. Ritchie. The C Programming Language. Prentice Hall, Inc., second edition, 1988. Google Scholar
Digital Library
- D. E. Knuth. The Art of Computer Programming, volume 1. Addison Wesley, Reading, MA, 2nd edition, 1973.Google Scholar
Digital Library
- D. G. Korn and K.-P. Vo. In search of a better malloc. In Proceedings of the Summer ’85 USENIX Conference, pages 489–506, 1985.Google Scholar
- A. Kukanov and M. J. Voss. The foundations for scalable multi-core software in Intel Threading Building Blocks. Intel Technology Journal, 11(4):309–322, Q4 2007. http: //citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.71.8289&rep=rep1&type=pdf.Google Scholar
- P.- ˚ A. Larson and M. Krishnan. Memory allocation for long-running server applications. In Proceedings of the International Symposium on Memory Management (ISMM’98), pages 176–185, Vancouver, Canada, Oct. 1998. Google Scholar
Digital Library
- doi:10.1145/286860.286880.Google Scholar
- D. Lea. A memory allocator, 1996. http://g.oswego.edu/dl/html/malloc. Viewed Nov. 3, 2014.Google Scholar
- C. Lever and D. Boreham. malloc() performance in a multithreaded Linux environment. In Proceedings of FREENIX Track: USENIX Annual Technical Conference, San Diego, CA, June 2000. https://www.usenix.org/ legacy/events/usenix2000/freenix/full_ papers/lever/lever.pdf. Google Scholar
Digital Library
- V. B. Lvin, G. Novark, E. D. Berger, and B. G. Zorn. Archipelago: Trading address space for reliability and security. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XIII), pages 115–124, Seattle, WA, Mar. 1–5 2008. doi:10.1145/1346281.1346296. Google Scholar
Digital Library
- D. J. Magenheimer, L. Peters, K. Pettis, and D. Zuras. Integer multiplication and division on the hp precision architecture. In Proceedings of the 2nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-II), pages 90–99, Palo Alto, California, Oct. 5–8 1987. doi:10.1145/36206.36189. Google Scholar
Digital Library
- M. Matz, J. Hubiˇcka, A. Jaeger, and M. Mitchell. System V Application Binary Interface AMD64 architecture processor supplement draft version 0.99, May 2009.Google Scholar
- www.x86-64.org/documentation/abi.pdf.Google Scholar
- M. M. Michael. Scalable lock-free dynamic memory allocation. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI), pages 35–46, Washington, DC, June 2004. doi:10.1145/996893.996848. Google Scholar
Digital Library
- S. Oboguev. The deferred set priority facility for Linux, July 2014. https://raw.githubusercontent.com/ oboguev/dprio/master/dprio.txt. Viewed Apr. 27, 2015.Google Scholar
- S. Oboguev. Sched: deferred set priority (dprio), July 2014. http://lists.openwall.net/linuxkernel/2014/07/28/41. Viewed Apr. 27, 2015.Google Scholar
- F. Schanda. Bug 206 - malloc does not align memory correctly for sse capable systems. https://sourceware.org/bugzilla/show_bug. cgi?id=206, June 2004. Viewed Apr. 27, 2015.Google Scholar
- S. Schneider, C. D. Antonopoulos, and D. S. Nikolopoulos. Scalable locality-conscious multithreaded memory allocation. In Proceedings of the 5th International Symposium on Memory Managment (ISMM), pages 84–94, Ottawa, Canada, June 2006. doi:10.1145/1133956.1133968. Google Scholar
Digital Library
- Standard Template Library Programmers Guide – Allocators section. SGI, Feb. 1997. Archived as https://web. archive.org/web/20010221202030/http: //www.sgi.com/tech/stl/Allocators.html. Viewed Nov. 10, 2014.Google Scholar
- D. Vyukov. Possible problem in scalable allocator, Nov. 2008. https://software.intel.com/enus/forums/topic/299796. Viewed Nov. 3, 2014.Google Scholar
- T. Yang, E. D. Berger, S. F. Kaplan, and J. E. B. Moss. CRAMM: Virtual memory support for garbage-collected applications. In Proceedings of the 7th symposium on Operating systems design and implementation, pages 103–116, Seattle, WA, Nov. 6–8 2006. Google Scholar
Digital Library
- https://people.cs.umass.edu/˜emery/pubs/ 06-25.pdf.Google Scholar
Index Terms
SuperMalloc: a super fast multithreaded malloc for 64-bit machines
Recommendations
SuperMalloc: a super fast multithreaded malloc for 64-bit machines
ISMM '15: Proceedings of the 2015 International Symposium on Memory ManagementSuperMalloc is an implementation of malloc(3) originally designed for X86 Hardware Transactional Memory (HTM)@. It turns out that the same design decisions also make it fast even without HTM@. For the malloc-test benchmark, which is one of the most ...
Mostly lock-free malloc
MSP 2002 and ISMM 2002Modern multithreaded applications, such as application servers and database engines, can severely stress the performance of user-level memory allocators like the ubiquitous malloc subsystem. Such allocators can prove to be a major scalability impediment ...
Mostly lock-free malloc
ISMM '02: Proceedings of the 3rd international symposium on Memory managementModern multithreaded applications, such as application servers and database engines, can severely stress the performance of user-level memory allocators like the ubiquitous malloc subsystem. Such allocators can prove to be a major scalability impediment ...






Comments