You can read more on the design of _mimalloc_ in the [technical report](https://www.microsoft.com/en-us/research/publication/mimalloc-free-list-sharding-in-action)
specific details and further benchmarks we refer to the [technical report](https://www.microsoft.com/en-us/research/publication/mimalloc-free-list-sharding-in-action).
Google's [_tcmalloc_](https://github.com/gperftools/gperftools) (tc) used in Chrome,
[_jemalloc_](https://github.com/jemalloc/jemalloc) (je) by Jason Evans used in Firefox and FreeBSD,
[_snmalloc_](https://github.com/microsoft/snmalloc) (sn) by Liétar et al. \[8], [_rpmalloc_](https://github.com/rampantpixels/rpmalloc) (rp) by Mattias Jansson at Rampant Pixels,
The _redis_ benchmark shows more differences between the allocators where
_mimalloc_ is 14\% faster than _jemalloc_. On this benchmark _tbb_ (and _Hoard_) do
not do well and are over 40\% slower.
The _larson_ server workload allocates and frees objects between
many threads. Larson and Krishnan \[2] observe this
behavior (which they call _bleeding_) in actual server applications, and the
benchmark simulates this.
Here, _mimalloc_ is more than 2.5× faster than _tcmalloc_ and _jemalloc_
due to the object migration between different threads. This is a difficult
benchmark for other allocators too where _mimalloc_ is still 48% faster than the next
fastest (_snmalloc_).
The second benchmark set tests specific aspects of the allocators and
shows even more extreme differences between them.
The _alloc-test_, by
[OLogN Technologies AG](http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/), is a very allocation intensive benchmark doing millions of
allocations in various size classes. The test is scaled such that when an
allocator performs almost identically on _alloc-test1_ as _alloc-testN_ it
means that it scales linearly. Here, _tcmalloc_, _snmalloc_, and
_Hoard_ seem to scale less well and do more than 10% worse on the
multi-core version. Even the best allocators (_tcmalloc_ and _jemalloc_) are
more than 10% slower as _mimalloc_ here.
The _sh6bench_ and _sh8bench_ benchmarks are
developed by [MicroQuill](http://www.microquill.com/) as part of SmartHeap.
In _sh6bench__mimalloc_ does much
better than the others (more than 2× faster than _jemalloc_).
We cannot explain this well but believe it is
caused in part by the "reverse" free-ing pattern in _sh6bench_.
Again in _sh8bench_ the _mimalloc_ allocator handles object migration
between threads much better and is over 36% faster than the next best
allocator, _snmalloc_. Whereas _tcmalloc_ did well on _sh6bench_, the
addition of object migration caused it to be almost 3 times slower
than before.
The _xmalloc-testN_ benchmark by Lever and Boreham \[5] and Christian Eder,
simulates an asymmetric workload where
some threads only allocate, and others only free. The _snmalloc_
allocator was especially developed to handle this case well as it
often occurs in concurrent message passing systems (like the [Pony] language
for which _snmalloc_ was initially developed). Here we see that
the _mimalloc_ technique of having non-contended sharded thread free
lists pays off as it even outperforms _snmalloc_ here.
Only _jemalloc_ also handles this reasonably well, while the
others underperform by a large margin.
The _cache-scratch_ benchmark by Emery Berger \[1], and introduced with the Hoard
allocator to test for _passive-false_ sharing of cache lines. With a single thread they all
perform the same, but when running with multiple threads the potential allocator
induced false sharing of the cache lines causes large run-time
differences, where _mimalloc_ is more than 18× faster than _jemalloc_ and
_tcmalloc_! Crundal \[6] describes in detail why the false cache line
sharing occurs in the _tcmalloc_ design, and also discusses how this
can be avoided with some small implementation changes.
Only _snmalloc_ and _tbb_ also avoid the
cache line sharing like _mimalloc_. Kukanov and Voss \[7] describe in detail
how the design of _tbb_ avoids the false cache line sharing.
- \[5] C. Lever, and D. Boreham. _Malloc() Performance in a Multithreaded Linux Environment._
In USENIX Annual Technical Conference, Freenix Session. San Diego, CA. Jun. 2000.
Available at <https://github.com/kuszmaul/SuperMalloc/tree/master/tests>
- \[6] Timothy Crundal. _Reducing Active-False Sharing in TCMalloc._
2016.<http://courses.cecs.anu.edu.au/courses/CSPROJECTS/16S1/Reports/Timothy*Crundal*Report.pdf>. CS16S1 project at the Australian National University.
- \[7] Alexey Kukanov, and Michael J Voss.
_The Foundations for Scalable Multi-Core Software in Intel Threading Building Blocks._
Intel Technology Journal 11 (4). 2007
- \[8] Paul Liétar, Theodore Butler, Sylvan Clebsch, Sophia Drossopoulou, Juliana Franco, Matthew J Parkinson,
Alex Shamis, Christoph M Wintersteiger, and David Chisnall.
_Snmalloc: A Message Passing Allocator._
In Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management, 122–135. ACM. 2019.