C++14 multiple-producer-multiple-consumer lock-free queues based on circular buffers and std::atomic.
Designed with a goal to minimize the latency between one thread pushing an element into a queue and another thread popping it from the queue.
It has been developed, tested and benchmarked on Linux. Yet, any C++14 platform implementing std::atomic is expected to compile the unit-tests and run them without failures just as well.
Continuous integrations running the unit-tests on GitHub are set up for x86_64 and arm64 platforms, Ubuntu-22.04, Ubuntu-24.04 and Windows. Pull requests to extend the continuous integrations to run the unit-tests on other architectures/platforms are most welcome.
When minimizing latency a good design is not when there is nothing left to add, but rather when there is nothing left to remove, as these queues exemplify.
Minimizing latency naturally maximizes throughput. Low latency reciprocal is high throughput, in ideal mathematical and practical engineering sense. Low latency is incompatible with any delays and/or batching, which destroy original (hardware) global time order of events pushed into one queue by different threads. Maximizing throughput, on the other hand, can be done at expense of latency by delaying and batching multiple updates.
The main design principle these queues follow is minimalism, which results in such design choices as:
push/pop and keep no references/pointers to its function arguments after returning, and that no reference/pointer to elements in the queue ring-buffer can be obtained. Simplest to use, hard to misuse, best machine code due to no pointer aliasing possible.The impact of each of these small design choices on their own is barely measurable, but their total impact is much greater than a simple sum of the constituents’ impacts, aka super-scalar compounding or synergy. The synergy emerging from combining multiple of these small design choices together is what allows CPUs to perform at their peak capacities least impeded.
These design choices are also limitations:
Ultra-low-latency applications need just that and nothing more. The minimalism pays off, see the throughput and latency benchmarks.
Several other well established and popular thread-safe containers are used for reference in the benchmarks:
| Queue | Type | Description |
|---|---|---|
std::mutex |
MPMC | A fixed size ring-buffer with std::mutex. |
pthread_spinlock |
MPMC | A fixed size ring-buffer with pthread_spinlock_t. |
boost::lockfree::spsc_queue |
SPSC | A wait-free queue from Boost library. |
boost::lockfree::queue |
MPMC | A lock-free queue from Boost library. |
moodycamel::ConcurrentQueue |
quasi-MPMC | A lock-free queue used in non-blocking mode. Designed to maximize throughput at the expense of latency, eschewing global time order, by emulating a MPMC queue with a bunch of SPSC queues under the hood. It is not equivalent to other queues benchmarked here in this respect. |
moodycamel::ReaderWriterQueue |
SPSC | A lock-free queue used in non-blocking mode. |
xenium::michael_scott_queue |
MPMC | A lock-free queue proposed by Michael and Scott (similar to boost::lockfree::queue which is also based on the same proposal). |
xenium::ramalhete_queue |
MPMC | A lock-free queue proposed by Ramalhete and Correia. |
xenium::vyukov_bounded_queue |
MPMC | A bounded queue based on the version proposed by Vyukov. |
tbb::spin_mutex |
MPMC | A locked fixed size ring-buffer with tbb::spin_mutex from Intel Threading Building Blocks. |
tbb::concurrent_bounded_queue |
MPMC | Eponymous queue used in non-blocking mode from Intel Threading Building Blocks. |
The containers provided are header-only class templates, no building/installing is necessary.
git clone https://github.com/max0x7ba/atomic_queue.git
atomic_queue/include directory (use full path) to the include paths of your build system.#include <atomic_queue/atomic_queue.h> in your C++ source.If you use CMake, these can be simplified as follows:
add_subdirectory(atomic_queue)
target_link_libraries(main PRIVATE atomic_queue::atomic_queue)
You can also use CMake’s FetchContent.
include(FetchContent)
FetchContent_Declare(
atomic_queue
GIT_REPOSITORY https://github.com/max0x7ba/atomic_queue.git
GIT_TAG v1.9.1
)
FetchContent_MakeAvailable(atomic_queue)
target_link_libraries(main PRIVATE atomic_queue::atomic_queue)
vcpkg install atomic-queue
It provides CMake targets:
find_package(atomic_queue CONFIG REQUIRED)
target_link_libraries(main PRIVATE atomic_queue::atomic_queue)
Follow the official tutorial on how to consume conan packages. Details specific to this library are available in ConanCenter.
Building is necessary to run the unit-tests and benchmarks.
GNU Make Makefile is the original/reference build system this project has been developed, unit-tested and benchmarked with. It is feature-complete and most efficient, but the least portable to anything else than Linux. Linux tools, however, are the ultimate best for development, benchmarking, deep performance analyses and profiling at CPU instruction level.
CMake and Meson build systems provide the greatest portability and easy usage/consumption of the library, build and run the unit-tests on their supported platforms. They provide the best library user experience, as opposed to best library developer experience.
Building and running the unit-tests require Boost.Test library (e.g. libboost-test-dev on Debian/Ubuntu). Installing the complete set of Boost development libraries is the easiest (e.g. libboost-all-dev on Debian/Ubuntu).
git clone https://github.com/max0x7ba/atomic_queue.git
cd atomic_queue
Then:
# Build and run the unit-tests with gcc (default).
make -R -j$(($(nproc)/2)) BUILD=debug run_tests
# Build and run the unit-tests with gcc,gcc-14,clang,clang-20 in parallel.
make -R -j$(($(nproc)/2)) BUILD=debug TOOLSET=gcc,gcc-14,clang,clang-20 run_tests
Building and running the benchmarks require additional third-party libraries:
libboost-all-dev on Debian/Ubuntu).libtbb-dev on Debian/Ubuntu). When Intel TBB library is installed elsewhere, you may like to specify that location in cppflags.tbb and ldlibs.tbb in Makefile.git clone https://github.com/cameron314/concurrentqueue.git
git clone https://github.com/cameron314/readerwriterqueue.git
git clone https://github.com/mpoeter/xenium.git
git clone https://github.com/max0x7ba/atomic_queue.git
cd atomic_queue
Then:
make -R -j$(($(nproc)/2)) run_benchmarks_n # Build and run the benchmarks once.
make -R -j$(($(nproc)/2)) run_benchmarks_n N=3 # Build and run the benchmarks 3 times.
make -R -j$(($(nproc)/2)) TOOLSET=gcc-14 run_benchmarks_n # Build with gcc-14 and run the benchmarks.
make -R -j$(($(nproc)/2)) TOOLSET=clang-20 run_benchmarks_n # Build with clang-20 and run the benchmarks.
taskset --cpu-list 0,1,14,15 make -R -j$(($(nproc)/2)) TOOLSET=gcc-14 run_benchmarks_n # Use only cpus [0,1,14,15] to build with gcc-14 and run the benchmarks 3 times.
AtomicQueue - a fixed size ring-buffer for atomic elements.OptimistAtomicQueue - a faster fixed size ring-buffer for atomic elements which busy-waits when empty or full. It is AtomicQueue used with push/pop instead of try_push/try_pop.AtomicQueue2 - a fixed size ring-buffer for non-atomic elements.OptimistAtomicQueue2 - a faster fixed size ring-buffer for non-atomic elements which busy-waits when empty or full. It is AtomicQueue2 used with push/pop instead of try_push/try_pop.These containers maintain their ring-buffers as array data members with size specified at compile-time and have no pointer data members. That makes them position-independent, allows allocating them into process-shared memory with a plain C++ placement new statement, and mapping at arbitrary addresses in different processes using the same queue objects in shared memory. The queue elements must be position-independent too to support this particular use-case (unlike classes with process-position-dependent pointers such as std::unique_ptr, std::string and all the C++ standard containers with default allocators).
There are corresponding B variants (AtomicQueueB, OptimistAtomicQueueB, AtomicQueueB2, OptimistAtomicQueueB2) that use std::allocator or user-specified (stateful) allocator for allocating the ring-buffers, where the buffer size is specified as an argument to the constructor at run-time.
Totally ordered mode is supported. In this mode consumers receive messages in the same FIFO order the messages were posted. This mode is supported for push and pop functions, but not for the try_ versions. On Intel x86 the totally ordered mode has 0 cost, as of 2019.
Single-producer-single-consumer mode is supported. In this mode, no expensive atomic read-modify-write CPU instructions are necessary, only the cheapest atomic loads and stores. That improves queue throughput significantly.
Move-only queue element types are fully supported. For example, a queue of std::unique_ptr<T> elements would be AtomicQueueB2<std::unique_ptr<T>> or AtomicQueue2<std::unique_ptr<T>, CAPACITY>.
queue-end queue-front
[newest-element, ..., oldest-element]
push() pop()
The queue class templates provide the following member functions:
try_push - Appends an element to the end of the queue. Returns false when the queue is full.try_pop - Removes an element from the front of the queue. Returns false when the queue is empty.push (optimist) - Appends an element to the end of the queue. Busy waits when the queue is full. Faster than try_push when the queue is not full. Optional FIFO producer queuing and total order.pop (optimist) - Removes an element from the front of the queue. Busy waits when the queue is empty. Faster than try_pop when the queue is not empty. Optional FIFO consumer queuing and total order.was_size - Returns the number of unconsumed elements during the call. The state may have changed by the time the return value is examined.was_empty - Returns true if the container was empty during the call. The state may have changed by the time the return value is examined.was_full - Returns true if the container was full during the call. The state may have changed by the time the return value is examined.capacity - Returns the maximum number of elements the queue can possibly hold.Atomic elements are those, for which std::atomic<T>{T{}}.is_lock_free() returns true, and, when C++17 features are available, std::atomic<T>::is_always_lock_free evaluates to true at compile time. In other words, the CPU can load, store and compare-and-exchange such elements atomically natively. On x86-64 such elements are all the C++ standard arithmetic and pointer types.
The queues for atomic elements reserve one value to serve as an empty element marker NIL, its default value is 0. NIL value must not be pushed into a queue and there is an assert statement in push functions to guard against that in debug mode builds. Pushing NIL element into a queue in release mode builds results in undefined behaviour, such as deadlocks and/or lost queue elements.
Note that optimism is a choice of a queue modification operation control flow, rather than a queue type. An optimist push is fastest when the queue is not full most of the time, an optimistic pop - when the queue is not empty most of the time. Optimistic and not so operations can be mixed with no restrictions. The OptimistAtomicQueues in the benchmarks use only optimist push and pop.
See example.cc for a usage example.
push and try_push operations synchronize-with (as defined in std::memory_order) with any subsequent pop or try_pop operation of the same queue object. Meaning that:
push/try_push, which is a memory_order::release operation. Same memory order as that of std::mutex::unlock.pop/try_pop, which is a memory_order::acquire operation. Same memory order as that of std::mutex::lock.push/try_push of an element into a queue become visible in the consumer’s thread which pop/try_pop that particular element.The available queues here use a ring-buffer array for storing elements. The capacity of the queue is fixed at compile time or construction time.
In a production multiple-producer-multiple-consumer scenario the ring-buffer capacity should be set to the maximum expected queue size. When the ring-buffer gets full it means that the consumers cannot consume the elements fast enough. A fix for that is any of:
push and pop calls always incur some expensive CPU cycles to maintain the integrity of queue state in atomic/consistent/isolated fashion with respect to other threads and these costs increase super-linearly as queue contention grows. Producer batching of multiple small elements or elements resulting from one event into one queue message is often a reasonable solution.Using a power-of-2 ring-buffer array size allows a couple of important optimizations:
% SIZE. Remainder binary operator % normally generates a division CPU instruction which isn’t cheap, but using a power-of-2 size turns that remainder operator into one cheap binary and CPU instruction and that is as fast as it gets.N producers together with M consumers competing on subsequent elements in the same ring-buffer cache line in the worst case, it is only one producer competing with one consumer (pedantically, when the number of CPUs is not greater than the number of elements that can fit in one cache line). This optimisation scales better with the number of producers and consumers, and element size. With low number of producers and consumers (up to about 2 of each in these benchmarks) disabling this optimisation may yield better throughput (but higher variance across runs).The containers use unsigned type for size and internal indexes. On x86-64 platform unsigned is 32-bit wide, whereas size_t is 64-bit wide. 64-bit instructions utilise an extra byte instruction prefix resulting in slightly more pressure on the CPU instruction cache and the front-end. Hence, 32-bit unsigned indexes are used to maximise performance. That limits the queue size to 4,294,967,295 elements, which seems to be a reasonable hard limit for many applications.
While the atomic queues can be used with any moveable element types (including std::unique_ptr), for best throughput and latency the queue elements should be cheap to copy and lock-free (e.g. unsigned or T*), so that push and pop operations complete fastest.
Conceptually, a push or pop operation does two atomic steps:
head index, consumers incrementing tail index. Each slot is accessed by one producer and one consumer threads only.NIL, consumer loading from a slot changes its state to be NIL. The slot is a spinlock for its one producer and one consumer threads.These queues anticipate that a thread doing push or pop may complete step 1 and then be preempted before completing step 2.
When a thread completes step 1 and terminates (for any reason) without completing step 2, the queue slot remains locked and deadlocks the next thread attempting to try_pop/try_push/pop/push from/into that slot. A thread can be terminated by the OS (e.g., oomkiller), or throw/crash in the user-defined copy/move constructor/assignment of queue element (if any). Should that happen, the game is over, and the best course of action is to terminate the process as soon as possible and address the root cause of one’s threads crashing.
Once constructed/allocated, queue objects maintain their invariants and never throw exceptions, provided that:
push/pop.push/pop is not great, not terrible. Real-time FIFO threads with real-time thread throttling disabled are required for best results.An algorithm is lock-free if there is guaranteed system-wide progress. These queues guarantee system-wide progress by the following properties:
push is independent of any preceding push. An incomplete (preempted) push by one producer thread doesn’t affect push of any other thread.pop is independent of any preceding pop. An incomplete (preempted) pop by one consumer thread doesn’t affect pop of any other thread.push from one producer thread affects only one consumer thread poping an element from this particular queue slot. All other threads pops are unaffected.pop from one consumer thread affects only one producer thread pushing an element into this particular queue slot while expecting it to have been consumed long time ago, in the rather unlikely scenario that producers have wrapped around the entire ring-buffer while this consumer hasn’t completed its pop. All other threads pushs and pops are unaffected.Linux task scheduler thread preemption is something no user-space process should be able to affect or escape, otherwise any/every malicious application would exploit that.
Still, there are a few things one can do to minimize preemption of one’s mission critical application threads:
SCHED_FIFO scheduling class for your threads, e.g. chrt --fifo 50 <app>. A higher priority SCHED_FIFO thread or kernel interrupt handler can still preempt your SCHED_FIFO threads.SCHED_OTHER with its dynamically adjusted priorities defeats the purpose of using these queues.SCHED_FIFO real-time threads from being throttled.taskset.People often propose limiting busy-waiting with a subsequent call to std::this_thread::yield()/sched_yield/pthread_yield. However, sched_yield is a wrong tool for locking because it doesn’t communicate to the OS kernel what the thread is waiting for, so that the OS thread scheduler can never schedule the calling thread to resume at the right time when the shared state has changed (unless there are no other threads that can run on this CPU core, so that the caller resumes immediately). See notes section in man sched_yield and a Linux kernel thread about sched_yield and spinlocks for more details.
In Linux, there is mutex type PTHREAD_MUTEX_ADAPTIVE_NP which busy-waits a locked mutex for a number of iterations and then makes a blocking syscall into the kernel to deschedule the waiting thread. In the benchmarks it was the worst performer and I couldn’t find a way to make it perform better, and that’s the reason it is not included in the benchmarks.
C++20 introduced blocking std::atomic::wait which uses Linux futex for atomic compare-and-block operation, similar to PTHREAD_MUTEX_ADAPTIVE_NP, with hard-coded spin-count limits. And thread_yield calls, which Linus Torvalds above explains is only applicable for real-time threads with a single run-queue for them. A queue implementation with std::atomic::wait is due to be benchmarked, with its performance expected to be similar to that of PTHREAD_MUTEX_ADAPTIVE_NP, but I’d love to be pleasantly surprised.
On Intel CPUs one could use the 4 debug control registers to monitor the spinlock memory region for write access and wait on it using select (and its friends) or sigwait (see perf_event_open and uapi/linux/hw_breakpoint.h for more details). A spinlock waiter could suspend itself with select or sigwait until the spinlock state has been updated. But there are only 4 of these registers, so that such a solution wouldn’t scale.
Using huge pages improves performance of memory intensive applications dramatically. The benchmark tries allocating 1GB or 2MB huge pages first, and falls back to allocating the default tiny pages (4kB on x86_64). The benchmark results are reproducible only when it succeeds allocating one 1GB huge page.
Using smaller pages cripple CPU performance with TLB cache misses.
View latency and throughput benchmarks charts.
Two threads ping-pong one 4-byte integer counter via two queue objects. The counter is decremented and sent as a reply, until reaching 0. There are 2 queues and 2 threads, the counter is initialized with 1,000,000 (each of the two threads pushes/pops 500,000 messages).
Contention is minimal here: each queue has 1 producer 1 consumer, up to 1 element in the queue – the best case ideal scenario for a queue to demonstrate its lowest possible latency. The dependency chains on popped counter exist to prevent CPUs from pipe-lining out-of-order execution in order to measure the true round-trip latency.
This benchmark measures the total time taken for the 2 threads to exchange the 1,000,000 messages. A benchmark run reports the best sec/round-trip latency (time taken to push a message and pop its reply) out of 3 tries for each queue. The charts report mean, stdev, min and max of sec/round-trip latency across 33 benchmark runs.
N producer threads push a 4-byte integer into one same queue, N consumer threads pop the integers from the queue.
With SMT threads, the benchmark is run for from 1 producer and 1 consumer up to (total-number-of-cpus / 2) producers/consumers to measure the scalability of different queues. Without using SMT threads (cross-core communication only) – up to (total-number-of-cpus / 4) producers/consumers.
There are no dependency chains on messages in producer and consumer threads in this benchmark in order to let the queues demonstrate their highest possible throughputs. The reported throughputs are higher than the inverse of latencies because of no dependency chains stalling CPU pipe-lining and out-of-order execution (as intended).
This benchmark measures the total time taken to send and receive a total of 1,000,000 messages through one queue. A benchmark run reports the best msg/sec throughput out of 3 tries for each queue. The charts report mean, stdev, min and max of msg/sec throughput across 33 benchmark runs.
moodycamel::ConcurrentQueue is the notable exception here, with its 1-producer-1-consumer throughput being the worst, yet scaling up in almost perfect linear fashion when the benchmark adds another pair of producer-consumer threads. It tries to emulate a MPMC queue with a bunch of SPSC queues under the hood. moodycamel::ConcurrentQueue is not a drop-in replacement for the other true MPMC queues benchmarked here. Whereas all the other MPMC queues are drop-in replacements for another (interface adaptors may be required), including moodycamel::ConcurrentQueue too.
There are a few OS behaviours that complicate benchmarking, make benchmark runs irreproducible, and introduce otherwise unnecessary timing noise into benchmark timings:
SCHED_FIFO priority 50 is used to make benchmark threads non-preemptable by lower priority processes/threads.SCHED_FIFO threads every second to protect against the risk of real-time threads hogging all CPUs and making the system unavailable for anything else. Disabled during benchmarks runs2.To further minimise the noise in timings:
benchmarks executable runs the same workload 10 times with each queue and reports the shortest (best) time measured. Everyone needs a break every little while and benchmarks executable gives each queue 10 breaks every few fractions of a second.{mean, stdev, min, max} descriptive statistics of the best throughput msg/sec and latency sec/round-trip measurements obtained from at least 33 runs of benchmarks executable. (33 runs of benchmarks take ~1h30s on Ryzen 5950X).The goal of the benchmarks is to time things as accurately as possible. For the purpose of viewing the reality clearly without any distortions of perception, prejudice or judgement.
When huge pages are available the benchmarks use 1x1GB or 16x2MB huge pages for the queues to minimise TLB misses. To enable huge pages do one of:
sudo hugeadm --pool-pages-min 1GB:1
sudo hugeadm --pool-pages-min 2MB:32
Alternatively, you may like to enable transparent hugepages in your system and/or use a hugepage-aware allocator. thp-usage provides more information.
By default, Linux scheduler throttles real-time threads from consuming 100% of CPU and that is detrimental to benchmarking. Full details can be found in Real-Time group scheduling. To disable real-time thread throttling do:
echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us >/dev/null
Some books on the subject of multi-threaded programming I found quite instructive:
Copyright (c) 2019 Maxim Egorushkin. MIT License. See the full licence in file LICENSE.
“Completely Fair Scheduler” was never a good idea nor a right solution for anything: make -j freezes a Linux system because the process scheduler strives to allocate the CPU/memory resources to all processes equally fairly. The scheduler automatic task group creation Linux feature, now enabled by default, breaks nice and creates more problem trying to put a band-aid on the ill-conceived “Completely Fair Scheduler” idea. Desirable robust process scheduling is the opposite of fair. Solaris implemented the desirable robust process scheduling in 2000s, make -j never destabilised Solaris. make -j was the right way to build fast on any Unix. Switching to Linux, people got most surprised with make -j completely freezing the highest-end Linux systems in 2009, and had to learn that -j option takes an integer argument (my experience). GNU Make implemented -l band-aid option to work-around this Linux-only CPU scheduler problem, and -l is worthless for 99% of use-cases (people still hope it solves 1% of use-cases, which have been elusive, so far). And here we are now, with fundamentally completely broken “Completely Fair Scheduler”, with its scheduler automatic task group creation band-aid completely breaking nice. And a worthless make -l workaround for make -j completely destabilising Linux with its “Completely Fair Scheduler”. While the deluge of Linux security mitigations severely crippling CPU performance for protection against threat vectors non-existent for 99% of Linux users. ↩
A security feature that cripples CPU performance to protect against someone else’s threat vectors. Always disabled in my workstations. ↩ ↩2 ↩3
Always enabled in my workstations for best performance in memory/compute-intensive workloads, such as linear algebra computations with numpy and Pandas, training ANNs on GPUs with PyTorch. ↩ ↩2