atomic_queue

C++14 MIT license Latest release Conan Center Vcpkg Version
Makefile Continuous Integrations CMake Continuous Integrations Meson Continuous Integrations
platform Linux x86_64 platform Linux ARM platform Linux RISC-V platform Linux PowerPC platform Linux IBM System/390 platform Linux LoongArch platform Windows x86_64

atomic_queue

C++14 multiple-producer-multiple-consumer lock-free queues based on circular buffers and std::atomic.

Designed with a goal to minimize the latency between one thread pushing an element into a queue and another thread popping it from the queue.

It has been developed, tested and benchmarked on Linux. Yet, any C++14 platform implementing std::atomic is expected to compile the unit-tests and run them without failures just as well.

Continuous integrations running the unit-tests on GitHub are set up for x86_64 and arm64 platforms, Ubuntu-22.04, Ubuntu-24.04 and Windows. Pull requests to extend the continuous integrations to run the unit-tests on other architectures/platforms are most welcome.

Design Principles

When minimizing latency a good design is not when there is nothing left to add, but rather when there is nothing left to remove, as these queues exemplify.

Minimizing latency naturally maximizes throughput. Low latency reciprocal is high throughput, in ideal mathematical and practical engineering sense. Low latency is incompatible with any delays and/or batching, which destroy original (hardware) global time order of events pushed into one queue by different threads. Maximizing throughput, on the other hand, can be done at expense of latency by delaying and batching multiple updates.

The main design principle these queues follow is minimalism, which results in such design choices as:

The impact of each of these small design choices on their own is barely measurable, but their total impact is much greater than a simple sum of the constituents’ impacts, aka super-scalar compounding or synergy. The synergy emerging from combining multiple of these small design choices together is what allows CPUs to perform at their peak capacities least impeded.

These design choices are also limitations:

Ultra-low-latency applications need just that and nothing more. The minimalism pays off, see the throughput and latency benchmarks.

Role Models

Several other well established and popular thread-safe containers are used for reference in the benchmarks:

Queue Type Description
std::mutex MPMC A fixed size ring-buffer with std::mutex.
pthread_spinlock MPMC A fixed size ring-buffer with pthread_spinlock_t.
boost::lockfree::spsc_queue SPSC A wait-free queue from Boost library.
boost::lockfree::queue MPMC A lock-free queue from Boost library.
moodycamel::ConcurrentQueue quasi-MPMC A lock-free queue used in non-blocking mode. Designed to maximize throughput at the expense of latency, eschewing global time order, by emulating a MPMC queue with a bunch of SPSC queues under the hood. It is not equivalent to other queues benchmarked here in this respect.
moodycamel::ReaderWriterQueue SPSC A lock-free queue used in non-blocking mode.
xenium::michael_scott_queue MPMC A lock-free queue proposed by Michael and Scott (similar to boost::lockfree::queue which is also based on the same proposal).
xenium::ramalhete_queue MPMC A lock-free queue proposed by Ramalhete and Correia.
xenium::vyukov_bounded_queue MPMC A bounded queue based on the version proposed by Vyukov.
tbb::spin_mutex MPMC A locked fixed size ring-buffer with tbb::spin_mutex from Intel Threading Building Blocks.
tbb::concurrent_bounded_queue MPMC Eponymous queue used in non-blocking mode from Intel Threading Building Blocks.

Using the library

The containers provided are header-only class templates, no building/installing is necessary.

Installing

From GitHub

  1. Clone the project:
    git clone https://github.com/max0x7ba/atomic_queue.git
    
  2. Add atomic_queue/include directory (use full path) to the include paths of your build system.
  3. #include <atomic_queue/atomic_queue.h> in your C++ source.

If you use CMake, these can be simplified as follows:

add_subdirectory(atomic_queue)
target_link_libraries(main PRIVATE atomic_queue::atomic_queue)

From GitHub using cmake FetchContent

You can also use CMake’s FetchContent.

include(FetchContent)
FetchContent_Declare(
        atomic_queue
        GIT_REPOSITORY https://github.com/max0x7ba/atomic_queue.git
        GIT_TAG v1.9.1
)
FetchContent_MakeAvailable(atomic_queue)
target_link_libraries(main PRIVATE atomic_queue::atomic_queue)

From vcpkg

vcpkg install atomic-queue

It provides CMake targets:

find_package(atomic_queue CONFIG REQUIRED)
target_link_libraries(main PRIVATE atomic_queue::atomic_queue)

Install using conan

Follow the official tutorial on how to consume conan packages. Details specific to this library are available in ConanCenter.

Build and run

Building is necessary to run the unit-tests and benchmarks.

GNU Make Makefile is the original/reference build system this project has been developed, unit-tested and benchmarked with. It is feature-complete and most efficient, but the least portable to anything else than Linux. Linux tools, however, are the ultimate best for development, benchmarking, deep performance analyses and profiling at CPU instruction level.

CMake and Meson build systems provide the greatest portability and easy usage/consumption of the library, build and run the unit-tests on their supported platforms. They provide the best library user experience, as opposed to best library developer experience.

Build and run unit-tests

Building and running the unit-tests require Boost.Test library (e.g. libboost-test-dev on Debian/Ubuntu). Installing the complete set of Boost development libraries is the easiest (e.g. libboost-all-dev on Debian/Ubuntu).

git clone https://github.com/max0x7ba/atomic_queue.git
cd atomic_queue

Then:

# Build and run the unit-tests with gcc (default).
make -R -j$(($(nproc)/2)) BUILD=debug run_tests

# Build and run the unit-tests with gcc,gcc-14,clang,clang-20 in parallel.
make -R -j$(($(nproc)/2)) BUILD=debug TOOLSET=gcc,gcc-14,clang,clang-20 run_tests

Build and run benchmarks

Building and running the benchmarks require additional third-party libraries:

git clone https://github.com/cameron314/concurrentqueue.git
git clone https://github.com/cameron314/readerwriterqueue.git
git clone https://github.com/mpoeter/xenium.git

git clone https://github.com/max0x7ba/atomic_queue.git
cd atomic_queue

Then:

make -R -j$(($(nproc)/2)) run_benchmarks_n       # Build and run the benchmarks once.
make -R -j$(($(nproc)/2)) run_benchmarks_n N=3   # Build and run the benchmarks 3 times.

make -R -j$(($(nproc)/2)) TOOLSET=gcc-14 run_benchmarks_n    # Build with gcc-14 and run the benchmarks.
make -R -j$(($(nproc)/2)) TOOLSET=clang-20 run_benchmarks_n  # Build with clang-20 and run the benchmarks.

taskset --cpu-list 0,1,14,15 make -R -j$(($(nproc)/2)) TOOLSET=gcc-14 run_benchmarks_n  # Use only cpus [0,1,14,15] to build with gcc-14 and run the benchmarks 3 times.

Library contents

Available queues

These containers maintain their ring-buffers as array data members with size specified at compile-time and have no pointer data members. That makes them position-independent, allows allocating them into process-shared memory with a plain C++ placement new statement, and mapping at arbitrary addresses in different processes using the same queue objects in shared memory. The queue elements must be position-independent too to support this particular use-case (unlike classes with process-position-dependent pointers such as std::unique_ptr, std::string and all the C++ standard containers with default allocators).

There are corresponding B variants (AtomicQueueB, OptimistAtomicQueueB, AtomicQueueB2, OptimistAtomicQueueB2) that use std::allocator or user-specified (stateful) allocator for allocating the ring-buffers, where the buffer size is specified as an argument to the constructor at run-time.

Totally ordered mode is supported. In this mode consumers receive messages in the same FIFO order the messages were posted. This mode is supported for push and pop functions, but not for the try_ versions. On Intel x86 the totally ordered mode has 0 cost, as of 2019.

Single-producer-single-consumer mode is supported. In this mode, no expensive atomic read-modify-write CPU instructions are necessary, only the cheapest atomic loads and stores. That improves queue throughput significantly.

Move-only queue element types are fully supported. For example, a queue of std::unique_ptr<T> elements would be AtomicQueueB2<std::unique_ptr<T>> or AtomicQueue2<std::unique_ptr<T>, CAPACITY>.

Queue schematics

queue-end                 queue-front
[newest-element, ..., oldest-element]
push()                          pop()

Queue API

The queue class templates provide the following member functions:

Atomic elements are those, for which std::atomic<T>{T{}}.is_lock_free() returns true, and, when C++17 features are available, std::atomic<T>::is_always_lock_free evaluates to true at compile time. In other words, the CPU can load, store and compare-and-exchange such elements atomically natively. On x86-64 such elements are all the C++ standard arithmetic and pointer types.

The queues for atomic elements reserve one value to serve as an empty element marker NIL, its default value is 0. NIL value must not be pushed into a queue and there is an assert statement in push functions to guard against that in debug mode builds. Pushing NIL element into a queue in release mode builds results in undefined behaviour, such as deadlocks and/or lost queue elements.

Note that optimism is a choice of a queue modification operation control flow, rather than a queue type. An optimist push is fastest when the queue is not full most of the time, an optimistic pop - when the queue is not empty most of the time. Optimistic and not so operations can be mixed with no restrictions. The OptimistAtomicQueues in the benchmarks use only optimist push and pop.

See example.cc for a usage example.

Implementation Notes

Memory order of non-atomic loads and stores

push and try_push operations synchronize-with (as defined in std::memory_order) with any subsequent pop or try_pop operation of the same queue object. Meaning that:

Ring-buffer capacity

The available queues here use a ring-buffer array for storing elements. The capacity of the queue is fixed at compile time or construction time.

In a production multiple-producer-multiple-consumer scenario the ring-buffer capacity should be set to the maximum expected queue size. When the ring-buffer gets full it means that the consumers cannot consume the elements fast enough. A fix for that is any of:

Using a power-of-2 ring-buffer array size allows a couple of important optimizations:

The containers use unsigned type for size and internal indexes. On x86-64 platform unsigned is 32-bit wide, whereas size_t is 64-bit wide. 64-bit instructions utilise an extra byte instruction prefix resulting in slightly more pressure on the CPU instruction cache and the front-end. Hence, 32-bit unsigned indexes are used to maximise performance. That limits the queue size to 4,294,967,295 elements, which seems to be a reasonable hard limit for many applications.

While the atomic queues can be used with any moveable element types (including std::unique_ptr), for best throughput and latency the queue elements should be cheap to copy and lock-free (e.g. unsigned or T*), so that push and pop operations complete fastest.

Lock-free guarantees

Conceptually, a push or pop operation does two atomic steps:

  1. Atomically and exclusively claims the queue slot index to store/load an element to/from. That’s producers incrementing head index, consumers incrementing tail index. Each slot is accessed by one producer and one consumer threads only.
  2. Atomically store/load the element into/from the slot. Producer storing into a slot changes its state to be non-NIL, consumer loading from a slot changes its state to be NIL. The slot is a spinlock for its one producer and one consumer threads.

These queues anticipate that a thread doing push or pop may complete step 1 and then be preempted before completing step 2.

When a thread completes step 1 and terminates (for any reason) without completing step 2, the queue slot remains locked and deadlocks the next thread attempting to try_pop/try_push/pop/push from/into that slot. A thread can be terminated by the OS (e.g., oomkiller), or throw/crash in the user-defined copy/move constructor/assignment of queue element (if any). Should that happen, the game is over, and the best course of action is to terminate the process as soon as possible and address the root cause of one’s threads crashing.

Once constructed/allocated, queue objects maintain their invariants and never throw exceptions, provided that:

An algorithm is lock-free if there is guaranteed system-wide progress. These queues guarantee system-wide progress by the following properties:

Preemption

Linux task scheduler thread preemption is something no user-space process should be able to affect or escape, otherwise any/every malicious application would exploit that.

Still, there are a few things one can do to minimize preemption of one’s mission critical application threads:

People often propose limiting busy-waiting with a subsequent call to std::this_thread::yield()/sched_yield/pthread_yield. However, sched_yield is a wrong tool for locking because it doesn’t communicate to the OS kernel what the thread is waiting for, so that the OS thread scheduler can never schedule the calling thread to resume at the right time when the shared state has changed (unless there are no other threads that can run on this CPU core, so that the caller resumes immediately). See notes section in man sched_yield and a Linux kernel thread about sched_yield and spinlocks for more details.

In Linux, there is mutex type PTHREAD_MUTEX_ADAPTIVE_NP which busy-waits a locked mutex for a number of iterations and then makes a blocking syscall into the kernel to deschedule the waiting thread. In the benchmarks it was the worst performer and I couldn’t find a way to make it perform better, and that’s the reason it is not included in the benchmarks.

C++20 introduced blocking std::atomic::wait which uses Linux futex for atomic compare-and-block operation, similar to PTHREAD_MUTEX_ADAPTIVE_NP, with hard-coded spin-count limits. And thread_yield calls, which Linus Torvalds above explains is only applicable for real-time threads with a single run-queue for them. A queue implementation with std::atomic::wait is due to be benchmarked, with its performance expected to be similar to that of PTHREAD_MUTEX_ADAPTIVE_NP, but I’d love to be pleasantly surprised.

On Intel CPUs one could use the 4 debug control registers to monitor the spinlock memory region for write access and wait on it using select (and its friends) or sigwait (see perf_event_open and uapi/linux/hw_breakpoint.h for more details). A spinlock waiter could suspend itself with select or sigwait until the spinlock state has been updated. But there are only 4 of these registers, so that such a solution wouldn’t scale.

Huge pages

Using huge pages improves performance of memory intensive applications dramatically. The benchmark tries allocating 1GB or 2MB huge pages first, and falls back to allocating the default tiny pages (4kB on x86_64). The benchmark results are reproducible only when it succeeds allocating one 1GB huge page.

Using smaller pages cripple CPU performance with TLB cache misses.

Benchmarks

View latency and throughput benchmarks charts.

Latency benchmark

Two threads ping-pong one 4-byte integer counter via two queue objects. The counter is decremented and sent as a reply, until reaching 0. There are 2 queues and 2 threads, the counter is initialized with 1,000,000 (each of the two threads pushes/pops 500,000 messages).

Contention is minimal here: each queue has 1 producer 1 consumer, up to 1 element in the queue – the best case ideal scenario for a queue to demonstrate its lowest possible latency. The dependency chains on popped counter exist to prevent CPUs from pipe-lining out-of-order execution in order to measure the true round-trip latency.

This benchmark measures the total time taken for the 2 threads to exchange the 1,000,000 messages. A benchmark run reports the best sec/round-trip latency (time taken to push a message and pop its reply) out of 3 tries for each queue. The charts report mean, stdev, min and max of sec/round-trip latency across 33 benchmark runs.

Throughput and scalability benchmark

N producer threads push a 4-byte integer into one same queue, N consumer threads pop the integers from the queue.

With SMT threads, the benchmark is run for from 1 producer and 1 consumer up to (total-number-of-cpus / 2) producers/consumers to measure the scalability of different queues. Without using SMT threads (cross-core communication only) – up to (total-number-of-cpus / 4) producers/consumers.

There are no dependency chains on messages in producer and consumer threads in this benchmark in order to let the queues demonstrate their highest possible throughputs. The reported throughputs are higher than the inverse of latencies because of no dependency chains stalling CPU pipe-lining and out-of-order execution (as intended).

This benchmark measures the total time taken to send and receive a total of 1,000,000 messages through one queue. A benchmark run reports the best msg/sec throughput out of 3 tries for each queue. The charts report mean, stdev, min and max of msg/sec throughput across 33 benchmark runs.

Results Notes

Notable exceptions

moodycamel::ConcurrentQueue is the notable exception here, with its 1-producer-1-consumer throughput being the worst, yet scaling up in almost perfect linear fashion when the benchmark adds another pair of producer-consumer threads. It tries to emulate a MPMC queue with a bunch of SPSC queues under the hood. moodycamel::ConcurrentQueue is not a drop-in replacement for the other true MPMC queues benchmarked here. Whereas all the other MPMC queues are drop-in replacements for another (interface adaptors may be required), including moodycamel::ConcurrentQueue too.

Methodology

There are a few OS behaviours that complicate benchmarking, make benchmark runs irreproducible, and introduce otherwise unnecessary timing noise into benchmark timings:

To further minimise the noise in timings:

The goal of the benchmarks is to time things as accurately as possible. For the purpose of viewing the reality clearly without any distortions of perception, prejudice or judgement.

Huge pages

When huge pages are available the benchmarks use 1x1GB or 16x2MB huge pages for the queues to minimise TLB misses. To enable huge pages do one of:

sudo hugeadm --pool-pages-min 1GB:1
sudo hugeadm --pool-pages-min 2MB:32

Alternatively, you may like to enable transparent hugepages in your system and/or use a hugepage-aware allocator. thp-usage provides more information.

Real-time thread throttling

By default, Linux scheduler throttles real-time threads from consuming 100% of CPU and that is detrimental to benchmarking. Full details can be found in Real-Time group scheduling. To disable real-time thread throttling do:

echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us >/dev/null

Reading material

Some books on the subject of multi-threaded programming I found quite instructive:


Copyright (c) 2019 Maxim Egorushkin. MIT License. See the full licence in file LICENSE.

  1. “Completely Fair Scheduler” was never a good idea nor a right solution for anything: make -j freezes a Linux system because the process scheduler strives to allocate the CPU/memory resources to all processes equally fairly. The scheduler automatic task group creation Linux feature, now enabled by default, breaks nice and creates more problem trying to put a band-aid on the ill-conceived “Completely Fair Scheduler” idea. Desirable robust process scheduling is the opposite of fair. Solaris implemented the desirable robust process scheduling in 2000s, make -j never destabilised Solaris. make -j was the right way to build fast on any Unix. Switching to Linux, people got most surprised with make -j completely freezing the highest-end Linux systems in 2009, and had to learn that -j option takes an integer argument (my experience). GNU Make implemented -l band-aid option to work-around this Linux-only CPU scheduler problem, and -l is worthless for 99% of use-cases (people still hope it solves 1% of use-cases, which have been elusive, so far). And here we are now, with fundamentally completely broken “Completely Fair Scheduler”, with its scheduler automatic task group creation band-aid completely breaking nice. And a worthless make -l workaround for make -j completely destabilising Linux with its “Completely Fair Scheduler”. While the deluge of Linux security mitigations severely crippling CPU performance for protection against threat vectors non-existent for 99% of Linux users. 

  2. A security feature that cripples CPU performance to protect against someone else’s threat vectors. Always disabled in my workstations.  2 3

  3. Always enabled in my workstations for best performance in memory/compute-intensive workloads, such as linear algebra computations with numpy and Pandas, training ANNs on GPUs with PyTorch 2