Two threads ping-pong one 4-byte integer counter via two queue objects. The counter is decremented and sent as a reply, until reaching 0. There are 2 queues and 2 threads, the counter is initialized with 1,000,000 (each of the two threads pushes/pops 500,000 messages).
Contention is minimal here: each queue has 1 producer 1 consumer, up to 1 element in the queue -- the best case ideal scenario for a queue to demonstrate its lowest possible latency. The dependency chains on popped counter exist to prevent CPUs from pipe-lining out-of-order execution in order to measure the true round-trip latency.
This benchmark measures the total time taken for the 2 threads to exchange the 1,000,000 messages. The charts report mean, stdev, min and max of sec/round-trip latency across 33 benchmark runs.
N producer threads push a 4-byte integer into one same queue, N consumer threads pop the integers from the queue. With SMT threads, the benchmark is run for from 1 producer and 1 consumer up to (total-number-of-cpus / 2) producers/consumers to measure the scalability of different queues. Without using SMT threads (cross-core communication only) -- up to (total-number-of-cpus / 4) producers/consumers.
There are no dependency chains on messages in producer and consumer threads in this benchmark in order to let the queues demonstrate their highest possible throughputs. The reported throughputs are higher than the inverse of latencies because of no dependency chains stalling CPU pipe-lining and out-of-order execution (as intended).
This benchmark measures the total time taken to send and receive a total of 1,000,000 messages through one queue. The charts report mean, stdev, min and max of msg/sec throughput across 33 benchmark runs.
github.com/max0x7ba/atomic_queue
Copyright (c) Maxim Egorushkin. MIT License. See the full licence in file LICENSE.