How fast are atomics?
With a simple micro-benchmark on my IvyBridge laptop (i7-3610QM), I get roughly this many atomics/second to a single address. AtomicAdd (lock xadd) and AtomicCAS (lock cmpxchg) produced similar results:- ~140 million [single-threaded]
- ~37 million [under contention - 4 or more threads accessing the same location]
Restating in Hertz, we have between 30MHz - 150MHz atomics to a single address on modern CPUs. Considering that modern CPUs run at between 2000-3000MHz, we have 1-2 orders of magnitude difference in performance between non-atomic and atomic ops.
This paper corroborates these numbers, and explains them in terms of the modified MESI cache-coherency protocols. Atomics are performed by first acquiring exclusive ownership of a cacheline into a core's L1. Uncontended accesses are faster because the exclusive acquisition happens only once, and all operations stay local to the core after that. When multiple cores contend for the same cacheline, additional write-back-invalidate signals are sent as part of the transfer-of-ownership protocol, which adds latency. The latency translates acts as a direct throughput limiter since atomics block the core's memory pipeline.
ARM atomics require a sequence of instructions in an LDREX/STREX loop, which is an explicit use of MESI.
A method of analysis
Say you write this simple conversion function.std::string ToString(bool x) { return x ? "true" : "false"; }String must call "new" to allocate memory. With the default thread-safe allocator which uses a lightweight mutex, we see this function costs one atomic instruction. So, this function is at best a 30-150MHz function -- if you called it in a tight loop, it couldn't run faster than that. Considering the only other operations are a strcpy of 4-5 bytes (maybe 1s of clock-cycles), it's clear the atomic dominates by at least an order of magnitude. But we're still being too charitable; for every new we'll have a corresponding delete. So really we have a 15-75MHz function. Just to return a string!
The irony is that if you ran that same code on 10-year-old hardware, but with a single-threaded C runtime, it would certainly run faster.
This insane overhead of new/delete is why custom allocators, and malloc-replacements like TCMalloc, are so important.
Other Anti-Patterns
All of the following have something in common: they drag down the performance of single-threaded code in avoidable ways.Passing a refcounted smart-pointer (like std::shared_ptr or boost::intrusive_ptr) by-value. This incurs extra refcount operations, and each one is an atomic add by +1 or -1.
Passing objects like std::string and std::vector by-value.
Using a std::shared_ptr to manage the lifetime of an object that is only accessed by one thread at a time.
Fine-grained locking in general. The previous item about shared_ptr is somewhat an example of this.
No comments:
Post a Comment