Not A Grammar: Atomics aren't free

Modern code is filled with the use of atomic-instructions, which serve to speed up multi-threaded or threadsafe code. These are hidden behind every new/delete, shared_ptr, lightweight mutexes like Win32 CRITICAL_SECTION and linux futex, and lock-free data-structures like those provided by Boost.LockFree. It's easy to be excited about their performance gains as compared with syscall-based synchronization variants. However, what's often ignored is their performance versus unsynchronized single-threaded code.

How fast are atomics?

With a simple micro-benchmark on my IvyBridge laptop (i7-3610QM), I get roughly this many atomics/second to a single address. AtomicAdd (lock xadd) and AtomicCAS (lock cmpxchg) produced similar results:

~140 million [single-threaded]
~37 million [under contention - 4 or more threads accessing the same location]

The "under contention" case hasn't changed much in the last 10 years, at least for systems I tested. A desktop Core2 Duo and Nehalem i7 produced nearly the same results.

Restating in Hertz, we have between 30MHz - 150MHz atomics to a single address on modern CPUs. Considering that modern CPUs run at between 2000-3000MHz, we have 1-2 orders of magnitude difference in performance between non-atomic and atomic ops.

This paper corroborates these numbers, and explains them in terms of the modified MESI cache-coherency protocols. Atomics are performed by first acquiring exclusive ownership of a cacheline into a core's L1. Uncontended accesses are faster because the exclusive acquisition happens only once, and all operations stay local to the core after that. When multiple cores contend for the same cacheline, additional write-back-invalidate signals are sent as part of the transfer-of-ownership protocol, which adds latency. The latency translates acts as a direct throughput limiter since atomics block the core's memory pipeline.

ARM atomics require a sequence of instructions in an LDREX/STREX loop, which is an explicit use of MESI.

A method of analysis

Say you write this simple conversion function.

std::string ToString(bool x)
{
    return x ? "true" : "false";
}

String must call "new" to allocate memory. With the default thread-safe allocator which uses a lightweight mutex, we see this function costs one atomic instruction. So, this function is at best a 30-150MHz function -- if you called it in a tight loop, it couldn't run faster than that. Considering the only other operations are a strcpy of 4-5 bytes (maybe 1s of clock-cycles), it's clear the atomic dominates by at least an order of magnitude. But we're still being too charitable; for every new we'll have a corresponding delete. So really we have a 15-75MHz function. Just to return a string!

The irony is that if you ran that same code on 10-year-old hardware, but with a single-threaded C runtime, it would certainly run faster.

This insane overhead of new/delete is why custom allocators, and malloc-replacements like TCMalloc, are so important.

Other Anti-Patterns

All of the following have something in common: they drag down the performance of single-threaded code in avoidable ways.

Passing a refcounted smart-pointer (like std::shared_ptr or boost::intrusive_ptr) by-value. This incurs extra refcount operations, and each one is an atomic add by +1 or -1.

Passing objects like std::string and std::vector by-value.

Using a std::shared_ptr to manage the lifetime of an object that is only accessed by one thread at a time.

Fine-grained locking in general. The previous item about shared_ptr is somewhat an example of this.

Take-aways

Treat atomics as "slow" operations, just as you would any other synchronization primitive.

Not A Grammar

Friday, January 15, 2016

Atomics aren't free

How fast are atomics?

A method of analysis

Other Anti-Patterns

Take-aways

No comments:

Post a Comment