Sunday, January 17, 2016

a critique of shared_ptr

std::shared_ptr made it into the C++ standard library, is popular, and now in widespread use.  And before it, everyone used boost::shared_ptr.  So, what's the problem?  In a nutshell: weak_ptr support, and mandatory threadsafe refcounting.  The resulting performance is far less than optimal, for many common use-cases. 

If you need a refresher, this stackoverflow post explains how shared_ptr works.

The major off-the-shelf alternatives are std::unique_ptr and boost::intrusive_ptr.

The weak_ptr requirement always adds storage cost to shared_ptr ...

... even if you're not using weak_ptr in your own code.  The common algorithm is to maintain a separate refcount for strong-references and weak-references.  Both MSVC and libstdc++ do this.

Refcounting is always thread-safe ...

... even when you don't need it to be.  And as we all know, atomics are 1-2 orders of magnitude slower than their single-threaded counterparts.

If you are trying to build a DAG that is only accessed by one thread at a time, then shared_ptr is the wrong solution.

The weak_ptr requirement always adds extra performance overhead.

The optimal number of refcounts per operation involve a trick, where all strong references collectively hold 1 weak reference.  This allows all but the last strong reference to avoid touching the weak_count.  Both MSVC and libstdc++ do this.  Here are all the places atomics occur:
  • shared_ptr creation: set strong_count=1 , weak count = 1 ; no atomics
  • weak acquire: atomically increment weak_count
  • weak release: atomically decrement weak_count
  • strong acquire: atomically increment strong_count
  • strong release: atomically decrement strong_count, and if it was the last one, also atomically decrement weak_count.
Sadly, this means that every object created through shared_ptr incurs a minimum cost of two atomics.  In contrast, a strong-only pointer system would incur at minimum one atomic.

Let's also remember that at least one new/delete is required as well, so we're up to four atomics per shared_ptr-mediated object.  By the performance-analysis method presented here, single-threaded shared_ptr object-management is limited to ~(150MHz / 4) = 37MHz.

If you're using C++, you probably expect performance.  37MHz object creation is a far cry from peak performance.

the make_shared optimization undermines weak_ptr

The make_shared optimization is to allocate both the object and control-block in a single allocation (a single call to "new").  Herb Sutter describes it well in GotW#89.  This effectively makes all weak_ptrs now hold a strong reference to the object's raw memory -- a shallow strong-reference.  The irony!  Especially considering the only reason you'd use shared_ptr now, is if you needed weak_ptr as well.

Admittedly the extended object-memory lifetime isn't a big deal for small objects like vector or string.  Just be wary of using shared_ptr+weak_ptr with large-footprint objects.

shared_ptr implementation requires a virtual release() ...

... even if you're only using a concrete type with no inheritance.  shared_ptr must account for all possible usage scenarios, including multiple-inheritance, and .  It's analogous to how all COM objects inherit from IUnknown and have a virtual Release() method.

The alternative optimal solution, is to use intrusive_ptr.  There you have the freedom of defining an inlinable intrusive_ptr_release() on your concrete type.

This may sound like a minor micro-optimization, but the effect of a non-inlinable call on surrounding code-generation can be profound,

Concluding Remarks

shared_ptr is at best a convenient low-performance class, to be used sparingly and in code that is called at low frequency.  Prefer unique_ptr and intrusive_ptr, in that order.

No comments:

Post a Comment