Not A Grammar: 2016

Monday, May 30, 2016

C++ Pass-By-Value a Passing Fad?

Argument passing seems basic, but is surprisingly nuanced in C++. Conventional C++ wisdom dictates that pass-by-reference is the best default way to pass parameters. For C++03, Scott Meyers' book "Effective C++" has an item "Prefer pass-by-reference-to-const over pass-by-value", and that was the gold standard. C++03 only had one kind of reference (const Foo&) so that just made sense.

C++11 introduced rvalue-references (Foo&&) to support move-semantics, and now there are (at least) 2 kinds of references. With that came a lot of confusing and conflicting advice on how to best take advantage of the performance benefits of move-semantics. One counter-intuitive piece of advice is to use pass-by-value to gain performance. Examples of such information:

Dave Abrahams' - Want Speed? Pass by Value
juanchopanza - Want Speed? Don't (always) pass by value.
isocpp.org - Why is "want speed? pass by value" not recommended?
This StackOverflow topic
StackOverflow C++ FAQ on Arithmetic Operators
Chandler Carruth - (video) Optimizing the Emergent Structures of C++
Scott Meyers - Should move-only types ever be passed as values?
Herb Sutter - (2013) GotW #91 Solution : Smart Pointer Parameters - suggesting use of value parameters
Herb Sutter - (2014) Back to the Basic! Modern C++ and related SO post - suggesting use of const-ref and rvalue-ref overloads

In general my default recommendations have always been:

pass built-in types by value (char, int, float, double, etc.)
pass read-only arguments as reference-to-const
pass "sink" arguments as rvalue-references
pass mutable arguments as lvalue-references (otherwise known as "out params" and "inout params")
never pass by value (it's a code smell and most functions don't need to sink values)

Passing references is always a safe default, because in the optimal case they collapse to nothing, and in the pessimal case they incur one extra indirection. Whereas passing values may optimally collapse to nothing (if special optimizations kick in), but in the pessimal case will incur a huge penalty (copy construction). Even simple types like std::vector and std::string have huge copy penalties due to the extra new & delete, and unnecessary copying is a major anti-pattern. Pass-by-reference also avoids the famous "slicing problem" (pass-by-value taking a copy of a base class).

However, I've learned to be skeptical of "always" and "never" arguments -- and thus am obliged to examine and prove my own views. So, I'm breaking down the various cases and seeing where pass-by-value might make sense.

Separate Compilation

This is an intrinsically suboptimal case (compared to inlining), and not particular worth being clever about. The compiler cannot optimize across the call boundary, so restrictive ABI rules are followed (Windows MSVC and Linux) -- which have this common behavior:

built-in-type values are passed in registers (integers, pointers, floats)
C++ references are implemented as pointers
non-POD struct values are forced to caller stack-memory, and passed by invisible reference (note: on MSVC even POD obey this rule)
[linux only] POD struct values are efficiently broken down into scalar components, and each scalar is passed in the next available register

Since non-POD structs at the ABI level require pass-by-reference, they are forced to memory, and we cannot gain performance with pass-by-value. The only remaining concerns are code design at the language-level, and exception guarantees (which I'll get to).

Passing by value across a non-inline boundary guarantees either a copy or a move. Some (like Abrahams) have argued that if the function needs to make a copy anyways, then pass by value and let the copy occur in the caller. For extremely simple cases, this might collapse 2 overloads into one:

    void Foo::make_copy(Bar bar);
// versus
    void Foo::make_copy(const Bar& bar);
    void Foo::make_copy(Bar&& bar);

If the function But the solution doesn't generalize for multi-argument functions.

Pass-by-Value doesn't always scale

Consider operator+, which is typically a free-function that builds on operator+=. The prevailing answer on StackOverflow's C++ FAQ is pass-by-value on the first arg. Quoting the code directly:

class Foo {
  Foo& operator+=(const Foo& rhs)
  {
    // actual addition of rhs to *this
    return *this;
  }
};
inline Foo operator+(Foo lhs, const Foo& rhs)
{
  lhs += rhs;
  return lhs;
}

Unfortunately, if you follow this advice and then happen to write code like this ...

const Foo a;
Foo c = a + Foo(3, 4);

... you'll end up with 2 copy constructor invocations and no moves -- which is clearly pessimal! Why? Because the second argument has no provision for handling temporaries as inputs. Since operator+ is commutative, we can try to patch over this with an additional overload:

inline Foo operator+(Foo lhs, const Foo& rhs);  // see above
inline Foo operator+(const Foo& lhs, Foo rhs)
{
    rhs += lhs;
    return rhs;
}

But whoops, now we'll get a compile-error "ambiguous call". The only way to resolve this? Abandon pass-by-value and write every possible pass-by-reference overload:

inline Foo op_plus(Foo&& lhs, const Foo& rhs)
{
    Foo result = std::move(lhs);
    result += rhs;
    return result;
}
inline Foo operator+(      Foo&& lhs, const Foo&  rhs) { return op_plus(std::forward(lhs), rhs); }
inline Foo operator+(      Foo&& lhs,       Foo&& rhs) { return op_plus(std::forward(lhs), rhs); }
inline Foo operator+(const Foo&  lhs,       Foo&& rhs) { return op_plus(std::forward(rhs), lhs); }
inline Foo operator+(const Foo&  lhs, const Foo&  rhs) { return op_plus(lhs, rhs); }

This solution performs the optimal number of copies and moves in every case.

Exception-safety and noexcept

Herb Sutter provided this "example #3", which I'm pasting from this SO post:

class employee {
  std::string name_;
public:
  void set_name(std::string name) noexcept { name_ = std::move(name); }
};

Sutter points out (here in the video) that the function is marked noexcept, yet a call to it may throw! Why? Because value arguments are constructed in the caller's context before the function is called, so a (throwing) copy-constructor will throw before the "noexcept block" has been reached. It's especially insidious because the calling code doesn't make it obvious:

std::string str = ...;
emp.set_name(std::move(str));    // cannot throw
emp.set_name(str);               // may throw

Sutter's "example #2" has both const-ref and rvalue-ref overloads:

  void set_name(const std::string&  name)          { name_ = name; }
  void set_name(      std::string&& name) noexcept { name_ = std::move(name); }

That's a better option, since at least the const-ref overload is not marked noexcept. I could live with that. The same code above compiles, but at least you can see in the code (and backtrace in the debugger) that a throwing function was called.
Can we do better? If we required ourselves to only pass "sink" arguments as rvalue-ref, we would only retain this overload:

  void set_name(      std::string&& name) noexcept { name_ = std::move(name); }

Now all callers must be explicit, and the risk-of-exception is much clearer in the calling context.

std::string str = ...;
emp.set_name(std::move(str));    // cannot throw
emp.set_name(std::string(str));  // string constructor is explicit, and that does throw
emp.set_name(str);               // doesn't compile

By following this convention, we reserve the simplest syntax "set_name(str)" for arguments that are passed by const-ref; moves and copies are always explicit. And that makes scanning the code for performance issues or exception-throwers that much easier.

Inline-able Code

Now we're talking about actual optimization. The compiler can see across the function-call boundary, and collapse a lot of code down into nothing. The inliner and backend optimizer is especially good at collapsing reference arguments back to the original variable. Consider the following code:

void Direct()
{
  int aa = 1;
  int bb = 2;
  int cc = aa + bb;
}
// split into 2 calls
inline int Add(const int& lhs, const int& rhs)
{
  return lhs + rhs;
}
void CallAdd()
{
  int aa = 1;
  int bb = 2;
  int cc = Add(aa, bb);
}

Both "Direct" and "CallAdd" will result in the same object code on every major compiler (gcc, MSVC, clang). Which means, you don't need to sweat the small stuff. Feel free to use references to pass trivial and non-trivial types ... they won't end up in memory. Note also that perfect-forwarding (using T&& arguments and std::forward) are relying on these types of optimizations

How does this work? Clang + LLVM handles this in an elegant way. The clang frontend faithfully outputs naive LLVM-IR for each function. When a variable must be passed by reference, the caller defines the variable with an "alloca" and the callee takes in a pointer. See a simplified example here. Now LLVM (the backend) takes over, and starts trying to inline and then simplify. One of the most important optimization passes is mem2reg, which transforms a memory-oriented alloca+store+load into a value (virtual register).

mem2reg is such a critical transformation that this paper indicates it alone provides on average an 81% speedup of their benchmark, whereas enabling all remaining optimizations yield 102% (a comparatively small difference). Why do I mention this in particular? Because when you write your code in a straightforward manner, mem2reg is an optimization you can rely upon. This is generally not the case for many other optimizations, especially across compilers/versions/vendors.

Aliases mess up inlining

The inliner works well with pure code and values on the stack. Mix impure logic, like necessary side-effects into heap-allocated memory or global variables, and all those optimizations can disappear. Consider the following tweak to the inline-able code:

// same Add as before
inline Foo Add(const Foo& lhs, const Foo& rhs)
{
  return lhs + rhs;
}

Foo g_pFoo = nullptr;
void CallAddWithGlobal()
{
  Foo aa = 1;
  Foo bb = 2;
  g_pFoo = &aa;
  Foo cc = Add(aa, bb);
}

Now that a global variable is pointing at our local variable, aa must be stored into memory. There's also a chance that the value will be loaded from memory when this invocation of Add() is inlined, since aliases can disable trivial mem2reg collapsing.

Does this mean the const-ref argument to Add() was a mistake? Definitely not! It just means that CallAddWithGlobal() needs to be written better. One easy workaround is to take a separate snapshot of variable aa before assigning g_pFoo -- that way Add() can go back to being inlined.

void CallAddWithGlobal()
{
  Foo aa = 1;
  Foo aa_copy = aa;
  Foo bb = 2;
  g_pFoo = &aa;
  Foo cc = Add(aa_copy, bb);
}

Other ways to mess up Inlining

This is kind of a tangent, but I found that classes with destructors completely suppress inlining in MSVC! Passing a const Foo& into a function would always keep that function non-inlined in the assembly code. This was true on 2013 and 2015 compilers. Clang+LLVM did not have this issue.

Conclusions

I have examined a large number of cases, only to arrive at the same conclusion -- pass-by-value is an anti-pattern. For every case where pass-by-value seems reasonable, there is an equal or better solution that only uses pass-by-reference. In a few cases, pass-by-value may result in fewer lines of code, but it doesn't generalize.

Tuesday, April 12, 2016

Code Consistency and Refactoring

Until your code is perfect and done, there's a good chance you will be refactoring or rewriting some portion of it. When that time comes, will you dread the coming changes, or embrace them with open arms? The answer may depend on how consistent your code is.

The most maintainable code uses the same lexical names and code patterns in similar contexts, to allow for seamless code refactoring. In other words, in a good codebase it's trivial to copy/paste from any function to another, to either compose or decompose a function. I encourage you to try refactoring each of the examples below into the goal, given only the starting code, and either time yourself or imagine the steps involved.

Goal

    void func(Thing* pThing, Other* pOther)
    {
        if (!pThing || !pOther) {
            return;
        }
        pThing->doFirst();
        pThing->progress += pOther->progress1;
        pThing->doSecond();
        pThing->progress += pOther->progress2;
        pThing->doThird();
        pThing->progress += pOther->progress3;
        pThing->doFourth();
        pThing->progress += pOther->progress4;
    }

Example 1

    void funcA(Thing* pThing, Other* pOther)
    {
        if (!pThing || !pOther) {
            return;
        }
        pThing->doFirst();
        pThing->progress += pOther->progress1;
        funcB(*pThing, pOther);
    }
    void funcB(Thing& thing, Other* pOther)
    {
        thing.doSecond();
        thing.progress += pOther->progress2;
        funcC(*pOther, thing);
        funcD(&thing, *pOther);
    }
    void funcC(Other& b, Thing& a)
    {
        a.doThird();
        a.progress += b.progress3;
    }
    void funcD(Thing* thing, Other& other)
    {
        thing->doFourth();
        thing->progress += other.progress4;
    }

Example 2

    void funcA(Thing* pThing, Other* pOther)
    {
        if (!pThing || !pOther) {
            return;
        }
        pThing->doFirst();
        pThing->progress += pOther->progress1;
        funcB(pThing, pOther);
    }
    void funcB(Thing* pThing, Other* pOther)
    {
        pThing->doSecond();
        pThing->progress += pOther->progress2;
        funcC(pThing, pOther);
        funcD(pThing, pOther);
    }
    void funcC(Thing* pThing, Other* pOther)
    {
        pThing->doThird();
        pThing->progress += pOther->progress3;
    }
    void funcD(Thing* pThing, Other* pOther)
    {
        pThing->doFourth();
        pThing->progress += pOther->progress4;
    }

Which one was easier?

I'm gonna go out on a limb and say Example 2. It's a near-zero effort set of line-level copy&pastes. On the other hand, Example 1 requires tons of manual edits, changes in indirection, recognizing the same variable name used for both a pointer and reference (ugh, even harder to search & replace) ... the list goes on.

Pointers vs References

On a related note, this code should provide a guideline for when to prefer pointers and references. Prefer pointers if any other code operating on the type also uses pointers. Otherwise, use references. Loosely speaking, this means using pointers for any heap-allocated "objects with identity", while using references for "value-type" instances defined on the stack.

This is not a rant against references -- they are in fact extremely useful for perfect-forwarding, move-semantics, etc. This is a rant against mixing the use of references and pointers for the semantically same variable in different contexts.

My advice flies in the face of advice online or in the ISO C++ FAQ that heavily advocate for references wherever possible (at the expense of consistency). What I say here is -- those guys don't have to maintain your code, but YOU do. So make the smart choice!

Sunday, March 27, 2016

Reference Counting != Garbage Collection

Garbage collection (aka GC) allows for the non-deterministic destruction of resources. Reference-counting (aka refcounting) always provides guaranteed deterministic destruction of resources. These are fundamentally different strategies (each with their uses), yet I still see Wikipedia and plenty of literature and blogs conflating the two.

Non-deterministic destruction via GC is good for managing memory, and little else. It's especially good for cyclic object graphs, which refcounting can't automatically handle. It is notably very poor at managing I/O resources like file handles -- that's where determinism is an absolute requirement.

Refcounting is a natural extension of object destructors (like in C++), and works naturally with RAII.

Ultimately, if you don't have both features available, things get awkward:

In C#, you must manually inherit from IDisposable, and write a Dispose() function at every level of your object graph. This is something the compiler writes for you in C++! (in C++, every object's destructor automatically calls every member's destructor)
In C++, you must manually break reference cycles, either with the performance-killing weak_ptr, or by manually iterating the object graph to break all cycles from a well-defined moment.

Sunday, January 31, 2016

C++: When is extern not extern?

The extern storage class specifer in C++ makes a symbol visible outside the file where it's defined, or helps to declare a symbol that is defined elsewhere. These symbols have "external linkage".

extern int foo();       // declaration
extern int bar = 3;     // definition
int baz()               // definition, which is extern by default
{
    return 4;
}

So what is the exception to the rule? Anything declared or defined in an anonymous namespace (aka unnamed namespace) has "internal linkage", like a namespace-level static variable. In other words, it cannot be accessed outside the current translation unit, as mentioned here. You can, however, still use extern in an anonymous namespace, and there are actually useful things you can do with it!

Breaking Circular Dependencies

Let's say you need to declare a global variable before using it, but you want to keep the symbols private to the file to avoid name-collisions. Let's try to use the 'static' keyword:

struct Funcs { int(*fn)(void); };

static Funcs global_var;
int UseGlobalVar()
{
    return global_var.fn();
}
static Funcs global_var = { &UseGlobalVar };

This might have been fine in C, but it doesn't compile in C++ -- we get error : redefinition of 'global_var' from the compiler, because there's no way to forward-declare a static variable in C++. We could change 'static' to 'extern', but then it wouldn't be file-private. The solution? Anonymous namespaces.

namespace {
    extern Funcs global_var;  // fwd-decl with extern, but has internal linkage
}
int UseGlobalVar()
{
    return global_var.fn();
}
namespace {
    Funcs global_var = { &UseGlobalVar };
}

Even though global_var is marked extern, the anonymous namespace has precedence and gives it internal linkage, and everything works out.

Non-Type Template Arguments

You're probably familiar with non-type template parameters. For example, in std::array, the "size_t N" parameter is a value, not a type.

You probably also know that object-pointers/references and function-pointers/references are valid non-type template parameters. But did you know that in C++03, any argument you plug into those must have external linkage?! Even worse, many compilers didn't get the memo that this rule was relaxed in C++11. As a contrived example, let's say we want to pre-bake a constant double by reference:

template <const double& Addend>
double add(double x)
{
    return x + Addend;
}

double one = 1.0;
static double two = 2.0;
const double three = 3.0;
namespace {
    double four = 4.0;
    const double five = 5.0;
    extern double six = 6.0;
}

int main() {
    printf("%f\n", add<one>(0));     #1 // ok
    printf("%f\n", add<two>(0));     #2 // error
    printf("%f\n", add<three>(0));   #3 // error
    printf("%f\n", add<four>(0));    #4 // ok
    printf("%f\n", add<five>(0));    #5 // error
    printf("%f\n", add<six>(0));     #6 // ok

From MSVC 2015 we get errors like:

program.cpp(36): error C2970: 'add': template parameter 'Addend': 'two': an expression involving objects with internal linkage cannot be used as a non-type argument

Let's go through each case:

namespace-level objects have external linkage by default ==> ok
static at namespace level implies internal linkage ==> error
const at namespace level implies internal linkage ==> error
surprise! same as #1
same as #3
surprise! same as #1

Yes, even though "four" and "six" are in an anonymous namespace (and therefore have internal linkage), they are "just extern enough" to be usable in a non-type template argument. And this is truly when extern is not extern.

Sunday, January 17, 2016

a critique of shared_ptr

std::shared_ptr made it into the C++ standard library, is popular, and now in widespread use. And before it, everyone used boost::shared_ptr. So, what's the problem? In a nutshell: weak_ptr support, and mandatory threadsafe refcounting. The resulting performance is far less than optimal, for many common use-cases.

If you need a refresher, this stackoverflow post explains how shared_ptr works.

The major off-the-shelf alternatives are std::unique_ptr and boost::intrusive_ptr.

The weak_ptr requirement always adds storage cost to shared_ptr ...

... even if you're not using weak_ptr in your own code. The common algorithm is to maintain a separate refcount for strong-references and weak-references. Both MSVC and libstdc++ do this.

Refcounting is always thread-safe ...

... even when you don't need it to be. And as we all know, atomics are 1-2 orders of magnitude slower than their single-threaded counterparts.

If you are trying to build a DAG that is only accessed by one thread at a time, then shared_ptr is the wrong solution.

The weak_ptr requirement always adds extra performance overhead.

The optimal number of refcounts per operation involve a trick, where all strong references collectively hold 1 weak reference. This allows all but the last strong reference to avoid touching the weak_count. Both MSVC and libstdc++ do this. Here are all the places atomics occur:

shared_ptr creation: set strong_count=1 , weak count = 1 ; no atomics
weak acquire: atomically increment weak_count
weak release: atomically decrement weak_count
strong acquire: atomically increment strong_count
strong release: atomically decrement strong_count, and if it was the last one, also atomically decrement weak_count.

Sadly, this means that every object created through shared_ptr incurs a minimum cost of two atomics. In contrast, a strong-only pointer system would incur at minimum one atomic.

Let's also remember that at least one new/delete is required as well, so we're up to four atomics per shared_ptr-mediated object. By the performance-analysis method presented here, single-threaded shared_ptr object-management is limited to ~(150MHz / 4) = 37MHz.

If you're using C++, you probably expect performance. 37MHz object creation is a far cry from peak performance.

the make_shared optimization undermines weak_ptr

The make_shared optimization is to allocate both the object and control-block in a single allocation (a single call to "new"). Herb Sutter describes it well in GotW#89. This effectively makes all weak_ptrs now hold a strong reference to the object's raw memory -- a shallow strong-reference. The irony! Especially considering the only reason you'd use shared_ptr now, is if you needed weak_ptr as well.

Admittedly the extended object-memory lifetime isn't a big deal for small objects like vector or string. Just be wary of using shared_ptr+weak_ptr with large-footprint objects.

shared_ptr implementation requires a virtual release() ...

... even if you're only using a concrete type with no inheritance. shared_ptr must account for all possible usage scenarios, including multiple-inheritance, and . It's analogous to how all COM objects inherit from IUnknown and have a virtual Release() method.

The alternative optimal solution, is to use intrusive_ptr. There you have the freedom of defining an inlinable intrusive_ptr_release() on your concrete type.

This may sound like a minor micro-optimization, but the effect of a non-inlinable call on surrounding code-generation can be profound,

Concluding Remarks

shared_ptr is at best a convenient low-performance class, to be used sparingly and in code that is called at low frequency. Prefer unique_ptr and intrusive_ptr, in that order.

Friday, January 15, 2016

Atomics aren't free

Modern code is filled with the use of atomic-instructions, which serve to speed up multi-threaded or threadsafe code. These are hidden behind every new/delete, shared_ptr, lightweight mutexes like Win32 CRITICAL_SECTION and linux futex, and lock-free data-structures like those provided by Boost.LockFree. It's easy to be excited about their performance gains as compared with syscall-based synchronization variants. However, what's often ignored is their performance versus unsynchronized single-threaded code.

How fast are atomics?

With a simple micro-benchmark on my IvyBridge laptop (i7-3610QM), I get roughly this many atomics/second to a single address. AtomicAdd (lock xadd) and AtomicCAS (lock cmpxchg) produced similar results:

~140 million [single-threaded]
~37 million [under contention - 4 or more threads accessing the same location]

The "under contention" case hasn't changed much in the last 10 years, at least for systems I tested. A desktop Core2 Duo and Nehalem i7 produced nearly the same results.

Restating in Hertz, we have between 30MHz - 150MHz atomics to a single address on modern CPUs. Considering that modern CPUs run at between 2000-3000MHz, we have 1-2 orders of magnitude difference in performance between non-atomic and atomic ops.

This paper corroborates these numbers, and explains them in terms of the modified MESI cache-coherency protocols. Atomics are performed by first acquiring exclusive ownership of a cacheline into a core's L1. Uncontended accesses are faster because the exclusive acquisition happens only once, and all operations stay local to the core after that. When multiple cores contend for the same cacheline, additional write-back-invalidate signals are sent as part of the transfer-of-ownership protocol, which adds latency. The latency translates acts as a direct throughput limiter since atomics block the core's memory pipeline.

ARM atomics require a sequence of instructions in an LDREX/STREX loop, which is an explicit use of MESI.

A method of analysis

Say you write this simple conversion function.

std::string ToString(bool x)
{
    return x ? "true" : "false";
}

String must call "new" to allocate memory. With the default thread-safe allocator which uses a lightweight mutex, we see this function costs one atomic instruction. So, this function is at best a 30-150MHz function -- if you called it in a tight loop, it couldn't run faster than that. Considering the only other operations are a strcpy of 4-5 bytes (maybe 1s of clock-cycles), it's clear the atomic dominates by at least an order of magnitude. But we're still being too charitable; for every new we'll have a corresponding delete. So really we have a 15-75MHz function. Just to return a string!

The irony is that if you ran that same code on 10-year-old hardware, but with a single-threaded C runtime, it would certainly run faster.

This insane overhead of new/delete is why custom allocators, and malloc-replacements like TCMalloc, are so important.

Other Anti-Patterns

All of the following have something in common: they drag down the performance of single-threaded code in avoidable ways.

Passing a refcounted smart-pointer (like std::shared_ptr or boost::intrusive_ptr) by-value. This incurs extra refcount operations, and each one is an atomic add by +1 or -1.

Passing objects like std::string and std::vector by-value.

Using a std::shared_ptr to manage the lifetime of an object that is only accessed by one thread at a time.

Fine-grained locking in general. The previous item about shared_ptr is somewhat an example of this.

Take-aways

Treat atomics as "slow" operations, just as you would any other synchronization primitive.

Tuesday, January 12, 2016

C++ Assignment Operators Considered Optional

Summary

Conventional C++11 wisdom states that if a class needs a custom copy-constructor, it probably needs a custom copy-assignment-operator as well. The same goes for move-construct + move-assignment-operators as well. It's widely known as the rule-of-3/5/0. The rule is onerous, as you are effectively required to implement the same functions repeatedly.

As it turns out, any class with a noexcept move-constructor is assignable in terms only of its constructors, via a function I dub auto_assign<T>. The impact is not limited to simplifying class design (though that cannot be overstated enough). Reference-members and const-members of a class are only assignable by class-constructors, which preclude them from use in container-types like std::vector. auto_assign-aware containers and libraries will afford a vast new degree of flexibility, with literally no down-side.

Why do we need this?

Quite simply, there are things that just don't work in C++ today. For example, you cannot stick this class in a std::vector.

struct ConstructOnly
{
    int& value;

    ConstructOnly(int& value_)
        : value(value_)
    {}
};

Seriously, you have to change that int& value to a pointer int* pValue just to make it copyable and movable. Why? Because reference member-variables are only assignable by the constructor. You'll have similar grief with const members.

And of course, who doesn't like doing half the work? We can centralize all our copying and moving logic into constructors, rather than worrying about how to factor the code between constructors and operator=.

The Basic Idea

The basic idea behind auto_assign, is that C++ allows us to in-place destruct, then construct, any object. Let's look at it for a concrete Foo, and copy-construction, to simplify matters.

void Usage()
{
    Foo foo;                // assume succeeded
    Foo rhs = ...;          // assume succeeded
    foo_assign(lhs, rhs);
}                           // <-- lhs.~Foo() invoked by compiler

We can naively define foo_assign as follows:

Foo& foo_assign(Foo& lhs, Foo& rhs)
{
    lhs.~Foo();             // <-- explicit destructor call
    new (&lhs) Foo(rhs);    // <-- placement-new into lhs, with copy-constructor
}

Let's call this version "destruct-then-construct". The big problem here, is that foo_assign is not exception-safe ... unless Foo's copy-constructor is noexcept. If Foo's copy-constructor throws an exception, then we won't have a valid foo object on the closing brace of Usage(), resulting in two destructor calls in a row on the same object. And that is undefined behavior.

noexcept move-constructors come to the rescue! We can rewrite foo_assign to be both correct and exception-safe:

Foo& foo_assign(Foo& lhs, Foo& rhs)
{
    Foo tmp(rhs);           // <-- copy-construct into temporary [may throw]
    lhs.~Foo();             // <-- explicit destructor call
    new (&lhs) Foo(std::move(rhs));  // <-- placement-new into lhs, with noexcept move-constructor
}

Let's call this version "copy-destruct-move", which is a flavor of the well-known "copy-and-swap idiom". It is slightly less efficient than destruct-then-construct, since an extra constructor and destructor are invoked, so we should only use it with noexcept(false) constructors.

Definition of auto_assign<T>

The general purpose auto_assign<T> must handle the following cases:

Copy, where type T is assignable via operator=.
Move, where type T is assignable via operator=.
Copy, where type T is not assignable via operator=.
Move, where type T is not assignable via operator=.

We want to handle these cases as follows, keeping "the basic idea" in mind:

lhs = rhs; // use operator if it exists
lhs = std::move(rhs); // use operator if it exists
If the copy-constructor is noexcept, destruct-then-construct. Otherwise, use copy-destruct-move.
Destruct-then-construct.

A full working version is available as part of my toy project here, Getting the noexcept specifiers and template deduction was tricky, but ultimately the code is pretty straightforward.

auto_assign<T>-aware swap()

This one's easy -- just replace "=" calls with auto_assign(). As a bonus, we'll also delegate to a member-swap function if it exists.

        template <class T>
        void swap(T& lhs, T& rhs, typename std::enable_if<!has_swap<T>::value>::type* = nullptr)
        {
            T tmp(std::move(lhs));
            auto_assign(lhs, std::move(rhs));
            auto_assign(rhs, std::move(tmp));
        }
        template <class T>
        void swap(T& lhs, T& rhs, typename std::enable_if<has_swap<T>::value>::type* = nullptr)
        {
            lhs.swap(rhs);
        }

auto_assign<T>-aware containers

The real power will come when containers support construct-only types. This is no trivial exercise. I hope to have some of these working "soon".

When is it safe to use auto_assign<T>?

It's safe to use in any context where you would legitimately use operator=. For example, any time you have a fully-formed object that you wish to copy or move. If you have a virtual operator= on your class, it'll work just fine too.

As an example of what not to do: Do not use auto_assign to implement your own operator=. When implementing operator=, follow conventional wisdom.

In an auto_assign<T>-aware world, should I ever implement operator= ever again?

IMO, the answer for most classes is "no, you do not need a custom operator=, but you should spend your effort on writing noexcept move-constructors instead".

operator= ends up becoming a mere optimization, to be applied where needed. Performance-wise:

auto_assign-moves are equal in the number of implied destructor and constructor calls.
auto_assign-copies invoke one additional move-constructor and trivial-destructor.
BUT the compiler may have certain in-built optimizations for operator= that it cannot apply with destruct-then-construct sequences. This one's hard to quantify.

Concluding Remarks

Without wide-ranging library support, it's too early to start switching all your code over. But, if this catches on with boost and/or STL, we'll be living in a brave new world of C++.

Typed Exceptions are Pointless

Summary

Major languages like C++, Java, C#, and ML allow you to throw exceptions of different types. This is a needless complication; programs only need a type-less "signal" that indicates whether a computation succeeded or failed.

In this simpler model of type-less exceptions, we can do the following in C++

Continue to use RAII to protect resources.
When you must use try/catch, the only legitimate catch-handler is catch(...) .
Decorate functions with noexcept wherever possible.

Why do exceptions exist?

Before discussing the topic at hand, we must first understand: why do programming languages need exceptions?

In C++, the only way to fail an object's constructor, such that the caller never sees a fully formed object, is by throwing an exception.
Copy/move-constructors and assignment-operators can only fail via exception.
We want to write "expressive" code, with use of math operators, method-chaining.
STL and boost types, including popular classes like std::vector and std::function, use exceptions.

Notice that we can avoid the use of exceptions in all cases:

write trivial noexcept constructors + Init functions
manually write clone functions
use error-codes, nullables, optionals, or other alternative error-handling strategies
roll our own container types, or use something like EASTL

So technically, exceptions aren't necessary. But, if we want to write code a certain way, we must use them with all their associated baggage. It's a classic engineering trade-off.

Getting to the point.

Let's say we have to calculate the sum of numbers in a file. The requester says "just give me a number, that's all we want to store in the database". We start by writing this function:

Number CalculateSum(const char* pDirName)
{
    try {
        boost::filesystem::path filePath(pDirName);
        filePath /= "numbers.txt";
        std::vector<Number> numbers = ReadNumbers(filePath);
        Number sum = std::accumulate(numbers.begin(), numbers.end(), Number(0));
        return sum;
    }
    catch (const std::bad_alloc& e) {
        // ... handle out-of-memory? which allocation? ...
    }
    catch (const std::ifstream::failure e) {
        // ... handle filestream failure? what filestream? ...
    }
    catch (const std::overflow_error& e) {
        // ... handle overflow? which subexpression? ...
    }
    // "they" said catch(...) is evil, but what if another type were thrown?
    return Number(std::nan(""));
}

We're doing some file I/O (in ReadNumbers), allocating memory (inside boost::filesystem::path and std::vector), doing some math ... many different kinds of actions are going on here. Semantically, a failure in any of those steps is a total failure in CalculateSum. Our only recourse is to take appropriate action at our current layer.

To reiterate: At this scope, we cannot know which step failed. Even if we did, we couldn't do things any differently. This is why I say that typed-exceptions are pointless. The type and contents of the exception are completely useless; we only care about whether an exception was thrown at all.

Let's rewrite that function.

Number CalculateSum(const char* pDirName) noexcept
{
    try {
        boost::filesystem::path filePath(pDirName);
        filePath /= "numbers.txt";
        std::vector<Number> numbers = ReadNumbers(filePath);
        Number sum = std::accumulate(numbers.begin(), numbers.end(), Number(0));
        return sum;
    }
    catch (...) {
        return Number(std::nan(""));
    }
}

Notice that this time, our function is marked "noexcept", and indeed we are sure that no exceptions will leak out. catch(...) is the way of transforming a throwing function into a noexcept function.

If you're still not convinced that catch(...) is the proper approach, consider what the introduction of noexcept in C++11 implies. noexcept is a boolean concept, supplanting the multi-typed throw()-specifier. The noexcept boolean argument can be calculated on template functions, whereas no such facility exists for throw()-specifiers. noexcept in C++17 is also becoming part of a function's type, paving the way for compile-time checking of exception propagation. Overall the language now considers "whether an exception was thrown" to be more important than the type of the exception.

I will take this opportunity to point out that if you're with me so far, you'll agree that this section of the C++ FAQ about exceptions is dispensing really bad advice. For extra entertainment value, consult the corresponding section of the C++FQA.

Finally, note that RAII is still the preferred way of protecting resources. In the example, we still use a std::vector to avoid leaking memory, and so on.

Compile-time errors are still missing?

That's right. Unfortunately, the standards committee chose the runtime-checked approach, which was (and still is) highly contentious. This SO post sheds some light, although ultimately it was a hasty decision that simply is.

IMO we could live with a stop-gap of compile-time checked warnings whenever a possibly-throwing expression appears in an unguarded section of a noexcept function. This shouldn't be difficult to implement, perhaps even as a clang tool or plugin.

What types of exceptions should I throw?

Any type you want! Just don't catch them by type.

I suggest sticking to std::exception or a type derived from it. Since the type will be ignored by the catch-handler, only the std::exception::what() method will be relevant for diagnostics.

e.what() about std::exception?

std::exception has a const char* what() const method, which can be used to print a diagnostic message in a catch-handler. Since catch(...) doesn't provide access to the exception object, what can we do? One option is to repeat the error-handling in both a catch-std::exception and catch(...) block:

    try {
        return foo();
    }
    catch (const std::exception& e) {
        std::cerr << "caught exception: " << e.what() << std::endl;
        return Error();
    }
    catch (...) {
        std::cerr << "caught exception: " << std::endl;
        return Error();
    }

But that's way more verbose, and it's the exception to the rule of "the only legitimate catch-handler is catch(...)". One alternative is to hide the try/catch inside a function, like so:

template <class TCatch, class TTry>
auto try_(TTry&& try_block, TCatch&& catch_block) -> decltype(try_block())
{
    try {
        return try_block();
    }
    catch (const std::exception& e) {
        return catch_block(e);
    }
    catch (...) {
        return catch_block(std::exception());
    }
}

// and use it like so
    return try_(
        [&]() {
            return foo();
        },
        [&](const std::exception& e) {
            std::cerr << "caught exception: " << e.what() << std::endl;
            return Error();
        });

I'm not particularly advocating this technique, but it has two interesting properties:

It guarantees your catch-handler sees a std::exception.
It makes try/catch usable in expressions, whereas the keywords are limited to statements.

Additional Resources

The go language uses a non-typed exception mechanism, described in Defer, Panic, and Recover.

The java language has the option of checked exceptions, the moral equivalent of C++ throw()-specifiers, but that are checked at compile-time.

C++ exception specifications are evil.

Monday, May 30, 2016

Separate Compilation

Pass-by-Value doesn't always scale

Exception-safety and noexcept

Inline-able Code

Aliases mess up inlining

Other ways to mess up Inlining

Conclusions

Tuesday, April 12, 2016

Goal

Example 1

Example 2

Which one was easier?

Pointers vs References

Sunday, March 27, 2016

Sunday, January 31, 2016

Breaking Circular Dependencies

Non-Type Template Arguments

Sunday, January 17, 2016

The weak_ptr requirement always adds storage cost to shared_ptr ...

Refcounting is always thread-safe ...

... even when you don't need it to be. And as we all know, atomics are 1-2 orders of magnitude slower than their single-threaded counterparts. If you are trying to build a DAG that is only accessed by one thread at a time, then shared_ptr is the wrong solution.

The weak_ptr requirement always adds extra performance overhead.

the make_shared optimization undermines weak_ptr

shared_ptr implementation requires a virtual release() ...

Concluding Remarks

Friday, January 15, 2016

How fast are atomics?

A method of analysis

Other Anti-Patterns

Take-aways

Tuesday, January 12, 2016

Summary

Why do we need this?

The Basic Idea

Definition of auto_assign<T>

auto_assign<T>-aware swap()

auto_assign<T>-aware containers

When is it safe to use auto_assign<T>?

In an auto_assign<T>-aware world, should I ever implement operator= ever again?

Concluding Remarks

Summary

Why do exceptions exist?

Getting to the point.

Compile-time errors are still missing?

What types of exceptions should I throw?

e.what() about std::exception?

Additional Resources

... even when you don't need it to be. And as we all know, atomics are 1-2 orders of magnitude slower than their single-threaded counterparts.

If you are trying to build a DAG that is only accessed by one thread at a time, then shared_ptr is the wrong solution.