Sunday, January 31, 2016

C++: When is extern not extern?

The extern storage class specifer in C++ makes a symbol visible outside the file where it's defined, or helps to declare a symbol that is defined elsewhere.  These symbols have "external linkage".
extern int foo();       // declaration
extern int bar = 3;     // definition
int baz()               // definition, which is extern by default
{
    return 4;
}

So what is the exception to the rule?  Anything declared or defined in an anonymous namespace (aka unnamed namespace) has "internal linkage", like a namespace-level static variable.  In other words, it cannot be accessed outside the current translation unit, as mentioned here.   You can, however, still use extern in an anonymous namespace, and there are actually useful things you can do with it!

Breaking Circular Dependencies

Let's say you need to declare a global variable before using it, but you want to keep the symbols private to the file to avoid name-collisions.  Let's try to use the 'static' keyword:
struct Funcs { int(*fn)(void); };

static Funcs global_var;
int UseGlobalVar()
{
    return global_var.fn();
}
static Funcs global_var = { &UseGlobalVar };
This might have been fine in C, but it doesn't compile in C++ -- we get error : redefinition of 'global_var' from the compiler, because there's no way to forward-declare a static variable in C++.  We could change 'static' to 'extern', but then it wouldn't be file-private.  The solution?  Anonymous namespaces.
namespace {
    extern Funcs global_var;  // fwd-decl with extern, but has internal linkage
}
int UseGlobalVar()
{
    return global_var.fn();
}
namespace {
    Funcs global_var = { &UseGlobalVar };
}
Even though global_var is marked extern, the anonymous namespace has precedence and gives it internal linkage, and everything works out.

Non-Type Template Arguments

You're probably familiar with non-type template parameters.  For example, in std::array, the "size_t N" parameter is a value, not a type.

You probably also know that object-pointers/references and function-pointers/references are valid non-type template parameters.  But did you know that in C++03, any argument you plug into those must have external linkage?!  Even worse, many compilers didn't get the memo that this rule was relaxed in C++11.  As a contrived example, let's say we want to pre-bake a constant double by reference:
template <const double& Addend>
double add(double x)
{
    return x + Addend;
}

double one = 1.0;
static double two = 2.0;
const double three = 3.0;
namespace {
    double four = 4.0;
    const double five = 5.0;
    extern double six = 6.0;
}

int main() {
    printf("%f\n", add<one>(0));     #1 // ok
    printf("%f\n", add<two>(0));     #2 // error
    printf("%f\n", add<three>(0));   #3 // error
    printf("%f\n", add<four>(0));    #4 // ok
    printf("%f\n", add<five>(0));    #5 // error
    printf("%f\n", add<six>(0));     #6 // ok
From MSVC 2015 we get errors like:
program.cpp(36): error C2970: 'add': template parameter 'Addend': 'two': an expression involving objects with internal linkage cannot be used as a non-type argument
Let's go through each case:
  1. namespace-level objects have external linkage by default ==> ok
  2. static at namespace level implies internal linkage ==> error
  3. const at namespace level implies internal linkage ==> error
  4. surprise!  same as #1
  5. same as #3
  6. surprise!  same as #1 
Yes, even though "four" and "six" are in an anonymous namespace (and therefore have internal linkage), they are "just extern enough" to be usable in a non-type template argument.  And this is truly when extern is not extern.

Sunday, January 17, 2016

a critique of shared_ptr

std::shared_ptr made it into the C++ standard library, is popular, and now in widespread use.  And before it, everyone used boost::shared_ptr.  So, what's the problem?  In a nutshell: weak_ptr support, and mandatory threadsafe refcounting.  The resulting performance is far less than optimal, for many common use-cases. 

If you need a refresher, this stackoverflow post explains how shared_ptr works.

The major off-the-shelf alternatives are std::unique_ptr and boost::intrusive_ptr.

The weak_ptr requirement always adds storage cost to shared_ptr ...

... even if you're not using weak_ptr in your own code.  The common algorithm is to maintain a separate refcount for strong-references and weak-references.  Both MSVC and libstdc++ do this.

Refcounting is always thread-safe ...

... even when you don't need it to be.  And as we all know, atomics are 1-2 orders of magnitude slower than their single-threaded counterparts.

If you are trying to build a DAG that is only accessed by one thread at a time, then shared_ptr is the wrong solution.

The weak_ptr requirement always adds extra performance overhead.

The optimal number of refcounts per operation involve a trick, where all strong references collectively hold 1 weak reference.  This allows all but the last strong reference to avoid touching the weak_count.  Both MSVC and libstdc++ do this.  Here are all the places atomics occur:
  • shared_ptr creation: set strong_count=1 , weak count = 1 ; no atomics
  • weak acquire: atomically increment weak_count
  • weak release: atomically decrement weak_count
  • strong acquire: atomically increment strong_count
  • strong release: atomically decrement strong_count, and if it was the last one, also atomically decrement weak_count.
Sadly, this means that every object created through shared_ptr incurs a minimum cost of two atomics.  In contrast, a strong-only pointer system would incur at minimum one atomic.

Let's also remember that at least one new/delete is required as well, so we're up to four atomics per shared_ptr-mediated object.  By the performance-analysis method presented here, single-threaded shared_ptr object-management is limited to ~(150MHz / 4) = 37MHz.

If you're using C++, you probably expect performance.  37MHz object creation is a far cry from peak performance.

the make_shared optimization undermines weak_ptr

The make_shared optimization is to allocate both the object and control-block in a single allocation (a single call to "new").  Herb Sutter describes it well in GotW#89.  This effectively makes all weak_ptrs now hold a strong reference to the object's raw memory -- a shallow strong-reference.  The irony!  Especially considering the only reason you'd use shared_ptr now, is if you needed weak_ptr as well.

Admittedly the extended object-memory lifetime isn't a big deal for small objects like vector or string.  Just be wary of using shared_ptr+weak_ptr with large-footprint objects.

shared_ptr implementation requires a virtual release() ...

... even if you're only using a concrete type with no inheritance.  shared_ptr must account for all possible usage scenarios, including multiple-inheritance, and .  It's analogous to how all COM objects inherit from IUnknown and have a virtual Release() method.

The alternative optimal solution, is to use intrusive_ptr.  There you have the freedom of defining an inlinable intrusive_ptr_release() on your concrete type.

This may sound like a minor micro-optimization, but the effect of a non-inlinable call on surrounding code-generation can be profound,

Concluding Remarks

shared_ptr is at best a convenient low-performance class, to be used sparingly and in code that is called at low frequency.  Prefer unique_ptr and intrusive_ptr, in that order.

Friday, January 15, 2016

Atomics aren't free

Modern code is filled with the use of atomic-instructions, which serve to speed up multi-threaded or threadsafe code.  These are hidden behind every new/delete, shared_ptr, lightweight mutexes like Win32 CRITICAL_SECTION and linux futex, and lock-free data-structures like those provided by Boost.LockFree.  It's easy to be excited about their performance gains as compared with syscall-based synchronization variants.  However, what's often ignored is their performance versus unsynchronized single-threaded code.

How fast are atomics?

With a simple micro-benchmark on my IvyBridge laptop (i7-3610QM), I get roughly this many atomics/second to a single address.  AtomicAdd (lock xadd) and AtomicCAS (lock cmpxchg) produced similar results:
  • ~140 million [single-threaded]
  • ~37 million [under contention - 4 or more threads accessing the same location]
The "under contention" case hasn't changed much in the last 10 years, at least for systems I tested.  A desktop Core2 Duo and Nehalem i7 produced nearly the same results.

Restating in Hertz, we have between 30MHz - 150MHz atomics to a single address on modern CPUs.  Considering that modern CPUs run at between 2000-3000MHz, we have 1-2 orders of magnitude difference in performance between non-atomic and atomic ops.

This paper corroborates these numbers, and explains them in terms of the modified MESI cache-coherency protocols.  Atomics are performed by first acquiring exclusive ownership of a cacheline into a core's L1.  Uncontended accesses are faster because the exclusive acquisition happens only once, and all operations stay local to the core after that.  When multiple cores contend for the same cacheline, additional write-back-invalidate signals are sent as part of the transfer-of-ownership protocol, which adds latency.  The latency translates acts as a direct throughput limiter since atomics block the core's memory pipeline.

ARM atomics require a sequence of instructions in an LDREX/STREX loop, which is an explicit use of MESI.

A method of analysis

Say you write this simple conversion function.
std::string ToString(bool x)
{
    return x ? "true" : "false";
}
String must call "new" to allocate memory.  With the default thread-safe allocator which uses a lightweight mutex, we see this function costs one atomic instruction.  So, this function is at best a 30-150MHz function -- if you called it in a tight loop, it couldn't run faster than that.  Considering the only other operations are a strcpy of 4-5 bytes (maybe 1s of clock-cycles), it's clear the atomic dominates by at least an order of magnitude.  But we're still being too charitable; for every new we'll have a corresponding delete.  So really we have a 15-75MHz function.  Just to return a string!

The irony is that if you ran that same code on 10-year-old hardware, but with a single-threaded C runtime, it would certainly run faster.

This insane overhead of new/delete is why custom allocators, and malloc-replacements like TCMalloc, are so important.

Other Anti-Patterns

All of the following have something in common: they drag down the performance of single-threaded code in avoidable ways.

Passing a refcounted smart-pointer (like std::shared_ptr or boost::intrusive_ptr)  by-value.  This incurs extra refcount operations, and each one is an atomic add by +1 or -1.

Passing objects like std::string and std::vector by-value.

Using a std::shared_ptr to manage the lifetime of an object that is only accessed by one thread at a time.

Fine-grained locking in general.  The previous item about shared_ptr is somewhat an example of this.

Take-aways

Treat atomics as "slow" operations, just as you would any other synchronization primitive.

Tuesday, January 12, 2016

C++ Assignment Operators Considered Optional

Summary

Conventional C++11 wisdom states that if a class needs a custom copy-constructor, it probably needs a custom copy-assignment-operator as well.  The same goes for move-construct + move-assignment-operators as well.  It's widely known as the rule-of-3/5/0.  The rule is onerous, as you are effectively required to implement the same functions repeatedly.

As it turns out, any class with a noexcept move-constructor is assignable in terms only of its constructors, via a function I dub auto_assign<T>.  The impact is not limited to simplifying class design (though that cannot be overstated enough).  Reference-members and const-members of a class are only assignable by class-constructors, which preclude them from use in container-types like std::vector.  auto_assign-aware containers and libraries will afford a vast new degree of flexibility, with literally no down-side.

Why do we need this?

Quite simply, there are things that just don't work in C++ today. For example, you cannot stick this class in a std::vector.
struct ConstructOnly
{
    int& value;

    ConstructOnly(int& value_)
        : value(value_)
    {}
};
Seriously, you have to change that int& value to a pointer int* pValue just to make it copyable and movable.  Why?  Because reference member-variables are only assignable by the constructor.  You'll have similar grief with const members.

And of course, who doesn't like doing half the work?  We can centralize all our copying and moving logic into constructors, rather than worrying about how to factor the code between constructors and operator=.

The Basic Idea

The basic idea behind auto_assign, is that C++ allows us to in-place destruct, then construct, any object.  Let's look at it for a concrete Foo, and copy-construction, to simplify matters.
void Usage()
{
    Foo foo;                // assume succeeded
    Foo rhs = ...;          // assume succeeded
    foo_assign(lhs, rhs);
}                           // <-- lhs.~Foo() invoked by compiler
We can naively define foo_assign as follows:
Foo& foo_assign(Foo& lhs, Foo& rhs)
{
    lhs.~Foo();             // <-- explicit destructor call
    new (&lhs) Foo(rhs);    // <-- placement-new into lhs, with copy-constructor
}
Let's call this version "destruct-then-construct".  The big problem here, is that foo_assign is not exception-safe ... unless Foo's copy-constructor is noexcept. If Foo's copy-constructor throws an exception, then we won't have a valid foo object on the closing brace of Usage(), resulting in two destructor calls in a row on the same object.  And that is undefined behavior.

noexcept move-constructors come to the rescue!  We can rewrite foo_assign to be both correct and exception-safe:
Foo& foo_assign(Foo& lhs, Foo& rhs)
{
    Foo tmp(rhs);           // <-- copy-construct into temporary [may throw]
    lhs.~Foo();             // <-- explicit destructor call
    new (&lhs) Foo(std::move(rhs));  // <-- placement-new into lhs, with noexcept move-constructor
}
Let's call this version "copy-destruct-move", which is a flavor of the well-known "copy-and-swap idiom".  It is slightly less efficient than destruct-then-construct, since an extra constructor and destructor are invoked, so we should only use it with noexcept(false) constructors.

Definition of auto_assign<T>

The general purpose auto_assign<T> must handle the following cases:
  1. Copy, where type T is assignable via operator=.
  2. Move, where type T is assignable via operator=.
  3. Copy, where type T is not assignable via operator=.
  4. Move, where type T is not assignable via operator=.
We want to handle these cases as follows, keeping "the basic idea" in mind:
  1. lhs = rhs;  // use operator if it exists
  2. lhs = std::move(rhs);  // use operator if it exists
  3. If the copy-constructor is noexcept, destruct-then-construct.  Otherwise, use copy-destruct-move.
  4. Destruct-then-construct.
A full working version is available as part of my toy project here,  Getting the noexcept specifiers and template deduction was tricky, but ultimately the code is pretty straightforward.

auto_assign<T>-aware swap()

This one's easy -- just replace "=" calls with auto_assign().  As a bonus, we'll also delegate to a member-swap function if it exists.
        template <class T>
        void swap(T& lhs, T& rhs, typename std::enable_if<!has_swap<T>::value>::type* = nullptr)
        {
            T tmp(std::move(lhs));
            auto_assign(lhs, std::move(rhs));
            auto_assign(rhs, std::move(tmp));
        }
        template <class T>
        void swap(T& lhs, T& rhs, typename std::enable_if<has_swap<T>::value>::type* = nullptr)
        {
            lhs.swap(rhs);
        }

auto_assign<T>-aware containers

The real power will come when containers support construct-only types.  This is no trivial exercise.  I hope to have some of these working "soon".


When is it safe to use auto_assign<T>?

It's safe to use in any context where you would legitimately use operator=.  For example, any time you have a fully-formed object that you wish to copy or move.  If you have a virtual operator= on your class, it'll work just fine too.

As an example of what not to do:  Do not use auto_assign to implement your own operator=.  When implementing operator=, follow conventional wisdom.

In an auto_assign<T>-aware world, should I ever implement operator= ever again?

IMO, the answer for most classes is "no, you do not need a custom operator=, but you should spend your effort on writing noexcept move-constructors instead".

operator= ends up becoming a mere optimization, to be applied where needed.  Performance-wise:

  1. auto_assign-moves are equal in the number of implied destructor and constructor calls.
  2. auto_assign-copies invoke one additional move-constructor and trivial-destructor.
  3. BUT the compiler may have certain in-built optimizations for operator= that it cannot apply with destruct-then-construct sequences.  This one's hard to quantify.

Concluding Remarks


Without wide-ranging library support, it's too early to start switching all your code over.  But, if this catches on with boost and/or STL, we'll be living in a brave new world of C++.

Typed Exceptions are Pointless

Summary

Major languages like C++, Java, C#, and ML allow you to throw exceptions of different types.  This is a needless complication; programs only need a type-less "signal" that indicates whether a computation succeeded or failed.

In this simpler model of type-less exceptions, we can do the following in C++
  • Continue to use RAII to protect resources.
  • When you must use try/catch, the only legitimate catch-handler is catch(...) .
  • Decorate functions with noexcept wherever possible.

Why do exceptions exist?

Before discussing the topic at hand, we must first understand: why do programming languages need exceptions?
  1. In C++, the only way to fail an object's constructor, such that the caller never sees a fully formed object, is by throwing an exception.
  2. Copy/move-constructors and assignment-operators can only fail via exception.
  3. We want to write "expressive" code, with use of math operators, method-chaining.
  4. STL and boost types, including popular classes like std::vector and std::function, use exceptions.
Notice that we can avoid the use of exceptions in all cases:
  1. write trivial noexcept constructors + Init functions
  2. manually write clone functions
  3. use error-codes, nullablesoptionals, or other alternative error-handling strategies
  4. roll our own container types, or use something like EASTL
So technically, exceptions aren't necessary.   But, if we want to write code a certain way, we must use them with all their associated baggage.  It's a classic engineering trade-off.

Getting to the point.

Let's say we have to calculate the sum of numbers in a file.  The requester says "just give me a number, that's all we want to store in the database".  We start by writing this function:
Number CalculateSum(const char* pDirName)
{
    try {
        boost::filesystem::path filePath(pDirName);
        filePath /= "numbers.txt";
        std::vector<Number> numbers = ReadNumbers(filePath);
        Number sum = std::accumulate(numbers.begin(), numbers.end(), Number(0));
        return sum;
    }
    catch (const std::bad_alloc& e) {
        // ... handle out-of-memory? which allocation? ...
    }
    catch (const std::ifstream::failure e) {
        // ... handle filestream failure? what filestream? ...
    }
    catch (const std::overflow_error& e) {
        // ... handle overflow? which subexpression? ...
    }
    // "they" said catch(...) is evil, but what if another type were thrown?
    return Number(std::nan(""));
}

We're doing some file I/O (in ReadNumbers), allocating memory (inside boost::filesystem::path and std::vector), doing some math ... many different kinds of actions are going on here.  Semantically, a failure in any of those steps is a total failure in CalculateSum.  Our only recourse is to take appropriate action at our current layer.

To reiterate: At this scope, we cannot know which step failed.  Even if we did, we couldn't do things any differently.  This is why I say that typed-exceptions are pointless.  The type and contents of the exception are completely useless; we only care about whether an exception was thrown at all.

Let's rewrite that function.
Number CalculateSum(const char* pDirName) noexcept
{
    try {
        boost::filesystem::path filePath(pDirName);
        filePath /= "numbers.txt";
        std::vector<Number> numbers = ReadNumbers(filePath);
        Number sum = std::accumulate(numbers.begin(), numbers.end(), Number(0));
        return sum;
    }
    catch (...) {
        return Number(std::nan(""));
    }
}

Notice that this time, our function is marked "noexcept", and indeed we are sure that no exceptions will leak out.  catch(...) is the way of transforming a throwing function into a noexcept function.

If you're still not convinced that catch(...) is the proper approach, consider what the introduction of noexcept in C++11 implies.  noexcept is a boolean concept, supplanting the multi-typed throw()-specifier.  The noexcept boolean argument can be calculated on template functions, whereas no such facility exists for throw()-specifiers.  noexcept in C++17 is also becoming part of a function's type, paving the way for compile-time checking of exception propagation.  Overall the language now considers "whether an exception was thrown" to be more important than the type of the exception.

I will take this opportunity to point out that if you're with me so far, you'll agree that this section of the C++ FAQ about exceptions  is dispensing really bad advice.  For extra entertainment value, consult the corresponding section of the C++FQA.

Finally, note that RAII is still the preferred way of protecting resources.  In the example, we still use a std::vector to avoid leaking memory, and so on.

Compile-time errors are still missing?

That's right.  Unfortunately, the standards committee chose the runtime-checked approach, which was (and still is) highly contentious.  This SO post sheds some light, although ultimately it was a hasty decision that simply is.

IMO we could live with a stop-gap of compile-time checked warnings whenever a possibly-throwing expression appears in an unguarded section of a noexcept function.  This shouldn't be difficult to implement, perhaps even as a clang tool or plugin.

What types of exceptions should I throw?

Any type you want!  Just don't catch them by type.

I suggest sticking to std::exception or a type derived from it.  Since the type will be ignored by the catch-handler, only the std::exception::what() method will be relevant for diagnostics.

e.what() about std::exception?

std::exception has a const char* what() const method, which can be used to print a diagnostic message in a catch-handler.  Since catch(...) doesn't provide access to the exception object, what can we do?  One option is to repeat the error-handling in both a catch-std::exception and catch(...) block:
    try {
        return foo();
    }
    catch (const std::exception& e) {
        std::cerr << "caught exception: " << e.what() << std::endl;
        return Error();
    }
    catch (...) {
        std::cerr << "caught exception: " << std::endl;
        return Error();
    }
But that's way more verbose, and it's the exception to the rule of "the only legitimate catch-handler is catch(...)".  One alternative is to hide the try/catch inside a function, like so:
template <class TCatch, class TTry>
auto try_(TTry&& try_block, TCatch&& catch_block) -> decltype(try_block())
{
    try {
        return try_block();
    }
    catch (const std::exception& e) {
        return catch_block(e);
    }
    catch (...) {
        return catch_block(std::exception());
    }
}

// and use it like so
    return try_(
        [&]() {
            return foo();
        },
        [&](const std::exception& e) {
            std::cerr << "caught exception: " << e.what() << std::endl;
            return Error();
        });

I'm not particularly advocating this technique, but it has two interesting properties:
  1. It guarantees your catch-handler sees a std::exception.
  2. It makes try/catch usable in expressions, whereas the keywords are limited to statements.

Additional Resources

The go language uses a non-typed exception mechanism, described in Defer, Panic, and Recover.

The java language has the option of checked exceptions, the moral equivalent of C++ throw()-specifiers, but that are checked at compile-time.

C++ exception specifications are evil.