Monday, January 2, 2017

Design for Debug: An Overview

Software is never perfect, and it's never done.  Bugs always creep in, and tend to reveal themselves towards the end of the development cycle when multiple modules are integrated together.  In the inevitable debugging sessions that follow, do you expect a long hellish grind of guesswork, tweaks, and recompiles?  Or, are there fast debugging methods, and ways to preemptively ease the troubleshooting?

This is where "Design for Debug" (DFD) comes in.  The idea is to follow a set of coding practices that preemptively makes code easier to understand and debug.  The analogous practice in the chip design world is Design for Testing (DFT).  I strongly believe that DFD is a key component that separates 10x (or 100x) programmers from the pack.  Not only does it make them more productive; it makes their colleagues more productive since anyone on the team can more easily debug issues in their code.

State Observation

A bug must first be observable.  Given some inputs, the program must produce an incorrect or unexpected output (including crashing).  The primary observation methods are:
  • inspecting program-visible inputs and outputs
    • includes: keyboard/mouse/microphone input, input/output files, graphical and audio output
    • Pros: no changes to the program needed
    • Cons: no intermediate state available
  • trace logging
    • Pros: shows the evolution of a narrow slice of intermediate state over time
    • Cons: requires modification to multiple locations in code; observation limited to state that was selected ahead of time
  • interactive debugger
    • Pros: shows the full program state at one instant in time
    • Cons: time consuming to observe evolution of state over time
Notice how logging and debugging complement each other, in a type of space-time tradeoff.  It's up to us to choose the right tool for the job, depending on the situation.
  • Do you need to compare data values across 1000s of iterations?
    • add a logging statement
    • why?  it would take forever to single-step 1000s of iterations
  • Are you trying to determine which codepath was taken through some extremely branchy code?
    • step through it with a debugger
    • why?  it would be tedious to manually add a logging statement to 10s of code branches
  • Want to know why a function or call-chain is failing?
    • set a breakpoint on every suspected "return" and "throw" statement, see which one hits
    • why?  it's faster to set a handful of breakpoints than to manually add logging and wait for a recompile
  • Do you need to observe control flow over large swaths of code over time?
    • set breakpoints and "binary search" the error, or fall back to adding logging statements
    • Breakpoints can save you recompiles, but if you keep missing the error location, then thorough logging may be the only recourse.
Of course all of the above presumes that the code is structured to permit all of the above easily.

Coding Guidelines

Each of the following deserve their own article with code samples, so this is more of a braindump.

  • Place intermediate values in named variables.
    • Avoid long chained calls and excessive function composition within a statement.
    • You can easily see variables in a debugger.  You can't see compiler generated temporaries or return values.
    • If you want to log a value, make sure it's in a variable so that the computation and log statement don't go out-of-sync.
  • Calculate the return value into a local variable before returning it.
    • To set a breakpoint, you need a line of code to set it on.
    • To see the return value, you need a line where it's visible.
  • Clearly separate error and success paths.
    • Write separate return or throw statements for errors.
    • Try to end the success path at the bottom of the function.
    • Avoid single exit point style; in C++ it's an anti-pattern.
      • Historically the "Single Entry Single Exit" style was introduced to add structure to assembly, FORTRAN, and C, focused on resource cleanup.  It's obsolete in C++, which has exceptions and generally requires RAII for cleanup.  See here.
      • Herb Sutter would agree.
      • RVO is not a defensible argument for single-exit sytle.  There are two cases:
        • The function is perf-critical, and thus made inline-able.  Inline context renders RVO moot since the backend compiler can do far better.
        • The function is not perf-critical, and thus placed in a separate compilation unit.  Now RVO is in play, but since we already assumed that micro-optimization is a non-goal, we can prioritize maintainability and debuggability over RVO.
  • Use consistent naming across the codebase.
    • If all variables of type Foo are written "Foo* pFoo" then a single watch expression of "pFoo" will reveal that state in every scope.  Any sub-expressions you want to see in every context (like "pFoo->pBar->thing") will also re-evaluate cleanly in every context.
    • On the flip side, the use of different variable names in every context requires retyping every expression/sub-expression in every context.  Wastefully inefficient with zero benefit.  (the speed of thought reduced to repeated manual operations)
  • Make it easy to copy/paste sub-expressions from source code into the debugger's expression evaluator.
    • Avoid chain-calling accessor functions.  Many debuggers cannot evaluate those functions, either as an inherent limitation, or because inline accessors have no out-of-line instance in the executable that is callable at debug time.
    • Pass raw pointers and references to functions whenever possible.  (as opposed to smart pointer references)
  • Avoid function composition at callsites. to make it easier to step into any function call.
    • "Step in" must first go into every argument-producing call before finally going into the "actual function" of interest.
    • Anyone who's used a debugger has done the tedious "step in dance" -- "step in", "step out", "step in", "step out", "step in" ... until the function-of-interest is reached.  Of course this is very error prone since one extra "step out" skips over the function-of-interest, which usually requires a program restart (i.e. a complete waste of time).
    • In Visual Studio there is a workaround that lets you mark functions as "NoStepInto".  It's useful for leaf functions like getters/setters and conversion operators, but little else.
    • Strongly related to placing intermediate values in named variables.
  • Avoid excessive wrapper and forwarding functions, to make it easier to step into the "actual logic".
    • Basically, if you can eliminate wrapper layers, then do so.  Kind of obvious.
  • Use data types that show up cleanly and easily in the debugger.
    • Prefer structs with named fields over std::pair and std::tuple.  std::pair and std::tuple implementations use deep inheritance, either to process a type-list, or to implement the empty base class optimization.  In a debugger, base classes are displayed similar to member variables by introducing a nesting level; to see all elements of a tuple requires excessive child expansion (i.e. many mouse-clicks).
    • Consider using virtual functions instead of "bags of function pointers" or std::function.  Debuggers can deduce the derived-most type of any class/struct that has a vtable, and show the full contents automatically.  Code browsing works better on virtuals too (e.g. "find all references" to jump between pure virtual declaration and its derived implementations).  Note that superior implementations of std::function would obviate this choice, by storing internal state in a debuggable fashion (at least in debug builds!).
    • Other container types suffer this problem; even with custom debugger visualizers the result isn't always workable.
    • The problems aren't unique to Visual Studio; gdb has similar issues.
    • Debugger visualizers are not a substitute for debuggable data structures.  Visualizers show values but not addresses, making it impossible to compare local pointers/references to elements, and impossible to plug into a memory viewer.  Child elements of visualized data trees cannot be evaluated as watch expressions.
  • When iterating arrays and std::vectors, prefer integer indexing over alternatives.
    • The debugger shows array and vector contents by integer index.
    • Typical debug sessions involve setting a breakpoint in a loop, looking for a bad input element.  Upon discovery of the bad element, you need to know its index to "work backwards" towards the code that generated it.
    • In the debugger, if you already know the bad array element (by inspecting array contents after a loop), you can then set a conditional breakpoint in the loop body on "index == BAD_INDEX" and rerun the program.  Of course this requires an "index" variable in the code.
    • Iterators are not debuggable!  You cannot trivially determine the array index from a vector iterator.  Conditional breakpoints cannot perform iterator arithmetic like "itr - vec._Head" in most debuggers either.
    • Range-based for-loops use compiler-generated iterators, which reduces to the same problem.  Only use range-based for-loops for trivially correct code.
    • Avoid std::for_each.  In almost every case, it's a worse way of writing a range-based for-loop.
    • Cons: This approach results in slightly more verbose code.
    • Minor pro: index-based loops are friendlier to the optimizer by easing alias analysis. Index-based loops are auto-vectorizable, whereas iterator-based loops are not.  This includes the use of pointers as iterators.  See here and here.
  • Give functions unique names whenever possible.
    • Don't excessively overload on terse names.  Or even reuse function names across namespaces.
      • It's difficult to set breakpoints when you don't know which function is going to be called.
      • Reused names are not grep-friendly.
      • Even sophisticated code-browsing tools often fail to find the right overload for a given callsite.  They may resort to providing tons of options, causing trial & error navigation or missed breakpoints.
    • Of course the guideline doesn't apply when required by interface constraints (compile time, virtual, and meta-programming name requirements are all "interface requirements").  It only behooves us to choose semi-unique names per interface, to avoid collisions across different interfaces.

Concluding Remarks

If you've ever debugged through someone else's code and gotten upset, it's very likely because they violated any of the guidelines listed above.  You may not have even realized why you were upset, other than "this code is sh*t and so difficult to work with".  Hopefully by building up the list of patterns and anti-patterns, we can all consciously recognize and promote a software style that makes everyone more productive.  Remember, software development is both art and science.

Monday, May 30, 2016

C++ Pass-By-Value a Passing Fad?

Argument passing seems basic, but is surprisingly nuanced in C++.  Conventional C++ wisdom dictates that pass-by-reference is the best default way to pass parameters.  For C++03, Scott Meyers' book "Effective C++" has an item "Prefer pass-by-reference-to-const over pass-by-value", and that was the gold standard.  C++03 only had one kind of reference (const Foo&) so that just made sense.

C++11 introduced rvalue-references (Foo&&) to support move-semantics, and now there are (at least) 2 kinds of references.  With that came a lot of confusing and conflicting advice on how to best take advantage of the performance benefits of move-semantics.  One counter-intuitive piece of advice is to use pass-by-value to gain performance.  Examples of such information:
In general my default recommendations have always been:
  • pass built-in types by value (char, int, float, double, etc.)
  • pass read-only arguments as reference-to-const
  • pass "sink" arguments as rvalue-references
  • pass mutable arguments as lvalue-references (otherwise known as "out params" and "inout params")
  • never pass by value (it's a code smell and most functions don't need to sink values)
Passing references is always a safe default, because in the optimal case they collapse to nothing, and in the pessimal case they incur one extra indirection.  Whereas passing values may optimally collapse to nothing (if special optimizations kick in), but in the pessimal case will incur a huge penalty (copy construction).  Even simple types like std::vector and std::string have huge copy penalties due to the extra new & delete, and unnecessary copying is a major anti-pattern.  Pass-by-reference also avoids the famous "slicing problem" (pass-by-value taking a copy of a base class).

However, I've learned to be skeptical of "always" and "never" arguments -- and thus am obliged to examine and prove my own views.  So, I'm breaking down the various cases and seeing where pass-by-value might make sense.

Separate Compilation

This is an intrinsically suboptimal case (compared to inlining), and not particular worth being clever about.  The compiler cannot optimize across the call boundary, so restrictive ABI rules are followed (Windows MSVC and Linux) -- which have this common behavior:
  • built-in-type values are passed in registers (integers, pointers, floats)
  • C++ references are implemented as pointers
  • non-POD struct values are forced to caller stack-memory, and passed by invisible reference (note: on MSVC even POD obey this rule)
  • [linux only] POD struct values are efficiently broken down into scalar components, and each scalar is passed in the next available register
Since non-POD structs at the ABI level require pass-by-reference, they are forced to memory, and we cannot gain performance with pass-by-value.  The only remaining concerns are code design at the language-level, and exception guarantees (which I'll get to).

Passing by value across a non-inline boundary guarantees either a copy or a move.  Some (like Abrahams) have argued that if the function needs to make a copy anyways, then pass by value and let the copy occur in the caller.  For extremely simple cases, this might collapse 2 overloads into one:
    void Foo::make_copy(Bar bar);
// versus
    void Foo::make_copy(const Bar& bar);
    void Foo::make_copy(Bar&& bar);

If the function But the solution doesn't generalize for multi-argument functions.

Pass-by-Value doesn't always scale

Consider operator+, which is typically a free-function that builds on operator+=.  The prevailing answer on StackOverflow's C++ FAQ is pass-by-value on the first arg.  Quoting the code directly:
class Foo {
  Foo& operator+=(const Foo& rhs)
  {
    // actual addition of rhs to *this
    return *this;
  }
};
inline Foo operator+(Foo lhs, const Foo& rhs)
{
  lhs += rhs;
  return lhs;
}
Unfortunately, if you follow this advice and then happen to write code like this ...
const Foo a;
Foo c = a + Foo(3, 4);
... you'll end up with 2 copy constructor invocations and no moves -- which is clearly pessimal!  Why?  Because the second argument has no provision for handling temporaries as inputs. Since operator+ is commutative, we can try to patch over this with an additional overload:
inline Foo operator+(Foo lhs, const Foo& rhs);  // see above
inline Foo operator+(const Foo& lhs, Foo rhs)
{
    rhs += lhs;
    return rhs;
}
But whoops, now we'll get a compile-error "ambiguous call".  The only way to resolve this?  Abandon pass-by-value and write every possible pass-by-reference overload:
inline Foo op_plus(Foo&& lhs, const Foo& rhs)
{
    Foo result = std::move(lhs);
    result += rhs;
    return result;
}
inline Foo operator+(      Foo&& lhs, const Foo&  rhs) { return op_plus(std::forward(lhs), rhs); }
inline Foo operator+(      Foo&& lhs,       Foo&& rhs) { return op_plus(std::forward(lhs), rhs); }
inline Foo operator+(const Foo&  lhs,       Foo&& rhs) { return op_plus(std::forward(rhs), lhs); }
inline Foo operator+(const Foo&  lhs, const Foo&  rhs) { return op_plus(lhs, rhs); }
This solution performs the optimal number of copies and moves in every case.

Exception-safety and noexcept

Herb Sutter provided this "example #3", which I'm pasting from this SO post:
class employee {
  std::string name_;
public:
  void set_name(std::string name) noexcept { name_ = std::move(name); }
};
Sutter points out (here in the video) that the function is marked noexcept, yet a call to it may throw!  Why?  Because value arguments are constructed in the caller's context before the function is called, so a (throwing) copy-constructor will throw before the "noexcept block" has been reached.  It's especially insidious because the calling code doesn't make it obvious:
std::string str = ...;
emp.set_name(std::move(str));    // cannot throw
emp.set_name(str);               // may throw
Sutter's "example #2" has both const-ref and rvalue-ref overloads:
  void set_name(const std::string&  name)          { name_ = name; }
  void set_name(      std::string&& name) noexcept { name_ = std::move(name); }
That's a better option, since at least the const-ref overload is not marked noexcept. I could live with that. The same code above compiles, but at least you can see in the code (and backtrace in the debugger) that a throwing function was called.
Can we do better? If we required ourselves to only pass "sink" arguments as rvalue-ref, we would only retain this overload:
  void set_name(      std::string&& name) noexcept { name_ = std::move(name); }
Now all callers must be explicit, and the risk-of-exception is much clearer in the calling context.
std::string str = ...;
emp.set_name(std::move(str));    // cannot throw
emp.set_name(std::string(str));  // string constructor is explicit, and that does throw
emp.set_name(str);               // doesn't compile
By following this convention, we reserve the simplest syntax "set_name(str)" for arguments that are passed by const-ref; moves and copies are always explicit.  And that makes scanning the code for performance issues or exception-throwers that much easier.

Inline-able Code

Now we're talking about actual optimization.  The compiler can see across the function-call boundary, and collapse a lot of code down into nothing.  The inliner and backend optimizer is especially good at collapsing reference arguments back to the original variable. Consider the following code:
void Direct()
{
  int aa = 1;
  int bb = 2;
  int cc = aa + bb;
}
// split into 2 calls
inline int Add(const int& lhs, const int& rhs)
{
  return lhs + rhs;
}
void CallAdd()
{
  int aa = 1;
  int bb = 2;
  int cc = Add(aa, bb);
}
Both "Direct" and "CallAdd" will result in the same object code on every major compiler (gcc, MSVC, clang).  Which means, you don't need to sweat the small stuff.  Feel free to use references to pass trivial and non-trivial types ... they won't end up in memory.  Note also that perfect-forwarding (using T&& arguments and std::forward) are relying on these types of optimizations

How does this work?  Clang + LLVM handles this in an elegant way.  The clang frontend faithfully outputs naive LLVM-IR for each function.  When a variable must be passed by reference, the caller defines the variable with an "alloca" and the callee takes in a pointer.  See a simplified example here.  Now LLVM (the backend) takes over, and starts trying to inline and then simplify.  One of the most important optimization passes is mem2reg, which transforms a memory-oriented alloca+store+load into a value (virtual register).

mem2reg is such a critical transformation that this paper indicates it alone provides on average an 81% speedup of their benchmark, whereas enabling all remaining optimizations yield 102% (a comparatively small difference).  Why do I mention this in particular?  Because when you write your code in a straightforward manner, mem2reg is an optimization you can rely upon.  This is generally not the case for many other optimizations, especially across compilers/versions/vendors.

Aliases mess up inlining

The inliner works well with pure code and values on the stack.  Mix impure logic, like necessary side-effects into heap-allocated memory or global variables, and all those optimizations can disappear.  Consider the following tweak to the inline-able code:
// same Add as before
inline Foo Add(const Foo& lhs, const Foo& rhs)
{
  return lhs + rhs;
}

Foo g_pFoo = nullptr;
void CallAddWithGlobal()
{
  Foo aa = 1;
  Foo bb = 2;
  g_pFoo = &aa;
  Foo cc = Add(aa, bb);
}
Now that a global variable is pointing at our local variable, aa must be stored into memory.  There's also a chance that the value will be loaded from memory when this invocation of Add() is inlined, since aliases can disable trivial mem2reg collapsing.

Does this mean the const-ref argument to Add() was a mistake?  Definitely not!  It just means that CallAddWithGlobal() needs to be written better.  One easy workaround is to take a separate snapshot of variable aa before assigning g_pFoo -- that way Add() can go back to being inlined.
void CallAddWithGlobal()
{
  Foo aa = 1;
  Foo aa_copy = aa;
  Foo bb = 2;
  g_pFoo = &aa;
  Foo cc = Add(aa_copy, bb);
}

Other ways to mess up Inlining

This is kind of a tangent, but I found that classes with destructors completely suppress inlining in MSVC!  Passing a const Foo& into a function would always keep that function non-inlined in the assembly code.  This was true on 2013 and 2015 compilers.  Clang+LLVM did not have this issue.

Conclusions

I have examined a large number of cases, only to arrive at the same conclusion -- pass-by-value is an anti-pattern.  For every case where pass-by-value seems reasonable, there is an equal or better solution that only uses pass-by-reference.  In a few cases, pass-by-value may result in fewer lines of code, but it doesn't generalize.

Tuesday, April 12, 2016

Code Consistency and Refactoring

Until your code is perfect and done, there's a good chance you will be refactoring or rewriting some portion of it.  When that time comes, will you dread the coming changes, or embrace them with open arms?  The answer may depend on how consistent your code is.

The most maintainable code uses the same lexical names and code patterns in similar contexts, to allow for seamless code refactoring.  In other words, in a good codebase it's trivial to copy/paste from any function to another, to either compose or decompose a function.  I encourage you to try refactoring each of the examples below into the goal, given only the starting code, and either time yourself or imagine the steps involved.

Goal

    void func(Thing* pThing, Other* pOther)
    {
        if (!pThing || !pOther) {
            return;
        }
        pThing->doFirst();
        pThing->progress += pOther->progress1;
        pThing->doSecond();
        pThing->progress += pOther->progress2;
        pThing->doThird();
        pThing->progress += pOther->progress3;
        pThing->doFourth();
        pThing->progress += pOther->progress4;
    }

Example 1

    void funcA(Thing* pThing, Other* pOther)
    {
        if (!pThing || !pOther) {
            return;
        }
        pThing->doFirst();
        pThing->progress += pOther->progress1;
        funcB(*pThing, pOther);
    }
    void funcB(Thing& thing, Other* pOther)
    {
        thing.doSecond();
        thing.progress += pOther->progress2;
        funcC(*pOther, thing);
        funcD(&thing, *pOther);
    }
    void funcC(Other& b, Thing& a)
    {
        a.doThird();
        a.progress += b.progress3;
    }
    void funcD(Thing* thing, Other& other)
    {
        thing->doFourth();
        thing->progress += other.progress4;
    }

Example 2

    void funcA(Thing* pThing, Other* pOther)
    {
        if (!pThing || !pOther) {
            return;
        }
        pThing->doFirst();
        pThing->progress += pOther->progress1;
        funcB(pThing, pOther);
    }
    void funcB(Thing* pThing, Other* pOther)
    {
        pThing->doSecond();
        pThing->progress += pOther->progress2;
        funcC(pThing, pOther);
        funcD(pThing, pOther);
    }
    void funcC(Thing* pThing, Other* pOther)
    {
        pThing->doThird();
        pThing->progress += pOther->progress3;
    }
    void funcD(Thing* pThing, Other* pOther)
    {
        pThing->doFourth();
        pThing->progress += pOther->progress4;
    }

Which one was easier?

I'm gonna go out on a limb and say Example 2.  It's a near-zero effort set of line-level copy&pastes.  On the other hand, Example 1 requires tons of manual edits, changes in indirection, recognizing the same variable name used for both a pointer and reference (ugh, even harder to search & replace) ... the list goes on.

Pointers vs References

On a related note, this code should provide a guideline for when to prefer pointers and references.  Prefer pointers if any other code operating on the type also uses pointers.  Otherwise, use references.  Loosely speaking, this means using pointers for any heap-allocated "objects with identity", while using references for "value-type" instances defined on the stack.

This is not a rant against references -- they are in fact extremely useful for perfect-forwarding, move-semantics, etc.  This is a rant against mixing the use of references and pointers for the semantically same variable in different contexts.

My advice flies in the face of advice online or in the ISO C++ FAQ that heavily advocate for references wherever possible (at the expense of consistency).  What I say here is -- those guys don't have to maintain your code, but YOU do.  So make the smart choice!

Sunday, March 27, 2016

Reference Counting != Garbage Collection

Garbage collection (aka GC) allows for the non-deterministic destruction of resources.  Reference-counting (aka refcounting) always provides guaranteed deterministic destruction of resources.  These are fundamentally different strategies (each with their uses), yet I still see Wikipedia and plenty of literature and blogs conflating the two.

Non-deterministic destruction via GC is good for managing memory, and little else.  It's especially good for cyclic object graphs, which refcounting can't automatically handle.  It is notably very poor at managing I/O resources like file handles -- that's where determinism is an absolute requirement.

Refcounting is a natural extension of object destructors (like in C++), and works naturally with RAII.

Ultimately, if you don't have both features available, things get awkward:

  • In C#, you must manually inherit from IDisposable, and write a Dispose() function at every level of your object graph.  This is something the compiler writes for you in C++!  (in C++, every object's destructor automatically calls every member's destructor)
  • In C++, you must manually break reference cycles, either with the performance-killing weak_ptr, or by manually iterating the object graph to break all cycles from a well-defined moment.

Sunday, January 31, 2016

C++: When is extern not extern?

The extern storage class specifer in C++ makes a symbol visible outside the file where it's defined, or helps to declare a symbol that is defined elsewhere.  These symbols have "external linkage".
extern int foo();       // declaration
extern int bar = 3;     // definition
int baz()               // definition, which is extern by default
{
    return 4;
}

So what is the exception to the rule?  Anything declared or defined in an anonymous namespace (aka unnamed namespace) has "internal linkage", like a namespace-level static variable.  In other words, it cannot be accessed outside the current translation unit, as mentioned here.   You can, however, still use extern in an anonymous namespace, and there are actually useful things you can do with it!

Breaking Circular Dependencies

Let's say you need to declare a global variable before using it, but you want to keep the symbols private to the file to avoid name-collisions.  Let's try to use the 'static' keyword:
struct Funcs { int(*fn)(void); };

static Funcs global_var;
int UseGlobalVar()
{
    return global_var.fn();
}
static Funcs global_var = { &UseGlobalVar };
This might have been fine in C, but it doesn't compile in C++ -- we get error : redefinition of 'global_var' from the compiler, because there's no way to forward-declare a static variable in C++.  We could change 'static' to 'extern', but then it wouldn't be file-private.  The solution?  Anonymous namespaces.
namespace {
    extern Funcs global_var;  // fwd-decl with extern, but has internal linkage
}
int UseGlobalVar()
{
    return global_var.fn();
}
namespace {
    Funcs global_var = { &UseGlobalVar };
}
Even though global_var is marked extern, the anonymous namespace has precedence and gives it internal linkage, and everything works out.

Non-Type Template Arguments

You're probably familiar with non-type template parameters.  For example, in std::array, the "size_t N" parameter is a value, not a type.

You probably also know that object-pointers/references and function-pointers/references are valid non-type template parameters.  But did you know that in C++03, any argument you plug into those must have external linkage?!  Even worse, many compilers didn't get the memo that this rule was relaxed in C++11.  As a contrived example, let's say we want to pre-bake a constant double by reference:
template <const double& Addend>
double add(double x)
{
    return x + Addend;
}

double one = 1.0;
static double two = 2.0;
const double three = 3.0;
namespace {
    double four = 4.0;
    const double five = 5.0;
    extern double six = 6.0;
}

int main() {
    printf("%f\n", add<one>(0));     #1 // ok
    printf("%f\n", add<two>(0));     #2 // error
    printf("%f\n", add<three>(0));   #3 // error
    printf("%f\n", add<four>(0));    #4 // ok
    printf("%f\n", add<five>(0));    #5 // error
    printf("%f\n", add<six>(0));     #6 // ok
From MSVC 2015 we get errors like:
program.cpp(36): error C2970: 'add': template parameter 'Addend': 'two': an expression involving objects with internal linkage cannot be used as a non-type argument
Let's go through each case:
  1. namespace-level objects have external linkage by default ==> ok
  2. static at namespace level implies internal linkage ==> error
  3. const at namespace level implies internal linkage ==> error
  4. surprise!  same as #1
  5. same as #3
  6. surprise!  same as #1 
Yes, even though "four" and "six" are in an anonymous namespace (and therefore have internal linkage), they are "just extern enough" to be usable in a non-type template argument.  And this is truly when extern is not extern.

Sunday, January 17, 2016

a critique of shared_ptr

std::shared_ptr made it into the C++ standard library, is popular, and now in widespread use.  And before it, everyone used boost::shared_ptr.  So, what's the problem?  In a nutshell: weak_ptr support, and mandatory threadsafe refcounting.  The resulting performance is far less than optimal, for many common use-cases. 

If you need a refresher, this stackoverflow post explains how shared_ptr works.

The major off-the-shelf alternatives are std::unique_ptr and boost::intrusive_ptr.

The weak_ptr requirement always adds storage cost to shared_ptr ...

... even if you're not using weak_ptr in your own code.  The common algorithm is to maintain a separate refcount for strong-references and weak-references.  Both MSVC and libstdc++ do this.

Refcounting is always thread-safe ...

... even when you don't need it to be.  And as we all know, atomics are 1-2 orders of magnitude slower than their single-threaded counterparts.

If you are trying to build a DAG that is only accessed by one thread at a time, then shared_ptr is the wrong solution.

The weak_ptr requirement always adds extra performance overhead.

The optimal number of refcounts per operation involve a trick, where all strong references collectively hold 1 weak reference.  This allows all but the last strong reference to avoid touching the weak_count.  Both MSVC and libstdc++ do this.  Here are all the places atomics occur:
  • shared_ptr creation: set strong_count=1 , weak count = 1 ; no atomics
  • weak acquire: atomically increment weak_count
  • weak release: atomically decrement weak_count
  • strong acquire: atomically increment strong_count
  • strong release: atomically decrement strong_count, and if it was the last one, also atomically decrement weak_count.
Sadly, this means that every object created through shared_ptr incurs a minimum cost of two atomics.  In contrast, a strong-only pointer system would incur at minimum one atomic.

Let's also remember that at least one new/delete is required as well, so we're up to four atomics per shared_ptr-mediated object.  By the performance-analysis method presented here, single-threaded shared_ptr object-management is limited to ~(150MHz / 4) = 37MHz.

If you're using C++, you probably expect performance.  37MHz object creation is a far cry from peak performance.

the make_shared optimization undermines weak_ptr

The make_shared optimization is to allocate both the object and control-block in a single allocation (a single call to "new").  Herb Sutter describes it well in GotW#89.  This effectively makes all weak_ptrs now hold a strong reference to the object's raw memory -- a shallow strong-reference.  The irony!  Especially considering the only reason you'd use shared_ptr now, is if you needed weak_ptr as well.

Admittedly the extended object-memory lifetime isn't a big deal for small objects like vector or string.  Just be wary of using shared_ptr+weak_ptr with large-footprint objects.

shared_ptr implementation requires a virtual release() ...

... even if you're only using a concrete type with no inheritance.  shared_ptr must account for all possible usage scenarios, including multiple-inheritance, and .  It's analogous to how all COM objects inherit from IUnknown and have a virtual Release() method.

The alternative optimal solution, is to use intrusive_ptr.  There you have the freedom of defining an inlinable intrusive_ptr_release() on your concrete type.

This may sound like a minor micro-optimization, but the effect of a non-inlinable call on surrounding code-generation can be profound,

Concluding Remarks

shared_ptr is at best a convenient low-performance class, to be used sparingly and in code that is called at low frequency.  Prefer unique_ptr and intrusive_ptr, in that order.

Friday, January 15, 2016

Atomics aren't free

Modern code is filled with the use of atomic-instructions, which serve to speed up multi-threaded or threadsafe code.  These are hidden behind every new/delete, shared_ptr, lightweight mutexes like Win32 CRITICAL_SECTION and linux futex, and lock-free data-structures like those provided by Boost.LockFree.  It's easy to be excited about their performance gains as compared with syscall-based synchronization variants.  However, what's often ignored is their performance versus unsynchronized single-threaded code.

How fast are atomics?

With a simple micro-benchmark on my IvyBridge laptop (i7-3610QM), I get roughly this many atomics/second to a single address.  AtomicAdd (lock xadd) and AtomicCAS (lock cmpxchg) produced similar results:
  • ~140 million [single-threaded]
  • ~37 million [under contention - 4 or more threads accessing the same location]
The "under contention" case hasn't changed much in the last 10 years, at least for systems I tested.  A desktop Core2 Duo and Nehalem i7 produced nearly the same results.

Restating in Hertz, we have between 30MHz - 150MHz atomics to a single address on modern CPUs.  Considering that modern CPUs run at between 2000-3000MHz, we have 1-2 orders of magnitude difference in performance between non-atomic and atomic ops.

This paper corroborates these numbers, and explains them in terms of the modified MESI cache-coherency protocols.  Atomics are performed by first acquiring exclusive ownership of a cacheline into a core's L1.  Uncontended accesses are faster because the exclusive acquisition happens only once, and all operations stay local to the core after that.  When multiple cores contend for the same cacheline, additional write-back-invalidate signals are sent as part of the transfer-of-ownership protocol, which adds latency.  The latency translates acts as a direct throughput limiter since atomics block the core's memory pipeline.

ARM atomics require a sequence of instructions in an LDREX/STREX loop, which is an explicit use of MESI.

A method of analysis

Say you write this simple conversion function.
std::string ToString(bool x)
{
    return x ? "true" : "false";
}
String must call "new" to allocate memory.  With the default thread-safe allocator which uses a lightweight mutex, we see this function costs one atomic instruction.  So, this function is at best a 30-150MHz function -- if you called it in a tight loop, it couldn't run faster than that.  Considering the only other operations are a strcpy of 4-5 bytes (maybe 1s of clock-cycles), it's clear the atomic dominates by at least an order of magnitude.  But we're still being too charitable; for every new we'll have a corresponding delete.  So really we have a 15-75MHz function.  Just to return a string!

The irony is that if you ran that same code on 10-year-old hardware, but with a single-threaded C runtime, it would certainly run faster.

This insane overhead of new/delete is why custom allocators, and malloc-replacements like TCMalloc, are so important.

Other Anti-Patterns

All of the following have something in common: they drag down the performance of single-threaded code in avoidable ways.

Passing a refcounted smart-pointer (like std::shared_ptr or boost::intrusive_ptr)  by-value.  This incurs extra refcount operations, and each one is an atomic add by +1 or -1.

Passing objects like std::string and std::vector by-value.

Using a std::shared_ptr to manage the lifetime of an object that is only accessed by one thread at a time.

Fine-grained locking in general.  The previous item about shared_ptr is somewhat an example of this.

Take-aways

Treat atomics as "slow" operations, just as you would any other synchronization primitive.