C++11 introduced rvalue-references (Foo&&) to support move-semantics, and now there are (at least) 2 kinds of references. With that came a lot of confusing and conflicting advice on how to best take advantage of the performance benefits of move-semantics. One counter-intuitive piece of advice is to use pass-by-value to gain performance. Examples of such information:
- Dave Abrahams' - Want Speed? Pass by Value
- juanchopanza - Want Speed? Don't (always) pass by value.
- isocpp.org - Why is "want speed? pass by value" not recommended?
- This StackOverflow topic
- StackOverflow C++ FAQ on Arithmetic Operators
- Chandler Carruth - (video) Optimizing the Emergent Structures of C++
- Scott Meyers - Should move-only types ever be passed as values?
- Herb Sutter - (2013) GotW #91 Solution : Smart Pointer Parameters - suggesting use of value parameters
- Herb Sutter - (2014) Back to the Basic! Modern C++ and related SO post - suggesting use of const-ref and rvalue-ref overloads
- pass built-in types by value (char, int, float, double, etc.)
- pass read-only arguments as reference-to-const
- pass "sink" arguments as rvalue-references
- pass mutable arguments as lvalue-references (otherwise known as "out params" and "inout params")
- never pass by value (it's a code smell and most functions don't need to sink values)
Passing references is always a safe default, because in the optimal case they collapse to nothing, and in the pessimal case they incur one extra indirection. Whereas passing values may optimally collapse to nothing (if special optimizations kick in), but in the pessimal case will incur a huge penalty (copy construction). Even simple types like std::vector and std::string have huge copy penalties due to the extra new & delete, and unnecessary copying is a major anti-pattern. Pass-by-reference also avoids the famous "slicing problem" (pass-by-value taking a copy of a base class).
However, I've learned to be skeptical of "always" and "never" arguments -- and thus am obliged to examine and prove my own views. So, I'm breaking down the various cases and seeing where pass-by-value might make sense.
However, I've learned to be skeptical of "always" and "never" arguments -- and thus am obliged to examine and prove my own views. So, I'm breaking down the various cases and seeing where pass-by-value might make sense.
Separate Compilation
This is an intrinsically suboptimal case (compared to inlining), and not particular worth being clever about. The compiler cannot optimize across the call boundary, so restrictive ABI rules are followed (Windows MSVC and Linux) -- which have this common behavior:
- built-in-type values are passed in registers (integers, pointers, floats)
- C++ references are implemented as pointers
- non-POD struct values are forced to caller stack-memory, and passed by invisible reference (note: on MSVC even POD obey this rule)
- [linux only] POD struct values are efficiently broken down into scalar components, and each scalar is passed in the next available register
Passing by value across a non-inline boundary guarantees either a copy or a move. Some (like Abrahams) have argued that if the function needs to make a copy anyways, then pass by value and let the copy occur in the caller. For extremely simple cases, this might collapse 2 overloads into one:
If the function But the solution doesn't generalize for multi-argument functions.
void Foo::make_copy(Bar bar); // versus void Foo::make_copy(const Bar& bar); void Foo::make_copy(Bar&& bar);
If the function But the solution doesn't generalize for multi-argument functions.
Pass-by-Value doesn't always scale
Consider operator+, which is typically a free-function that builds on operator+=. The prevailing answer on StackOverflow's C++ FAQ is pass-by-value on the first arg. Quoting the code directly:class Foo { Foo& operator+=(const Foo& rhs) { // actual addition of rhs to *this return *this; } }; inline Foo operator+(Foo lhs, const Foo& rhs) { lhs += rhs; return lhs; }Unfortunately, if you follow this advice and then happen to write code like this ...
const Foo a; Foo c = a + Foo(3, 4);... you'll end up with 2 copy constructor invocations and no moves -- which is clearly pessimal! Why? Because the second argument has no provision for handling temporaries as inputs. Since operator+ is commutative, we can try to patch over this with an additional overload:
inline Foo operator+(Foo lhs, const Foo& rhs); // see above inline Foo operator+(const Foo& lhs, Foo rhs) { rhs += lhs; return rhs; }But whoops, now we'll get a compile-error "ambiguous call". The only way to resolve this? Abandon pass-by-value and write every possible pass-by-reference overload:
inline Foo op_plus(Foo&& lhs, const Foo& rhs) { Foo result = std::move(lhs); result += rhs; return result; } inline Foo operator+( Foo&& lhs, const Foo& rhs) { return op_plus(std::forward(lhs), rhs); } inline Foo operator+( Foo&& lhs, Foo&& rhs) { return op_plus(std::forward (lhs), rhs); } inline Foo operator+(const Foo& lhs, Foo&& rhs) { return op_plus(std::forward (rhs), lhs); } inline Foo operator+(const Foo& lhs, const Foo& rhs) { return op_plus(lhs, rhs); }
This solution performs the optimal number of copies and moves in every case.
Exception-safety and noexcept
Herb Sutter provided this "example #3", which I'm pasting from this SO post:
class employee { std::string name_; public: void set_name(std::string name) noexcept { name_ = std::move(name); } };
Sutter points out (here in the video) that the function is marked noexcept, yet a call to it may throw! Why? Because value arguments are constructed in the caller's context before the function is called, so a (throwing) copy-constructor will throw before the "noexcept block" has been reached. It's especially insidious because the calling code doesn't make it obvious:
Can we do better? If we required ourselves to only pass "sink" arguments as rvalue-ref, we would only retain this overload:
std::string str = ...; emp.set_name(std::move(str)); // cannot throw emp.set_name(str); // may throwSutter's "example #2" has both const-ref and rvalue-ref overloads:
void set_name(const std::string& name) { name_ = name; } void set_name( std::string&& name) noexcept { name_ = std::move(name); }That's a better option, since at least the const-ref overload is not marked noexcept. I could live with that. The same code above compiles, but at least you can see in the code (and backtrace in the debugger) that a throwing function was called.
Can we do better? If we required ourselves to only pass "sink" arguments as rvalue-ref, we would only retain this overload:
void set_name( std::string&& name) noexcept { name_ = std::move(name); }Now all callers must be explicit, and the risk-of-exception is much clearer in the calling context.
std::string str = ...; emp.set_name(std::move(str)); // cannot throw emp.set_name(std::string(str)); // string constructor is explicit, and that does throw emp.set_name(str); // doesn't compileBy following this convention, we reserve the simplest syntax "set_name(str)" for arguments that are passed by const-ref; moves and copies are always explicit. And that makes scanning the code for performance issues or exception-throwers that much easier.
Inline-able Code
Now we're talking about actual optimization. The compiler can see across the function-call boundary, and collapse a lot of code down into nothing. The inliner and backend optimizer is especially good at collapsing reference arguments back to the original variable. Consider the following code:
How does this work? Clang + LLVM handles this in an elegant way. The clang frontend faithfully outputs naive LLVM-IR for each function. When a variable must be passed by reference, the caller defines the variable with an "alloca" and the callee takes in a pointer. See a simplified example here. Now LLVM (the backend) takes over, and starts trying to inline and then simplify. One of the most important optimization passes is mem2reg, which transforms a memory-oriented alloca+store+load into a value (virtual register).
mem2reg is such a critical transformation that this paper indicates it alone provides on average an 81% speedup of their benchmark, whereas enabling all remaining optimizations yield 102% (a comparatively small difference). Why do I mention this in particular? Because when you write your code in a straightforward manner, mem2reg is an optimization you can rely upon. This is generally not the case for many other optimizations, especially across compilers/versions/vendors.
Does this mean the const-ref argument to Add() was a mistake? Definitely not! It just means that CallAddWithGlobal() needs to be written better. One easy workaround is to take a separate snapshot of variable aa before assigning g_pFoo -- that way Add() can go back to being inlined.
void Direct() { int aa = 1; int bb = 2; int cc = aa + bb; } // split into 2 calls inline int Add(const int& lhs, const int& rhs) { return lhs + rhs; } void CallAdd() { int aa = 1; int bb = 2; int cc = Add(aa, bb); }Both "Direct" and "CallAdd" will result in the same object code on every major compiler (gcc, MSVC, clang). Which means, you don't need to sweat the small stuff. Feel free to use references to pass trivial and non-trivial types ... they won't end up in memory. Note also that perfect-forwarding (using T&& arguments and std::forward) are relying on these types of optimizations
How does this work? Clang + LLVM handles this in an elegant way. The clang frontend faithfully outputs naive LLVM-IR for each function. When a variable must be passed by reference, the caller defines the variable with an "alloca" and the callee takes in a pointer. See a simplified example here. Now LLVM (the backend) takes over, and starts trying to inline and then simplify. One of the most important optimization passes is mem2reg, which transforms a memory-oriented alloca+store+load into a value (virtual register).
mem2reg is such a critical transformation that this paper indicates it alone provides on average an 81% speedup of their benchmark, whereas enabling all remaining optimizations yield 102% (a comparatively small difference). Why do I mention this in particular? Because when you write your code in a straightforward manner, mem2reg is an optimization you can rely upon. This is generally not the case for many other optimizations, especially across compilers/versions/vendors.
Aliases mess up inlining
The inliner works well with pure code and values on the stack. Mix impure logic, like necessary side-effects into heap-allocated memory or global variables, and all those optimizations can disappear. Consider the following tweak to the inline-able code:// same Add as before inline Foo Add(const Foo& lhs, const Foo& rhs) { return lhs + rhs; } Foo g_pFoo = nullptr; void CallAddWithGlobal() { Foo aa = 1; Foo bb = 2; g_pFoo = &aa; Foo cc = Add(aa, bb); }Now that a global variable is pointing at our local variable, aa must be stored into memory. There's also a chance that the value will be loaded from memory when this invocation of Add() is inlined, since aliases can disable trivial mem2reg collapsing.
Does this mean the const-ref argument to Add() was a mistake? Definitely not! It just means that CallAddWithGlobal() needs to be written better. One easy workaround is to take a separate snapshot of variable aa before assigning g_pFoo -- that way Add() can go back to being inlined.
void CallAddWithGlobal() { Foo aa = 1; Foo aa_copy = aa; Foo bb = 2; g_pFoo = &aa; Foo cc = Add(aa_copy, bb); }
Other ways to mess up Inlining
This is kind of a tangent, but I found that classes with destructors completely suppress inlining in MSVC! Passing a const Foo& into a function would always keep that function non-inlined in the assembly code. This was true on 2013 and 2015 compilers. Clang+LLVM did not have this issue.