Software is never perfect, and it's never done. Bugs always creep in, and tend to reveal themselves towards the end of the development cycle when multiple modules are integrated together. In the inevitable debugging sessions that follow, do you expect a long hellish grind of guesswork, tweaks, and recompiles? Or, are there fast debugging methods, and ways to preemptively ease the troubleshooting?
State Observation
A bug must first be observable. Given some inputs, the program must produce an incorrect or unexpected output (including crashing). The primary observation methods are:
- inspecting program-visible inputs and outputs
- includes: keyboard/mouse/microphone input, input/output files, graphical and audio output
- Pros: no changes to the program needed
- Cons: no intermediate state available
- trace logging
- Pros: shows the evolution of a narrow slice of intermediate state over time
- Cons: requires modification to multiple locations in code; observation limited to state that was selected ahead of time
- interactive debugger
- Pros: shows the full program state at one instant in time
- Cons: time consuming to observe evolution of state over time
Notice how logging and debugging complement each other, in a type of space-time tradeoff. It's up to us to choose the right tool for the job, depending on the situation.
- Do you need to compare data values across 1000s of iterations?
- add a logging statement
- why? it would take forever to single-step 1000s of iterations
- Are you trying to determine which codepath was taken through some extremely branchy code?
- step through it with a debugger
- why? it would be tedious to manually add a logging statement to 10s of code branches
- Want to know why a function or call-chain is failing?
- set a breakpoint on every suspected "return" and "throw" statement, see which one hits
- why? it's faster to set a handful of breakpoints than to manually add logging and wait for a recompile
- Do you need to observe control flow over large swaths of code over time?
- set breakpoints and "binary search" the error, or fall back to adding logging statements
- Breakpoints can save you recompiles, but if you keep missing the error location, then thorough logging may be the only recourse.
Of course all of the above presumes that the code is structured to permit all of the above easily.
Coding Guidelines
Each of the following deserve their own article with code samples, so this is more of a braindump.
- Place intermediate values in named variables.
- Avoid long chained calls and excessive function composition within a statement.
- You can easily see variables in a debugger. You can't see compiler generated temporaries or return values.
- If you want to log a value, make sure it's in a variable so that the computation and log statement don't go out-of-sync.
- Calculate the return value into a local variable before returning it.
- To set a breakpoint, you need a line of code to set it on.
- To see the return value, you need a line where it's visible.
- Clearly separate error and success paths.
- Write separate return or throw statements for errors.
- Try to end the success path at the bottom of the function.
- Avoid single exit point style; in C++ it's an anti-pattern.
- Historically the "Single Entry Single Exit" style was introduced to add structure to assembly, FORTRAN, and C, focused on resource cleanup. It's obsolete in C++, which has exceptions and generally requires RAII for cleanup. See here.
- Herb Sutter would agree.
- RVO is not a defensible argument for single-exit sytle. There are two cases:
- The function is perf-critical, and thus made inline-able. Inline context renders RVO moot since the backend compiler can do far better.
- The function is not perf-critical, and thus placed in a separate compilation unit. Now RVO is in play, but since we already assumed that micro-optimization is a non-goal, we can prioritize maintainability and debuggability over RVO.
- Use consistent naming across the codebase.
- If all variables of type Foo are written "Foo* pFoo" then a single watch expression of "pFoo" will reveal that state in every scope. Any sub-expressions you want to see in every context (like "pFoo->pBar->thing") will also re-evaluate cleanly in every context.
- On the flip side, the use of different variable names in every context requires retyping every expression/sub-expression in every context. Wastefully inefficient with zero benefit. (the speed of thought reduced to repeated manual operations)
- Make it easy to copy/paste sub-expressions from source code into the debugger's expression evaluator.
- Avoid chain-calling accessor functions. Many debuggers cannot evaluate those functions, either as an inherent limitation, or because inline accessors have no out-of-line instance in the executable that is callable at debug time.
- Pass raw pointers and references to functions whenever possible. (as opposed to smart pointer references)
- Avoid function composition at callsites. to make it easier to step into any function call.
- "Step in" must first go into every argument-producing call before finally going into the "actual function" of interest.
- Anyone who's used a debugger has done the tedious "step in dance" -- "step in", "step out", "step in", "step out", "step in" ... until the function-of-interest is reached. Of course this is very error prone since one extra "step out" skips over the function-of-interest, which usually requires a program restart (i.e. a complete waste of time).
- In Visual Studio there is a workaround that lets you mark functions as "NoStepInto". It's useful for leaf functions like getters/setters and conversion operators, but little else.
- Strongly related to placing intermediate values in named variables.
- Avoid excessive wrapper and forwarding functions, to make it easier to step into the "actual logic".
- Basically, if you can eliminate wrapper layers, then do so. Kind of obvious.
- Use data types that show up cleanly and easily in the debugger.
- Prefer structs with named fields over std::pair and std::tuple. std::pair and std::tuple implementations use deep inheritance, either to process a type-list, or to implement the empty base class optimization. In a debugger, base classes are displayed similar to member variables by introducing a nesting level; to see all elements of a tuple requires excessive child expansion (i.e. many mouse-clicks).
- Consider using virtual functions instead of "bags of function pointers" or std::function. Debuggers can deduce the derived-most type of any class/struct that has a vtable, and show the full contents automatically. Code browsing works better on virtuals too (e.g. "find all references" to jump between pure virtual declaration and its derived implementations). Note that superior implementations of std::function would obviate this choice, by storing internal state in a debuggable fashion (at least in debug builds!).
- Other container types suffer this problem; even with custom debugger visualizers the result isn't always workable.
- boost::optional and boost::flat_map
- The problems aren't unique to Visual Studio; gdb has similar issues.
- Debugger visualizers are not a substitute for debuggable data structures. Visualizers show values but not addresses, making it impossible to compare local pointers/references to elements, and impossible to plug into a memory viewer. Child elements of visualized data trees cannot be evaluated as watch expressions.
- When iterating arrays and std::vectors, prefer integer indexing over alternatives.
- The debugger shows array and vector contents by integer index.
- Typical debug sessions involve setting a breakpoint in a loop, looking for a bad input element. Upon discovery of the bad element, you need to know its index to "work backwards" towards the code that generated it.
- In the debugger, if you already know the bad array element (by inspecting array contents after a loop), you can then set a conditional breakpoint in the loop body on "index == BAD_INDEX" and rerun the program. Of course this requires an "index" variable in the code.
- Iterators are not debuggable! You cannot trivially determine the array index from a vector iterator. Conditional breakpoints cannot perform iterator arithmetic like "itr - vec._Head" in most debuggers either.
- Range-based for-loops use compiler-generated iterators, which reduces to the same problem. Only use range-based for-loops for trivially correct code.
- Avoid std::for_each. In almost every case, it's a worse way of writing a range-based for-loop.
- Cons: This approach results in slightly more verbose code.
- Minor pro: index-based loops are friendlier to the optimizer by easing alias analysis. Index-based loops are auto-vectorizable, whereas iterator-based loops are not. This includes the use of pointers as iterators. See here and here.
- Give functions unique names whenever possible.
- Don't excessively overload on terse names. Or even reuse function names across namespaces.
- It's difficult to set breakpoints when you don't know which function is going to be called.
- Reused names are not grep-friendly.
- Even sophisticated code-browsing tools often fail to find the right overload for a given callsite. They may resort to providing tons of options, causing trial & error navigation or missed breakpoints.
- Of course the guideline doesn't apply when required by interface constraints (compile time, virtual, and meta-programming name requirements are all "interface requirements"). It only behooves us to choose semi-unique names per interface, to avoid collisions across different interfaces.
Concluding Remarks
If you've ever debugged through someone else's code and gotten upset, it's very likely because they violated any of the guidelines listed above. You may not have even realized why you were upset, other than "this code is sh*t and so difficult to work with". Hopefully by building up the list of patterns and anti-patterns, we can all consciously recognize and promote a software style that makes everyone more productive. Remember, software development is both art and science.
No comments:
Post a Comment