When Debuggers Fail
I recently traded in my car for a new one, only to take the new car home and have it completely dead in my garage the next day; I couldn’t crank the engine or anything. I had it towed back to the dealer. The dealer replaced the battery which they said had “gone bad” and wouldn’t hold a charge. Okay, fine. I got it back and started noticing another weird problem: the radio would randomly turn on when the car was off. Well, I thought, maybe this is what’s running the battery down. Took it back to the dealer and told them the radio was possessed, and, thankfully, they were able to reproduce the problem. So then they call the engineering team that worked on the car to get them involved.
The engineers have seen these kinds of issues before, they said, and said to replace the radio. The dealer did so, and all the diagnostics came up clear. I get the car back a few days later with a new radio and take it home. Guess what happened the next morning? Dead again. At this point, everyone involved has realized that this problem is substantially more serious than a dead battery. The technicians are stumped, the engineers are stumped, and I’m left with a crappy rental car. Finally, I just ask for a new vehicle, which I’ll be picking up on Monday. Last I heard, the technicians were ripping all the wiring out of the car looking for a short somewhere that’s causing the problem.
As a software developer, this kind of “unsolvable” problem is all too familiar. In fact, it seems like there’s a whole class of bugs that occur in software that can be damn near impossible for the developer to track down and fix. Among these (in roughly decreasing order of likelihood):
- Synchronization problems.
- Security exploits (especially if the security was tacked onto the system later).
- Bugs in closed-source third-party libraries.
- Problems that show up in a release build but not a debug build.
- Bugs in the compiler.
- Bugs in hardware.
Nearly all of these kinds bugs are not the kind that can be carefully reproduced and debugged using traditional debugging tools. Eliminating these issues requires a much more holistic approach to system design.
First and foremost, these issues result from incorrect assumptions. You might assume that a variable is initialized to a certain value, but it may be the case that the debug runtime initializes the variable but the release runtime does not. If you assume that a certain routine you’re calling is thread-safe, but it turns out not to be, which introduces synchronization problems into your system.
In the same vein, it’s a virtually unchallenged assumption at this point that compilers and processors are “correct,” inasmuch as each high language expression or machine instruction does exactly what it’s supposed to do semantically, regardless of any optimizations done at a lower level. You’d probably be laughed out of the room if you suggested (without a lot of proof) that a bug in your software was due to an error in the compiler. This is obviously not always the case, however, as was proven rather dramatically in the case of the original Pentium floating-point bug. Although Intel already knew about the bug before it was discovered in the wild, Thomas Nicely (who discovered the bug), spent considerable time trying to figure out the problem before tracking it to a single machine instruction. In the meantime, he pored over his own code and the compiler source-code (and found a bug there, as well).
Researchers have developed formal systems for verifying that a multi-threaded program is thread-safe (for example, the π-calculus), or for verifying that hardware is correct. There’s a whole subset of computer science devoted to formal correctness proofs: verification. Virtually no business software would (or could) undergo a formal proof of correctness, however, so while these techniques may be useful in planning your system, they probably won’t be able to help you actually verify your completed system.
The key to “solving” these “unsolvable” problems is to eliminate all your assumptions and start from scratch. Just because you set a variable to a value doesn’t mean that it has the same value you assigned to it the next time you look at it. In multi-threaded code, statements can be executed in nearly any order (outside of a critical section), so don’t assume that anything will happen before or after anything else unless you’ve set up the proper locks and signals to ensure that it will.
The holistic design process I’m suggesting requires you to design a thread-safe system (or a system with proper security) from the ground up. Tear the wires out and start over. Or, to paraphrase Fred Brooks, build one and throw it away, because it’s definitely broken.




