Ekinoderm
 

When Debuggers Fail

I recently traded in my car for a new one, only to take the new car home and have it completely dead in my garage the next day; I couldn’t crank the engine or anything.  I had it towed back to the dealer.  The dealer replaced the battery which they said had “gone bad” and wouldn’t hold a charge.  Okay, fine.  I got it back and started noticing another weird problem: the radio would randomly turn on when the car was off.  Well, I thought, maybe this is what’s running the battery down.  Took it back to the dealer and told them the radio was possessed, and, thankfully, they were able to reproduce the problem.  So then they call the engineering team that worked on the car to get them involved.

The engineers have seen these kinds of issues before, they said, and said to replace the radio.  The dealer did so, and all the diagnostics came up clear.  I get the car back a few days later with a new radio and take it home.  Guess what happened the next morning?  Dead again.  At this point, everyone involved has realized that this problem is substantially more serious than a dead battery.  The technicians are stumped, the engineers are stumped, and I’m left with a crappy rental car.  Finally, I just ask for a new vehicle, which I’ll be picking up on Monday.  Last I heard, the technicians were ripping all the wiring out of the car looking for a short somewhere that’s causing the problem.

As a software developer, this kind of “unsolvable” problem is all too familiar.  In fact, it seems like there’s a whole class of bugs that occur in software that can be damn near impossible for the developer to track down and fix.  Among these (in roughly decreasing order of likelihood):

  • Synchronization problems.
  • Security exploits (especially if the security was tacked onto the system later).
  • Bugs in closed-source third-party libraries.
  • Problems that show up in a release build but not a debug build.
  • Bugs in the compiler.
  • Bugs in hardware.

Nearly all of these kinds bugs are not the kind that can be carefully reproduced and debugged using traditional debugging tools.  Eliminating these issues requires a much more holistic approach to system design.

First and foremost, these issues result from incorrect assumptions.  You might assume that a variable is initialized to a certain value, but it may be the case that the debug runtime initializes the variable but the release runtime does not.  If you assume that a certain routine you’re calling is thread-safe, but it turns out not to be, which introduces synchronization problems into your system.

In the same vein, it’s a virtually unchallenged assumption at this point that compilers and processors are “correct,” inasmuch as each high language expression or machine instruction does exactly what it’s supposed to do semantically, regardless of any optimizations done at a lower level.  You’d probably be laughed out of the room if you suggested (without a lot of proof) that a bug in your software was due to an error in the compiler.  This is obviously not always the case, however, as was proven rather dramatically in the case of the original Pentium floating-point bug.  Although Intel already knew about the bug before it was discovered in the wild, Thomas Nicely (who discovered the bug), spent considerable time trying to figure out the problem before tracking it to a single machine instruction.  In the meantime, he pored over his own code and the compiler source-code (and found a bug there, as well).

Researchers have developed formal systems for verifying that a multi-threaded program is thread-safe (for example, the π-calculus), or for verifying that hardware is correct.  There’s a whole subset of computer science devoted to formal correctness proofs: verification.  Virtually no business software would (or could) undergo a formal proof of correctness, however, so while these techniques may be useful in planning your system, they probably won’t be able to help you actually verify your completed system.

The key to “solving” these “unsolvable” problems is to eliminate all your assumptions and start from scratch.  Just because you set a variable to a value doesn’t mean that it has the same value you assigned to it the next time you look at it.  In multi-threaded code, statements can be executed in nearly any order (outside of a critical section), so don’t assume that anything will happen before or after anything else unless you’ve set up the proper locks and signals to ensure that it will.

The holistic design process I’m suggesting requires you to design a thread-safe system (or a system with proper security) from the ground up.  Tear the wires out and start over.  Or, to paraphrase Fred Brooks, build one and throw it away, because it’s definitely broken.

Predict the Future For Me

My horoscope for today says:

Use today’s influences to free up tomorrow. This will be an excellent day for tying up any loose ends or clearing up any outstanding chores or duties, whether you’re at work or school, or at home. A helpful surge of energy will get you motivated early on and will keep you going!

As a software developer, one has to sort of admire the generic flexibility of the daily horoscope.  No matter how you read it, it comes out as sort of “truthy.”  It deals with safe generalities, never being too specific or making any predictions that could actually be falsified.  Also, predicting that I would experience a caffeinated “surge” of energy on a weekday morning was particularly accurate.

Ideally, we’d take a lesson from the horoscope writers when we’re developing software components and make them laughably generic and flexible.  Because, what we’re really being asked to do when we design software components is to predict the future, although we usually don’t have the luxury of playing it as safe as an astrologer.  That’s why some of the most important design questions are:

  • Which requirements are most likely to change?
  • How hard will it be to make the most likely changes?
  • Will it be possible to make unlikely changes?

As far as writing code goes, I think a good rule of thumb is to make the internals of each component one “level” more generic than the interface they present to their users. It’s a good substitute for predicting the future, and you might actually finish the code on schedule, too, which I can’t guarantee if you go for the super-generic ideal.

Software Engineering Principles in Disguise

Software Engineering Principles in Disguise

In other words, if your calculator application only allows for 5 numbers to be added together, the Adder component should internally support more numbers, even if its method-level interface only takes 5 numbers.  I picked 5 numbers here, because it’s an unwieldy number of parameters to pass around all over the place, just begging to be put into an array or some other structure.  It would seem unnatural, almost unconscionable, to not group those together into a common structure internally.

Although my example is obviously contrived, we really do face similar kinds of design decisions during the software development process.  In this process, we are forced to balance between two (apparently) opposing goals: creating flexible and generic components vs. finishing the project and delivering the software.  It seems to be a commonly-held belief that it takes longer to develop generic components than to develop specialized components, and, although I’m certainly not convinced this is correct 100% of the time, I’ll go ahead and allow for it for now.

This is why I’m advocating degrees of generality in the software.  Don’t make the Adder module internally capable of doing dot-products on vectors or complex tensor manipulations, unless these are features that your software has to support now or will need to support in the near future.  And, don’t add unused code. There’s no point in adding tons of additional methods and structures that are never called or used.  Your teammates will either think that this currently-unused functionality has been well-tested and begin to use it prematurely, or they’ll ignore it when they’re doing maintenance and it will get out-of-date.   I don’t know about you, but I get wary when I see code that appears to be unused or that appears to do nothing.  Save everyone the trouble and leave it out for now.

But do design your modules to be just a bit more generic than what is strictly required to implement the required functionality today, because tomorrow more functionality will almost certainly be required.  People will think you can see into the future.

Going for the Easy Win

A big part of software development is making hard problems into easy problems, but sometimes it’s essential for your sanity to work on easy problems that you know how to solve.  I call this “going for the easy win” or sometimes “picking the low-hanging fruit.”  While it might give you a lot of bragging rights to solve a tough synchronization bug that’s existed in your system for years, it might do a lot more for your sanity (and self-esteem) to fix an easier bug with output formatting that looks a bit more manageable.

Mmm...low-hanging fruit...

Mmm...low-hanging fruit...

The way I try to manage this in my own workflow is to designate a specific day and time (I use Monday mornings) to try to fix as many simple bugs as I can in a row.  I find that this practice has a few definite benefits:

  • It fixes bugs which might otherwise not be given the proper attention, since they are “easy” or not as critical as other fixes.
  • It provides a good ramp-up to other development tasks for the week.  It’s kind of like stretching your brain out for the real exercises to come.
  • Because I do it on Monday morning, it doesn’t interrupt me from anything else I was working on.  I tend to leave one obvious task unfinished at the end of each weekday so I have a good place to jump back in the next morning.  However, over the weekend, I find that I tend to forget where I was anyways, so my continuity isn’t broken by fixing bugs on Monday morning.
  • It makes me feel like super-developer to knock out a bunch of bugs quickly, which helps keep me from burning out (along with my side projects), and provides a positive start to the week.

There are a few things to keep in mind when going for the easy wins, most of which are fairly obvious, but I’ll list them anyway so you don’t think I’m a complete idiot:

  • Don’t ignore high-priority issues.  Duh.
  • Don’t substitute lots of bad fixes done quickly for a few good fixes done slowly. Duh.
  • Don’t miss deadlines and screw over your teammates to do this.  Duh.

Finally, don’t think of this as a form of structured procrastination.  Think of this as an important part of the maintenance process that can actually benefit you a lot, in addition to getting a bunch of bugs fixed.

It’s OK to be Ignorant

Programmers have big egos.  It’s not like it’s a big secret or something; we’re used to being right, and it can be very difficult for us to admit when we’re wrong.  And, God help you if you ever have to say:

I don’t know.

If you say that, no one will ever respect you again, and you’ll never get a job offer or a promotion, right?

Sarcasm aside, I think, on some level, a lot of programmers may really believe that not knowing something is an unforgivable sin.  But, in reality, uttering these three words should be a call to action.  If you don’t know something, then you should find it out, right?  Sounds simple.  But we’ve been conditioned (by crappy teachers or well-meaning, but misguided, interviewers) to believe that we are supposed to be able to recall everything we’ve ever learned, even though this is clearly impossible.

When you’re working on a project, sure, you can remember a lot of details about the various classes and modules that your project uses, but these are things that are passively memorized through repeated use.  And when you see something novel that you don’t recognize, you look it up. I hope you’ll forgive me if I’m over-generalizing, but I think this is frankly obvious to anyone whose ever held a programming job for more than a few months.

And yet, as programmers, we’re bombarded with offers to be “certified,” by Microsoft or Sun or whoever else.  A large portion of a certification exam consists of memorizing the names of framework classes and methods or memorizing the location of configuration options in your web server or other software.  I honestly can’t think of anything less indicative of actual programming and development skill, except maybe having a lot of experience playing video games (something that I seem to see rather frequently on resumes, actually).

On the other hand, I will say that certification is a good alternative to a college education if you just need you get your foot in the door somewhere, though it’s certainly no substitute for real experience.  The college-educated shouldn’t think that a diploma is an alternative to real experience either, because it’s not.

But what’s the problem with rote memorization?  It probably won’t hurt you, and it might even help you, right?

Bloom's Cognitive Domain

Bloom's Cognitive Domain

You might recall from your primary education something about Bloom’s Taxonomy of Educational Objectives; I can remember a poster of the skills in his “cognitive domain” hanging in my elementary school English classroom.  The whole idea behind his taxonomy is that there’s a hierarchy of reasoning skills.  At the lowest level is “remember,” i.e. the ability to recall specific facts.  For example, perhaps you recall that the Magna Carta was issued in the year 12151.  Moving up the tree, we have “understand” and “apply.”  These levels might lead us to ask, “What did the Magna Carta say?” and “How did it change life in England?”  At the top of the tree, we have “analyze,” “evaluate” and “create.” Here the questions could be very open-ended: “What are some problems with the Magna Carta?”  or “How would you write your own version of the Magna Carta?”

So, here’s the problem with rote memorization as it pertains to software development (sadly, we don’t have time to discuss rote memorization as it pertains to medieval English legal documents).  Rote memorization is at the very bottom of the cognitive tree, and nearly all of software development is done at the higher levels. The kinds of skills that can be assessed by testing if someone knows the name of .Net’s buffered string class2 are absolutely the most basic programming skills you could assess.  You’d be just as well asking someone what the syntax is for an if statement.  Moreover, the context-aware tab-completion in most IDEs has pretty much eliminated the need to recall exact class or method names while you’re writing code.

Unfortunately for the certification exam writers, the higher cognitive levels are very subjective.  We’re talking here about asking questions like, “Is this function good?” or “What are some problems with this code?”  In trivial cases, there may be “correct” answers to these questions, but most of the time, there’s room for differences of opinion and discussion.  And, a lot of times, it’s more important that you approach the problem in an intelligent way than that you get the “right” answer.  I know there’s a (deserved) backlash against the whole Interview 2.0 thing, but I do think that the idea of asking questions with no clear answer during interviews definitely has merit.

I guess the bottom line is that we are all ignorant of a great many things and it’s OK to be ignorant, because ignorance is an easily correctable condition.  When you think about it, ignorance is the default state for everyone.  Perhaps you know nothing about functional programming.  Perhaps you never learned assembly language.  In my case, I’m a total noob at networking; I understand the basics, but that’s about it.  But you don’t remedy ignorance with memorization, you remedy it with clear understanding and application of concepts.

Unless you specifically go out of your way to correct your ignorance, you’ll always stay that way.  And the only kind of ignorance that’s unforgivable is willful ignorance, actively refusing to find out things you don’t know.



  1. I had to look that up, actually []
  2. it’s called StringBuilder, and I didn’t have to look that one up.  See?  Passive memorization! []

Magical Thinking, Magical Code

Natives who beat drums to drive off evil spirits are objects of scorn to smart Americans who blow horns to break up traffic jams. — Mary Ellen Kelly

Wikipedia briefly defines “magical thinking” as “nonscientific causal reasoning,” and it can be broadly applied to all kinds of superstitious behavior.  We all know someone who believes that they have a good luck charm.  There are two basic rules that most kinds of magical thinking have: the law of similarity (an effect resembles its cause), and the law of contagion (things which were once in physical contact maintain a connection even after physical contact has been broken).

While magical thinking can be understood academically as a prescientific way of reasoning about cause and effect, it’s remarkable how prevalent forms of magical thought are in software development.

First, here’s an example from my own experience.  In .Net 1.1 there was an object called DataGrid (it’s still around, but it has been supplanted by DataGridView).  Every so often, we’d be doing development on a windows application and we couldn’t get updated rows from the grid to store back in the database, or process, or whatever.  The rows would appear in the grid, but the code couldn’t see them.  At some point, someone tried the following call:

DataSet.AcceptChanges();

on the underlying data set.  And it fixed the bug. Nobody really knew why.  To be perfectly honest, the documetation for the call didn’t really seem like it applied to our problem at all.  However, adding this code fixed our bug, so, as a result, AcceptChanges became the first thing that anyone would try when they were having trouble with a DataSet and the solution wasn’t obvious.  A lot of times, the call didn’t fix the problem, but it didn’t do anything bad, so it would stay in the code, even after the real bug had been fixed.

Automatically trying the code that fixed the last similar bug leads to a phenomenon that I call magical code (sometimes called “cargo-cult programming“).  Code structures and function calls are repeated all over the place, not because their effect is well-understood and required in that instance, but because of a random guess at what might fix a problem (based on experience with past fixes), along with a sort of prayer to the compiler that the change will fix the problem.  Despite being sloppy as hell, I’m fairly sure most developers do this at one time or another (I know I have).

A Cargo Cult in the South Pacific

A Cargo Cult in the South Pacific

Logically, there are two problems here:

  1. Not understanding real causes. Just as the cargo cults in the South Pacific thought that building landing strips and control towers would cause planes to land with cargo, the developer who inserts code without a specific reason is expecting a certain effect without truly knowing the thing that will cause it.  Just because a piece of code appears in working code and is missing in broken code does not mean that the code should be copied from the good code to the broken code to try to fix it.  It sounds silly that anyone would do this, but this is exactly what people are doing when they add magical code.
  2. Not noticing the real effects. Every piece of code does something.  Even a NOP instruction has an effect (the processor does nothing for a few clock cycles).  Even code optimized out by the compiler may cause compilation to take longer.  By inserting code without understanding what it does, you’re potentially introducing hard-to-find bugs into the system.  These kinds of problems are even worse in an unmanaged language like C++, where magical code can introduce undetected memory leaks.

Obviously, it’s best to understand exactly what is going on in every piece of code that we work on.  Some would go so far as to say that if you don’t understand why your bug fix works, then it’s not really a fix at all and the bug is still open.

Unfortunately, in the real world, everyone has to work on stuff that they don’t fully understand.  The bigger your system is, the more likely that you’ll find yourself completely in the dark about certain parts of it.  However, there are some ways to mitigate the “magical code” effect in your system:

  • Code reviews. This is mainly to get rid of code that doesn’t do anything.  If 3 or 4 people can sit down and independently verify that a piece of code has no useful effect, then you should probably get rid of it.  If you ever need it back, hey, it’s in source control, right?  Regardless of the magical nature  of the code, just having code around that has no effect is pretty bad by itself, nearly as bad as incorrect comments.
  • Review bug fixes. The person who is most experienced with a module should review all the fixes that are made inside that module.  In a perfect world, this person would do all the fixes as well, but, generally, the most experienced people are the ones who have the most demand on their time and they often don’t have time to debug and fix problems.
  • Get educated. Spend some time learning about the way the built-in pieces of your framework function.  In larger systems, the same problem can occur with some of your own internal classes.  For example, developers may call your in-house Reset() or Clear() methods much more frequently than they need to, in an attempt to exorcise demons from their code.  Or, in a multi-threaded app, they may use locks much more frequently than required.  Develop little test programs or unit tests to prove to yourself (and others) that certain classes behave in a certain way.

I’m not saying that everyone needs to understand the whole application stack from top to bottom to be able to modify code.  In fact, one of the whole points of abstraction and encapsulation is to hide how things work.  The black-box presentation of many system calls encourages developers to view them as being “magic” in the sense that they do what you want them to do, even if you have no clue how they actually accomplish that effect.

The kind of magical code I’m talking about is about not even knowing what certain methods do, but just superstitiously adding them to try to make things work.