Refactoring and the Bishonen Line

August 19, 2008 – 9:06 am

If sci-fi and fantasy has taught us anything, it’s that we should never judge anything by its size.  In fact, there’s even a geeky principle called The Bishonen Line which is:

…the tendency of monster creatures (especially evil ones) to become big and disfigured as they increase their power, then suddenly shrink back down to human proportions. This can be used to indicate that the character has reached a point where he has “full control” of his powers, and therefore can access them selectively — taking the ass-kicking abilities while leaving the giant-scary-monster abilities.

In development, we call taking huge, disfigured and monstrous code and shrinking back down to human size, “refactoring.”  “Refactoring” is a fancy-sounding term that often means “redoing,” and the point of it is to make big, ugly code into smaller, readable code.

So, if we take the Bishonen Line as our principle, our goal when refactoring should be to give the code “full control” of its powers, allowing it to do the ass-kicking without the whole scary-monster thing.

This runs counter to the assumption made by inexperienced development managers (and some inexperienced coders, as well) that writing lots of lines of code is the sign of a proficient and/or hard-working developer.  If you’re spending time reading a blog about software development, you probably already know that lines-of-code is a crap metric, so I won’t belabor the point, except to say that I’m the most proud of the changes I’ve made where I’ve eliminated 1000 lines of code in a single check-in.

Don't be unwise, judge me not by my size.

Realistically, a method or class will start with one well-defined purpose, then have some additional, kind of related functionality added into it, have a bunch of assumptions made that later turn out to not always be true, be forced to call into methods only to jump back out because they aren’t relevant for this particular processing, allocate some memory that never gets used (or freed), throw an exception and catch it in the same method, and then return an error code.

Good developers end up with horrifically mutated code that looks like this all the time, and they refactor it when they have the chance.  Bad developers end up with horrifically mutated code that looks like this all the time, and don’t even recognize this is a problem to be fixed.

There are many different methods (some links to which I’ve listed below) that can be used to refactor code to make it more readable, maintainable and smaller (but not usually to make it more efficient, although this can be a goal of refactoring), but there are some things that one should specifically be careful of:

  1. Resist “clever” hacks. Making the code smaller is only one of the goals of refactoring.  If to refactor the code you have to introduce something like Duff’s Device (outside the context of an embedded system or a really hot loop), you’re probably not really improving things. Situations will vary, of course, but learn to sanity-check yourself if you start to feel yourself grinning uncontrollably at your own cleverness.
  2. Use automated refactoring tools carefully. The right-click action “Extract Method…” in Microsoft Visual Studio is very adept at creating methods with tons of parameters.  It can analyze the block you’re asking it to extract and determine which are the free variables that would need to be passed as parameters and often what the return value(s) of the method should be, but it can easily be used to create methods that no sane developer would write from scratch.  Use it and similar automated methods carefully.
  3. Avoid adding lots of parameters to new methods. You want to minimize the amount of data that each method has to operate on, but if each new method you introduce during refactoring takes everything as a parameter to it, you won’t have any guarantee that the method won’t use or manipulate data inside of one of the values passed to it.  One of the goals of refactoring is to reduce the number of parameters that a method takes, and give access to data only inside of the methods that manipulate it.
  4. Don’t break conventions lightly. If you have a convention that’s followed throughout your code, don’t break it so you can refactor some code in one module or class.  Maybe the convention is bad and needs to be changed everywhere.  Honestly, if you have conventions that people actually follow consistently, you should consider yourself to be quite lucky.
  5. Avoid confusing people. If something is a part of the language that’s rarely used or that many people have trouble understanding, be sure to comment the code well.  For example, lots of folks have trouble understanding the new functional programming constructs that are in C# 3.5, not because they’re slow or stupid, but just because they haven’t been exposed to functional programming before.  These new features are great for making code more compact and possibly even more efficient, but you need to make sure that the people you’re working with understand what the hell you’re doing.

If you’re interested in some techniques for refactoring (things you should try to do when refactoring), here are some links:

Refactoring is something that takes time, to be sure, but it pays dividends down the road, and you shouldn’t let the opportunity to spend some time cleaning up your code pass you by.

P.S. When used by itself, the word “bishonen” means, uh, something slightly different.


Why Blog About Software?

August 14, 2008 – 9:01 am

So, I’ve recently made the commitment to myself to update this blog twice a week, and, thus far, I’ve been able to stick to my commitment.  But, I do have to ask myself: why does the world need yet another software blog?

So, these are the reasons I’m blogging, both as a note to myself and so readers can check me if I lose my way in the future:

  1. I want to better my own software engineering skills. For every post that I write, I do research and I learn all kinds of things I didn’t know before.  By writing about it afterward, I think I retain the things I learn much better.  Additionally, there’s always the (slim) possibility that someone else will benefit from the information I’ve gathered.
  2. I love reading and writing about software. I still find programming and software development as interesting as I did when I first got into it.  In fact, it’s possible that I find it more interesting now than I used to.  Blogging is a way of dealing with this (hopefully healthy) obsession.
  3. I want to hear opinions other than my own. I really do.  Comments are the big thing that distinguishes a blog from a book or a lecture; readers are allowed to discuss, criticize and add on to the content of the original post.  And this can go on for months after the post is originally published.

Additionally, the following are NOT reasons that I’m blogging, so if I ever catch myself doing these things I can stop myself:

  1. To get into a pissing match with other bloggers/writers. I’m all about interesting and constructive discussion and debate, but I’m not interested in getting into a contest of wills or a “who’s smarter” kind of contest.  With this in mind, I pledge not to ever link to another blog post just to trash the post or its author, and I reserve the right to ignore such links to any of my posts.
  2. To promote/disparage any specific technology or platform. I have technologies that I use on a daily basis, and I will tend to write about the things that I know.  This means that posts will tend to lean towards Windows development, Web development and .Net.  However, this should certainly not be seen as an endorsement of this platform over any other.  Generally speaking, developers don’t usually have the luxury of choosing all the technologies they will work with, and I think that a lot of software engineering is about dealing with the constraints that are placed on the development process from outside.
  3. To promote myself. I’m a mediocre developer at best.  I don’t have a lot of experience, I’m quite ignorant in certain technical areas, and overall I try to keep my ego in check as best I can.  I never want this site to turn into “Nathan on Software.”  No offense intended to Joel; his site is wonderful, and he has the knowledge and experience to make a site which actually is about his opinion on software.  The site is about Software, not about Me.

GUIDs are Great

August 12, 2008 – 9:02 am

Whenever someone says they’re going to use a GUID for something, I make it a point to always respond, “No!  Don’t use a GUID there!  If you use one there, eventually you’ll use them all up and we won’t have any left!”

Of course, GUIDs don’t get “used up,” and there are plenty to go around (enough for every star in the observable universe to have about 6.8 quadrillion GUIDs to itself), but this point if this post is to tell you how to use them.

A Globally-Unique Identifier (GUID) is generated algorithmically, usually using your network card’s MAC address (edit: this is no longer the case, please see comments below) as well as the number of 100 nanosecond periods elapsed since 12:00 AM on October 15, 1582 (naturally).  Using these two pieces of data ensures that your computer won’t generate the same GUID twice and two different computers won’t generate the same GUID at the same time.  It’s not a “100% guarantee” but it’s very, very reliable.  What you get is something that looks like this (as a string):

3F2504E0-4F89-11D3-9A0C-0305E82C3301

Despite being a bit unwieldly to type and look at, GUIDs are really good for uniquely identifying logs, transactions, database records and all kinds of things in your code.  On modern hardware, they can be generated quickly and the standard algorithm allows for 10 million GUIDs to be generated per second without conflict.

Often developers will use a timestamp, a random number or both to try to unique-ify some filename or result.  This is called “reinventing the wheel.”  Use a GUID instead, any time you want to tag anything that can’t have a conflict.  Save yourself future headaches.

If you want to see the classic example of a bug resulting from the practice of using a timestamp as a unique id, look no further than Lotus Notes.  In older versions of Notes, “unique” document ids were generated using a timestamp with a resolution of 1/100th of a second.  To avoid an id conflict when new documents were created very quickly, the implementation would increment an internal counter when it saw a conflict in ids.  The end result was that if you managed to create, say, 100,000 documents in a single second, you would have documents that appeared to be created 1,000 seconds (about 16 minutes) in the future (since the unique id and the create date were actually the same field).  This disconnect with reality then could cause databases to fail to replicate properly.  The problem actually still exists in current versions of Notes, although it has been alleviated by some other architectural changes (but they still use creation date as the unique id).

In defense of the Lotus Notes developers, when it was designed, there was no hardware that could generate records that quickly, and I’m fairly certain that the GUID standard wasn’t around at the time.

Nowadays, there’s no excuse to roll your own unique identifier system.  Use a GUID!

If you’re interested in the internals of how GUID generation is implemented, I’d recommend looking at this document (PDF).


The Money Class

August 8, 2008 – 9:30 am

Something which is conspicuously missing from the .Net Framework is a Money class for runtime calculations on currency amounts. Such a class’s primary purpose would be to eliminate the round-off errors that you see when using float or double for money amounts, as well as providing runtime checks against adding or subtracting two money amounts with different currencies.   (There actually is a class called Money in the framework,  but it’s for SQL Server money type columns, not for doing calculations on amounts).  What I’m talking about is the ability to do something like:

MoneyAmount m = new MoneyAmount();
m.CurrencyInfo = CurrencyInfo.USD;
m.Cents = 9995;

MoneyAmount m2 = new MoneyAmount();
m2.CurrencyInfo = CurrencyInfo.CAD;
m2.Cents = 1120;

MoneyAmount m2InUSD = m2.ConvertTo(CurrencyInfo.USD);

MoneyAmount m3 = m1 + m2InUSD;

m3 would then contain an amount in USD cents that could be formatted for display or stored in a database or whatever.

There are lots of 3rd party implementations for a Money class for most languages and frameworks, but all of them face a similar set of challenges:

  1. How to deal with currency conversions. Should the conversion be explicit or implicit?  Should developers have to check for currency match before every arithmetic operation they do on two money objects? Or should they expect the class to do the conversion for them?  If the class does the conversion for them, which currency should it convert to?  The one that comes first?  That could be confusing.
    Also, in my example above, I used the property Cents to express the smallest integral value of the currency type, but not every currency has something analogous to “Dollars” and “Cents” (the most obvious example being Japanese Yen, but also is the case with Turkish Lira, Lebanese Lira … and, well, pretty much anything called Lira, actually).  How to deal with these inconsistencies is also an important design decision (especially since the conversion rate is usually given between dollars and yen, but the amounts stored might be cents and yen, changing the conversion rate by a factor of 100).
  2. Where to store conversion rates. A more flexible framework would allow the developer to pass the conversion rate when they ask for the currency to be converted.  However, this would introduce some annoying overhead to writing code that does lots of conversions.  But, any common structure that stored conversion rates inside the class would need to be made thread-safe.  Also, conversion rates need to be updated regularly, perhaps even multiple times a day, depending on the application.
  3. Which arithmetic operations to allow. Adding and subtracting money amounts is something which makes sense, but multiplying or dividing two money amounts probably doesn’t.  So addition and subtraction would be defined between money amounts, but multiplication and division are only defined between a money amount and a scalar numeric value (yielding another money amount). I’m not really sure if modulus makes any sense at all, but it might in some applications.  Again, this is something which would vary from one application to the next.
  4. Round off. If you’re calculating interest or any type of operation where you take a percentage of the money amount, you’re going to have decimal cents laying around.  What happens to them is either a matter of business process or legal requirement, although I suppose you could always transfer them to a personal account.

This list really illuminates the reason why these classes aren’t built-in to major frameworks.  There’s a lot of design decisions that can only be made with information about the context in which the class would be used.  Frameworks should try to stick to doing things that they can do well, and things that they can do right for more than 90% of the expected uses.  There’s just too much customization that has to go on with a money amount class.  So, roll your own.


Using Else-If Responsibly

August 5, 2008 – 10:51 am

One of the compound constructs that exists in virtually all programming languages is the if-elseif-elseif-...-else block.  Nearly every language in use today (except for the most esoteric) has this kind of statement.  Some have a fancier, souped up version like switch in C/C++/C#/Java, match in OCAML or case in Haskell, but the basic idea is the same: a chain of conditional statements, each dependent on all of the previous conditionals.  It’s one of the first things that any programmer learns.

However, using the else-if construct can introduce subtle, hard to find bugs, and it’s easy to see why.  Imagine this code:

if (A)
{
     // 2 pages of code
}
else if (B)
{
     // Another 2 pages of code
}
else
{
     // Do something
}

Now, obviously, any if statement with blocks that big is a candidate for refactoring, but bear with me for a moment.   The code above can be rewritten as follows, while maintaining the same semantics (assuming there are no side-effects in A or B, or anything in the if body which modifies anything that A or B refer to (thanks, Phil)):

if (A)
{
     // 2 pages of code
}

if (!A && B)
{
     // Another 2 pages of code
}

if (!A && !B)
{
     // Do something
}

The reason this construct can introduce bugs is that the conditions under which the else-if block will be executed depends upon the conditions in the if condition above, which probably isn’t even visible in your editor when you’re looking at the else-if condition.  To figure out when the else-if block will be executed, you need to take the boolean inverse of A (which can be a little confusing if it’s a compound boolean statement and you need to apply DeMorgan’s laws) and AND it with the B conditional. And to find when the else clause will be executed, you need to take the inverse of both A and B and AND them together. Each additional else-if clause makes subsequent else clause conditions harder to derive.

I know what you’re thinking, “I learned this stuff in first year computer science, it’s easy.”  Fine, but one of the best things you can do to improve the quality of your software is to manage complexity in your code.  Else-if is something that we have to use, obviously, but I’d like to make the following humble suggestions to use it more responsibly:

  1. Don’t use more than 3 total clauses in a chain. That means if-elseif-else is the limit.  If your logic is more complicated than this, I suggest adding a function in the middle to dispatch to the various sub-cases, except in very simple situations.  It’s also a good opportunity to add some “self-documenting” code if you give your dispatch methods good names.
  2. Refactor the conditional expressions into separate methods. Any conditional expression with more than 2 sub-expressions should probably be refactored into a separate method with a name that indicates what case the compound check is trying to handle.  Hey, it’s “self-documenting”, too!
  3. Refactor the bodies of each case into separate methods. That way, it’s easy to see all of the different conditions on one screen of text.
  4. Use switch instead. Or whatever compound conditional statement your language supports, keeping in mind that switch and the like generally only support looking at equality between a variable and another variable or literal, rather than any arbitrary boolean expression.
  5. Always have an else or default case. Even if it doesn’t do anything, it’s a great place for a comment about when/why all of the tests will fail, or you can put Debug.Assert(false); there if it should never be reached, to catch cases (during testing, not after release) when your callers are potentially passing bad values.

Some C and C++ programmers may be preoccupied with maximizing the efficiency of the compiler-generated assembly code for multiple conditional branches, and may write their code accordingly.  These people are verifiably insane, and their ideas should be dismissed with extreme prejudice.