Writing Good Code

From AIRWiki
Revision as of 17:44, 27 April 2009 by BernardoDalSeno (Talk | contribs) (C and C++ code)

Jump to: navigation, search

Why this page

When tutoring students for their theses or projects, I often find many problems with the code they write. I'm not referring to bugs; bugs happen, as flu happens, although there are some things you can do to make them more unlikely. The problem discussed here is Bad codeTM, i.e., code that nobody can read or understand, not even its author. So I've decided to write this page with some advice on how to write Good codeTM. Please note that everything you find in this page should not be considered as strict rules, as it expresses the point of view of its author(s); but a reasonable piece of advice needs a good reason to counter it, in order not to follow it; so at least think about it. Contributions are welcome.

--Bernardo

General

The source of the program is not only a way to obtain a binary that your computer can run, it is a way to express ideas. They should be clear to the compiler, so it compiles and you have your nice executable, but they should be clear also to anyone that reads your code, including your fellow students, your supervisor, and, of course, you — even after a couple of months. Write for your fellows, not only for the computer!

I Think Therefore I Program

And the inverse is true: you cannot program without thinking. So, before rushing to the keyboard, take a piece of paper and try to lay down your ideas. A clear idea of the structure of data and the algorithm you want to apply is important to write a properly working program. Think about how you're going to use the data and shape the data structures accordingly; think about how to factorize your algorithm, what functions (methods) and what parameters you need.

Names

Names should be meaningful, clear, short (at least, not too long), and to the point. That applies to pretty everything: classes, variables, functions, modules, members. Remember: the compiler doesn't care about names, but humans do; your colleagues are humans, and probably they want to understand your code when they happen to read it. If the project you are working on or the language you are using come with a naming convention just follow it; separate words in names by using underscores '_' or mixedCaseLikeThis.

In general, you want to use nouns or adjectives as names for variable and attributes, while verbs are appropriate for functions and methods. Names with a wider scope (classes, public methods, global variables, global functions...) require more care: They cannot be too simple to avoid clashing with local names, and both the meaning and the context should be clear. 'tmp' can be a good name for a local variable with a life of a couple of lines of code, or 'k' is good for a loop counter, but they are really bad names for anything with a scope wider than a loop. For example, if you have to store the name of a temporary file where you save the result of the application of a fast Fourier transform, something like 'fft_file_name' is more appropriate; if you have a counter that keeps track of the number of time you called the 'evaluate_fitness()' method (e.g., for statistical purposes), you could chose something like 'fitness_call_count'. Completely uninformative and generic names like 'var', 'flag', 'foo', 'pippo', 'number'... are always bad.

Comments

A big problem is the use of comments. You can encounter projects with hundreds of lines of code and not a single line of comment, so that it is hard to guess even the general purpose of the program, or projects with ultra-verbose comments, like this:

k = k + 1; // increment k

Yeah, thanks, I thought it was extracting the square root of k.

Comments are important, and their correct use improves the quality of the code, even when there are no comments. Really! Often comments are written as a patch to badly-written code, but a tangled piece of code is not a way to clearly convey your ideas; and a comment added to a tangled piece of code doesn't make it much clearer. Besides, if you can't understand a piece of code, does some comment raise your confidence that the code is really working? Tangled code is more likely to be bugged, and more difficult to modify, so please write your algorithm in a way that it can be understood just be reading the instructions, not the comments. There are good places where to write obfuscated code, and a thesis is not one of them.

Just an example (taken from real code! names have been changed to protect the innocent):

#define NUM_IT 3
for (j = 0; j < NUM_IT; ++j) {
    if (j == 0) { // First iteration
        // Do something
        ...
    } else if (j == 1) { // Second iteration
        // Do something else
        ...
    } else if (j == 2) { // Third iteration
        // Do things
        ...       
    }
}

This is a rather confusing way to write a simple sequence:

// Do something
...
// Do something else
...
// Do things
...       

So, what to comment? Good places for comments are class declarations, functions or methods (you tell what the function is about, what the parameters are, and describe the expected results and side effects), class attribute declarations, global or important variables, the beginning of a file, and a few others. Also, when you take a tricky decision, please document it!

Debugging

There is no program without a bug (or was it "There is no rose without a bug"?). Your programs are no exception, so you'll have to remove bugs. Many books could be written about debugging (and they are), but here just a few ideas are hinted.

When you experience a failure, don't rush hacking at the code until it disappears. We are in an engineering school, so use the engineering method: build a model. In other words, try to pinpoint the error that caused the failure, before trying to remove it. Do tests and explore the functioning of your program in order to find the problem; try to find a way to make the failure reproducible, so after you correct the error you can check that everything is working properly. Debuggers are useful to inspect the state of the program, but you can use also assertions and printfs (or the output function of your favorite language) when you can't or don't want to run a debugger.

Sometimes you have to write some debugging code to check conditions, to validate data structures, or print out the value of variables. This code is not needed when you've finished debugging, but this no reason for removing it. If you spent time writing code, don't waste your effort; you or someone else may need it in the future to debug a similar problem. Leave it in place; surround it with conditionals: #ifdef DEBUG if you are using C, or if (debug) in Java or other languages that have no preprocessor (debug is a global variable in this case); in this way, it is easy to activate it. Unless the code is inside a deeply-nested loop, the performance hit of a run-time if is negligible; if (and only if) your debugging code is an a deeply-nested loop, wrap it in a comment.

If you write test cases (or better yet test scripts), don't throw them away. Rerun your tests periodically, so you can avoid regressions. There are even tools that help in this, like JUnit (for Java; there are ports for C++ and other languages).

Optimization

You want your code to be as fast as possible, right? Wrong! Really, do you care if your favorite word processor saves a microsecond every time you type a character, or would you trade that microsecond for a greater reliability, so that it doesn't crash losing half a chapter of your precious thesis?

Optimizing code require (your) time, and sometimes the code becomes less readable and maintainable. So, concentrate your efforts on the parts of the program that really require it; measure the speed of your program, and use a profiler to identify the bottlenecks. A simple rule of thumb can be given: anything that is not inside a doubly-nested loop is not worth optimizing. Always measure your progress, as bottlenecks may change.

And never forget that a better algorithm beats any tweaking of your code. For example, if you are looking for the maximum in an array, sorting the array and taking the first element is a bad solution (it has at least O(n log n) complexity); if you don't need the sorted array for other purposes, a linear sweep of the array is faster (it's O(n)) and requires less memory.

Indentation and spaces

In most languages, spaces and indentation are not part of the syntax, and they are mostly ignored by the compiler (Python is a significant exception), yet indentation and spaces are very useful to format your code and make it more readable. There are many way to use them, but the first rule is 'consistency'. Choose the style you like the most, and stick to it; particularly, choose if you want to use spaces or tabs for indentation, and be consistent, otherwise when someone else opens your project with a different editor with a different idea of tab length, your nice (you made it nice, didn't you?) indentation will be screwed up.

Warnings

When compiled, your program should not raise any warnings. Any warnings, which includes harmless warnings. Compilers generate warnings for a reason: you have written a code that does something suspicious and you should look into it. So, if someone else compiles your program, should they also check that all the warnings of your code are harmless? Or how can they be sure that you effectively checked all warnings and you're not just a sloppy coder? :-) And what happens if they (or you) change your code that raises a gazillion of warnings? How can anyone spot any new warning? So, please make sure that your code raises no warnings, and whoever will use it in the future will thank you for that.

C and C++ code

The Linux kernel coding style is an interesting reading, particularly chapters 3 (Placing Braces and Spaces), 4 (Naming), 6 (Functions), 8 (Commenting), 12 (Macros).

C (and C++) have a fair number of operators, with strictly defined precedences. Even if you know all the precedence rules by heart, don't assume others do; so, please use parentheses when writing complex expressions.

Header files

Apparently, header (include) files can be problematic, probably because of the Java background common to many people nowadasys. If you any doubt about them, Wikipedia does a good job in briefly explaining what header files are.

Major misconceptions regards what goes inside a header file. This a topic better described in a programming manual (and you've already read it, haven't you?), but a good rule of thumb is: only declarations and definitions that neither reserve memory nor produce code.

There can be problems with header files, due to multiple or recursive inclusions. To avoid them, you can wrap their content in a #ifndef...#endif construct (inside the header file!), like this:

#ifndef MY_PROJECT_MY_HEADER_H 
#define MY_PROJECT_MY_HEADER_H
/* Real content of the file */
...
#endif

The name of the macro should be something that is unlikely to clash with other macros; e.g., prepend the name of your project.

Matlab

Matlab is a rather slow interpreted language, but it shines — as its name implies — at matrix manipulation. So try to avoid loops, and do operations in parallel on arrays. Many vectorized functions are built-in, so look in the help when you need simple operations like sums, means, maximum...

Matlab has a tool, mlint, that checks the code for easily detectable problems. It is active by default in the Matlab editor, and it highlight problems with jagged orange lines under the troubled code (and corresponding orange indicators on the right-hand bar). Don't ignore them, unless you are sure they are harmless; look up in the help or on the Web if don't understand a problem.

One of the problems spotted by mlint is the dynamic growth of arrays. Avoid that; it slows Matlab, as it turns O(n) algorithms into O(n²). For example,

a = []; 
for k = 1:50000
    a(k) = k;
end

takes more than 30 seconds on a machine, while

a = zeros(50000,1);
for k = 1:50000
    a(k) = k;
end

takes only 0.1 seconds on the same machine. See the full explanation on the Mathworks Web site.

Remember not to trust mlint too much; there is no perfect static analyzer (you remember Rice's theorem, right?). If you have no warnings, it doesn't mean that your program is perfect.