Monday, June 18, 2007

Testing for Zero bugs

Testing for Zero bugs


The Software Quality Myth

A widely accepted premise on software quality is that software is so complex (in combinatorial terms) that it is impossible to have defect free software. We often hear maxims like "there's always one more bug", and "software testing can reveal the existence of bugs, but never prove their absence".

While the above maxims may be true, I'm afraid they tend to lead us to a state of mind where we accept them as an inevitable evil. If there's no way to make software clean - why bother.

Having some prior experience with software testing, I'm certain that we can do much better than we do today. Once we do this right, the results would pleasantly surprise us.


Conventional testing: How we test software?

Looking at the ways we test software, we see the following methods:

  • Known scenario replay (test suites, regression tests)
  • Random early exposure (Campus alphas, selected customer betas)

The known scenario replay is the single most commonly used method to test software. Unfortunately, it is the least effective method to uncover bugs as proven by the large number of bugs uncovered in later stages.

Regression tests and test suites are necessary but insufficient. They're a good way for conducting sanity checking and ensuring that popular and commonly compiled code runs correctly. However, they have two striking flaws:

  • Coverage is extremely limited. When you run the same suite so many times, obviously your bugs tend to hide elsewhere.




  • It is difficult to track bugs in case of failure. Test suites tend to be big pieces of code. When something goes wrong, we apply some further trial runs and search techniques until the problem is narrowed down to a single routine or line of source code.
Early exposure like alpha (and beta) testing has the advantage that it is much more random than the known-scenario replay. It provides more "real world" coverage, but it has its own flaws:
  • Since it is an informal testing method, reproducibility is a problem (e.g asking a beta customer: what exactly did you do to experience the problem?")
  • It relies on people's good will to investigate problems as they occur, and report bugs accurately and with enough detail when they are found.
  • It suffers for a small scope (time, number of machines, and people employed) compared to the real installed base and its usage patterns.
  • At least the Alpha part doesn't really emulate our customer environment: our environment is far from heterogenic: almost no other vendors' machines (Sun, HP, IBM, Mac, PCs) exist on campus.

Fortunately, there is a complementary testing method that covers the above flaws well.


Monkeys at work: Random Testing

It is said that if you give a zillion monkeys keyboards and let them type long enough, one of them would eventually produce (insert your favorite magnum opus here). Finding bugs is probably much easier than writing magnum opii. If you think this is absurd, just substitute the word "monkeys" with "customers".

For years we have been sending our software to much less than a zillion customers, and yet, without exception, they eventually hit hundreds of undiscovered bugs.

Optimization imperative #1 calls for applying zillions of computer cycles before shipment to try and beat the customers in the race to uncover the bugs. This is what random testing is all about. This is also obvious. The magic is in the details.


Constrained Random Testing: a zillion monkeys, optimized

Since the combinatorial space of software is so huge we would like the proverbial monkeys to be constrained in some way and not purely random. To give an example: if we had our monkeys test UNIX for us, we would want them to type stuff like ls, make, cc etc. rather than having them type stuff like %^3gkl'$#*(&% (*).

``Elementary,'' you may say, but read on.


A proof by example: It just works!

Before going into the details of Constrained Random Testing, let me state that from my experience applying this technique, to one problem space, it works far better than any other conventional testing method I've seen.

I used to work at the National Semiconductor microprocessor design center in Tel Aviv (NSTA) where I was a member of the Compilers team who wrote the compilers for all the 32000 Series of microprocessors.

For several years, our compilers were buggy as hell. Having worked closely with hardware engineers, we were urged to test the compilers "like hardware is tested", using random "vectors" of input. Once we were on the right track in our tought process things started to improve fast. It took one engineer about 4 months to come with a prototype for random compiler testing. He was generating random IR (Intermediate Representation) because it was easier to do than generating high-level code and we just needed a proof of concept. As a result we were testing only the back-end (Optimizer, Code generator, assembler, linker).

The introduction of random testing practically eliminated user bug reports on released back ends. At a certain point, we couldn't believe it ourselves, so we conducted an experiment by running the random program generator (it used to be called RIG, for Random IR Generator) and was implemented by Avi Bloch and Amit Matatia (see acknowledgments) on a previous release of the compiler. To our amazement, RIG was able to find over half of all the bugs reported by customers on the code generator in just one night of runtime.

Thanks to RIG, the GNX Compilers for the 32000 processor family have matured in a very short period to become one of the most reliable in the industry; I remember compiling an early version of perl (I think it was perl 2.0) with full optimizations and passing the whole test suite on a Sequent Balance (a big 32032-CPU SMP machine), while the SunOS and the HP/UX compilers I used for comparison weren't anywhere near this quality at that time (mind you, that was way back, in 1990 or so).

Based on this personal experience, I have no doubt this method is well worth pursuing here.


Testing a Compiler: CRT Outline

Here's the skeleton of RIG:

Loop:
  • Generate a random program subject to constraints
  • Compile it
  • Run it
  • Check that the results make sense:
    • if they do, discard of the program
    • if not, save the program and results for later human inspection

Generating a random program is done using recursive descent on the grammar of the language, while applying constraints at every node on the random generator (explained below). To give a simple example: pick an operand at random, check its arity, generate N random operands to operate on. Each operand can be either a constant or an arbitrary randomly generated new expression. etc.


Constraint Magic: making a random program behave

The outline sounds simple enough, but a few problems immediately come to mind:

  • How can you ensure that a randomly generated program terminates?
  • How can you ensure that a randomly generated program doesn't generate an exception?
  • How can you ensure that a perfectly "safe" optimization won't change the semantics of the program (e.g. use of an undefined variable, the behavior of which is undefined)

This is where the constraints come into play. The embedded constraints are what turns the method from a pure theoretical play into a practical and effective testing method.

  • To ensure termination, we artificially inject an independent counter into every randomly generated loop or recursive function call, if a configurable limit is exceeded, we just break out of the loop. Not perfect, indeed, but simple, straightforward, and does the job.
  • As for exceptions, there are two approaches: one is to simply allow them, if a randomly picked divisor happens to be zero, so be it. The output may be partial, but still deterministic. The other approach (if exceptions happen too often) is to regenerate those expressions that cause the exceptions, or to ensure by construction that they always fall within a certain range of values.

    For example, we ensure that constant divisors are never zero, and that variable divisors are checked at run time (if they are zero don't divide). Likewise we may add a check before array accesses for legal indexes into a pre-generated array. From experience, the second approach is what you want in 99% of the cases, and again, it works well.

  • To make sure the program has outputs (so we can inspect its runtime results) we simply put a hook to print all the randomly generated variables and constants of the program from a customized exit() routine that is linked with the program.
  • Likewise, to ensure there's no use of undefined variables, we simply initialize all of them (with random values of course) just before we jump to main().


This is in essence what the constraints are all about.


Closing the loop: proving correctness

But then there's another more fundamental question:

  • If the program is randomly generated, (i.e. not known in advance) how can you predict its output, in order to verify that it ran correctly?

It turns out that even though the answer to the third question is "You can't", still, from a practical point of view, there's a nice solution to this. Not only it is simple, but it was also empirically proven to work very well.

Solution: Generate a reference output using another compiler or interpreter for the same language. E.g. for C, the reference may be a most vanilla compilation (without any optimizations) by a widely deployed and stable compiler like GNU cc (gcc). This can be even done on a different vendor's system.

Actually, for most cases, you don't even need an additional compiler: You may generate multiple outputs using the same compiler under test, each time with different compilation options. For example: If the result with optimization is different than the one without it, bingo: you've found a bug in the optimizer.


Practice & experience: The importance of space/time tuning

The constraints mentioned above are just part of the story. To be effective and efficient, the random program generator should be space/time tuned, for example: Testing a small array for an unsigned index wrap-around case is effective as checking a big array and is far less consuming in time and space. Thus: elements like the maximum size of a randomly generated array, maximum number of elements in a structure, or the maximum depth of an expression should all be configurable via a configuration file.

Results (in terms of detected bugs per unit of testing time) can change dramatically by fine tuning of these space/time constraint parameters. The general guideline is: strive to use constraints that will create the smallest/simplest case to exhibit the problem.

Some golden rules:

  • Don't generate big data structures.
  • Don't generate big programs, loops, or routines
  • For every possible range of values, use some bias towards extremum and "interesting" values (MAXINT, 1, 0, -1, last index of an array) as opposed to a uniform distribution of values.
  • Think hard about the constraints, let the random generator do the rest
  • For every generated program, run it against as many cases as possible (e.g. many possible compiler options) to amortize the generation overhead, and leverage the reference results over many test cases.

From my experience, all compiler bugs can be narrowed down and reproduced in a very small example. Come to think of it, This is exactly what makes random testing of compilers work so well. Also, the smaller the example, the easier it is to investigate and fix. This is where tuning pays big: if you can generate 100,000 tiny programs per night you'll be much more effective covering the compiler and fixing the bugs than if you generate 1000 larger programs per night.


More Advice

Start with simple expressions, including all types and all possible cast and type conversions, these are easy to test and are a first step in a CRT implementation that is natural to start with and build upon.

Pick a very good random number generator. If you don't and if you don't have much variability in the types of nodes that you randomly generate, your programs will tend to find the same bug multiple times. There are practical ways to minimize this: like changing the configuration parameters once a bug is found (and until the bug is fixed) but this requires further resources and effort.

Generate random C rather than random IR. You cannot compile proprietary formats with a reference compiler (not the one you're testing). You also get to test the language front ends in addition to the back end.

It may be easier to generate IR and upwardly generate high-level constructs from it. This is certainly a sensible strategy, especially in case the mapping between IR and C is one to one. even if not, this will enable testing Fortran with almost no additional effort (since we already have these "reverse compilers").

Generating test cases, running them, and comparing results normally take less time than many compilation with many options, so pick compilation cases that are as different as possible of each other to get better coverage. Also: try to combine several compilation options with each run (i.e. many optimizations together, vs. a vanilla compilation) to achieve good coverage of the compiler in as few compilations as possible.

Jumbo tip: An excellent way of tuning the constraints and the random program generator is to basic-block profile the compiler itself and see what parts of source code are not being exercised by the randomly generated programs. Then, tune the random generator a bit more. In other words, inject some white-box style testing into the random black-box approach.

0 comments: