/usr/sbin - The Probability of a Correct Program

What is the probability that a given program is correct? In Programming as a discipline of mathematical nature Dijkstra models this with a simple equation:

A large sophisticated program can only be made by a careful application of the rule “Divide and Conquer” and as such consists of many components, say N; if, however, p is the probability of an individual component being correct, then the probability P of the whole aggregate being correct satisfies something like \(P ≤ p^N\). In other words, unless p is indistinguishable from 1, for large N the value P will be indistinguishable from zero!¹

In other words, a complex program must be broken down into parts to be manageable (whether these “parts” are modules, functions, or objects). If each of these parts has probability p of being correct and there are N parts, then the probability of the entire program being correct is \(p^N\). If N is large, then p must be very close to one for \(p^N\) to be close to one. More succinctly, a programmer must be very confident in the correctness of each component of a system if she wants to be confident in the system as a whole. This, Dijkstra argues, is why formal verification is important. It’s the only technique that can bring p close to one.

It’s important to note that Dijkstra is talking about correctness in the formal sense, where a program is correct if it produces the right output for every input, but maybe this is setting the bar too high. Instead, most people want programs that usually produce the correct output, and usually don’t crash, because, I suppose, perfection is an unattainable ideal. Personally, I take this as a reflection of how sad the state of programming really is.

But assuming we accept this lower standard, how do most programs measure up? As a friend of mine² pointed out, there are many programs that almost always produce the correct output almost all of the time, and many of these programs have been written in languages with poor safeguards and type systems. So does this mean that formal methods are unneeded, and Dijkstra was mistaken about their importance? I think what this observation shows is that current practices can get us most of the way to a correct program–a level of correctness that many businesses are satisfied with, but I think what this observation misses is the level of effort needed to get us there. Many large programs take years to get to the point where they are considered stable and production ready. I think the promise of formal methods is that we can get to this place more quickly.

One example of this is Google’s evaluation of the FindBugs analysis tool for Java, where thousands of Google engineers reviewed the output of the FindBugs tool after being run against the Google code base. Here was their conclusion:

We observed that most reviews recommended fixing the underlying issue, but few issues caused serious problems in practice. Those issues that might have been problematic were caught during development, testing and deployment. The value of static analysis is that it could have found these problems more cheaply, and we illustrated this with anecdotal examples from student code.³

So although many of the bugs found were either unimportant, or would have been caught anyway during the testing process, FindBugs allowed them to find issues earlier and more cheaply, and all this with a tool that uses rather simple analysis techniques.

Simon Peyton Jones echoed this in a talk about types at Microsoft Research after an audience member argued that runtime debugging is easier than compile time debugging, and that many experienced developers prefer it. Simon responds:

I just can’t imagine that if you have somebody who’s produced the value that’s the wrong shape for its consumer but this value that is being produced is being stuck in some data structure and being passed around and it’s been through some immense amount of plumbing and finally it pops out the end and somebody says, “Oh I didn’t expect to see that” then it’s very far away from the producer and finding your way back to it is difficult. A type system just nails it right there! Immediately!⁴

And so although many developers will be able to recognize that something is wrong without analysis tools, analysis tools make finding the source of these bugs easier. As SPJ put it, “A type system just nails it right there!” So even if we have given up on the idea of program correctness, which I think is a mistake, but perhaps more a business decision than anything else, formal methods can help us get to our desired level of program correctness more quickly, and a flexible expressive language lets the developer decide what level of checking he’s comfortable with. I think this is incredibly powerful, and one day we’ll look back and see dynamic languages as a failed experiment, and wonder why so many people wrote so many large code bases with them, even if we don’t agree with Dijkstra that p must be close to one.

http://www.cs.utexas.edu/…/EWD03xx/EWD361.html ↩
Nick Vanderweit. You should all follow him. https://twitter.com/nvanderw ↩
The Google FindBugs Fixit. Nathaniel Ayewah and William Pugh. ISSTA 2010. http://www.cs.umd.edu/~ayewah/web/pubs/Google-ISSTA2010.pdf ↩
Simon Peyton Jones at Sexy Types–Are We Done Yet? https://research.microsoft.com/apps/video/dl.aspx?id=150045 00:34:00↩