Skip navigation

In an article in a recent issue of Computer entitled “Really Rethinking Formal Methods”, David Parnas questions the current direction of formal methods research. His basic claim is that (stop me if this sounds familiar) formal methods have too low ROI and researchers, rather than proclaiming the successes, need to recognize this and adjust their direction. As he so eloquently puts it:

if [formal methods] were ready, their use would be widespread

I haven’t spent a lot of time trying to figure out if his proscriptions make sense or not, but one thing stood out to me. He talks about a gap between software development and older engineering disciplines. This is not a new insight. As far back as the 60’s, the “software crisis” was a concern as the first large complex software systems being built started experiencing acute schedule and quality problems. This was attributed to the fact that programming was a new profession and did not have the rigor or level of professionalism of engineering disciplines that had been around for much longer. Some of the criticisms heard were:

  • programmers are not required to have any degree, far less an engineering degree.
  • programmers are not required to be certified.
  • traditional engineering emphasizes using tried and true techniques, while programmers often invent new solutions for every problem.
  • traditional engineering often follows a rigorous design process, programming allows hacking.

These explanations are often used as the excuse when software (usually Microsoft software) is found to have obvious and annoying bugs. But is this really the truth? Let’s look at an example of traditional engineering to see if if this holds up.

Bridge building is technology that is thousands of years old. There are still roman bridges built two thousand years ago that are in use today. Bridges are designed by civil engineers who are required to be degreed, certified engineers. Bridge design follows a very rigorous process and is done very conservatively using tried and true principles. Given that humanity has been designing bridges for thousands of years, you would think that we would have gotten it right by now.

You would be wrong.

Even today, bridges are built with design flaws that result in accidents and loss of life. One could argue that, even so, the incidence of design flaws is far less in bridges than in software. But this is not really an apples to apples comparison. The consequences of a bug in, say, a web browser are far less than a design flaw in a bridge. In non-safety critical software, economics is a more important factor in determining the level of quality of software. The fact is, most of the time, getting a product out before the competition does is economically more important than producing a quality product.

However, there are safety critical software systems, such as airplanes, medical therapy machines, spacecraft, etc. It is fair to compare these systems to bridges in terms of catastrophic defect rates. Let’s look at one area in particular, commercial aircraft. All commercial aircraft designed in the last 20 years rely heavily on software and, in fact, would be impossible to fly if massive software failures were to occur. Over the past 20 years, there have been roughly 50 incidents of computer-related malfunctions, but the number of fatal accidents directly attributed to software design faults is maybe two or three. This is about the same rate of fatal bridge accidents attributable to design faults. This seems to indicate that this gap between software design and traditional engineering is not so real.

The basic question seems to boil down to: are bridges complex systems?  I define a complex system as one that has bugs in it when shipped. It is clear that bridges still have that characteristic and, therefore, must be considered as complex systems from a design standpoint. The intriguing question is, given that they are complex systems, do they obey the laws of designing complex systems? I believe they do and will illustrate this by comparing two bugs, one a bridge design fault and another a well known software bug.

The London Millennium Footbridge was completed in 2000 as part of the millennium celebration. It was closed two days after it opened due to excessive sway when large numbers of people crossed the bridge. It took two year and millions of pounds to fix. The bridge design used the latest design techniques, including software simulation to verify the design. Sway is a normal characteristic of bridges. However, the designers failed to anticipate how people walking on the bridge would interact with the sway in a way to magnify it. The root cause of this problem is that, while the simulation model was probably sufficiently accurate, the environment, in this case, people walking on the bridge, was not accurate.

This is a very common syndrome in designing complex hardware systems. You simulate the chip thoroughly and then when you power it up in the lab, it doesn’t work in the real environment. I describe an example of this exact scenario in this post.

In conclusion, it does seem that bridges obey the laws of designing complex systems. The bad news is that the catastrophic failure rate of safety-critical software is of roughly the same magnitude as that of bridges. This means that we cannot expect significant improvements in the quality of software over the next thousand years or so. On the plus side, we no longer need to buy the excuse that software development is not as rigorous as “traditional” disciplines such as building bridges.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: