Skip navigation

Tag Archives: methodology

In an article in a recent issue of Computer entitled “Really Rethinking Formal Methods”, David Parnas questions the current direction of formal methods research. His basic claim is that (stop me if this sounds familiar) formal methods have too low ROI and researchers, rather than proclaiming the successes, need to recognize this and adjust their direction. As he so eloquently puts it:

if [formal methods] were ready, their use would be widespread

I haven’t spent a lot of time trying to figure out if his proscriptions make sense or not, but one thing stood out to me. He talks about a gap between software development and older engineering disciplines. This is not a new insight. As far back as the 60’s, the “software crisis” was a concern as the first large complex software systems being built started experiencing acute schedule and quality problems. This was attributed to the fact that programming was a new profession and did not have the rigor or level of professionalism of engineering disciplines that had been around for much longer. Some of the criticisms heard were:

  • programmers are not required to have any degree, far less an engineering degree.
  • programmers are not required to be certified.
  • traditional engineering emphasizes using tried and true techniques, while programmers often invent new solutions for every problem.
  • traditional engineering often follows a rigorous design process, programming allows hacking.

These explanations are often used as the excuse when software (usually Microsoft software) is found to have obvious and annoying bugs. But is this really the truth? Let’s look at an example of traditional engineering to see if if this holds up.

Bridge building is technology that is thousands of years old. There are still roman bridges built two thousand years ago that are in use today. Bridges are designed by civil engineers who are required to be degreed, certified engineers. Bridge design follows a very rigorous process and is done very conservatively using tried and true principles. Given that humanity has been designing bridges for thousands of years, you would think that we would have gotten it right by now.

You would be wrong.

Even today, bridges are built with design flaws that result in accidents and loss of life. One could argue that, even so, the incidence of design flaws is far less in bridges than in software. But this is not really an apples to apples comparison. The consequences of a bug in, say, a web browser are far less than a design flaw in a bridge. In non-safety critical software, economics is a more important factor in determining the level of quality of software. The fact is, most of the time, getting a product out before the competition does is economically more important than producing a quality product.

However, there are safety critical software systems, such as airplanes, medical therapy machines, spacecraft, etc. It is fair to compare these systems to bridges in terms of catastrophic defect rates. Let’s look at one area in particular, commercial aircraft. All commercial aircraft designed in the last 20 years rely heavily on software and, in fact, would be impossible to fly if massive software failures were to occur. Over the past 20 years, there have been roughly 50 incidents of computer-related malfunctions, but the number of fatal accidents directly attributed to software design faults is maybe two or three. This is about the same rate of fatal bridge accidents attributable to design faults. This seems to indicate that this gap between software design and traditional engineering is not so real.

The basic question seems to boil down to: are bridges complex systems?  I define a complex system as one that has bugs in it when shipped. It is clear that bridges still have that characteristic and, therefore, must be considered as complex systems from a design standpoint. The intriguing question is, given that they are complex systems, do they obey the laws of designing complex systems? I believe they do and will illustrate this by comparing two bugs, one a bridge design fault and another a well known software bug.

The London Millennium Footbridge was completed in 2000 as part of the millennium celebration. It was closed two days after it opened due to excessive sway when large numbers of people crossed the bridge. It took two year and millions of pounds to fix. The bridge design used the latest design techniques, including software simulation to verify the design. Sway is a normal characteristic of bridges. However, the designers failed to anticipate how people walking on the bridge would interact with the sway in a way to magnify it. The root cause of this problem is that, while the simulation model was probably sufficiently accurate, the environment, in this case, people walking on the bridge, was not accurate.

This is a very common syndrome in designing complex hardware systems. You simulate the chip thoroughly and then when you power it up in the lab, it doesn’t work in the real environment. I describe an example of this exact scenario in this post.

In conclusion, it does seem that bridges obey the laws of designing complex systems. The bad news is that the catastrophic failure rate of safety-critical software is of roughly the same magnitude as that of bridges. This means that we cannot expect significant improvements in the quality of software over the next thousand years or so. On the plus side, we no longer need to buy the excuse that software development is not as rigorous as “traditional” disciplines such as building bridges.

Well, there isn’t one. But, understanding the three laws of verification can help you understand how to optimize your current verification methodology.

(See The First Law of Verification, The Second Law, The Third Law)

The first law says that no matter what methodology is used, verification will be the bottleneck in getting the project completed. The third law tells us that different methods may find the same bug at different points in the verification process. The second law tells us that the same bug may be easier or harder to find depending on the methodology used and the person doing the verification. Based on these observations, we can synthesize a methodology that optimizes verification efficiency:

  1. Parallelize the verification effort.
    • have multiple people working independently.
    • have multiple simulations or other automated verification tools running simultaneously.
  2. Minimize overlap between parallel efforts.
    • have different people verify different aspects of the design.
    • orchestrate verification automation tools to minimize duplicated effort.
  3. Use as many different methodologies as possible.
    • get as many different points of view as possible.
    • minimize the possibility that a bug that is hard to find with one methodology will slip through.
  4. Evaluate the efficiency of each methodology used.
    • efficiency = bugs found/effort put in.
    • put more effort into those methodologies with highest efficiency.
  5. Start as soon as possible.
    • the first law says you will find bugs at the beginning no matter what you do.
    • the sooner you start, the sooner you fill finish.

Most projects naturally follow the first two guidelines. The next three guidelines come from insight gained in understanding the three laws of verification. The subjective nature of bug hardness means that using a single methodology, even if parallelized, increases the chances of relatively simple bugs slipping through the verification process. The way to overcome this is to use as many different methodologies as possible. The downside to this is that each methodology has development cost associated with it. Therefore, there is a tradeoff between development effort and the number of different methodologies that can be deployed. Generally, smaller teams will end up using fewer different methodologies due to resource constraints and larger teams will be able to use more.

It is important to evaluate methodologies used in order to improve the overall verification process for future projects. You need to be careful here. Don’t throw away all methodologies except for the one that is most efficient. You will have lost the advantage of using different methodologies. If you do decide to remove a methodology because it is not finding bugs, try to replace it with another unique methodology rather than just putting more effort on fewer methodologies. Putting more effort on fewer methodologies just increases the risks of bugs slipping through.

As an example, here are efficiencies of the different methodologies used on the MCU project:

Method Bugs Found Effort (Man-Months) Bugs/Man-Month
code review 44 2 21.0
static checks+unit testing 31 13 2.4
full chip directed testing 70 38 1.8
regression+perturbation 12 9 1.3
coverage improvement 3 4 0.8
assertions 3 4 0.7
full chip random+algorithmic tests 5 8 0.6
emulation 3 28 0.1
overall 171 106 1.6

Based on this data, I would probably look seriously at dropping emulation in the next project.

But, these numbers can be deceiving. Unlike bug hardness, bug finding efficiency does vary with time. This is a big topic, so I will save it for its own post.