Skip navigation

My previous post on the second law of verification left open the question of why the bug rate curve starts high and goes low. If it is not due to bug hardness increasing with time, then what is it?

Towards the end of the MCU project, I decided to analyze all bug reports, over 100 total at the time, to try to get some ideas about the hardness of bugs that occurred. The first problem was how to synthesize a useful metric that captures the difficulty of finding bugs. We have seen that hardness is subjective, so rather than trying to measure absolute hardness, I chose to measure the relative hardness of one bug against another. Let’s call this measure “Bug Finding Probability” (BFP).

One way of measuring BFP is to count the number of events required to trigger a particular bug. For example, bug reports often give a description such as: “A remote read followed by a local write to the same address causes the wrong data to be written.” This is describing two independent events that must happen: 1) a remote read, and 2) a local write. Assuming we are verifying a memory, this is usually done by injecting a series of reads and/or writes to exercise various functions. Tests may inject a read with some probability, followed by a write with a different probability. The product of the two probabilities is the BFP for the bug.

Assuming equal probabilities for injecting requests, the BFP is inversely correlated with the number of events required to trigger the bug. In other words, the more events required to trigger the bug, the lower the BFP. When reading these bug reports, it was often as easy as simply counting the number of occurrencess of the word “and” in the bug description to determine the relative hardness of the bug. Here is a plot of bug hardness vs. time along with the bug rate vs. time for the MCU project. Bug hardness in this plot is measured as the number of independent events required to trigger the bug.

bug rate and bug hardness vs. time for the MCU project

bug rate and bug hardness vs. time for the MCU project

My initial impressions of this were that 1) bug hardness is quite “noisy” vs. time, which doesn’t necessarily invalidate the “bug hardness rises with time” theory if you average it out and 2) on average, bug hardness did not rise over time. At the time, I was not really concentrating on understanding this. I casually mentioned to some colleagues at Stanford that I had found this result which contradicted expectation. Kanna Shimizu, another researcher in our group, and Dave Dill, my advisor were the first to arrive at an explanation for this. Dave, being a professor, naturally had to cast the problem into the classical balls and bins analysis used in probability testbooks.

Assume that you have a bin with an unknown number of black and white balls, with some ratio of black to white balls. Randomly draw one ball from the bin, with all balls having equal probability of being chosen. If the chosen ball is black, paint it white and put it back in the bin and continue to pull out balls and paint the black ones white and put them back into the bin. If you then plot the rate at which black balls are selected, you will get curve that starts high and then exponentially decreases to a low value.

The analogy should be obvious. Balls represent points in the possible input space of a design, black balls represent points that hit a bug. Pulling out a ball corresponds to running a test. Painting a ball white corresponds to fixing a bug. Therefore, the rate that black balls are pulled out of the bin corresponds to the bug rate of the design. The bug rate started high and went low exactly as is observed in real designs. But we assumed that all balls were equally probably which corresponds to all bugs having equal hardness!

This was quite a revelation. It meant that the purgatory period was not caused by hard bugs at all, but simply due to the fact that there weren’t many bugs left to be found. We can now state the third law of verification:

third law of verification: The hardness of a bug is independent of when it was found.

Based on data from the MCU, I think it is fair to say that, to a first-order approximation, all bugs are equally hard. Since, in almost all cases, the verification effort finds 90+% of bugs in a design, we can therefore, to a first-order approximation, restate the third law as:

third law of verification (simplified) bugs are easy.

Advertisements

2 Comments

  1. Unless we exercise the part of a design functionally to check for its working we will not able to validate it. So the bugs will not be found till the design is exercised by giving all functionally valid values to the DUT. So hard/easy (as the 3rd law says) bugs are independent of when it is found.

  2. Thanks for sharing this graph which has the merit of giving a concrete approach to this. Unfortunately – for now – I can not confirm this based metrics I did.

    I would however tend to disagree with this third law. There is an saying that says something like when you solve a bug you’ll be putting another one in that is harder to find. We’ve had this quite a number of times.

    The metaphore of black and white balls is a nice one and is valid for bugs that have equal hardness. A bug that is harder to find is like a ball that is harder to catch.
    When I look at your graph, I observer that bug hardness doesn’t drop under the value of about 3 once day 225 is passed. So it looks like single event bugs and double event bugs have mostly been found.
    Further, I think we have to be careful with how bug hardness is described. In my opinion bug descriptions are localized to the block where they occur. Even if I find a bug during DMA transfer to the memory, I would still write that it happens during memory write bursts forgetting to mention all the conditions that are needed to get the memory doing this kind of burst to that memory. This is where separating IP Verification and Integration verification helps. It is more efficient to verify smaller blocks exhaustively before integrating them in a larger system that is verified too. And that might mean that one needs to split an IP into smaller pieces for verification. This is not often done and it makes certain bugs just harder to find.

    To conclude: interesting idea that I’ll thinker about.


One Trackback/Pingback

  1. […] the framework of the the three laws of verification, it is possible to synthesize a consistent explanation for all the conflicting data. First, to […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: