Tuesday, July 28, 2009

Measuring Value of Automation Tests

Value and purpose of test automation

The value of test automation is often described in terms of the cost benefits due to reduction in manual testing effort (and the resources needed thereof) and also their ability to give fast feedback. However, this is based on a key assumption that the automated tests are serving their primary purpose – to repeatedly, consistently, and quickly validate that the application is within the threshold of acceptable defects.

Since it is impossible to know most of the defects in an application without using it over a period of time (either by a manual testing team or by users in production), we will need statistical concepts and models to help us design and confirm that the automated tests are indeed serving their primary purpose.



Manual Confirmation of Defects


Is a defect

Is not a defect


Automation Test Results

Failure / Positive

Defective code correctly identified as defective –Caught Defects (CD)

Good code wrongly identified as defective - Not A Defect (NAD)

(aka Type I Error / False Positive)

Positive Predictive Value – CD/(CD+NAD)

Pass / Negative

Defective code wrongly identified as good - Missed Defects (MD)

(aka Type II Error / False Negative)

Good code correctly identified as good - Eureka! (E)


Sensitivity - CD/(CD+MD)



The sensitivity of a test is the probability that it will identify a defect when used on defective component. A sensitivity of 100% means that the tests recognize all defects as such. Thus in a high sensitivity test, a pass result is used to rule out defects.

The positive predictive value of a test is the probability that a component is indeed defective when the test fails. Predictive values are inherently dependent upon the prevalence of defects.

The threshold of acceptable number of defects has to be traded-off with the cost of achieving such a threshold - test development costs, test maintenance costs, higher test run times, etc.

Tests will involve a trade-off between the acceptable number of defects missed (false negatives) and the acceptable number of "Not a Defect" (false positives).

E.g. In order to prevent hijacking, airport security has to screen all baggage for arms being carried into the airplane. This can be done by manually checking all the cabin baggage. This was briefly done for domestic flights in India. However, this is prone to human error, increasing the probability of Missed Defects / false negative. Note - NAD / false positive would be low in this case. How would this change if the manual check is replaced with metal detectors?


The efficacy of automated tests should be measured by their sensitivity and the probability of Missed Defects / false negatives when the application is subjected to these tests.

Data from a project



Manual Confirmation of Defects




Is a defect

Is not a defect


Automation Test Results

Failure / Positive




Pass / Negative


Good code correctly identified as good - Eureka! (E)







Monday, July 13, 2009

Re-defining Agile concepts in a non-agile context

The metrics I suggested for use in an agile project will be equally valuable for a non-agile project as well. The terms / concepts used there-in have to be re-defined, though.

0. Story - A work component; Could be a use case, a functional requirement, etc.
1. Value estimates - Value of the work component (story / use case, etc.) towards enhancing the product. If this is not defined for the work components, it could be temporarily substituted with their effort estimates
2. Complexity estimates - Relative estimate of the complexity of the work component, relates to the effort needed for delivering the work component. This could be the effort estimates for the work component
3. Iteration - Time between 2 successive status reports (in projects that have a fortnightly status report, iteration will be a fortnight)
4. Status - status of the story. E.g. Analysis complete, Coding Complete, Testing Complete, etc.
5. Done Status for stories - This is the last tracked status in the life cycle of the work component. In agile projects, this is often "Showcase Complete" / "Customer Accepted".
6. Velocity - Sum of Value / Complexity estimates of all "Done" stories in an iteration

Thursday, July 9, 2009

Metrics for an Agile project

Q. How are we doing on delivering agreed scope of the current release?
A. Burn-up chart by iteration for the release. Below is a burn-up chart Manju created for reporting status on one of our large programs.

Among other things, this graph shows:
1. Scope changes (demonstrated by fluctuations on the "Total Scope" line
2. The gaps between succeeding status lines reflects in-process / wait stories. Larger than normally accepted gaps indicate bottlenecks. E.g. Dev is a bottleneck due to the huge gap between Analysis Complete and Dev Complete
3. Inventory of stories that are ready to go live (demonstrated by the "Showcase Passed" line
4. Actual completion status (demonstrated by the "Showcase passed" status line)

Q. How are we doing on throughput? How much value are we delivering? What is the trend - running faster, slowing down?
A. Velocity graph by iteration for the project. Only "Done" stories considered for velocity calculations. Below is a velocity graph Manju created for tracking velocity on one of our large programs. The 3 iteration average was first brought to my notice by Santosh, who was using it in one of his projects. I find this extremely valuable, as it balances the ups and downs into something like a trend line.

Why iterative development?

"Until you have seen some of the rest, you can't make sense of any part" - Marvin Minsky.

Minsky says this in the context of describing complex systems. This applies as much to software systems as to intelligence. How can we help users describe a complex system? Wouldn't building some of the rest help them in making sense of the parts.

Monday, July 6, 2009

Bottleneck - Cont.

Below is some data from my previous project:

Wait stages:
Ready for Dev 78
Ready for BA Acceptance 25
Ready for QA 14
Ready for Showcase 12
Ready for SAT 50

In-process stages:
In Analysis 71
In Dev 92
In QA 10

Its clear that Development is the bottleneck. Development takes the longest time among all the stages. Things just don't move as fast here. So, we push more work into this stage. That is the reason for the high in-process work. And more work means multi-tasking for the developers and consequently, diluted focus. That further adds to the time stories take to move out of this stage. And of course, you just cant push enough through the Development stage, so the inventory piles up. This could be the situation in most software development projects. The symptoms of this bottleneck sometimes showed a high inventory in other stages. But they could be traced back to the Development stage in most cases.

I wonder why we didn't look into the Development stage itself and saw what was happening WITHIN the stage. That could have helped us understand how to speed up the Development process.


My last project had a huge bottleneck at system acceptance test (SAT) stage - the SAT team was not able to sign-off stories at the same pace as the dev teams. Though I don't readily have the data, I am certain that the in-process time at the SAT stage was not high. Between the dev team completing the story and the SAT team picking it up were 2 steps - deployment into SAT Servers and showcasing these stories to SMEs and SAT team. Though we were deploying into SAT on a weekly basis, the showcases were done only once in 2 weeks.

There are 2 questions that bother me:
1. Is SAT the bottleneck? Is bottleneck identified purely by the inventory before that stage?
2. How can we ensure we manage the SAT process better?

My thoughts:
1. SAT is not the bottleneck here. If we define bottleneck as the number of stories that can be "processed" at a particular stage, given their ability to work full-time, SAT was not the bottleneck - as their ability to sign-off stories was high (demonstrated during the later iterations of the release). Even if we consider inventory as a measure, given that showcase is a mandatory step before SAT and showcase is done only once in 2 weeks, showcase could be the bottleneck. SAT is like the final assembly. What consumes time is not the assembly itself, but the wait for all parts to come through before they start.
2. I reckon one thing that can be done is: reduce the batch size. Do more showcases. Once a week maybe.

Wednesday, July 1, 2009

Prioritizing stories based on relative affordability

We use value delivered by the story as a primary measure of priority. Would it be more meaningful to consider Relative Value / Cost (relative affordability) as a formula for determining the priority of stories? Cost can be a relative measure of the complexity or effort for implementing and testing the story, determined by the development team (BAs and QAs) - in other words relative size of the story.

This relative affordability of a story gives us a measure of the worthiness of investing in a story. Judging purely based on value is inappropriate because the cost may be prohibitive.

Estimating value of technical stories

How can we determine the value of a "technical story"?

As discussed below in the case for relative value: The value of a story derives from making the system more attractive for end customers. And that could either be developing a new feature or making the system scale better.

For technical stories that fall under non functional requirements (requirements being the key word here), it should be possible to establish their value based on the above definition. An extreme position would be to de-prioritize any NFR / story that cannot establish how it would make the system more attractive for end customers. There is a possibility that inadequate articulation of the value may lower the perceived value of a tech story. But hey, I would ask the tech guys to articulate the value in a different and better way rather than let it get in the back door.

Note: Value need not necessarily be limited to returns in the short term. Long term benefits should also be considered to be of value, in some cases, more valuable than short term gains.

The concept of Relative Value

One of the interesting things that Chaman brought to my notice was details around TOC and that Throughput should be measured as sales (Rs / dollars).

Now, most people see Throughput in terms of story points that the dev teams estimate for stories. This is natural but inconsistent with the spirit of TOC. Story points, no matter that they are nebulous and a relative measure, are estimated by the people doing the job (devs, qas, etc.). And that firmly land them on the cost side. While it is meaningful to get this estimate, for the purpose of measuring throughput in the context of TOC, we may find "value" estimate by business more appropriate.

Now, given that the stories are not independent units but assimilate into a larger system which is then sold to end customers or used by end customers, the value of a story can't be defined with precision. The value of a story derives from making the system more attractive for end customers. As this is an abstraction of "value", its estimation becomes subjective to interpretation and nailing it down with accuracy difficult. Hence, relative value. My suggestion would be to use the standard sizing / estimation concepts for this. With some changes. Get the business to answer 2 questions:

1. By implementing this story, how much more direct (and indirect) revenue will the system generate? If this cannot be determined quantitatively, then ask the second question:
2. By implementing this story, how qualitatively useful is it making the system for end users / customers?

Use triangulation to ensure the relative accuracy of these numbers. This should enable us to measure relative value being delivered every iteration and hence throughput.