Statistical Testing Using Automated Search - Crest

POULDING AND CLARK: EFFICIENT SOFTWARE VERIFICATION: STATISTICAL TESTING USING AUTOMATED SEARCH 773greater fault-detecting ability for most test sizes. The effectsize is medium or large for many test sizes.These results support Hypothesis 2 that test setsgenerated using statistical testing are more efficient thantest sets generated by uniform random testing: For the sametest size, the former test sets typically have greater faultdetectingability.The exception is for larger test sets (size > 100) applied tonsichneu. For these larger test sets, uniform random testingis significantly more effective at detecting faults. We suspectthat this is due to a lack of diversity in the probabilitydistribution, and this motivates the use of a diversityconstraint during search.The horizontal dotted lines in the upper row of graphs inFig. 4 are the mean mutation score for each statistical test setat the size at which it first exercises all of the coverageelements in the SUT at least once. (The vertical dotted linesare the mean of these sizes.) At these sizes, the test setssatisfy the adequacy criterion of deterministic structuraltesting. However, due to the stochastic nature in which theyare generated, they will usually contain more test cases thana minimally sized test set. The additional test cases can onlyimprove the mutation score of the test set; therefore, theirmutation scores—the horizontal dotted line—represent anupper bound for the mutation scores achievable bydeterministic structural testing.For each SUT, the mutation scores obtained by statisticaltesting exceed this upper bound at sufficiently large testsizes. This is evidence in support of Hypothesis 3, that testsets derived using statistical testing where each coverageelement is exercised multiple times detect more faults thantest sets typical of deterministic structural testing thatexercise each element only once.7.4 Cost-Benefit AnalysisThe results of Experiments A and B taken together enable asimple cost-benefit analysis that compares statistical testingwith uniform random testing. This can be performed in anumber of ways: for example, by considering the differencein fault detecting ability of test sets of the same size for thetwo methods or the difference in the cost of testing for testsets of different size that have same fault detecting ability.We take the second of these two approaches in order toavoid speculation on the costs of later rectifying faults thatare not discovered during testing, as these costs are highlydependent on the context.If we choose a fault-detecting ability equivalent to amutation score of 0.55 for bestMove, the results of ExperimentB show that this occurs at a test size of 4 for statisticaltesting and at a test size of 61 for uniform random testing.(These sizes are obtained from the data used to plot thegraphs of Fig. 4b.) <strong>Statistical</strong> testing incurs a cost in findinga suitable probability distribution, and from Table 3, thissearch takes, on average, 129 minutes. If executing,and—often more significantly—checking the results of thetests against a specification or an oracle takes longer than129 minutes for the additional 57 test cases required byrandom testing, then statistical testing will be more efficientin terms of time. We argue that when checking test resultsinvolves a manual comparison against a specification, thenthe superior time efficiency of statistical testing is likely tobe realized in this case. We also note that the search for aprobability distribution is automated, requiring little or nomanual effort, and so speculate that statistical testing islikely to be superior in terms of monetary cost (even insituations where it is not superior in terms of time) if theexecution and checking of the additional 57 test cases is apredominantly manual, and therefore expensive, process.We may repeat this analysis for nsichneu, again choosing afault detecting ability equivalent to a mutation score of 0.55.In this case, random testing requires 14 more test cases thanstatistical testing: The test sizes are 66 and 52, respectively.The search for a probability distribution suitable for statisticaltesting takes 592 minutes. The superiority of statistical testingin terms of time is not as convincing for nsichneu: Randomtesting is quicker overall if executing and checking the extra14 test cases takes no longer than 10 hours. However, weagain note that the search for the probability distribution isautomated, and so, considering monetary cost, statisticaltesting may be cheaper than random testing if checking theresults of the extra 14 test cases is a manual process.The simple analyses of this section are dependent on themutation score we choose. Nevertheless, for bestMove, thelarge difference in the fault-detecting ability of the twomethods shown by Fig. 4b suggests that the generalconclusion remains the same for most choices of mutationscore. For nsichneu, however, any benefit of statistical testingis lost for mutation scores above approximately 0.6. As can beseen in Fig. 4c, mutation scores above this level are achievedby random testing with test sizes that are smaller than thoserequired by statistical testing. This underlines the importanceof investigating whether the use of a diversity constraintimproves the fault-detecting ability of statistical testing.7.5 Experiment CThe results of Experiment C are summarized in Fig. 5. Eachgraph shows the mutation scores for a sequence ofprobability distributions taken from the trajectory of onesearch run. The probability lower bound for each distribution(evaluated accurately using a large sample size) isplotted on the x-axis, and the mutation score (with errorbar) on the y-axis. The lines connect mean mutation scorescalculated at the same test size. Note that the left-hand pointin each graph is the uniform probability distribution that isused to initialize the search: This distribution is equivalentto that used for uniform random testing.At small values of the lower bound—at the left of thegraphs—mutation scores get better as the probability lowerbound increases. However, for larger values of the lowerbound at the right of Fig. 5c, the mutation score begins todecrease. There is also some evidence for this effect at thehighest lower bound values in Fig. 5b.We suspect that is because diversity can be lost as thesearch proceeds—the probability distribution is “overfitted”to the coverage constraint—and so, despite a highlower bound, the distribution generates test sets that arerelatively poor at detecting faults. This again motivates theuse of a diversity constraint in order to avoid this loss ofefficiency as the search proceeds.The results of Experiment C provide some evidence insupport of Hypothesis 4 that distributions with higher lowerbounds generate test sets with a greater ability to detectfaults. However, this is not true in all cases: <strong>Search</strong>ing fornear-optimal lower bounds can be counter-productive forsome SUTs, and we suspect this is a result of losing diversity.

Previous page

Next page

5

6

7

8

9

10

11

12

13

14

15

Statistical Testing Using Automated Search - Crest

Create successful ePaper yourself

Delete template?

Save as template?