12.07.2015 Views

Statistical Testing Using Automated Search - Crest

Statistical Testing Using Automated Search - Crest

Statistical Testing Using Automated Search - Crest

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

772 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 36, NO. 6, NOVEMBER/DECEMBER 2010Fig. 4. Fault-detecting ability of statistical testing compared to uniform random testing (Experiment B). (a) simpleFunc, (b) bestMove, (c) nsichneu.a probability lower bound greater than or equal to the targetvalue, t cov given in Table 2). T þ is the mean time taken bythese successful runs (measured as CPU user time), and Tthe mean time taken by unsuccessful runs.We may consider each run of the algorithm to beequivalent to a flip of a coin where the probability of thecoin landing with “heads up” or of the algorithm finding asuitable distribution is þ . Trials such as these, where thereare two possible outcomes and they occur with fixedprobabilities, are termed Bernoulli trials [40]. In practice, afteran unsuccessful search, we would run another search with anew, different seed to the PRNG; in other words, we wouldperform a series of Bernoulli trials until one is successful. Thenumber of failures that occur before a successful outcome in aseries of Bernoulli trials follows a geometric distribution [40],with the mean number of failures being given by ð1 þ Þ= þ[41]. Therefore, we may calculate the mean time taken until asuccessful search occurs (including the time taken for thesuccessful search itself) as:C þ ¼ 1 þT þ T þ : ð11Þ þThe calculated values for C þ are given in the Table 3 and area prediction of how long, on average, it would take to find asuitable probability distribution.It is a conservative prediction in that it assumes theavailability of only one CPU core on which searches canbe run. Software engineers are likely to have access tosignificantly more computing resources than this, even ontheir desktop PC, which would allow searches to beperformed in parallel.We argue that the predicted time until success isindicative of practicality for all three SUTs. The longesttime—the value of C þ for nsichneu—is approximately10 hours, rising to 22 hours at extreme end of its confidenceinterval. These results provide strong evidence for Hypothesis1, that automated search can be a practical method ofderiving probability distributions for statistical testing.7.3 Experiment BThe results of Experiment B are summarized in Fig. 4.The upper row of graphs compares the mutation score oftest sets derived from statistical and uniform randomtesting over a range of test sizes. The mutation scores wereassessed at each integer test size in the range, but for clarity,error bars are shown at regular intervals only.The lower row of graphs show the p-value (left-hand axisusing a log scale) for the rank-sum test applied at each testsize, and the effect size A-statistic (right-hand axis). Dottedlines indicate a 5 percent p-value and the lower boundaries ofthe medium effect size regions. p-values below the 5 percentline indicate that the differences in the mutation scores arestatistically significant. A-statistic values above the 0.64 line,or below the 0.36 line, indicate a medium or large effect size.The graphs demonstrate statistically significant differencesin the mutation scores of test sets derived by statisticaland uniform random testing, with the former having the

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!