Statistical Testing Using Automated Search - Crest

More documents

Recommendations

Info

770 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 36, NO. 6, NOVEMBER/DECEMBER 2010Finding near-optimal probability distributions for thisSUT is particularly difficult. For many coverage elements,there are only a few input vectors that exercise theelement—e.g., code reached only when the player has awinning position—and the vectors belong to disconnectedregions in the input domain.6.1.2 nsichneuThis function consists of automatically generated C codethat implements a Petri net. It forms part of a widely usedsuite of benchmark programs used in worst-case executiontime research [34]. The complexity of this example arisesfrom the large number (more than 250) of if statementsand the data-flow between them. The function wasmodified to restore the number of iterations of the mainloop to the original (larger) value in order to reduce theamount of unreachable code.Not only does this function have a large number ofcoverage elements, the code itself takes significantly longerto execute than the other two SUTs, making fitnessevaluation particularly time consuming.6.2 Assessing Fault-Detecting AbilityThe fault-detecting ability of a test set was assessed by itsability to detect mutant versions of the SUT.The mutants were generated using the Proteum/IM tool[35]. Sets of mutants were created by making single-pointmutations, using each of the large number of mutationoperators built in to the tool. Since mutants were being usedonly to compare test sets, there was no need to identify andremove equivalent mutants. For nsichneu, a random selectionconsisting of 10 percent of all possible mutants were used sothat the mutant set was of a practical size; for the other twoSUTs, all of the possible mutants were used (see Table 1).The mutation score—the proportion of mutants “killed”by a test set—is used as the measure of its ability to detectfaults. A mutant was considered as “killed” if for one ormore test cases, a mutant produced a different output or theprocess was terminated by a different operating signal,from the unmutated SUT. The testing process was configuredso that mutant executables were terminated by theoperating system if they used more than 1 s of CPU time.This limit is significantly longer than the CPU time requiredby the unmutated versions of the SUTs, and so identifiesand terminates any mutant that entered an infinite loop.Test sets were generated by random sampling of inputvectors from the probability distribution without replacement:If an input vector was already in the test set, it was rejectedand a further input vector was sampled. We believe this isconsistent with how test sets would be generated in practice:Using more than one test case with the same input vectorusually offers little benefit in terms of the number of faultsdetected. (Although not described here, initial experimentationshowed that using a with replacement mechanismproduced results that were qualitatively no different.)6.3 Experiment AThirty-two runs of the search algorithm were performedand the proportion of searches that found distributionssatisfying the coverage constraint—the statistical testingTABLE 2Search Parametersadequacy criterion—were measured. A diversity constraintwas not applied in this experiment.The searches were run on a server class machine runninga customized version of Slackware Linux. Each search was asingle-threaded process and used the equivalent of one coreof a CPU running at 3 GHz with a 4 MB cache. The CPUuser times taken by both successful (those finding a suitabledistribution) and unsuccessful searches were recorded. TheCPU user time is approximately equivalent to the wall-clocktime when running on an otherwise unloaded server.The parameters used for the search runs are listed inTable 2. Since the objective of the experiment is to showpracticality of the technique, little time was spent in tuningthe parameters and, where possible, the same parameterswere used for all three SUTs. The only difference betweenthe 32 runs were the seeds provided to the pseudorandomnumber generator (PRNG).Using multiple runs in this way allows us to estimate theexpected proportion of successful searches and the expectedtime taken across all possible seeds to the PRNG. The choiceof 32 runs is a compromise between the accuracy of theseestimates and the resources (computing power and time)that were available for experimentation. During the analysisof the results in Section 7.2, we provide confidence intervalsfor these estimates as a measure of the accuracy that wasachieved.The coverage target—equivalent to the lower bound of thecoverage element probabilities—is near the optimal valuesfor simpleFunc (0.25) and bestMove (0.1667). These optimalvalues are straightforward to determine by a manualexamination of the control flow graph of both these SUTs.The optimal lower bound itself is not used as the coveragetarget since it is difficult to obtain this exact fitness, evenwhen the distribution itself is optimal. It requires a randomsample where each input vector occurs with a frequency thatis exactly proportional to its probability in the distribution,and this is unlikely for the relatively small, finite samplesused to evaluate the fitness.The optimal lower bound for nsichneu is difficult todetermine from its control flow graph owing to thecomplexity of the data dependencies. Instead, we use a
POULDING AND CLARK: EFFICIENT SOFTWARE VERIFICATION: STATISTICAL TESTING USING AUTOMATED SEARCH 771coverage target that is close to the best value found duringpreliminary experimentation.This experiment is designed to provide evidence forHypothesis 1, that it is practical to use automated search toderive probability distributions for statistical testing.6.4 Experiment BExperiment B measures the fault-detecting ability of testsets generated by 10 of the probability distributions foundin Experiment A. For each of the 10 distributions, 10 test setswere generated by random sampling without replacement,using a different PRNG seed in each case. Each time a testcase was added to the test set, the mutation score wasassessed using the method described in Section 6.2.For comparison with random testing, the same experimentwas performed using a uniform distribution in placeof that derived by automated search. One hundred test setswere generated from the uniform distribution so that thetwo samples had the same size.This experiment is designed to provide data to demonstrateHypotheses 2 and 3 that, when using automatedsearch, the superior efficiency of statistical testing comparedto uniform random and deterministic structuraltesting is maintained.6.5 Experiment CExperiment C compares the fault-detecting ability ofprobability distributions that have different lower bounds.It is designed to provide evidence for Hypothesis 4, thatprobability distributions with higher lower bounds generatemore efficient test sets.For each SUT, a sequence of probability distributionswas taken at points along the trajectory of one search run. Ingeneral, such a sequence consists of distributions withincreasing probability lower bounds. The search run usedthe same parameters as in Experiment A. Twenty test setswere generated from each distribution and their mutationscores assessed over a range of test sizes.6.6 Experiment DExperiment D compares the fault-detecting ability of testsets generated from distributions found by automatedsearch with and without a diversity constraint. It providesdata to support Hypothesis 5, that the use of a diversityconstraint results in more efficient test sets.A sample of 10 distributions was found using theparameters specified in Table 2. The parameters have anonzero value for w div and so add a diversity constraint tothe search objectives. The coverage target parameters (t cov )for bestMove and nsichneu were less than the near-optimalvalues of Experiment A in order to demonstrate that a lackof diversity has a significant effect even when usingdistributions with only moderate lower bounds.For comparison, a further sample of distributions wasfound using the same parameters, but with w div set to zeroin order to disable the diversity constraint. More than10 such distributions were found, and a subset of size 10was selected in a principled manner so that the distributionsof the coverage fitnesses across the two samples wereas similar as possible. This was designed to minimize theeffect of probability lower bounds on this experiment.Ten test sets were generated from each distribution in thetwo samples, and their mutation scores assessed over a rangeof test sizes.TABLE 3Search Times (Experiment A)7 RESULTS AND ANALYSIS7.1 Statistical AnalysisIn this section, results are summarized using the mean.Although the median is potentially a more robust statisticfor the skewed distributions that the results could exhibit, itwas found to be misleading when, on occasion, the dataformed similarly sized clusters around two (or more)widely separated values. In this case, the median returneda value from one of the clusters, while the mean gave amore meaningful statistic located between the two clusters.Confidence intervals quoted for the mean values, and theerror bars shown in the graphs, are at the 95 percentconfidence level. They are calculated using bootstrapresampling, specifically the bias corrected and acceleratedpercentile method [36].Nonparametric statistical tests are used to analyze thedata. Since parametric statistical tests can be invalidatedby small deviations from the assumptions that the testsmake [37], the use of nonparametric tests ensures thevalidity of the analysis, while avoiding the need toperform additional analysis to verify that the data conformto the test assumptions.To compare samples, the nonparametric Mann-Whitney-Wilcoxon or rank-sum test is applied [38]. The nullhypothesis for the rank-sum test is that the samples arefrom the same distribution; the alternate hypothesis is thatthe distributions are different. We apply the test at a5 percent significance level.Given a sufficiently large sample, hypothesis tests such asthe rank-sum test can demonstrate statistically significantresults even when the underlying differences are extremelysmall. Therefore, we use an additional test to show that theeffect size—in this case, a difference in the ability to detectfaults—is large enough to be meaningful, given the variabilityin the results. The nonparametric Vargha-DelaneyA-test [39] is used here since its value can be calculated fromthe statistic used by the rank-sum test. We use the guidelinespresented in [39] that an A-statistic of greater than 0.64 (orless than 0.36) is indicative of a “medium” effect size andgreater than 0.71 (or less than 0.29), of a “large” effect size.7.2 Experiment AThe results of Experiment A are summarized 2 in Table 3.In this table, þ is the proportion of search algorithmruns that found a suitable probability distribution (i.e., with2. The unsummarized data for all four experiments is available from:http://www.cs.york.ac.uk/~smp/supplemental.
Page 5 and 6: POULDING AND CLARK: EFFICIENT SOFTW
Page 7: POULDING AND CLARK: EFFICIENT SOFTW
Page 15: POULDING AND CLARK: EFFICIENT SOFTW

Statistical Testing Using Automated Search - Crest

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?