12.07.2015 Views

Statistical Testing Using Automated Search - Crest

Statistical Testing Using Automated Search - Crest

Statistical Testing Using Automated Search - Crest

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

770 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 36, NO. 6, NOVEMBER/DECEMBER 2010Finding near-optimal probability distributions for thisSUT is particularly difficult. For many coverage elements,there are only a few input vectors that exercise theelement—e.g., code reached only when the player has awinning position—and the vectors belong to disconnectedregions in the input domain.6.1.2 nsichneuThis function consists of automatically generated C codethat implements a Petri net. It forms part of a widely usedsuite of benchmark programs used in worst-case executiontime research [34]. The complexity of this example arisesfrom the large number (more than 250) of if statementsand the data-flow between them. The function wasmodified to restore the number of iterations of the mainloop to the original (larger) value in order to reduce theamount of unreachable code.Not only does this function have a large number ofcoverage elements, the code itself takes significantly longerto execute than the other two SUTs, making fitnessevaluation particularly time consuming.6.2 Assessing Fault-Detecting AbilityThe fault-detecting ability of a test set was assessed by itsability to detect mutant versions of the SUT.The mutants were generated using the Proteum/IM tool[35]. Sets of mutants were created by making single-pointmutations, using each of the large number of mutationoperators built in to the tool. Since mutants were being usedonly to compare test sets, there was no need to identify andremove equivalent mutants. For nsichneu, a random selectionconsisting of 10 percent of all possible mutants were used sothat the mutant set was of a practical size; for the other twoSUTs, all of the possible mutants were used (see Table 1).The mutation score—the proportion of mutants “killed”by a test set—is used as the measure of its ability to detectfaults. A mutant was considered as “killed” if for one ormore test cases, a mutant produced a different output or theprocess was terminated by a different operating signal,from the unmutated SUT. The testing process was configuredso that mutant executables were terminated by theoperating system if they used more than 1 s of CPU time.This limit is significantly longer than the CPU time requiredby the unmutated versions of the SUTs, and so identifiesand terminates any mutant that entered an infinite loop.Test sets were generated by random sampling of inputvectors from the probability distribution without replacement:If an input vector was already in the test set, it was rejectedand a further input vector was sampled. We believe this isconsistent with how test sets would be generated in practice:<strong>Using</strong> more than one test case with the same input vectorusually offers little benefit in terms of the number of faultsdetected. (Although not described here, initial experimentationshowed that using a with replacement mechanismproduced results that were qualitatively no different.)6.3 Experiment AThirty-two runs of the search algorithm were performedand the proportion of searches that found distributionssatisfying the coverage constraint—the statistical testingTABLE 2<strong>Search</strong> Parametersadequacy criterion—were measured. A diversity constraintwas not applied in this experiment.The searches were run on a server class machine runninga customized version of Slackware Linux. Each search was asingle-threaded process and used the equivalent of one coreof a CPU running at 3 GHz with a 4 MB cache. The CPUuser times taken by both successful (those finding a suitabledistribution) and unsuccessful searches were recorded. TheCPU user time is approximately equivalent to the wall-clocktime when running on an otherwise unloaded server.The parameters used for the search runs are listed inTable 2. Since the objective of the experiment is to showpracticality of the technique, little time was spent in tuningthe parameters and, where possible, the same parameterswere used for all three SUTs. The only difference betweenthe 32 runs were the seeds provided to the pseudorandomnumber generator (PRNG).<strong>Using</strong> multiple runs in this way allows us to estimate theexpected proportion of successful searches and the expectedtime taken across all possible seeds to the PRNG. The choiceof 32 runs is a compromise between the accuracy of theseestimates and the resources (computing power and time)that were available for experimentation. During the analysisof the results in Section 7.2, we provide confidence intervalsfor these estimates as a measure of the accuracy that wasachieved.The coverage target—equivalent to the lower bound of thecoverage element probabilities—is near the optimal valuesfor simpleFunc (0.25) and bestMove (0.1667). These optimalvalues are straightforward to determine by a manualexamination of the control flow graph of both these SUTs.The optimal lower bound itself is not used as the coveragetarget since it is difficult to obtain this exact fitness, evenwhen the distribution itself is optimal. It requires a randomsample where each input vector occurs with a frequency thatis exactly proportional to its probability in the distribution,and this is unlikely for the relatively small, finite samplesused to evaluate the fitness.The optimal lower bound for nsichneu is difficult todetermine from its control flow graph owing to thecomplexity of the data dependencies. Instead, we use a

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!