Source Code Analysis Laboratory (SCALe) for Energy ... - CERT
Source Code Analysis Laboratory (SCALe) for Energy ... - CERT
Source Code Analysis Laboratory (SCALe) for Energy ... - CERT
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
• individual consistency (<strong>for</strong> example, how consistent is the individual in rendering the same<br />
judgment across time <strong>for</strong> the same or virtually the same flagged noncon<strong>for</strong>mity; often referred<br />
to as repeatability)<br />
• group accuracy (<strong>for</strong> example, what percentage of the time does a specific group of <strong>SCALe</strong><br />
analysts render the correct judgment)<br />
• group consistency (<strong>for</strong> example, what percentage of the time does a specific group of <strong>SCALe</strong><br />
analysts render the same judgment <strong>for</strong> a given flagged noncon<strong>for</strong>mity across time; often referred<br />
to as reproducibility)<br />
Any modern statistical package can easily determine the attribute agreement measures, which are<br />
interpreted as follows [Landis 1977]:<br />
• Less than 0 (no agreement)<br />
• 0–0.20 (slight agreement)<br />
• 0.21–0.40 (fair agreement)<br />
• 0.41–0.60 (moderate agreement)<br />
• 0.61–0.80 (substantial agreement)<br />
• 0.81–1 (almost perfect agreement)<br />
A need may arise to assess both accuracy and consistency of <strong>SCALe</strong> analysts’ judgments with<br />
measures that extend beyond the binary situation (correct or incorrect) to situations in which a<br />
judgment is a gradual measure of closeness to the right answer. In this case, analysts should use<br />
an alternative attribute agreement measure and interpret the output quite similarly to the Kappa<br />
coefficient, with results possible on the dimensions listed above. As such, Kendall coefficients<br />
serve well <strong>for</strong> judgments on an ordinal scale, in which incorrect answers are closer or farther away<br />
from the correct answer. A hypothetical example of a judgment on an ordinal scale would be if a<br />
<strong>SCALe</strong> analyst were asked to render judgment of the severity of a flagged noncon<strong>for</strong>mity, say on<br />
a 10-point scale. If the true severity is, <strong>for</strong> example, 8, and two <strong>SCALe</strong> analysts provided answers<br />
of 1 and 7, respectively, then a severity judgment of 7 would have a much higher Kendall coefficient<br />
than the severity judgment of 1.<br />
In conclusion, attribute agreement analysis may be conducted via small exercises with <strong>SCALe</strong><br />
analysts rendering judgments on a reasonably-sized list of different types of flagged noncon<strong>for</strong>mities<br />
mapped to the set of rules within the scope of a given code con<strong>for</strong>mance test.<br />
2.7.2.2 Formative Evaluation Assessment Using Attribute Agreement <strong>Analysis</strong><br />
<strong>SCALe</strong> analysts participate in a <strong>for</strong>mative evaluation assessment as part of their training and certification.<br />
Certification of a candidate as a <strong>SCALe</strong> analyst requires attribute agreement scores of<br />
80 percent or higher. In addition, acceptable thresholds <strong>for</strong> accuracy may be imposed separately<br />
<strong>for</strong> each rule.<br />
The <strong>for</strong>mative evaluation assessment implements a simple attribute agreement analysis as follows.<br />
First, the <strong>SCALe</strong> assessment administrator identifies a preliminary set of 20 to 35 different<br />
flagged noncon<strong>for</strong>mities (from the diagnostic output of a software system or code base) <strong>for</strong> the<br />
evaluation assessment. The administrator ensures that the preliminary set includes a variety of<br />
CMU/SEI-2010-TR-021 | 22