09.12.2012 Views

I__. - International Military Testing Association

I__. - International Military Testing Association

I__. - International Military Testing Association

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Hands-on managers used these reports extensively. Differences<br />

among TAs were discussed, and ambiguities in interpretation of<br />

scoring rules were resolved through discussion and, if required,<br />

additional training. Individual and group trends could be<br />

detected. Individual TA counselling focused on adherence to the<br />

original training standards and the definition and interpretation<br />

of scoreable steps. Hands-on managers avoided overemphasis<br />

on consistency to prevent artificially high levels of agreement.<br />

Interrater Reliability Results<br />

Interrater reliability, or TA agreement, can indicate the<br />

presence of several possible error sources: test design, time,<br />

t environmental, or other effects. Interrater reliability is the<br />

percentage agreement between primary and shadow scorers on<br />

individual task steps, It is computed by dividing the number of<br />

steps on which the primary and shadow scorer agreed by the total<br />

number they both graded, summed across all examinees and all<br />

tasks. It was calculated using all obsenrations where both<br />

primary and shadow step scores were available.<br />

Fig. 1: Agreement by task<br />

Age-t bctrrem ,nimwy md rhodow test odminirtrotor,<br />

b y time intcrvd<br />

Fig. 2: Agreement by Time<br />

Period<br />

Figure 1 shows scorer agreement across tasks for automotive<br />

mechanics. Agreement ranged from .873 to .971 indicating that<br />

TAs could reliably differentiate tlG~t' and "No Go" performance.<br />

The lowest reliabilities at both sites were on troubleshooting<br />

tasks, indicating some ambiguity in scoring the steps on those<br />

tasks. Three of the lowest four reliabilities occurred on tasks<br />

which were hard to observe because of confined spaces. The fact<br />

that the relative reliabilities among tasks were the same between<br />

sites also indicates a good training program and suggests that<br />

reliability differences were due to test effects.<br />

538

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!