09.12.2012 Views

2003 IMTA Proceedings - International Military Testing Association

2003 IMTA Proceedings - International Military Testing Association

2003 IMTA Proceedings - International Military Testing Association

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

performance domains at the same level of specificity (e.g., Sackett et al., 1988). Each of these<br />

two items was rated on a five point Likert scale, ranging from 1 to 5, where higher scores reflect<br />

higher efficiency or quality of team actions. As these two scores were highly correlated (r = .72),<br />

we decided to average them to form a composite team performance score. Given that the ICC (1)<br />

is .35 (p < .05) and ICC(2) = .85, we averaged the scores across raters to form an overall typical<br />

team performance rating for each team.<br />

Team Performance in Maximum Contexts. These ratings were collected during a one-day<br />

assessment center conducted at the end of the team training to evaluate the combat proficiency of<br />

the team. One external assessor was randomly assigned to evaluate the performance of the team<br />

over a series of six military tasks (e.g., the team may be tasked by HQ to evacuate a casualty<br />

from one place to another); these tasks comprehensively summarized the types of tasks<br />

performed as part of the team training. The assessor used a 5 point Likert scale to evaluate the<br />

efficiency and the quality of team actions on each of the tasks. As there was only one assessor<br />

per team, inter-rater reliability was not available; however, the inter-task reliability for the<br />

efficiency measure was .90 while the inter-task reliability for the quality of team actions was .87.<br />

As with the ratings collected under the typical performance context, both of these measures were<br />

highly correlated (r = .67, p < .01) and so a composite team performance measure was created.<br />

Content Equivalence of the Performance Ratings. To ensure the content and specificity<br />

of the performance measures was sufficiently similar across the two performance contexts, we<br />

asked 10 subject matter experts (SMEs) with extensive experience in training and evaluating this<br />

type of combat team to respond to a 10 item survey. The survey sought their opinions about the<br />

overlap of the performance domains being assessed by the supervisors and the assessors under<br />

the two performance conditions. As shown in Table 1, the responses from these SMEs<br />

demonstrate the performance domains assessed by the supervisors and the assessors were highly<br />

similar in terms of content. The mean response from these SMEs on the 10-item survey is 5.2 on<br />

a six-point scale, indicating these raters believed the overlap between the two performance<br />

measures was present to at least “a great extent.” This information, coupled with the fact that<br />

raters were instructed to only consider team behaviors associated with the content of the training,<br />

and that the same performance dimensions and response scales were used for both conditions,<br />

suggests that the performance ratings assessed the same performance domain with the same<br />

degree of specificity. Thus, the ratings obtained under the two contexts differ in terms of the<br />

performance demands placed on the team (typical or maximum), and perhaps also the knowledge<br />

raters had about the teams and leaders (a point we return to in the Limitations section).<br />

Results<br />

Power Analysis and Data<br />

In contrast to research at the individual level of analysis, the difficulty of collecting data<br />

from large samples of intact teams usually results in smaller sample sizes. For instance, Liden,<br />

Wayne, Judge, Sparrowe, Kraimer, and Franz (1999) analyzed 41 workgroups while Marks,<br />

Sabella, Burke, and Zaccaro (2002) analyzed 45 teams. Indeed, Cohen and Bailey (1997) report<br />

the average number of teams in project team research (the type of teams most similar to those in<br />

the present study) averages only 45 per study. Such is the case in this study, making a careful<br />

consideration of power, p-values, and effect size important.<br />

A power analysis shows that given a sample size of 39 teams, there is only a 59% chance<br />

of detecting moderate effects at p < .05 (one-tailed) (Cohen, 1988). However, with a one-tailed<br />

test and p < .10, power becomes 73%. Hence, we considered values with p < .10 (one-tailed) to<br />

be statistically significant instead of p < .05 (one-tailed) because of the low statistical power<br />

639<br />

45 th Annual Conference of the <strong>International</strong> <strong>Military</strong> <strong>Testing</strong> <strong>Association</strong><br />

Pensacola, Florida, 3-6 November <strong>2003</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!