13.07.2015 Views

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Kasper <strong>and</strong> ÜnlüAssumptions of factor analytic approachesdistribution, <strong>and</strong> <strong>the</strong> discrepancy between <strong>the</strong> estimated <strong>and</strong> <strong>the</strong>true loading matrix. The latter two criteria are computed using<strong>the</strong> true number of factors. Fur<strong>the</strong>rmore, Shapiro-Wilk tests forassessing normality of <strong>the</strong> ability estimates are presented <strong>and</strong>distributions of <strong>the</strong> estimated <strong>and</strong> true factor scores are compared.For <strong>the</strong> skewness criterion, under a factor model <strong>and</strong> a simulationcondition, for any <strong>data</strong> set <strong>the</strong> factor scores on a factor werecomputed <strong>and</strong> <strong>the</strong>ir empirical skewness was <strong>the</strong> value for this <strong>data</strong>set that was used <strong>and</strong> plotted. For <strong>the</strong> discrepancy criterion, undera factor model <strong>and</strong> a simulation condition, for any <strong>data</strong> set i = 1,. . ., 100 a discrepancy measure D i was calculated,D i =∑ px=1∑ ky=1| ˆl i;xy −l xy |,kpwhere ˆl i;xy <strong>and</strong> l xy represent <strong>the</strong> entries of <strong>the</strong> estimated (varimaxrotated, for <strong>data</strong> set i = 1, . . ., 100) <strong>and</strong> true loading matrices,respectively. It gives <strong>the</strong> averaged sum of <strong>the</strong> absolute differencesbetween <strong>the</strong> estimated <strong>and</strong> true factor loadings. We also report <strong>the</strong>average <strong>and</strong> variance (or st<strong>and</strong>ard deviation) of <strong>the</strong>se discrepancymeasures, over all simulated <strong>data</strong> sets,D = 1 ∑100D i <strong>and</strong> s 2 1 ∑100=(D i − D) 2 .100100 − 1i=1i=1In addition to calculating estimated factor score skewness values,we also tested for univariate normality of <strong>the</strong> estimated factorscores. We used <strong>the</strong> Shapiro-Wilk test statistic W (Shapiro <strong>and</strong>Wilk, 1965). In comparison to o<strong>the</strong>r univariate normality tests,<strong>the</strong> Shapiro-Wilk test seems to have relatively high power (Seier,2002). In our study, under a factor model <strong>and</strong> a simulation condition,for any <strong>data</strong> set <strong>the</strong> Shapiro-Wilk test statistic’s p-value wascalculated for <strong>the</strong> estimated factor scores on a factor <strong>and</strong> <strong>the</strong> distributionof <strong>the</strong> p-values obtained from 100 simulated <strong>data</strong> setswas plotted.5. RESULTSWe present <strong>the</strong> results of our simulation study.5.1. NUMBER OF EXTRACTED FACTORSFigure 2 shows <strong>the</strong> relative frequencies of <strong>the</strong> numbers of extractedfactors for sample size n = 200 <strong>and</strong> k = 4 as true number of factors.If <strong>the</strong> Kaiser-Guttman criterion is used, <strong>the</strong> number of extractedfactors is overestimated (for PCA) or tends to be underestimated(for EFA <strong>and</strong> PAA). With <strong>the</strong> scree test, four dimensions whereextracted in <strong>the</strong> majority of cases, but variation of <strong>the</strong> numbersof extracted factors over <strong>the</strong> different <strong>data</strong> sets is high. High variationin this case can be explained by <strong>the</strong> ambiguous <strong>and</strong> hencedifficult to interpret eigenvalue graphics that one needs to visuallyinspect for <strong>the</strong> scree test. Applying <strong>the</strong> parallel analysis method,variation of <strong>the</strong> numbers of extracted factors can be reduced <strong>and</strong><strong>the</strong> true number of factors is estimated very well (e.g., for PCA).There does not seem to be a relationship between <strong>the</strong> number ofextracted factors <strong>and</strong> <strong>the</strong> underlying distribution (normal, slightlyskewed, strongly skewed) of <strong>the</strong> latent ability values.When sample size is increased to n = 600, variation of <strong>the</strong> estimatednumbers of factors decreases substantially under manyconditions (see Figure 3). Compared to small sample sizes, <strong>the</strong>scree test <strong>and</strong> <strong>the</strong> parallel analysis method perform very well. TheKaiser-Guttman criterion still leads to a biased estimation of <strong>the</strong>true number of factors. Once again, <strong>the</strong>re seems to be no relationshipbetween <strong>the</strong> distribution of <strong>the</strong> latent ability values <strong>and</strong> <strong>the</strong>number of extracted factors.Figure 4, for a sample size of n = 200, shows <strong>the</strong> case when<strong>the</strong>re are k = 8 factors underlying <strong>the</strong> <strong>data</strong>. The Kaiser-Guttmancriterion again leads to overestimation or underestimation of <strong>the</strong>true number of factors. The extraction results for <strong>the</strong> scree testhave very high variation, <strong>and</strong> estimation of <strong>the</strong> true number offactors is least biased when <strong>the</strong> parallel analysis method is used.Increasing sample size from n = 200 to 600 results in a significantreduction of variation (Figure 5). However, <strong>the</strong> true numberof factors can be estimated without bias only when <strong>the</strong> parallelanalysis method is used as extraction criterion. A possible relationshipbetween <strong>the</strong> distribution of <strong>the</strong> latent ability values <strong>and</strong><strong>the</strong> number of extracted factors once again does not seem to beapparent.To sum up, we suppose that <strong>the</strong>“number of factors extracted” isrelatively robust against <strong>the</strong> extent <strong>the</strong> latent ability values may beskewed. Ano<strong>the</strong>r observation is that <strong>the</strong> parallel analysis methodseems to outperform <strong>the</strong> scree test <strong>and</strong> <strong>the</strong> Kaiser-Guttman criterionwhen it comes to detecting <strong>the</strong> number of underlyingfactors.5.2. SKEWNESS OF THE ESTIMATED LATENT ABILITY DISTRIBUTIONFigure 6A shows <strong>the</strong> distributions of <strong>the</strong> estimated factor scoreskewness values, for n = 200, k = 4, <strong>and</strong> µ 3m = 0. The majority of<strong>the</strong> skewness values lies in close vicinity of 0. In o<strong>the</strong>r words, for atrue normal latent ability distribution with skewness µ 3 = 0, under<strong>the</strong> classical factor models <strong>the</strong> estimated latent ability scores mostlikely seem to have skewness values of approximately 0. An impactof <strong>the</strong> factor model used for <strong>the</strong> analysis of <strong>the</strong> <strong>data</strong> on <strong>the</strong> skewnessof <strong>the</strong> estimated latent ability values cannot be seen under this simulationcondition. However, <strong>the</strong> st<strong>and</strong>ard deviations of <strong>the</strong> skewnessvalues clearly decrease from <strong>the</strong> first to <strong>the</strong> fourth factor. Ino<strong>the</strong>r words,<strong>the</strong> true skewness of <strong>the</strong> latent ability distribution maybe more precisely estimated for <strong>the</strong> fourth factor than for <strong>the</strong> first.When true latent ability values are slightly negative skewed,µ 3 = −0.20, in our simulation study this skewness may onlybe properly estimated for <strong>the</strong> first <strong>and</strong> second extracted factors(Figure 6B). The estimated latent ability values of <strong>the</strong> third <strong>and</strong>fourth extracted factors more give skewness values of approximately0. The true value of skewness for <strong>the</strong>se factors hence maylikely to be overestimated.If true latent ability values are strongly negative skewed,µ 3 = −2, unbiased estimation of true skewness may not be possible(Figure 6C). Even in <strong>the</strong> case of <strong>the</strong> first <strong>and</strong> second factors,<strong>the</strong> estimation is biased now. True skewness of <strong>the</strong> latent abilitydistribution may be overestimated regardless of <strong>the</strong> used factormodel or factor position.To sum up, under <strong>the</strong> classical factor models, <strong>the</strong> concept of“skewness of <strong>the</strong> estimated latent ability distribution” seems to besensitive with respect to <strong>the</strong> extent <strong>the</strong> latent ability values may beskewed. It seems that, <strong>the</strong> more <strong>the</strong> true latent ability values areskewed, <strong>the</strong> greater is overestimation of true skewness. In o<strong>the</strong>rwww.frontiersin.org March 2013 | Volume 4 | Article 109 | 130

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!