13.07.2015 Views

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Flora et al.Factor analysis assumptionsproduct-moment correlations were attenuated which led to negativelybiased factor loading estimates (especially with fewerresponse categories), whereas polychoric correlations remainedaccurate <strong>and</strong> produced accurate factor loadings. When <strong>the</strong>observed variable set contained a mix of positively <strong>and</strong> negativelyskewed items (Cases 3 <strong>and</strong> 4), product-moment correlationswere strongly affected by <strong>the</strong> direction of skewness (especiallywith fewer response categories), which can lead to dramaticallymisleading factor analysis results. Unfortunately, strongly skeweditems can also be problematic for factor analyses of polychoric R inpart because <strong>the</strong>y produce sparse observed frequency tables, whichleads to higher rates of improper solutions <strong>and</strong> inaccurate results(e.g., Flora <strong>and</strong> Curran, 2004; Forero et al., 2009). Yet this difficultyis alleviated with larger sample size <strong>and</strong> more item-responsecategories; if a proper solution is obtained, <strong>the</strong>n <strong>the</strong> results of afactor analysis of polychoric R are more trustworthy than thoseobtained with product-moment R.GENERAL DISCUSSION AND CONCLUSIONFactor analysis is traditionally <strong>and</strong> most commonly an analysis of<strong>the</strong> correlations (or covariances) among a set of observed variables.A unifying <strong>the</strong>me of this paper is that if <strong>the</strong> correlationsbeing analyzed are misrepresentative or inappropriate summariesof <strong>the</strong> relationships among <strong>the</strong> variables, <strong>the</strong>n <strong>the</strong> factor analysisis compromised. Thus, <strong>the</strong> process of <strong>data</strong> screening <strong>and</strong> assumption<strong>testing</strong> for factor analysis should begin with a focus on <strong>the</strong>adequacy of <strong>the</strong> correlation matrix among <strong>the</strong> observed variables.In particular, analysis of product-moment correlations or covariancesimplies that <strong>the</strong> bivariate association between two observedvariables can be adequately modeled with a straight line, whichin turn leads to <strong>the</strong> expression of <strong>the</strong> common factor model asa linear regression model. This crucial linearity assumption takesprecedence over any concerns about having normally distributedvariables, although normality is important for certain model fitstatistics <strong>and</strong> estimating parameter SEs. O<strong>the</strong>r concerns about<strong>the</strong> appropriateness of <strong>the</strong> correlation matrix involve collinearity<strong>and</strong> potential influence of unusual cases. Once one accepts<strong>the</strong> sufficiency of a given correlation matrix as a representationof <strong>the</strong> observed <strong>data</strong>, <strong>the</strong> actual formal assumptions of <strong>the</strong> commonfactor model are relatively mild. These assumptions are that<strong>the</strong> unique factors (i.e., <strong>the</strong> residuals, ε,inEq. 1) are uncorrelatedwith each o<strong>the</strong>r <strong>and</strong> are uncorrelated with <strong>the</strong> common factors(i.e., η in Eq. 1). Substantial violation of <strong>the</strong>se assumptions typicallymanifests as poor model-<strong>data</strong> fit, <strong>and</strong> is o<strong>the</strong>rwise difficult toassess with a priori <strong>data</strong>-screening procedures based on descriptivestatistics or graphs.After reviewing <strong>the</strong> common factor model, we gave separatepresentations of issues concerning factor analysis of continuousobserved variables <strong>and</strong> issues concerning factor analysis of categorical,item-level observed variables. For <strong>the</strong> former, we showedhow concepts from regression diagnostics apply to factor analysis,given that <strong>the</strong> common factor model is itself a multivariate multipleregression model with unobserved explanatory variables. Animportant point was that cases that appear as outliers in univariateor bivariate plots are not necessarily influential <strong>and</strong> conversely thatinfluential cases may not appear as outliers in univariate or bivariateplots (though <strong>the</strong>y often do). If one can determine that unusualobservations are not a result of researcher or participant error,<strong>the</strong>nwe recommend <strong>the</strong> use of robust estimation procedures instead ofdeleting <strong>the</strong> unusual observation. Likewise, we also recommend<strong>the</strong> use of robust procedures for calculating model fit statistics<strong>and</strong> SEs when observed, continuous variables are non-normal.Next, a crucial message was that <strong>the</strong> linearity assumption isnecessarily violated when <strong>the</strong> common factor model is fitted toproduct-moment correlations among categorical, ordinally scaleditems, including <strong>the</strong> ubiquitous Likert-type items. At worst (e.g.,with a mix of positively <strong>and</strong> negatively skewed dichotomousitems), this assumption violation has severe consequences forfactor analysis results. At best (e.g., with symmetric items withfive response categories), this assumption violation still producesbiased factor-loading estimates. Alternatively, factor analysis ofpolychoric R among item-level variables explicitly specifies a nonlinearlink between <strong>the</strong> common factors <strong>and</strong> <strong>the</strong> observed variables,<strong>and</strong> as such is <strong>the</strong>oretically well-suited to <strong>the</strong> analysis ofitem-level variables. However, this method is also vulnerable tocertain <strong>data</strong> characteristics, particularly sparseness in <strong>the</strong> bivariatefrequency tables for item pairs, which occurs when stronglyskewed items are analyzed with a relatively small sample. Yet, factoranalysis of polychoric R among items generally produces superiorresults compared to those obtained with product-moment R,especially if <strong>the</strong>re are five or fewer item-response categories.We have not yet directly addressed <strong>the</strong> role of sample size. Inshort,no simple rule-of-thumb regarding sample size is reasonablygeneralizable across factor analysis applications. Instead, adequatesample size depends on many features of <strong>the</strong> research, such as <strong>the</strong>major substantive goals of <strong>the</strong> analysis, <strong>the</strong> number of observedvariables per factor, closeness to simple structure, <strong>and</strong> <strong>the</strong> strengthof <strong>the</strong> factor loadings (MacCallum et al., 1999, 2001). Beyond <strong>the</strong>seconsiderations, having a larger sample size can guard against someof <strong>the</strong> harmful consequences of unusual cases <strong>and</strong> assumptionviolation. For example, unusual cases are less likely to exert stronginfluence on model estimates as overall sample size increases. Conversely,removing unusual cases decreases <strong>the</strong> sample size, whichreduces <strong>the</strong> precision of parameter estimation <strong>and</strong> statistical powerfor hypo<strong>the</strong>sis tests about model fit or parameter estimates. Yet,having a larger sample size does not protect against <strong>the</strong> negativeconsequences of treating categorical item-level variables as continuousby factor analyzing product-moment R. But we did illustratethat larger sample size produces better results for factor analysis ofpolychoric R among strongly skewed items, in part because largersample size reduces <strong>the</strong> occurrence of sparseness.In closing, we emphasize that factor analysis, whe<strong>the</strong>r EFA orCFA, is a method for modeling relationships among observedvariables. It is important for researchers to recognize that itis impossible for a statistical model to be perfect; assumptionswill always be violated to some extent in that no model canever exactly capture <strong>the</strong> intricacies of nature. Instead, researchersshould strive to find models that have an approximate fit to<strong>data</strong> such that <strong>the</strong> inevitable assumption violations are trivial,but <strong>the</strong> models can still provide useful results that help answerimportant substantive research questions (see MacCallum, 2003,<strong>and</strong> Rodgers, 2010, for discussions of this principle). We recommendextensive use of sensitivity analyses <strong>and</strong> cross-validation toaid in this endeavor. For example, researchers should comparewww.frontiersin.org March 2012 | Volume 3 | Article 55 | 119

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!