Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

More documents

Recommendations

Info

REVIEW ARTICLEpublished: 01 March 2012doi: 10.3389/fpsyg.2012.00055Old and new ideas for data screening and assumptiontesting for exploratory and confirmatory factor analysisDavid B. Flora*, Cathy LaBrish and R. Philip ChalmersDepartment of Psychology, York University, Toronto, ON, CanadaEdited by:Jason W. Osborne, Old DominionUniversity, USAReviewed by:Stanislav Kolenikov, University ofMissouri, USACam L. Huynh, University ofManitoba, Canada*Correspondence:David B. Flora, Department ofPsychology, York University, 101 BSB,4700 Keele Street, Toronto, ON M3J1P3, Canada.e-mail: dflora@yorku.caWe provide a basic review of the data screening and assumption testing issues relevantto exploratory and confirmatory factor analysis along with practical advice for conductinganalyses that are sensitive to these concerns. Historically, factor analysis was developedfor explaining the relationships among many continuous test scores, which led to theexpression of the common factor model as a multivariate linear regression model withobserved, continuous variables serving as dependent variables, and unobserved factorsas the independent, explanatory variables. Thus, we begin our paper with a review of theassumptions for the common factor model and data screening issues as they pertain tothe factor analysis of continuous observed variables. In particular, we describe how principlesfrom regression diagnostics also apply to factor analysis. Next, because modernapplications of factor analysis frequently involve the analysis of the individual items froma single test or questionnaire, an important focus of this paper is the factor analysis ofitems. Although the traditional linear factor model is well-suited to the analysis of continuouslydistributed variables, commonly used item types, including Likert-type items,almost always produce dichotomous or ordered categorical variables. We describe howrelationships among such items are often not well described by product-moment correlations,which has clear ramifications for the traditional linear factor analysis. An alternative,non-linear factor analysis using polychoric correlations has become more readily available toapplied researchers and thus more popular. Consequently, we also review the assumptionsand data-screening issues involved in this method.Throughout the paper, we demonstratethese procedures using an historic data set of nine cognitive ability variables.Keywords: exploratory factor analysis, confirmatory factor analysis, item factor analysis, structural equationmodeling, regression diagnostics, data screening, assumption testingLike any statistical modeling procedure, factor analysis carries a setof assumptions and the accuracy of results is vulnerable not only toviolation of these assumptions but also to disproportionate influencefrom unusual observations. Nonetheless, the importance ofdata screening and assumption testing is often ignored or misconstruedin empirical research articles utilizing factor analysis.Perhaps some researchers have an overly indiscriminate impressionthat, as a large sample procedure, factor analysis is “robust” toassumption violation and the influence of unusual observations.Or, researchers may simply be unaware of these issues. Thus, withthe applied psychology researcher in mind, our primary goal forthis paper is to provide a general review of the data screening andassumption testing issues relevant to factor analysis along withpractical advice for conducting analyses that are sensitive to theseconcerns. Although presentation of some matrix-based formulasis necessary, we aim to keep the paper relatively non-technical anddidactic. To make the statistical concepts concrete, we provide dataanalytic demonstrations using different factor analyses based onan historic data set of nine cognitive ability variables.First, focusing on factor analyses of continuous observed variables,wereview the common factor model and its assumptions andshow how principles from regression diagnostics can be applied todetermine the presence of influential observations. Next, we moveto the analysis of categorical observed variables, because treatingordered, categorical variables as continuous variables is anextremely common form of assumption violation involving factoranalysis in the substantive research literature. Thus, a key aspectof the paper focuses on how the linear common factor modelis not well-suited to the analysis of categorical, ordinally scaleditem-level variables, such as Likert-type items. We then describean alternative approach to item factor analysis based on polychoriccorrelations along with its assumptions and limitations. We beginwith a review the linear common factor model which forms thebasis for both exploratory factor analysis (EFA) and confirmatoryfactor analysis (CFA).THE COMMON FACTOR MODELThe major goal of both EFA and CFA is to model the relationshipsamong a potentially large number of observed variablesusing a smaller number of unobserved, or latent, variables. Thelatent variables are the factors. In EFA, a researcher does not havea strong prior theory about the number of factors or how eachobserved variable relates to the factors. In CFA, the number offactors is hypothesized a priori along with hypothesized relationshipsbetween factors and observed variables. With both EFA andCFA, the factors influence the observed variables to account forwww.frontiersin.org March 2012 | Volume 3 | Article 55 | 101
Flora et al.Factor analysis assumptionstheir variation and covariation; that is, covariation between anytwo observed variables is due to them being influenced by thesame factor. This idea was introduced by Spearman (1904) and,largely due to Thurstone (1947), evolved into the common factormodel, which remains the dominant paradigm for factor analysistoday. Factor analysis is traditionally a method for fitting modelsto the bivariate associations among a set of variables, with EFAmost commonly using Pearson product-moment correlations andCFA most commonly using covariances. Use of product-momentcorrelations or covariances follows from the fact that the commonfactor model specifies a linear relationship between the factors andthe observed variables.Lawley and Maxwell (1963) showed that the common factormodel can be formally expressed as a linear model with observedvariables as dependent variables and factors as explanatory orindependent variables:y j = λ 1 η 1 + λ 2 η 2 + λ k η k + ε jwhere y j is the jth observed variable from a battery of p observedvariables, η k is the kth of m common factors, λ k is the regressioncoefficient, or factor loading, relating each factor to y j , and ε j is theresidual, or unique factor, for y j . (Often there are only one or twofactors, in which case the right hand side of the equation includesonly λ 1 η 1 + ε j or only λ 1 η 1 + λ 2 η 2 + ε j .) It is convenient to workwith the model in matrix form:y = Λη + ε, (1)where y isavectorofthep observed variables, Λ is a p × m matrixof factor loadings, η isavectorofm common factors, and ε isavectorofp unique factors 1 . Thus, each common factor mayinfluence more than one observed variable while each unique factor(i.e., residual) influences only one observed variable. As withthe standard regression model, the residuals are assumed to beindependent of the explanatory variables; that is, all unique factorsare uncorrelated with the common factors. Additionally, theunique factors are usually assumed uncorrelated with each other(although this assumption may be tested and relaxed in CFA).Given Eq. 1, it is straightforward to show that the covariancesamong the observed variables can be written as a function of modelparameters (factor loadings, common factor variances and covariances,and unique factor variances). Thus, in CFA, the parametersare typically estimated from their covariance structure:Σ = ΛΨΛ ′ + Θ, (2)where Σ is the p × p population covariance matrix for the observedvariables, Ψ is the m × m interfactor covariance matrix, and Θ isthe p × p matrix unique factor covariance matrix that often containsonly diagonal elements, i.e., the unique factor variances. Thecovariance structure model shows that the observed covariances1 For traditional factor analysis models, the means of the observed variables arearbitrary and unstructured by the model, which allows omission of an interceptterm in Eq. 1 by assuming the observed variables are mean-deviated, or centered(MacCallum, 2009).are a function of the parameters but not the unobservable scoreson the common or unique factors; hence, it is not necessary toobserve scores on the latent variables to estimate the model parameters.In EFA, the parameters are most commonly estimatedfrom the correlation structureP = Λ ∗ Ψ ∗ Λ ∗′ + Θ ∗ (3)where P is the population correlation matrix and, in that P issimply a re-scaled version of Σ, we can view Λ ∗ , Ψ ∗ , and Θ ∗as re-scaled versions of Λ, Ψ, and Θ, respectively. This tendencyto conduct EFA using correlations is mainly a result of historicaltradition, and it is possible to conduct EFA using covariances orCFA using correlations. For simplicity, we focus on the analysisof correlations from this point forward, noting that the principleswe discuss apply equivalently to the analysis of both correlationand covariance matrices (MacCallum, 2009; but see Cudeck, 1989;Bentler, 2007; Bentler and Savalei, 2010 for discussions of theanalysis of correlations vs. covariances). We also drop the asteriskswhen referring to the parameter matrices in Eq. 3.Jöreskog (1969) showed how the traditional EFA model, oran “unrestricted solution” for the general factor model describedabove, can be constrained to produce the “restricted solution”that is commonly understood as today’s CFA model and is wellintegratedin the structural equation modeling (SEM) literature.Specifically, in the EFA model, the elements of Λ are all freely estimated;that is, each of the m factors has an estimated relationship(i.e., factor loading) with every observed variable; factor rotationis then used to aid interpretation by making some values in Λlarge and others small. But in the CFA model, depending on theresearcher’s hypothesized model, many of the elements of Λ arerestricted,or constrained,to equal zero,often so that each observedvariable is determined by one and only one factor (i.e., so that thereare no “cross-loadings”). Because the common factors are unobservedvariables and thus have an arbitrary scale, it is conventionalto define them as standardized (i.e., with variance equal to one);thus Ψ is the interfactor correlation matrix 2 . This convention is nota testable assumption of the model, but rather imposes necessaryidentification restrictions that allow the model parameters to beestimated (although alternative identification constraints are possible,such as the marker variable approach often used with CFA).In addition to constraining the factor variances, EFA requires adiagonal matrix for Θ, with the unique factor variances along thediagonal.Exploratory factor analysis and CFA therefore share the goalof using the common factor model to represent the relationshipsamong a set of observed variables using a small number of factors.Hence, EFA and CFA should not be viewed as disparate methods,despite that their implementation with conventional softwaremight seem quite different. Instead, they are two approaches to2 In EFA, the model is typically estimated by first setting Ψ to be an identity matrix,which implies that the factors are uncorrelated, or orthogonal, leading to the initialunrotated factor loadings in Λ. Applying an oblique factor rotation obtains a new setof factor loadings along with non-zero interfactor correlations. Although rotationis not a focus of the current paper, we recommend that researchers always use anoblique rotation.Frontiers in Psychology | Quantitative Psychology and Measurement March 2012 | Volume 3 | Article 55 | 102
Page 2 and 3:
FRONTIERS COPYRIGHTSTATEMENT© Copy
Page 4 and 5:
Table of Contents05 Is Data Cleanin
Page 7 and 8:
OsborneAssumptions and data cleanin
Page 9 and 10:
ORIGINAL RESEARCH ARTICLEpublished:
Page 12 and 13:
Hoekstra et al.Why assumptions are
Page 14 and 15:
Page 16 and 17:
Page 18 and 19:
ORIGINAL RESEARCH ARTICLEpublished:
Page 20 and 21:
García-PérezStatistical conclusio
Page 22 and 23:
Page 24 and 25:
Page 26 and 27:
Page 28 and 29:
Page 30 and 31:
Sheng and ShengEffect of non-normal
Page 32 and 33:
Page 34 and 35:
Page 36 and 37:
Page 38 and 39:
Page 40 and 41:
Page 42 and 43:
REVIEW ARTICLEpublished: 12 April 2
Page 44 and 45:
Nimon et al.The assumption of relia
Page 46 and 47:
Page 48 and 49:
Page 50 and 51:
Page 52 and 53: Nimon et al.The assumption of relia
Page 54 and 55: Nimon et al.The assumption of relia
Page 56 and 57: TressoldiPower replication unreliab
Page 58 and 59: TressoldiPower replication unreliab
Page 60 and 61: ORIGINAL RESEARCH ARTICLEpublished:
Page 62 and 63: FinchModern methods for the detecti
Page 72 and 73: MINI REVIEW ARTICLEpublished: 28 Au
Page 74 and 75: NimonStatistical assumptionsand Del
Page 76 and 77: NimonStatistical assumptionsFor exa
Page 78 and 79: Kraha et al.Interpreting multiple r
Page 94 and 95: SmithsonComparing moderation of slo
Page 104 and 105: Flora et al.Factor analysis assumpt
Page 124 and 125: Kasper and ÜnlüAssumptions of fac
Page 144 and 145: Lages and JaworskaHow predictable a
Page 152 and 153:
Cummiskey et al.Testing assumptions
Page 154 and 155:
Page 156 and 157:
Page 158:
show all

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Create successful ePaper yourself

Delete template?

Save as template?