REVIEW ARTICLEpublished: 01 March 2012doi: 10.3389/fpsyg.2012.00055Old <strong>and</strong> new ideas for <strong>data</strong> screening <strong>and</strong> assumption<strong>testing</strong> for exploratory <strong>and</strong> confirmatory factor analysisDavid B. Flora*, Cathy LaBrish <strong>and</strong> R. Philip ChalmersDepartment of Psychology, York University, Toronto, ON, CanadaEdited by:Jason W. Osborne, Old DominionUniversity, USAReviewed by:Stanislav Kolenikov, University ofMissouri, USACam L. Huynh, University ofManitoba, Canada*Correspondence:David B. Flora, Department ofPsychology, York University, 101 BSB,4700 Keele Street, Toronto, ON M3J1P3, Canada.e-mail: dflora@yorku.caWe provide a basic review of <strong>the</strong> <strong>data</strong> screening <strong>and</strong> assumption <strong>testing</strong> issues relevantto exploratory <strong>and</strong> confirmatory factor analysis along with practical advice for conductinganalyses that are sensitive to <strong>the</strong>se concerns. Historically, factor analysis was developedfor explaining <strong>the</strong> relationships among many continuous test scores, which led to <strong>the</strong>expression of <strong>the</strong> common factor model as a multivariate linear regression model withobserved, continuous variables serving as dependent variables, <strong>and</strong> unobserved factorsas <strong>the</strong> independent, explanatory variables. Thus, we begin our paper with a review of <strong>the</strong>assumptions for <strong>the</strong> common factor model <strong>and</strong> <strong>data</strong> screening issues as <strong>the</strong>y pertain to<strong>the</strong> factor analysis of continuous observed variables. In particular, we describe how principlesfrom regression diagnostics also apply to factor analysis. Next, because modernapplications of factor analysis frequently involve <strong>the</strong> analysis of <strong>the</strong> individual items froma single test or questionnaire, an important focus of this paper is <strong>the</strong> factor analysis ofitems. Although <strong>the</strong> traditional linear factor model is well-suited to <strong>the</strong> analysis of continuouslydistributed variables, commonly used item types, including Likert-type items,almost always produce dichotomous or ordered categorical variables. We describe howrelationships among such items are often not well described by product-moment correlations,which has clear ramifications for <strong>the</strong> traditional linear factor analysis. An alternative,non-linear factor analysis using polychoric correlations has become more readily available toapplied researchers <strong>and</strong> thus more popular. Consequently, we also review <strong>the</strong> assumptions<strong>and</strong> <strong>data</strong>-screening issues involved in this method.Throughout <strong>the</strong> paper, we demonstrate<strong>the</strong>se procedures using an historic <strong>data</strong> set of nine cognitive ability variables.Keywords: exploratory factor analysis, confirmatory factor analysis, item factor analysis, structural equationmodeling, regression diagnostics, <strong>data</strong> screening, assumption <strong>testing</strong>Like any statistical modeling procedure, factor analysis carries a setof assumptions <strong>and</strong> <strong>the</strong> accuracy of results is vulnerable not only toviolation of <strong>the</strong>se assumptions but also to disproportionate influencefrom unusual observations. None<strong>the</strong>less, <strong>the</strong> importance of<strong>data</strong> screening <strong>and</strong> assumption <strong>testing</strong> is often ignored or misconstruedin empirical research articles utilizing factor analysis.Perhaps some researchers have an overly indiscriminate impressionthat, as a large sample procedure, factor analysis is “robust” toassumption violation <strong>and</strong> <strong>the</strong> influence of unusual observations.Or, researchers may simply be unaware of <strong>the</strong>se issues. Thus, with<strong>the</strong> applied psychology researcher in mind, our primary goal forthis paper is to provide a general review of <strong>the</strong> <strong>data</strong> screening <strong>and</strong>assumption <strong>testing</strong> issues relevant to factor analysis along withpractical advice for conducting analyses that are sensitive to <strong>the</strong>seconcerns. Although presentation of some matrix-based formulasis necessary, we aim to keep <strong>the</strong> paper relatively non-technical <strong>and</strong>didactic. To make <strong>the</strong> statistical concepts concrete, we provide <strong>data</strong>analytic demonstrations using different factor analyses based onan historic <strong>data</strong> set of nine cognitive ability variables.First, focusing on factor analyses of continuous observed variables,wereview <strong>the</strong> common factor model <strong>and</strong> its assumptions <strong>and</strong>show how principles from regression diagnostics can be applied todetermine <strong>the</strong> presence of influential observations. Next, we moveto <strong>the</strong> analysis of categorical observed variables, because treatingordered, categorical variables as continuous variables is anextremely common form of assumption violation involving factoranalysis in <strong>the</strong> substantive research literature. Thus, a key aspectof <strong>the</strong> paper focuses on how <strong>the</strong> linear common factor modelis not well-suited to <strong>the</strong> analysis of categorical, ordinally scaleditem-level variables, such as Likert-type items. We <strong>the</strong>n describean alternative approach to item factor analysis based on polychoriccorrelations along with its assumptions <strong>and</strong> limitations. We beginwith a review <strong>the</strong> linear common factor model which forms <strong>the</strong>basis for both exploratory factor analysis (EFA) <strong>and</strong> confirmatoryfactor analysis (CFA).THE COMMON FACTOR MODELThe major goal of both EFA <strong>and</strong> CFA is to model <strong>the</strong> relationshipsamong a potentially large number of observed variablesusing a smaller number of unobserved, or latent, variables. Thelatent variables are <strong>the</strong> factors. In EFA, a researcher does not havea strong prior <strong>the</strong>ory about <strong>the</strong> number of factors or how eachobserved variable relates to <strong>the</strong> factors. In CFA, <strong>the</strong> number offactors is hypo<strong>the</strong>sized a priori along with hypo<strong>the</strong>sized relationshipsbetween factors <strong>and</strong> observed variables. With both EFA <strong>and</strong>CFA, <strong>the</strong> factors influence <strong>the</strong> observed variables to account forwww.frontiersin.org March 2012 | Volume 3 | Article 55 | 101
Flora et al.Factor analysis assumptions<strong>the</strong>ir variation <strong>and</strong> covariation; that is, covariation between anytwo observed variables is due to <strong>the</strong>m being influenced by <strong>the</strong>same factor. This idea was introduced by Spearman (1904) <strong>and</strong>,largely due to Thurstone (1947), evolved into <strong>the</strong> common factormodel, which remains <strong>the</strong> dominant paradigm for factor analysistoday. Factor analysis is traditionally a method for fitting modelsto <strong>the</strong> bivariate associations among a set of variables, with EFAmost commonly using Pearson product-moment correlations <strong>and</strong>CFA most commonly using covariances. Use of product-momentcorrelations or covariances follows from <strong>the</strong> fact that <strong>the</strong> commonfactor model specifies a linear relationship between <strong>the</strong> factors <strong>and</strong><strong>the</strong> observed variables.Lawley <strong>and</strong> Maxwell (1963) showed that <strong>the</strong> common factormodel can be formally expressed as a linear model with observedvariables as dependent variables <strong>and</strong> factors as explanatory orindependent variables:y j = λ 1 η 1 + λ 2 η 2 + λ k η k + ε jwhere y j is <strong>the</strong> jth observed variable from a battery of p observedvariables, η k is <strong>the</strong> kth of m common factors, λ k is <strong>the</strong> regressioncoefficient, or factor loading, relating each factor to y j , <strong>and</strong> ε j is <strong>the</strong>residual, or unique factor, for y j . (Often <strong>the</strong>re are only one or twofactors, in which case <strong>the</strong> right h<strong>and</strong> side of <strong>the</strong> equation includesonly λ 1 η 1 + ε j or only λ 1 η 1 + λ 2 η 2 + ε j .) It is convenient to workwith <strong>the</strong> model in matrix form:y = Λη + ε, (1)where y isavectorof<strong>the</strong>p observed variables, Λ is a p × m matrixof factor loadings, η isavectorofm common factors, <strong>and</strong> ε isavectorofp unique factors 1 . Thus, each common factor mayinfluence more than one observed variable while each unique factor(i.e., residual) influences only one observed variable. As with<strong>the</strong> st<strong>and</strong>ard regression model, <strong>the</strong> residuals are assumed to beindependent of <strong>the</strong> explanatory variables; that is, all unique factorsare uncorrelated with <strong>the</strong> common factors. Additionally, <strong>the</strong>unique factors are usually assumed uncorrelated with each o<strong>the</strong>r(although this assumption may be tested <strong>and</strong> relaxed in CFA).Given Eq. 1, it is straightforward to show that <strong>the</strong> covariancesamong <strong>the</strong> observed variables can be written as a function of modelparameters (factor loadings, common factor variances <strong>and</strong> covariances,<strong>and</strong> unique factor variances). Thus, in CFA, <strong>the</strong> parametersare typically estimated from <strong>the</strong>ir covariance structure:Σ = ΛΨΛ ′ + Θ, (2)where Σ is <strong>the</strong> p × p population covariance matrix for <strong>the</strong> observedvariables, Ψ is <strong>the</strong> m × m interfactor covariance matrix, <strong>and</strong> Θ is<strong>the</strong> p × p matrix unique factor covariance matrix that often containsonly diagonal elements, i.e., <strong>the</strong> unique factor variances. Thecovariance structure model shows that <strong>the</strong> observed covariances1 For traditional factor analysis models, <strong>the</strong> means of <strong>the</strong> observed variables arearbitrary <strong>and</strong> unstructured by <strong>the</strong> model, which allows omission of an interceptterm in Eq. 1 by assuming <strong>the</strong> observed variables are mean-deviated, or centered(MacCallum, 2009).are a function of <strong>the</strong> parameters but not <strong>the</strong> unobservable scoreson <strong>the</strong> common or unique factors; hence, it is not necessary toobserve scores on <strong>the</strong> latent variables to estimate <strong>the</strong> model parameters.In EFA, <strong>the</strong> parameters are most commonly estimatedfrom <strong>the</strong> correlation structureP = Λ ∗ Ψ ∗ Λ ∗′ + Θ ∗ (3)where P is <strong>the</strong> population correlation matrix <strong>and</strong>, in that P issimply a re-scaled version of Σ, we can view Λ ∗ , Ψ ∗ , <strong>and</strong> Θ ∗as re-scaled versions of Λ, Ψ, <strong>and</strong> Θ, respectively. This tendencyto conduct EFA using correlations is mainly a result of historicaltradition, <strong>and</strong> it is possible to conduct EFA using covariances orCFA using correlations. For simplicity, we focus on <strong>the</strong> analysisof correlations from this point forward, noting that <strong>the</strong> principleswe discuss apply equivalently to <strong>the</strong> analysis of both correlation<strong>and</strong> covariance matrices (MacCallum, 2009; but see Cudeck, 1989;Bentler, 2007; Bentler <strong>and</strong> Savalei, 2010 for discussions of <strong>the</strong>analysis of correlations vs. covariances). We also drop <strong>the</strong> asteriskswhen referring to <strong>the</strong> parameter matrices in Eq. 3.Jöreskog (1969) showed how <strong>the</strong> traditional EFA model, oran “unrestricted solution” for <strong>the</strong> general factor model describedabove, can be constrained to produce <strong>the</strong> “restricted solution”that is commonly understood as today’s CFA model <strong>and</strong> is wellintegratedin <strong>the</strong> structural equation modeling (SEM) literature.Specifically, in <strong>the</strong> EFA model, <strong>the</strong> elements of Λ are all freely estimated;that is, each of <strong>the</strong> m factors has an estimated relationship(i.e., factor loading) with every observed variable; factor rotationis <strong>the</strong>n used to aid interpretation by making some values in Λlarge <strong>and</strong> o<strong>the</strong>rs small. But in <strong>the</strong> CFA model, depending on <strong>the</strong>researcher’s hypo<strong>the</strong>sized model, many of <strong>the</strong> elements of Λ arerestricted,or constrained,to equal zero,often so that each observedvariable is determined by one <strong>and</strong> only one factor (i.e., so that <strong>the</strong>reare no “cross-loadings”). Because <strong>the</strong> common factors are unobservedvariables <strong>and</strong> thus have an arbitrary scale, it is conventionalto define <strong>the</strong>m as st<strong>and</strong>ardized (i.e., with variance equal to one);thus Ψ is <strong>the</strong> interfactor correlation matrix 2 . This convention is nota testable assumption of <strong>the</strong> model, but ra<strong>the</strong>r imposes necessaryidentification restrictions that allow <strong>the</strong> model parameters to beestimated (although alternative identification constraints are possible,such as <strong>the</strong> marker variable approach often used with CFA).In addition to constraining <strong>the</strong> factor variances, EFA requires adiagonal matrix for Θ, with <strong>the</strong> unique factor variances along <strong>the</strong>diagonal.Exploratory factor analysis <strong>and</strong> CFA <strong>the</strong>refore share <strong>the</strong> goalof using <strong>the</strong> common factor model to represent <strong>the</strong> relationshipsamong a set of observed variables using a small number of factors.Hence, EFA <strong>and</strong> CFA should not be viewed as disparate methods,despite that <strong>the</strong>ir implementation with conventional softwaremight seem quite different. Instead, <strong>the</strong>y are two approaches to2 In EFA, <strong>the</strong> model is typically estimated by first setting Ψ to be an identity matrix,which implies that <strong>the</strong> factors are uncorrelated, or orthogonal, leading to <strong>the</strong> initialunrotated factor loadings in Λ. Applying an oblique factor rotation obtains a new setof factor loadings along with non-zero interfactor correlations. Although rotationis not a focus of <strong>the</strong> current paper, we recommend that researchers always use anoblique rotation.<strong>Frontiers</strong> in Psychology | Quantitative Psychology <strong>and</strong> Measurement March 2012 | Volume 3 | Article 55 | 102