Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

More documents

Recommendations

Info

Flora et al.Factor analysis assumptionsinvestigating variants of the same general model which differaccording to the number of constraints placed on the model,wherethe constraints are determined by the strength of theoretical expectations(see MacCallum, 2009). Indeed, it is possible to conductEFA in a confirmatory fashion (e.g., by using “target rotation”) orto conduct CFA in an exploratory fashion (e.g., by comparing aseries of models that differ in the number or nature of the factors)3 . Given that EFA and CFA are based on the same commonfactor model, the principles and methods we discuss in this paperlargely generalize to both procedures. The example analyses wepresent below follow traditional EFA procedures; where necessary,we comment on how certain issues may be different for CFA.In practice, a sample correlation matrix R (or sample covariancematrix S) is analyzed to obtain estimates of the unconstrainedparameters given some specified value m of the number of commonfactors 4 . The parameter estimates can be plugged into Eq. 3to derive a model-implied population correlation matrix, ˆP. Thegoal of model estimation is thus to find the parameter estimatesthat optimize the match of the model-implied correlation matrixto the observed sample correlation matrix. An historically popularmethod for parameter estimation in EFA is the “principalfactors method with prior communality estimates,” or “principalaxis factor analysis,” which obtains factor loading estimates fromthe eigenstructure of a matrix formed by R − ˆΘ (see MacCallum,2009). But given modern computing capabilities, we agree withMacCallum (2009) that this method should be considered obsolete,and instead factor analysis models should be estimated usingan iterative algorithm to minimize a model fitting function, such asthe unweighted least-squares (ULS) or maximum likelihood (ML)functions. Although principal axis continues to be used in modernapplications of EFA, iterative estimation, usually ML, is almostalways used with CFA. In short, ULS is preferable to principal axisbecause it will better account for the observed correlations in R(specifically, it will produce a smaller squared difference between agiven observed correlation and the corresponding model-impliedcorrelation), whereas ML obtains parameter estimates that givethe most likely account of the observed data (MacCallum, 2009;see also Briggs and MacCallum, 2003). Below, we discuss how theassumption of normally distributed observed variables comes intoplay for ULS and ML.As shown above, given that the parameters (namely, the factorloadings in Λ and interfactor correlations in Ψ) are estimateddirectly from the observed correlations, R, using an estimatorsuch as ML or ULS, factor analysis as traditionally implemented isessentially an analysis of the correlations among a set of observedvariables. In this sense, the correlations are the data. Indeed,any factor analysis software program can proceed if a samplecorrelation matrix is the only data given; it does not need thecomplete raw, person by variable (N × p) data set from which3 Asparouhov and Muthén (2009) further show how an unrestricted EFA solutioncan be used in place of a restricted CFA-type measurement model within a largerSEM.4 We will not extensively discuss EFA methods for determining the number of factorsexcept to note where assumption violation or inaccurate correlations may affectdecisions about the optimal number of common factors. See MacCallum (2009) fora brief review of methods for determining the number of common factors in EFA.the correlations were calculated 5 . Conversely, when the softwareis given the complete (N × p) data set, it will first calculate thecorrelations among the p variables to be analyzed and then fit themodel to those correlations using the specified estimation method.Thus, it is imperative that the“data”for factor analysis, the correlations,be appropriate and adequate summaries of the relationshipsamong the observed variables, despite that the common factormodel makes no explicit assumptions about the correlations themselves.Thus, if the sample correlations are misrepresentative of thecomplete raw data, then the parameter estimates (factor loadingsand interfactor correlations) will be inaccurate, as will model fitstatistics and estimated SEs for the parameter estimates. Of course,more explicit assumption violation also can cause these problems.In these situations, information from the complete data set beyondjust the correlations becomes necessary to obtain “robust” results.Before we discuss these issues further, we first present a dataexample to illustrate the common factor model and to provide acontext for demonstrating the main concepts of this article involvingdata screening and assumption testing for factor analysis. Thepurpose of this data example is not to provide a comprehensivesimulation study for evaluating the effectiveness of different factoranalytic methods; such studies have already been conducted in theliterature cited throughout this paper. Rather, we use analyses ofthis data set to illustrate the statistical concepts discussed below sothat they may be more concrete for the applied researcher.DATA EXAMPLEOur example is based on unpublished data reported in Harman(1960); these data and an accompanying factor analysis aredescribed in the user’s guide for the free software package CEFA(Browne et al., 2010). The variables are scores from nine cognitiveability tests. Although the data come from a sample of N = 696individuals, for our purposes we consider the correlation matrixamong the nine variables to be a population correlation matrix (seeTable 1) and thus an example of P in Eq. 3. An obliquely rotated(quartimin rotation) factor pattern for a three-factor model isconsidered the population factor loading matrix (see Table 2) andthus an example of Λ in Eq. 3. Interpretation of the factor loadingsindicates that the first factor (η 1 ) has relatively strong effects on thevariables Word Meaning, Sentence Completion, and Odd words;thus, η 1 is considered a “verbal ability” factor. The second factor(η 2 ) has relatively strong effects on Mixed Arithmetic,Remainders,and Missing numbers; thus, η 2 is “math ability.” Finally, the thirdfactor (η 3 ) is “spatial ability” given its strong influences on Gloves,Boots, and Hatchets. The interfactor correlation between η 1 andη 2 is ψ 12 = 0.59, the correlation between η 1 and η 3 is ψ 13 = 0.43,and η 2 and η 3 are also moderately correlated with ψ 23 = 0.48.These population parameters give a standard against whichto judge sample-based results for the same factor model that wepresent throughout this paper. To begin, we created a randomsample of N = 100 with nine variables from a multivariate standardnormal population distribution with a correlation matrixmatching that in Table 1. We then estimated a three-factor EFA5 Certain models, such as multiple-group measurement models, may also include amean structure, in which case the means of the observed variables also serve as datafor the complete model estimation.www.frontiersin.org March 2012 | Volume 3 | Article 55 | 103
Flora et al.Factor analysis assumptionsTable1|Population correlation matrix.Variable 1 2 3 4 5 6 7 8 9WrdMean 1SntComp 0.75 1OddWrds 0.78 0.72 1MxdArit 0.44 0.52 0.47 1Remndrs 0.45 0.53 0.48 0.82 1MissNum 0.51 0.58 0.54 0.82 0.74 1Gloves 0.21 0.23 0.28 0.33 0.37 0.35 1Boots 0.30 0.32 0.37 0.33 0.36 0.38 0.45 1Hatchts 0.31 0.30 0.37 0.31 0.36 0.38 0.52 0.67 1WrdMean, word meaning; SntComp, sentence completion; OddWrds, odd words;MxdArit, mixed arithmetic; Remndrs, remainders; MissNum, missing numbers,Hatchts, hatchets.Table2|Population factor loading matrix.VariableFactorη 1 η 2 η 3WrdMean 0.94 −0.05 −0.03SntComp 0.77 0.14 −0.03OddWrds 0.83 0.00 0.08MxdArit −0.05 1.01 −0.05Remndrs 0.04 0.80 0.06MissNum 0.14 0.75 0.06Gloves −0.06 0.13 0.56Boots 0.05 −0.01 0.74Hatchts 0.03 −0.09 0.91WrdMean, word meaning; SntComp, sentence completion; OddWrds, odd words;MxdArit, mixed arithmetic; Remndrs, remainders; MissNum, missing numbers;Hatchts, hatchets. Primary loadings for each observed variable are in bold.model using ULS and applied a quartimin rotation. The estimatedrotated factor loading matrix, ˆΛ, isinTable 3. Not surprisingly,these factor loading estimates are similar, but not identical, to thepopulation factor loadings.REGRESSION DIAGNOSTICS FOR FACTOR ANALYSIS OFCONTINUOUS VARIABLESBecause the common factor model is just a linear regression model(Eq. 1), many of the well-known concepts about regression diagnosticsgeneralize to factor analysis. Regression diagnostics are aset of methods that can be used to reveal aspects of the data thatare problematic for a model that has been fitted to that data (seeBelsley et al., 1980; Fox, 1991, 2008, for thorough treatments).Many data characteristics that are problematic for ordinary multipleregression are also problematic for factor analysis; the trickis that in the common factor model, the explanatory variables(the factors) are unobserved variables whose values cannot bedetermined precisely. In this section, we illustrate how regressiondiagnostic principles can be applied to factor analysis using theexample data presented above.Table 3 | Sample factor loading matrix (multivariate normal data, nooutlying cases).VariableFactorη 1 η 2 η 3WrdMean 0.96 −0.08 0.04SntComp 0.81 0.15 −0.13OddWrds 0.75 0.05 0.12MxdArit −0.06 1.01 −0.03Remndrs 0.09 0.75 0.09MissNum 0.09 0.80 0.06Gloves −0.03 0.20 0.56Boots 0.07 0.00 0.68Hatchts −0.01 −0.01 0.85N = 100. WrdMean, word meaning; SntComp, sentence completion; OddWrds,odd words; MxdArit, mixed arithmetic; Remndrs, remainders; MissNum, missingnumbers; Hatchts, hatchets. Primary loadings for each observed variable are inbold.But taking a step back, given that factor analysis is basicallyan analysis of correlations, all of the principles about correlationthat are typically covered in introductory statistics coursesare also relevant to factor analysis. Foremost is that a productmomentcorrelation measures the linear relationship between twovariables. Two variables may be strongly related to each other,but the actual correlation between them might be close to zero iftheir relationship is poorly approximated by a straight line (forexample, a U-shaped relationship). In other situations, there maybe a clear curvilinear relationship between two variables, but aresearcher decides that a straight line is still a reasonable modelfor that relationship. Thus, some amount of subjective judgmentmay be necessary to decide whether a product-moment correlationis an adequate summary of a given bivariate relationship. If not,variables involved in non-linear relationships may be transformedprior to fitting the factor model or alternative correlations, suchas Spearman rank-order correlations, may be factor analyzed (seeGorsuch, 1983, pp. 297–309 for a discussion of these strategies, butsee below for methods for item-level categorical variables).Visual inspection of simple scatterplots (or a scatterplot matrixcontaining many bivariate scatterplots) is an effective method forassessing linearity, although formal tests of linearity are possible(e.g., the RESET test of Ramsey, 1969). If there is a large numberof variables, then it may be overly tedious to inspect every bivariaterelationship. In this situation, one might focus on scatterplotsinvolving variables with odd univariate distributions (e.g.,stronglyskewed or bimodal) or randomly select several scatterplots to scrutinizeclosely. Note that it is entirely possible for two variables tobe linearly related even when one or both of them is non-normal;conversely, if two variables are normally distributed, their bivariaterelationship is not necessarily linear. Returning to our dataexample, the scatterplot matrix in Figure 1 shows that none of thebivariate relationships among these nine variables has any cleardeparture from linearity.Scatterplots can also be effective for identifying unusual cases,although relying on scatterplots alone for this purpose is notFrontiers in Psychology | Quantitative Psychology and Measurement March 2012 | Volume 3 | Article 55 | 104
Page 2 and 3:
FRONTIERS COPYRIGHTSTATEMENT© Copy
Page 4 and 5:
Table of Contents05 Is Data Cleanin
Page 7 and 8:
OsborneAssumptions and data cleanin
Page 9 and 10:
ORIGINAL RESEARCH ARTICLEpublished:
Page 12 and 13:
Hoekstra et al.Why assumptions are
Page 14 and 15:
Page 16 and 17:
Page 18 and 19:
ORIGINAL RESEARCH ARTICLEpublished:
Page 20 and 21:
García-PérezStatistical conclusio
Page 22 and 23:
Page 24 and 25:
Page 26 and 27:
Page 28 and 29:
Page 30 and 31:
Sheng and ShengEffect of non-normal
Page 32 and 33:
Page 34 and 35:
Page 36 and 37:
Page 38 and 39:
Page 40 and 41:
Page 42 and 43:
REVIEW ARTICLEpublished: 12 April 2
Page 44 and 45:
Nimon et al.The assumption of relia
Page 46 and 47:
Page 48 and 49:
Page 50 and 51:
Page 52 and 53:
Page 54 and 55: Nimon et al.The assumption of relia
Page 56 and 57: TressoldiPower replication unreliab
Page 58 and 59: TressoldiPower replication unreliab
Page 60 and 61: ORIGINAL RESEARCH ARTICLEpublished:
Page 62 and 63: FinchModern methods for the detecti
Page 72 and 73: MINI REVIEW ARTICLEpublished: 28 Au
Page 74 and 75: NimonStatistical assumptionsand Del
Page 76 and 77: NimonStatistical assumptionsFor exa
Page 78 and 79: Kraha et al.Interpreting multiple r
Page 94 and 95: SmithsonComparing moderation of slo
Page 102 and 103: REVIEW ARTICLEpublished: 01 March 2
Page 106 and 107: Flora et al.Factor analysis assumpt
Page 124 and 125: Kasper and ÜnlüAssumptions of fac
Page 144 and 145: Lages and JaworskaHow predictable a
Page 152 and 153: Cummiskey et al.Testing assumptions
Page 154 and 155:
Cummiskey et al.Testing assumptions
Page 156 and 157:
Page 158:
show all

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?