Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

More documents

Recommendations

Info

Kasper and ÜnlüAssumptions of factor analytic approachesA second objective of this paper is to examine the scope ofthese classical methods for estimating the probability distributionof latent ability values or properties thereof postulated ina population under investigation, especially when this distributionis skewed (and not normal). In applied educational contexts,for instance, that is not seldom the practice. Therefore a criticalevaluation of this usage of classical factor analytic methods forestimating distributional properties of ability is important, as wedo present with our simulation study in this paper, in which metricscale (i.e., at least interval scale; not dichotomous) items are used.The results of the simulation study indicate that having a nonnormaldistribution for latent variables does not strongly affectthe number of extracted factors and the estimation of the loadingmatrix. However, as shown in this paper, it clearly affects theestimation of the latent factor score distribution and propertiesthereof (e.g., skewness).More precisely, the “estimation accuracy” for factorial structureof these models is shown to be worse when the assumption ofinterval-scaled data is not met or item statistics are skewed. Thiscorroborates related findings published in other works, which webriefly review later in this paper. More importantly, the empiricaldistribution of estimated latent ability values is biased comparedto the true distribution (i.e., estimates deviate from the true values)when population abilities are skewly distributed. It seemstherefore that classical factor analytic procedures, even thoughthey are performed with metric (instead of non-metric) scaleindicator variables, are not appropriate approaches to ability estimationwhen skewly distributed population ability values are tobe estimated.Why should that be of interest? In large scale assessmentstudies such as the Programme for International Student Assessment(PISA) 2 latent person-related background (conditioning)seeks to create composite scores of observed variables while EFA assumes latentvariables. There is no latent variable in PCA. PCA is not a model and instead issimply a re-expression of variables based on the eigenstructure of their correlationmatrix. A statistical model, as is for EFA, is a simplification of observed data thatnecessarily does not perfectly reproduce the data, leading to the inclusion of an errorterm. This point is well-established in the methodological literature (e.g.,Velicer andJackson, 1990; Widaman, 2007). Correlation matrix is usually used in EFA, and themodels for EFA and PAA are the same. There are several methods to fit EFA suchas unweighted least squares (ULS), generalized least squares (GLS), or maximumlikelihood (ML). PAA is just one of the various methods to fit EFA. PAA is a methodof estimating the model of EFA that does not rely on a discrepancy function such asfor ULS, GLS, or ML. This point is made clear, for instance, in MacCallum (2009). Infact, PAA with iterative communality estimation is asymptotically equivalent to ULSestimation. Applied researchers often use PCA in situations where factor analysismore closely matches the purpose of their analysis. This is why we want to includePCA in our present study with latent variables, to examine how well PCA resultsmay approximate a factor analysis model. Such practice is frequently pursued, forexample in empirical educational research, as we tried to criticize for the large scaleassessment PISA study (e.g., OECD, 2005, 2012). Moreover, the comparison of EFA(based on ML) with PAA in this paper seems to be justified and interesting, as the(manifest) normality assumption in the observed indicator variables for the MLprocedure is violated in the simulation study and empirical large scale assessmentPIRLS application.2 PISA is an international large scale assessment study funded by the Organisationfor Economic Co-operation and Development (OECD), which aims to evaluateeducation systems worldwide by assessing 15-year-old students’ competencies inreading, mathematics, and science. For comprehensive and detailed information,see www.pisa.oecd.org.variables such as sex or socioeconomic status are obtained as wellby principal component analysis, and that “covariate” informationis part of the PISA procedure that assigns to students their literacyor plausible values (OECD, 2012; see also Section 3.1 in thepresent paper). Now, if it is assumed that the distribution of latentbackground information conducted through questionnaires at thestudents, schools, or parents levels (the true latent variable distribution)is skewed, based on the simulation study of this paper wecan expect that the empirical distribution of estimated backgroundinformation (the “empirical” distribution of the calculated componentscores) is biased compared to the true distribution (and ismost likely skewed as well). In other words, estimated backgroundvalues do deviate from their corresponding true values they oughtto approximate, and so the inferred students’ plausible values maybe biased. Further research is necessary in order to investigate theeffects and possible implications of potentially biased estimates oflatent background information on students’ assigned literacy valuesand competence levels, based on which the PISA rankings ofOECD countries are reported. For an analysis of empirical largescale assessment (Progress in International Reading Literacy Study;PIRLS) data, see Section 6.The paper is structured as follows. We introduce the consideredclassical factor analysis models in Section 2 and discuss the relevanceof the assumptions associated with these models in Section3. We describe the simulation study in Section 4 and present theresults of it in Section 5. We give an empirical data analysis examplein Section 6. In Section 7, we conclude with a summary of themain findings and an outlook on possible implications and furtherresearch.2. CLASSICAL FACTOR ANALYSIS METHODSWe consider the method of principal component analysis on theone hand, and the method of exploratory factor and principalaxis analysis on the other. At this point recall Footnote 1, where weclarified that, strictly speaking, principal component analysis is notfactor analysis and that principal axis analysis is a specific methodfor estimating the exploratory factor analysis model. Despite this,for the sake of simplicity and for our purposes and analyses, wecall these approaches collectively factor analysis/analytic methodsor even models. For a more detailed discussion of these methods,see Bartholomew et al. (2011).Our study shows, amongst others, that the purely computationaldimensionality reduction method PCA performs surprisinglywell, as compared to the results obtained based on thelatent variable models EFA and PAA. This is important, becauseapplied researchers often use PCA in situations where factor analysismore closely matches their purpose of analysis. In general, suchcomputational procedures as PCA are easy to use. Moreover, thecomparison of EFA (based on ML) with PAA (eigenstructure ofthe reduced correlation matrix based on communality estimates)in this paper represents an evaluation of different estimation proceduresfor the classical factor analysis model. This comparisonof the two estimation procedures seems to be justified and interesting,as the (manifest) normality assumption in the observedindicators for the ML procedure is violated, both in the simulationstudy and empirical large scale assessment PIRLS application. Atthis point, see also Footnote 1.Frontiers in Psychology | Quantitative Psychology and Measurement March 2013 | Volume 4 | Article 109 | 123
Kasper and ÜnlüAssumptions of factor analytic approaches2.1. PRINCIPAL COMPONENT ANALYSISThe model of principal component analysis (PCA) isZ = FL ′ ,where Z is a n × p matrix of standardized test results of n personson p items, F is a n × p matrix of p principal components(“factors”), and L is a p × p loading matrix. 3 In the estimation(computation) procedure F and L are determined as F = ZC −1/2and L = C 1/2 with a p × p matrix = diag{λ 1 , . . ., λ p }, whereλ l are the eigenvalues of the empirical correlation matrix R =Z ′ Z , and with a p × p matrix C = (c 1 , . . ., c p ) of correspondingeigenvectors c l .In principal component analysis we assume that Z ∈ R n×p ,F ∈ R n×p , and L ∈ R n×p and that empirical moments of the manifestvariables exist such that, for any manifest variable j = 1, . . .,p, its empirical variance is not zero (s 2 j̸= 0). Moreover we assumethat rk(Z) = rk(R) = p (rk, the matrix rank) and that Z, F, and Lare interval-scaled (at the least).The relevance of the assumption of interval-scaled variables forclassical factor analytic approaches is the subject matter of variousresearch works, which we briefly discuss later in this paper.2.2. EXPLORATORY FACTOR ANALYSISThe model of exploratory factor analysis (EFA) isy = µ + Lf + e,where y is a p × 1 vector of responses on p items, µ is the p × 1 vectorof means of the p items, L is a p × k matrix of factor loadings,f is a k × 1 vector of ability values (of factor scores) on k latentcontinua (on factors), and e is a p × 1 vector subsuming remainingitem specific effects or measurement errors.In exploratory factor analysis, we assume thaty ∈ R p×1 , µ ∈ R p×1 , L ∈ R p×k , f ∈ R k×1 , and e ∈ R p×1 ,y, µ, L, f, and e are interval-scaled (at the least),E(f ) = 0,E(e) = 0,cov(e,e) = E(ee ′ ) = D = diag{v 1 , . . . , v p },cov(f,e) = E(fe ′ ) = 0,where ν i are the variances of e i (i = 1, . . ., p). If the factors arenot correlated, we call this the orthogonal factor model; otherwiseit is called the oblique factor model. In this paper, weinvestigate the sensitivity of the classical factor analysis modelagainst violated assumptions only for the orthogonal case (withcov(f, f ) = E(f f ′ ) = I = diag{1, . . ., 1}).Under this orthogonal factor model, can be decomposed asfollows: = E [ (y − µ)(y − µ) ′] = E [ (Lf + e)(Lf + e) ′] = LL ′ + D.3 For the sake of simplicity and without ambiguity, in this paper we want to refer tocomponent scores from PCA as “factor scores” or “ability values,” albeit componentsconceptually may not be viewed as latent variables or factors. See also Footnote 1.This decomposition is utilized by the methods of unweightedleast squares (ULS), generalized least squares (GLS), or maximumlikelihood (ML) for the estimation of L and D. For ULSand GLS, the corresponding discrepancy function is minimizedwith respect to L and D (Browne, 1974). ML estimation is performedbased on the partial derivatives of the logarithm of theWishart (W ) density function of the empirical covariance matrixS, with (n − 1) S ∼ W (, n − 1) (Jöreskog, 1967). After estimatesfor µ, k, L, and D are obtained, the vector f can be estimated byˆf = (L ′ D −1 L) −1 L ′ D −1 (y − µ).When applying this exploratory factor analysis, y is typicallyassumed to be normally distributed, and hence rk() = p, where is the covariance matrix of y. For instance, one condition requiredfor ULS or GLS estimation is that the fourth cumulants of y mustbe zero, which is the case, for example, if y follows a multivariatenormal distribution (for this and other conditions, see Browne,1974). For ML estimation note that (n − 1)S ∼ W(,n − 1) ify ∼ N(µ,).Another possibility of estimation for the EFA model is principalaxis analysis (PAA). The model of PAA isZ = FL ′ + E,where Z is a n × p matrix of standardized test results, F is a n × pmatrix of factor scores, L is a p × p matrix of factor loadings,and E is a n × p matrix of error terms. For estimation of F andL based on the representation Z ′ Z = R = LL ′ + D the principalcomponents transformation is applied. However, the eigenvaluedecomposition is not based on R, but is based on the reducedcorrelation matrix R h = R − ˆD, where ˆD is an estimate for D.An estimate ˆD is derived using hj2 = 1 − v j and estimating thecommunalities h j (for methods for estimating the communalities,see Harman, 1976).The assumptions of principal axis analysis areZ ∈ R n×p , L ∈ R p×p , F ∈ R n×p , and E ∈ R n×p ,E(f ) = 0,E(e) = 0,cov(e,e) = E(ee ′ ) = D = diag{v 1 , . . . , v p },cov(f,e) = E(fe ′ ) = 0,cov(f,f ) = E(ff ′ ) = I ,and empirical moments of the manifest variables are assumedto exist such that, for any manifest variable j = 1, . . ., p, itsempirical variance is not zero (sj2 ̸= 0). Moreover, we assumethat rk(Z) = rk(R) = p and that the matrices Z, F, L, and E areinterval-scaled (at the least).2.3. GENERAL REMARKSTwo remarks are important before we discuss the assumptionsassociated with the classical factor models in the next section.First, it can be shown that L is unique up to an orthogonal transformation.As different orthogonal transformations may yielddifferent correlation patterns, a specific orthogonal transformationmust be taken into account (and fixed) before the estimationwww.frontiersin.org March 2013 | Volume 4 | Article 109 | 124
Page 2 and 3:
FRONTIERS COPYRIGHTSTATEMENT© Copy
Page 4 and 5:
Table of Contents05 Is Data Cleanin
Page 7 and 8:
OsborneAssumptions and data cleanin
Page 9 and 10:
ORIGINAL RESEARCH ARTICLEpublished:
Page 12 and 13:
Hoekstra et al.Why assumptions are
Page 14 and 15:
Page 16 and 17:
Page 18 and 19:
Page 20 and 21:
García-PérezStatistical conclusio
Page 22 and 23:
Page 24 and 25:
Page 26 and 27:
Page 28 and 29:
Page 30 and 31:
Sheng and ShengEffect of non-normal
Page 32 and 33:
Page 34 and 35:
Page 36 and 37:
Page 38 and 39:
Page 40 and 41:
Page 42 and 43:
REVIEW ARTICLEpublished: 12 April 2
Page 44 and 45:
Nimon et al.The assumption of relia
Page 46 and 47:
Page 48 and 49:
Page 50 and 51:
Page 52 and 53:
Page 54 and 55:
Page 56 and 57:
TressoldiPower replication unreliab
Page 58 and 59:
TressoldiPower replication unreliab
Page 60 and 61:
Page 62 and 63:
FinchModern methods for the detecti
Page 64 and 65:
Page 66 and 67:
Page 68 and 69:
Page 70 and 71:
Page 72 and 73:
MINI REVIEW ARTICLEpublished: 28 Au
Page 74 and 75: NimonStatistical assumptionsand Del
Page 76 and 77: NimonStatistical assumptionsFor exa
Page 78 and 79: Kraha et al.Interpreting multiple r
Page 94 and 95: SmithsonComparing moderation of slo
Page 102 and 103: REVIEW ARTICLEpublished: 01 March 2
Page 104 and 105: Flora et al.Factor analysis assumpt
Page 126 and 127: Kasper and ÜnlüAssumptions of fac
Page 144 and 145: Lages and JaworskaHow predictable a
Page 152 and 153: Cummiskey et al.Testing assumptions
Page 158: Cummiskey et al.Testing assumptions
show all

Sweating the Small Stuff: Does data cleaning and testing ... - Frontiers

Create successful ePaper yourself

Delete template?

Save as template?