Presentation on theme: "Factor Analysis Elizabeth Garrett-Mayer, PhD Georgiana Onicescu, ScM"— Presentation transcript:
1 Factor Analysis Elizabeth Garrett-Mayer, PhD Georgiana Onicescu, ScM Cancer Prevention and Control Statistics TutorialJuly 9, 2009
2 Motivating Example: Cohesion in Dragon Boat paddler cancer survivors Dragon boat paddling is an ancient Chinese sport that offers a unique blend of factors that could potentially enhance the quality of the lives of cancer survivor participants.Evaluating the efficacy of dragon boating to improve the overall quality of life among cancer survivors has the potential to advance our understanding of factors that influence quality-of-life among cancer survivors.We hypothesize that physical activity conducted within the context of the social support of a dragon boat team contributes significantly to improved overall quality of life above and beyond a standard physical activity program because the collective experience of dragon boating is likely enhanced by team sport factors such as cohesion, teamwork, and the goal of competition.Methods: 134 cancer survivors self-selected to an 8-week dragon boat paddling intervention group or to an organized walking program. Each study arm was comprised of a series of 3 groups of approximately participants, with pre- and post-testing to compare quality of life and physical performance outcomes between study arms.
3 Motivating Example: Cohesion We have a concept of what “cohesion” is, but we can’t measure it directly.Merriam-Webster:the act or state of sticking together tightlythe quality or state of being made oneHow do we measure it?We cannot simply say “how cohesive is your team?” or “on a scale from 1-10, how do you rate your team cohesion?”We think it combines several elements of “unity” and “team spirit” and perhaps other “factors”
4 Factor Analysis Data reduction tool Removes redundancy or duplication from a set of correlated variablesRepresents correlated variables with a smaller set of “derived” variables.Factors are formed that are relatively independent of one another.Two types of “variables”:latent variables: factorsobserved variables
5 Cohesion Variables:G1 (I do not enjoy being a part of the social environment of this exercise group)G2 (I am not going to miss the members of this exercise group when the program ends)G3 (I am unhappy with my exercise group’s level of desire to exceed)G4 (This exercise program does not give me enough opportunities to improve my personal performance)G5 (For me, this exercise group has become one of the most important social groups to which I belong)G6 (Our exercise group is united in trying to reach its goals for performance)G7 (We all take responsibility for the performance by our exercise group)G8 (I would like to continue interacting with some of the members of this exercise group after the program ends)G9 (If members of our exercise group have problems in practice, everyone wants to help them)G10 (Members of our exercise group do not freely discuss each athlete’sresponsibilities during practice)G11 (I feel like I work harder during practice than other members of this exercise group)
6 Other examples Diet Air pollution Personality Customer satisfaction DepressionQuality of Life
7 Some Applications of Factor Analysis 1. Identification of Underlying Factors:clusters variables into homogeneous setscreates new variables (i.e. factors)allows us to gain insight to categories2. Screening of Variables:identifies groupings to allow us to select one variable to represent manyuseful in regression (recall collinearity)3. Summary:Allows us to describe many variables using a few factors4. Clustering of objects:Helps us to put objects (people) into categories depending on their factor scores
8 “Perhaps the most widely used (and misused) multivariate [technique] is factor analysis. Few statisticians are neutral aboutthis technique. Proponents feel that factor analysis is thegreatest invention since the double bed, while its detractors feelit is a useless procedure that can be used to support nearly anydesired interpretation of the data. The truth, as is usually the case,lies somewhere in between. Used properly, factor analysis canyield much useful information; when applied blindly, withoutregard for its limitations, it is about as useful and informative asTarot cards. In particular, factor analysis can be used to explorethe data for patterns, confirm our hypotheses, or reduce theMany variables to a more manageable number.-- Norman Streiner, PDQ Statistics
9 Let’s work backwardsOne of the primary goals of factor analysis is often to identify a measurement model for a latent variableThis includesidentifying the items to include in the modelidentifying how many ‘factors’ there are in the latent variableidentifying which items are “associated” with which factors
11 How to interpret?Loadings: represent correlations between item and factorHigh loadings: define a factorLow loadings: item does not “load” on factorEasy to skim the loadingsThis example:factor 1 is defined by G5, G6,G7, G8 G9factor 2 is defined by G1, G2,G3, G4, G10, G11Other things to note:factors are ‘independent’ (usually)we need to ‘name’ factorsimportant to check their face validity.These factors can now be ‘calculated’ using this modelEach person is assigned a factor score for each factorRange between -1 to 1Variable | Factor1 Factor2 |notenjoy | |notmiss | |desireexceed | |personalpe~m | |importants~l | |groupunited | |responsibi~y | |interact | |problemshelp | |notdiscuss | |workharder | |High loadings are highlightedin yellow.
12 How to interpret? Authors may conclude something like: “We were able to derive two factors from the 11 items. The first factor is defined as “teamwork.” The second factor is defined as “personal competitive nature .” These two factors describe 72% of the variance among the items.”Variable | Factor1 Factor2 |notenjoy | |notmiss | |desireexceed | |personalpe~m | |importants~l | |groupunited | |responsibi~y | |interact | |problemshelp | |notdiscuss | |workharder | |High loadings are highlightedin yellow.
13 Where did the results come from? Based on the basic “Classical Test Theory Idea”:For a case with just one factor:Ideal: X1 = F + e1 var(ej) = var(ek) , j ≠ kX2 = F + e2…Xm = F + emReality: X1 = λ1F + e1 var(ej) ≠ var(ek) , j ≠ kX2 = λ2F + e2Xm = λmF + em(unequal “sensitivity” to change in factor)(Related to Item Response Theory (IRT))
16 The factor analysis process Multiple steps“Stepwise optimal”many choices to be made!a choice at one step may impact the remaining decisionsconsiderable subjectivityData exploration is keyStrong theoretical model is critical
17 Steps in Exploratory Factor Analysis (1) Collect and explore data: choose relevant variables.(2) Determine the number of factors(3) Estimate the model using predefined number of factors(4) Rotate and interpret(5) (a) Decide if changes need to be made (e.g. drop item(s), include item(s))(b) repeat (3)-(4)(6) Construct scales and use in further analysis
18 Data Exploration Histograms normalitydiscretenessoutliersCovariance and correlations between variablesvery high or low correlations?Same scalehigh = good, low = bad?
22 Data MatrixFactor analysis is totally dependent on correlations between variables.Factor analysis summarizes correlation structurev1……...vkF1…..Fjv1……...vkv1.vkO1.Onv1.vkCorrelationMatrixFactorMatrixImplications for assumptions about X’s?Data Matrix
23 Important implications Correlation matrix must be valid measure of associationLikert scale? i.e. “on a scale of 1 to K?”Consider previous set of plotsIs Pearson (linear) correlation a reasonable measure of association?
24 Correlation for categorical items Odds ratios? Nope. on the wrong scale.Need measures on scale of -1 to 1, with zero meaning no associationSolutions:tetrachoric correlation: for binary itemspolychoric correlation: for ordinal items-’choric corelationsassume that variables are truncated versions of continuous variablesonly appropriate if ‘continuous underlying’ assumption makes senseNot available in many software packages for factor analysis!
26 Polychoric Correlation in Stata . findit polychoric. polychoric notenjoy-workharder. matrix R = r(R)
27 Choosing Number of Factors Intuitively: The number of uncorrelated constructs that are jointly measured by the X’s.Only useful if number of factors is less than number of X’s (recall “data reduction”).Use “principal components” to help decidetype of factor analysisnumber of factors is equivalent to number of variableseach factor is a weighted combination of the input variables:F1 = a11X1 + a12X2 + ….Recall: For a factor analysis, generally,X1 = a11F1 + a12F2 +...
28 EigenvaluesTo select how many factors to use, consider eigenvalues from a principal components analysisTwo interpretations:eigenvalue equivalent number of variables which the factor representseigenvalue amount of variance in the data described by the factor.Rules to go by:number of eigenvalues > 1scree plot% variance explainedcomprehensibilityNote: sum of eigenvalues is equal to the number of items
29 Cohesion Example . factormat R, pcf n(134) (obs=134) Factor analysis/correlation Number of obs =Method: principal-component factors Retained factors =Rotation: (unrotated) Number of params =Factor | Eigenvalue Difference Proportion CumulativeFactor1 |Factor2 |Factor3 |Factor4 |Factor5 |Factor6 |Factor7 |Factor8 |Factor9 |Factor10 |Factor11 |
31 Choose two factors: Now fit the model . factormat R, n(134) ipf factor(2)(obs=134)Factor analysis/correlation Number of obs =Method: iterated principal factors Retained factors =Rotation: (unrotated) Number of params =Factor loadings (pattern matrix) and unique variancesVariable | Factor1 Factor2 | Uniquenessnotenjoy | |notmiss | |desireexceed | |personalpe~m | |importants~l | |groupunited | |responsibi~y | |interact | |problemshelp | |notdiscuss | |workharder | |
32 Interpretability? Not interpretable at this stage In an unrotated solution, the first factor describes most of variability.Ideally we want tospread variability more evenly among factors.make factors interpretableTo do this we “rotate” factors:redefine factors such that loadings on various factors tend to be very high (-1 or 1) or very low (0)intuitively, it makes sharper distinctions in the meanings of the factorsWe use “factor analysis” for rotation NOT principal components!Rotation does NOT improve fit!
35 Rotation options “Orthogonal” “Oblique” maintains independence of factorsmore commonly seenusually at least one optionStata: varimax, quartimax, equamax, parsimax, etc.“Oblique”allows dependence of factorsmake distinctions sharper (loadings closer to 0’s and 1’scan be harder to interpret once you lose independence of factors
36 Uniqueness Should all items be retained? Uniquess for each item describes the proportion of the item described by the factor modelRecall an R-squared:proportion of variance in Y explained by X1-Uniqueness:proportion of the variance in Xk explained by F1, F2, etc.Uniqueness:represents what is left over that is not explained by factors“error” that remaineseA GOOD item has a LOW uniqueness
39 Methods for Estimating Model Principal Components (already discussed)Principal Factor MethodIterated Principal Factor / Least SquaresMaximum Likelihood (ML)Most common(?): ML and Least SquaresUnfortunately, default is often not the best approach!Caution! ipf and ml may not converge to the right answer! Look for uniqueness of 0 or 1. Problem of “identifiability” or getting “stuck.”
40 Interpretation Naming of Factors Wrong Interpretation: Factors represent separate groups of people.Right Interpretation: Each factor represents a continuum along which people vary (and dimensions are orthogonal if orthogonal)
41 Factor Scores and Scales Each object (e.g. each cancer survivor) gets a factor score for each factor.Old data vs. New dataThe factors themselves are variablesAn individual’s score is weighted combination of scores on input variablesThese weights are NOT the factor loadings!Loadings and weights determined simultaneously so that there is no correlation between resulting factors.
42 Factor Scoring . predict f1 f2 Why different than loadings? (regression scoring assumed)Scoring coefficients(method = regression; based on varimax rotated factors)Variable | Factor1 Factor2notenjoy |notmiss |desireexceed |personalpe~m |importants~l |groupunited |responsibi~y |interact |problemshelp |workharder |Why different than loadings?Factors are generallyscaled to havevariance 1.Mean is arbitrary.* If based on Pearson correlationmean will be zero.
44 Teamwork (Factor 1) by Program Dragon BoatWalking
45 Personal Competitive Nature (Factor 2) by Program Dragon BoatWalking
46 Criticisms of Factor Analysis Labels of factors can be arbitrary or lack scientific basisDerived factors often very obviousdefense: but we get a quantification“Garbage in, garbage out”really a criticism of input variablesfactor analysis reorganizes input matrixToo many steps that could affect resultsToo complicatedCorrelation matrix is often poor measure of association of input variables.
47 Our example? Preliminary analysis of pilot data! Concern: negative items “hang together”, positive items “hang together:Is separation into two factors:based on two different factors (teamwork, pers. comp. nature)based on negative versus positive items?Recall: the computer will always give you something!Validity?boxplots of factor 1 suggest somethingadditional reliability and validity needs to be considered
48 Stata Code pwcorr notenjoy-workharder polychoric notenjoy-workharder matrix R = r(R)factormat R, pcf n(134)screeplotfactormat R, n(134) ipf factor(2)rotatepolychoric notenjoy notmiss desire personal important group respon interact problem workharderpredict f1 f2scatter f1 f2graph box f1, by(progrm)graph box f2, by(progrm)
49 Stata Code for Pearson Correlation factor notenjoy-workharder, pcfscreeplotfactor notenjoy-workharder, ipf factor(2)rotatefactor notenjoy notmiss desire personal important group respon interact problem workharder, ipf factor(2)predict f1 f2scatter f1 f2graph box f1, by(progrm)graph box f2, by(progrm)
50 Stata Options Pearson correlation Polychoric correlation Use factor for principal components and factor analysischoose estimation approach: ipf, pcf, ml, pfchoose to retain n factors: factor(n)Polychoric correlationUse factormat for principal components and factor analysisinclude n(xxx) to describe the sample sizeScree Plot: screeplotRotate: choose rotation type: varimax (default), promax, etc.Create factor variablespredict: list as many new variable names as there are retained factors.Example: for 3 retained factors,factor teamwork competition hardworks