Factor Analysis Elizabeth Garrett-Mayer, PhD Georgiana Onicescu, ScM

Factor Analysis Elizabeth Garrett-Mayer, PhD Georgiana Onicescu, ScM
Cancer Prevention and Control Statistics Tutorial July 9, 2009

Motivating Example: Cohesion in Dragon Boat paddler cancer survivors
Dragon boat paddling is an ancient Chinese sport that offers a unique blend of factors that could potentially enhance the quality of the lives of cancer survivor participants. Evaluating the efficacy of dragon boating to improve the overall quality of life among cancer survivors has the potential to advance our understanding of factors that influence quality-of-life among cancer survivors. We hypothesize that physical activity conducted within the context of the social support of a dragon boat team contributes significantly to improved overall quality of life above and beyond a standard physical activity program because the collective experience of dragon boating is likely enhanced by team sport factors such as cohesion, teamwork, and the goal of competition. Methods: 134 cancer survivors self-selected to an 8-week dragon boat paddling intervention group or to an organized walking program. Each study arm was comprised of a series of 3 groups of approximately participants, with pre- and post-testing to compare quality of life and physical performance outcomes between study arms.

Motivating Example: Cohesion
We have a concept of what “cohesion” is, but we can’t measure it directly. Merriam-Webster: the act or state of sticking together tightly the quality or state of being made one How do we measure it? We cannot simply say “how cohesive is your team?” or “on a scale from 1-10, how do you rate your team cohesion?” We think it combines several elements of “unity” and “team spirit” and perhaps other “factors”

Factor Analysis Data reduction tool
Removes redundancy or duplication from a set of correlated variables Represents correlated variables with a smaller set of “derived” variables. Factors are formed that are relatively independent of one another. Two types of “variables”: latent variables: factors observed variables

Cohesion Variables: G1 (I do not enjoy being a part of the social environment of this exercise group) G2 (I am not going to miss the members of this exercise group when the program ends) G3 (I am unhappy with my exercise group’s level of desire to exceed) G4 (This exercise program does not give me enough opportunities to improve my personal performance) G5 (For me, this exercise group has become one of the most important social groups to which I belong) G6 (Our exercise group is united in trying to reach its goals for performance) G7 (We all take responsibility for the performance by our exercise group) G8 (I would like to continue interacting with some of the members of this exercise group after the program ends) G9 (If members of our exercise group have problems in practice, everyone wants to help them) G10 (Members of our exercise group do not freely discuss each athlete’s responsibilities during practice) G11 (I feel like I work harder during practice than other members of this exercise group)

Other examples Diet Air pollution Personality Customer satisfaction
Depression Quality of Life

Some Applications of Factor Analysis
1. Identification of Underlying Factors: clusters variables into homogeneous sets creates new variables (i.e. factors) allows us to gain insight to categories 2. Screening of Variables: identifies groupings to allow us to select one variable to represent many useful in regression (recall collinearity) 3. Summary: Allows us to describe many variables using a few factors 4. Clustering of objects: Helps us to put objects (people) into categories depending on their factor scores

“Perhaps the most widely used (and misused) multivariate
[technique] is factor analysis. Few statisticians are neutral about this technique. Proponents feel that factor analysis is the greatest invention since the double bed, while its detractors feel it is a useless procedure that can be used to support nearly any desired interpretation of the data. The truth, as is usually the case, lies somewhere in between. Used properly, factor analysis can yield much useful information; when applied blindly, without regard for its limitations, it is about as useful and informative as Tarot cards. In particular, factor analysis can be used to explore the data for patterns, confirm our hypotheses, or reduce the Many variables to a more manageable number. -- Norman Streiner, PDQ Statistics

Let’s work backwards One of the primary goals of factor analysis is often to identify a measurement model for a latent variable This includes identifying the items to include in the model identifying how many ‘factors’ there are in the latent variable identifying which items are “associated” with which factors

How to interpret? Loadings: represent correlations between item and factor High loadings: define a factor Low loadings: item does not “load” on factor Easy to skim the loadings This example: factor 1 is defined by G5, G6, G7, G8 G9 factor 2 is defined by G1, G2, G3, G4, G10, G11 Other things to note: factors are ‘independent’ (usually) we need to ‘name’ factors important to check their face validity. These factors can now be ‘calculated’ using this model Each person is assigned a factor score for each factor Range between -1 to 1 Variable | Factor1 Factor2 | notenjoy | | notmiss | | desireexceed | | personalpe~m | | importants~l | | groupunited | | responsibi~y | | interact | | problemshelp | | notdiscuss | | workharder | | High loadings are highlighted in yellow.

Where did the results come from?
Based on the basic “Classical Test Theory Idea”: For a case with just one factor: Ideal: X1 = F + e1 var(ej) = var(ek) , j ≠ k X2 = F + e2 … Xm = F + em Reality: X1 = λ1F + e1 var(ej) ≠ var(ek) , j ≠ k X2 = λ2F + e2 Xm = λmF + em (unequal “sensitivity” to change in factor) (Related to Item Response Theory (IRT))

Xn = λn1F1 + λn2F2 +…+ λnmFm + en
Multi-Factor Models Two factor orthogonal model ORTHOGONAL = INDEPENDENT Example: cohesion has two domains X1 = λ11F1 + λ12F2 + e1 X2 = λ21F1 + λ22F2 + e2 ……. X11 = λ111F1 + λ112F2 + e11 More generally, m factors, n observed variables X1 = λ11F1 + λ12F2 +…+ λ1mFm + e1 X2 = λ21F1 + λ22F2 +…+ λ2mFm + e2 Xn = λn1F1 + λn2F2 +…+ λnmFm + en

Loadings (estimated) in our example

The factor analysis process
Multiple steps “Stepwise optimal” many choices to be made! a choice at one step may impact the remaining decisions considerable subjectivity Data exploration is key Strong theoretical model is critical

Steps in Exploratory Factor Analysis
(1) Collect and explore data: choose relevant variables. (2) Determine the number of factors (3) Estimate the model using predefined number of factors (4) Rotate and interpret (5) (a) Decide if changes need to be made (e.g. drop item(s), include item(s)) (b) repeat (3)-(4) (6) Construct scales and use in further analysis

Data Exploration Histograms
normality discreteness outliers Covariance and correlations between variables very high or low correlations? Same scale high = good, low = bad?

Data exploration

Valid correlations?

Data Matrix Factor analysis is totally dependent on correlations between variables. Factor analysis summarizes correlation structure v1……...vk F1…..Fj v1……...vk v1 . vk O1 . On v1 . vk Correlation Matrix Factor Matrix Implications for assumptions about X’s? Data Matrix

Important implications
Correlation matrix must be valid measure of association Likert scale? i.e. “on a scale of 1 to K?” Consider previous set of plots Is Pearson (linear) correlation a reasonable measure of association?

Correlation for categorical items
Odds ratios? Nope. on the wrong scale. Need measures on scale of -1 to 1, with zero meaning no association Solutions: tetrachoric correlation: for binary items polychoric correlation: for ordinal items -’choric corelations assume that variables are truncated versions of continuous variables only appropriate if ‘continuous underlying’ assumption makes sense Not available in many software packages for factor analysis!

Polychoric Correlation Matrix
notenjoy notmiss desireexceed notenjoy notmiss desireexceed personalperform importantsocial groupunited responsibility interact problemshelp notdiscuss workharder personalperform importantsocial groupunited personalperform importantsocial groupunited responsibility interact problemshelp notdiscuss workharder responsibility interact problemshelp responsibility interact problemshelp notdiscuss workharder notdiscuss workharder notdiscuss workharder .

Polychoric Correlation in Stata
. findit polychoric . polychoric notenjoy-workharder . matrix R = r(R)

Choosing Number of Factors
Intuitively: The number of uncorrelated constructs that are jointly measured by the X’s. Only useful if number of factors is less than number of X’s (recall “data reduction”). Use “principal components” to help decide type of factor analysis number of factors is equivalent to number of variables each factor is a weighted combination of the input variables: F1 = a11X1 + a12X2 + …. Recall: For a factor analysis, generally, X1 = a11F1 + a12F2 +...

Eigenvalues To select how many factors to use, consider eigenvalues from a principal components analysis Two interpretations: eigenvalue  equivalent number of variables which the factor represents eigenvalue  amount of variance in the data described by the factor. Rules to go by: number of eigenvalues > 1 scree plot % variance explained comprehensibility Note: sum of eigenvalues is equal to the number of items

Scree Plot for Cohesion Example

Interpretability? Not interpretable at this stage
In an unrotated solution, the first factor describes most of variability. Ideally we want to spread variability more evenly among factors. make factors interpretable To do this we “rotate” factors: redefine factors such that loadings on various factors tend to be very high (-1 or 1) or very low (0) intuitively, it makes sharper distinctions in the meanings of the factors We use “factor analysis” for rotation NOT principal components! Rotation does NOT improve fit!

Rotating Factors (Intuitively)
2 3 1 3 2 1 F1 4 4 F1 Factor 1 Factor 2 x x x x Factor 1 Factor 2 x x x x

Rotation options “Orthogonal” “Oblique”
maintains independence of factors more commonly seen usually at least one option Stata: varimax, quartimax, equamax, parsimax, etc. “Oblique” allows dependence of factors make distinctions sharper (loadings closer to 0’s and 1’s can be harder to interpret once you lose independence of factors

Uniqueness Should all items be retained?
Uniquess for each item describes the proportion of the item described by the factor model Recall an R-squared: proportion of variance in Y explained by X 1-Uniqueness: proportion of the variance in Xk explained by F1, F2, etc. Uniqueness: represents what is left over that is not explained by factors “error” that remainese A GOOD item has a LOW uniqueness

Methods for Estimating Model
Principal Components (already discussed) Principal Factor Method Iterated Principal Factor / Least Squares Maximum Likelihood (ML) Most common(?): ML and Least Squares Unfortunately, default is often not the best approach! Caution! ipf and ml may not converge to the right answer! Look for uniqueness of 0 or 1. Problem of “identifiability” or getting “stuck.”

Interpretation Naming of Factors
Wrong Interpretation: Factors represent separate groups of people. Right Interpretation: Each factor represents a continuum along which people vary (and dimensions are orthogonal if orthogonal)

Factor Scores and Scales
Each object (e.g. each cancer survivor) gets a factor score for each factor. Old data vs. New data The factors themselves are variables An individual’s score is weighted combination of scores on input variables These weights are NOT the factor loadings! Loadings and weights determined simultaneously so that there is no correlation between resulting factors.

Orthgonal (i.e., independent)?

Teamwork (Factor 1) by Program
Dragon Boat Walking

Personal Competitive Nature (Factor 2) by Program
Dragon Boat Walking

Criticisms of Factor Analysis
Labels of factors can be arbitrary or lack scientific basis Derived factors often very obvious defense: but we get a quantification “Garbage in, garbage out” really a criticism of input variables factor analysis reorganizes input matrix Too many steps that could affect results Too complicated Correlation matrix is often poor measure of association of input variables.

Our example? Preliminary analysis of pilot data!
Concern: negative items “hang together”, positive items “hang together: Is separation into two factors: based on two different factors (teamwork, pers. comp. nature) based on negative versus positive items? Recall: the computer will always give you something! Validity? boxplots of factor 1 suggest something additional reliability and validity needs to be considered

Stata Code pwcorr notenjoy-workharder polychoric notenjoy-workharder
matrix R = r(R) factormat R, pcf n(134) screeplot factormat R, n(134) ipf factor(2) rotate polychoric notenjoy notmiss desire personal important group respon interact problem workharder predict f1 f2 scatter f1 f2 graph box f1, by(progrm) graph box f2, by(progrm)

Stata Code for Pearson Correlation
factor notenjoy-workharder, pcf screeplot factor notenjoy-workharder, ipf factor(2) rotate factor notenjoy notmiss desire personal important group respon interact problem workharder, ipf factor(2) predict f1 f2 scatter f1 f2 graph box f1, by(progrm) graph box f2, by(progrm)

Stata Options Pearson correlation Polychoric correlation
Use factor for principal components and factor analysis choose estimation approach: ipf, pcf, ml, pf choose to retain n factors: factor(n) Polychoric correlation Use factormat for principal components and factor analysis include n(xxx) to describe the sample size Scree Plot: screeplot Rotate: choose rotation type: varimax (default), promax, etc. Create factor variables predict: list as many new variable names as there are retained factors. Example: for 3 retained factors, factor teamwork competition hardworks

Factor Analysis Elizabeth Garrett-Mayer, PhD Georgiana Onicescu, ScM

Similar presentations

Presentation on theme: "Factor Analysis Elizabeth Garrett-Mayer, PhD Georgiana Onicescu, ScM"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Factor Analysis Elizabeth Garrett-Mayer, PhD Georgiana Onicescu, ScM

Similar presentations

Presentation on theme: "Factor Analysis Elizabeth Garrett-Mayer, PhD Georgiana Onicescu, ScM"— Presentation transcript:

Similar presentations

About project

Feedback