Principal Component Analysis

Slides:



Advertisements
Similar presentations
THE STANDARD DEVIATION AS A RULER AND THE NORMAL MODEL
Advertisements

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 March 12, 2012.
Wednesday AM  Presentation of yesterday’s results  Associations  Correlation  Linear regression  Applications: reliability.
Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Girls’ Teams & Boys’ Teams National Champions 2015.
The Olympic Value of Respect Library: Assemblies.
Psychology 202b Advanced Psychological Statistics, II April 7, 2011.
Overview of Factor Analysis Construct combinations of quantitative variablesConstruct combinations of quantitative variables Reduce a large set of variables.
Response Surfaces max(S(  )) Marco Lattuada Swiss Federal Institute of Technology - ETH Institut für Chemie und Bioingenieurwissenschaften ETH Hönggerberg/
CSE 300: Software Reliability Engineering Topics covered: Software metrics and software reliability.
CSE 300: Software Reliability Engineering Topics covered: Software metrics and software reliability Software complexity and software quality.
As with averages, researchers need to transform data into a form conducive to interpretation, comparisons, and statistical analysis measures of dispersion.
Performance Appraisal in Sports Kenneth M. York School of Business Administration Oakland University.
Chapter 2 Dimensionality Reduction. Linear Methods
AP Statistics Section 7.2 C Rules for Means & Variances.
Factor Analysis Istijanto MM, MCom. Definition Factor analysis  Data reduction technique and summarization  Identifying the underlying factors/ dimensions.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS ESSENTIALS -- Elliott & Woodward1.
Scales & Indices. Measurement Overview Using multiple indicators to create variables Using multiple indicators to create variables Two-step process: Two-step.
N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 11 Understanding Randomness.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Using Proportions to Solve Percent Word Problems Use this formula to solve percent problems Sometimes is it hard to tell which number is the part and which.
Techniques for studying correlation and covariance structure Principal Components Analysis (PCA) Factor Analysis.
Principal Components Analysis and Factor Analysis by Dr. Winai Bodhisuwan.
Principal Components: A Mathematical Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University.
Correlations: Relationship, Strength, & Direction Scatterplots are used to plot correlational data – It displays the extent that two variables are related.
Correlation. Correlation is a measure of the strength of the relation between two or more variables. Any correlation coefficient has two parts – Valence:
Somerset Schools Athletic Association. somersetschoolsathletics.org.uk The Website somersetschoolsathletics.org.uk.
Track and Field Researched via: Student Name: Amelia McDonald Date: 5/24/10.
Principle Component Analysis and its use in MA clustering Lecture 12.
Why preprocessing? Learning method needs data type: numerical, nominal,.. Learning method cannot deal well enough with noisy / incomplete data Too many.
Example 2 Finding a Part of a Base What number is 24% of 200? Write proportion. 100 p = b a Simplify. = a48 Substitute 200 for b and 24 for p =
Feature Selection and Extraction Michael J. Watts
Factor & Cluster Analyses. Factor Analysis Goals Data Process Results.
1 Canonical Correlation Analysis Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.
1.5 Scatter Plots & Line of Best Fit. Scatter Plots A scatter plot is a graph that shows the relationship between two sets of data. In a scatter plot,
Motto: Faster, Higher, Stronger Motto : Let me win, but if I can not win, let me be brave in the attempt Special Olympics.
1 Lecture 06 EEE 341 Introduction to Communication Engineering.
Bivariate Relationships
COMP 1942 PCA TA: Harry Chan COMP1942.
Factor Analysis An Alternative technique for studying correlation and covariance structure.
Principal Component Analysis (PCA)
Dimension Reduction via PCA (Principal Component Analysis)
Tips Need to Consider When Organizing a College Event
Scales & Indices.
Olympic Records Can you deduce which event Graph 1 represents?
Happiness comes not from material wealth but less desire.
ماجستير إدارة المعارض من بريطانيا
INDEXES, SCALES, & TYPOLOGIES
Descriptive Statistics vs. Factor Analysis
Multivariate Statistics
1.7 Nonlinear Regression.
مديريت موثر جلسات Running a Meeting that Works
Principal Components Analysis
Factor Analysis An Alternative technique for studying correlation and covariance structure.
Somi Jacob and Christian Bach
Chapter_19 Factor Analysis
Factor Analysis (Principal Components) Output
Seasonal Forecasting Using the Climate Predictability Tool
Improving the method of alternatives analysis
Principal Component Analysis
Factor Analysis.
Progress Report 04/25/2012.
Canonical Correlation Analysis
Regression and Correlation of Data
The Numerology of T Cell Functional Diversity
Chapter 4 –Dimension Reduction
Presentation transcript:

Principal Component Analysis Olympic Heptathlon Ch. 13

Principal components The Principal components method summarizes data by finding the major correlations in linear combinations of the obervations. Little information lost in process, usually Major application: Correlated variables are transformed into uncorrelated variables

Olympic Heptathlon Data 7 events: Hurdles, Highjump, Shot, run200m, longjump, javelin, run800m The scores for these events are all on different scales A relatively high number could be good or bad depending on the event 25 Olympic competitors

R Commands Reorder the scores so that a high number means a good score heptathlon$hurdles <- max(heptathlon$hurdles) – heptathlon$hurdles Hurdles, Run200m, Run800m requires reordering

Basic plot to look at data R Commands score <- which(colnames(heptathlon) == “score”) “which” searches the column names of the heptathlon data.frame for “score” and stores it in a variable “score” above plot(heptathlon[, -score]) Scatterplot matrix, excluding the score column

Interpretation The data looks correlated except for the javelin event. The book speculates the javelin is a ‘technical’ event, whereas the others are all ‘power’ events

To get numerical correlation values: round(cor(heptathlon[, -score]), 2) The cor(data.frame) function finds the actual correlation values The cor(data.frame) function is in agreement with this interpretation hurdles highjump shot run200m longjump javelin run800m hurdles 1.00 0.81 0.65 0.77 0.91 0.01 0.78 highjump 0.81 1.00 0.44 0.49 0.78 0.00 0.59 shot 0.65 0.44 1.00 0.68 0.74 0.27 0.42 run200m 0.77 0.49 0.68 1.00 0.82 0.33 0.62 longjump 0.91 0.78 0.74 0.82 1.00 0.07 0.70 javelin 0.01 0.00 0.27 0.33 0.07 1.00 -0.02 run800m 0.78 0.59 0.42 0.62 0.70 -0.02 1.00

Running a Principal Component analysis heptathlon_pca <- prcomp(heptathlon[, -score], scale = TRUE) print(heptathlon_pca) Standard deviations: [1] 2.1119364 1.0928497 0.7218131 0.6761411 0.4952441 0.2701029 0.2213617 Rotation: PC1 PC2 PC3 PC4 PC5 PC6 PC7 hurdles -0.4528710 0.15792058 -0.04514996 0.02653873 -0.09494792 -0.78334101 0.38024707 highjump -0.3771992 0.24807386 -0.36777902 0.67999172 0.01879888 0.09939981 -0.43393114 shot -0.3630725 -0.28940743 0.67618919 0.12431725 0.51165201 -0.05085983 -0.21762491 run200m -0.4078950 -0.26038545 0.08359211 -0.36106580 -0.64983404 0.02495639 -0.45338483 longjump -0.4562318 0.05587394 0.13931653 0.11129249 -0.18429810 0.59020972 0.61206388 javelin -0.0754090 -0.84169212 -0.47156016 0.12079924 0.13510669 -0.02724076 0.17294667 run800m -0.3749594 0.22448984 -0.39585671 -0.60341130 0.50432116 0.15555520 -0.09830963

a1 <- heptathlon_pca$rotation[, 1] This shows the coefficients for the first principal component y1 Y1 is the linear combination of observations that maximizes the sample variance as a portion of the overall sample variance. Y2 is the linear combination that maximizes out of the remaining portion of sample variance, with the added constraint of being uncorrelated with Y1

> a1<-heptathlon_pca$rotation[,1] > a1 hurdles highjump shot run200m longjump javelin run800m -0.4528710 -0.3771992 -0.3630725 -0.4078950 -0.4562318 -0.0754090 -0.3749594

Interpretation 200m and long jump is the most important factor Javelin result is less important

Data Analysis using the first principal component center <- heptathlon_pca$center This is the center or mean of the variables, it can also be a flag in the prcomp() function that sets the center at 0. scale <- heptathlon_pca$scale This is also a flag in the prcomp() function that can scale the variables to fit between 0 and 1, as it is, its just storing the current scale. hm <- as.matrix(heptathlon[, -score]) This coerces the data.frame heptathlon into a matrix and excludes score drop(scale(hm, center = center, scale = scale) %*% heptathlon_pca$rotation[, 1]) rescales the raw heptathlon data to the Principal component scale performs matrix multiplication on the coefficients of the linear combination for the first principal component (Y1) Drop() prints the resulting matrix

Joyner-Kersee (USA) John (GDR) Behmer (GDR) Sablovskaite (URS) -4.121447626 -2.882185935 -2.649633766 -1.343351210 Choubenkova (URS) Schulz (GDR) Fleming (AUS) Greiner (USA) -1.359025696 -1.043847471 -1.100385639 -0.923173639 Lajbnerova (CZE) Bouraga (URS) Wijnsma (HOL) Dimitrova (BUL) -0.530250689 -0.759819024 -0.556268302 -1.186453832 Scheider (SWI) Braun (FRG) Ruotsalainen (FIN) Yuping (CHN) 0.015461226 0.003774223 0.090747709 -0.137225440 Hagger (GB) Brown (USA) Mulliner (GB) Hautenauve (BEL) 0.171128651 0.519252646 1.125481833 1.085697646 Kytola (FIN) Geremias (BRA) Hui-Ing (TAI) Jeong-Mi (KOR) 1.447055499 2.014029620 2.880298635 2.970118607 Launa (PNG) 6.270021972

An easier way predict(heptathlon_pca)[, 1] Accomplishes the same thing as the previous set of commands

Principal Components Proportion of Sample Variances The first component contributes the vast majority of total sample variance Just looking at the first two (uncorrelated!) principal components will account for most of the overall sample variance (~81%) plot(heptathlon_pca)

First two Principal Components biplot(heptathlon_pca,col=c("gray","black"))

Interpretation The Olympians with the highest score seem to be at the bottom left of the graph, while The javelin event seems to give the scores a more fine variation and award the competitors a slight edge.

How well does it fit the Scoring? The correlation between Y1 and the scoring looks very strong. cor(heptathlon$score, heptathlon_pca$x[,1]) [1] -0.9910978

Homework! (Ch.13) Use the “meteo” data on page 225 and create scatterplots to check for correlation (don’t recode/reorder anything, and remember not to include columns in the analysis that don’t belong! Is there correlation? Don’t have R calculate the numerical values unless you really want to Run PCA using the long way or the shorter “predict” command (remember not to include the unneccesary column!) Create a biplot, but use colors other than gray and black! Create a scatterplot like on page 224 of the 1st principle component and the yield What is the numerical value of the correlation? Don’t forget to copy and paste your commands into word and print it out for me (and include the scatterplot)!