Multivariate statistics

Slides:



Advertisements
Similar presentations
Trends in Number of High School Graduates: National
Advertisements

PARTISAN CONTROL AND STATE DECISIONS ABOUT OBAMACARE FULL GO STATES (n = 22) Arkansas Michigan CALIFORNIA MINNESOTA COLORADO NEVADA CONNECTICUT New Hampshire.
Hwy Ops Div1 THE GREAT KAHUNA AWARD !!! TEA 2004 CONFERENCE, MOBILE, AL OCTOBER 09-11, 2004 OFFICE OF PROGRAM ADMINISTRATION HIPA-30.
The West` Washington Idaho 1 Montana Oregon California 3 4 Nevada Utah
Fifty Nifty United States
TOTAL CASES FILED IN MAINE PER 1,000 POPULATION CALENDAR YEARS FILINGS PER 1,000 POPULATION This chart shows bankruptcy filings relative to.
5 Year Total LIHEAP Block Grant Allotment (FY ) While LIHEAP is intended to assist low-income families with their year-round home energy needs,
BINARY CODING. Alabama Arizona California Connecticut Florida Hawaii Illinois Iowa Kentucky Maine Massachusetts Minnesota Missouri 0 Nebraska New Hampshire.
U.S. Civil War Map On a current map of the U.S. identify and label the Union States, the Confederate States, and U.S. territories. Create a map key and.
Xuhua Xia Slide 1 Principal Components Analysis Objectives: –Understand the principles of principal components analysis (PCA) –Recognize conditions under.
This chart compares the percentage of cases filed in Maine under chapter 13 with the national average between 1999 and As a percent of total filings,
Fasten your seatbelts we’re off on a cross country road trip!
Map Review. California Kentucky Alabama.
Judicial Circuits. If You Live In This State This Is Your Judicial Circuit Alabama11th Circuit Alaska 9th Circuit Arkansas 8th Circuit Arizona 9th Circuit.
1. AFL-CIO What percentage of the funds received by Alabama K-12 public schools in school year was provided by the state of Alabama? a)44% b)53%
The United States.
Directions: Label Texas, Arkansas, Louisiana, Mississippi, Tennessee, Alabama, Georgia, Florida, South Carolina, North Carolina, Virginia--- then color.
 As a group, we thought it be interesting to see how many of our peers drop out of school.  Since in the United States education is so important, we.
Warm Up Complete the Coordinate Practice #10. Content Objective: – Compare the physical and political regions. Language Objectives: – SWBAT define region.
CHAPTER 7 FILINGS IN MAINE CALENDAR YEARS 1999 – 2009 CALENDAR YEAR CHAPTER 7 FILINGS This chart shows total case filings in Maine for calendar years 1999.
By Carol Fahringer. I.The United States: Divided Into 8 Different Political Regions.
Study Cards The East (12) Study Cards The East (12) New Hampshire New York Massachusetts Delaware Connecticut New Jersey Rhode Island Rhode Island Maryland.
Hawaii Alaska (not to scale) Alaska GeoCurrents Customizable Base Map text.
US MAP TEST Practice
Education Level. STD RATE Teen Pregnancy Rates Pre-teen Pregnancy Rate.
TOTAL CASE FILINGS - MAINE CALENDAR YEARS 1999 – 2009 CALENDAR YEAR Total Filings This chart shows total case filings in Maine for calendar years 1999.
50 Nifty United States Fifty nifty United States from thirteen original colonies; Fifty nifty stars in the flag that billows so beautif’ly in the breeze.
The United States is a system that can be broken into 5 major parts or regions.
Can you locate all 50 states? Grade 4 Mrs. Kuntz.
USA ILLUSTRATIONS – US CHARACTER Go ahead and replace it with your own text. This is an example text. Go ahead and replace it with your own text Go ahead.
1st Hour2nd Hour3rd Hour Day #1 Day #2 Day #3 Day #4 Day #5 Day #2 Day #3 Day #4 Day #5.
2012 IFTA / IRP MANAGERS’AND LAW ENFORCEMENT WORKSHOP
The United States Song Wee Sing America.
Expanded State Agency Use of NMLS
Fifty nifty United States from
The United States.
Supplementary Data Tables, Utilization and Volume
Sales Tax Raw Data State Sales Tax 1 Alabama 4% 2 Alaska 0% 3 Arizona
Maps.
Physicians per 1,000 Persons
USAGE OF THE – GHz BAND IN THE USA
Content Objective: Language Objectives:
USA! E M O C L E W MAP OF USA To the Go ahead, use your tools below:
Table 3.1: Trends in Inpatient Utilization in Community Hospitals, 1992 – 2012
Name the State Flags Your group are to identify which state the flag belongs to and sign correctly to earn a point.
GLD Org Chart February 2008.
Membership Update July 13, 2016.
2008 presidential election
State Adoption of Uniform State Test
The States How many states are in the United States?
State Adoption of NMLS ESB
Supplementary Data Tables, Trends in Overall Health Care Market
Fifty nifty United States
AIDS Education & Training Center Program Regional Centers
Fifty Nifty United States
Table 2.3: Beds per 1,000 Persons by State, 2013 and 2014
Regions of the United States
DO NOW: TAKE OUT ANY FORMS OR PAPERS YOU NEED TO TURN IN
Regions of the United States
Supplementary Data Tables, Utilization and Volume
Regions How many do you know?.
Presidential Electoral College Map
2012 US Presidential Election Result
2008 presidential election
WASHINGTON MAINE MONTANA VERMONT NORTH DAKOTA MINNESOTA MICHIGAN
Expanded State Agency Use of NMLS
CBD Topical Sales Restrictions by State (as of May 23, 2019)
In 2006, approximately 46% of all AIDS cases among adults and adolescents were in the South, followed by the Northeast (26%), the West (16%), and the Midwest.
AIDS Education & Training Center Program Regional Centers
USAGE OF THE 4.4 – 4.99 GHz BAND IN THE USA
Presentation transcript:

Multivariate statistics PCA: principal component analysis Correspondence analysis Canonical correlation Discriminant function analysis Cluster analysis MANOVA Xuhua Xia

PCA Given a set of variables x1, x2, …, xn, find a set of coefficients a11, a12, …, a1n, so that PC1 = a11x1 + a12x2 + …+ a1nxn has the maximum variance (v1) subject to the constraint that a1 is a unit vector, i.e., sqrt(a112+ a122 …+ a1n2) = 1 find a 2nd set of coefficients a2 so that PC2 has the maximum variance (v2) subject to the unit vector constraint and the additional constraint that a2 is orthogonal to a1 find 3rd, 4th,… nth set of coefficients so that PC3, PC4, … have the maximum variance (v3, v4, …) subject to the unit vector constraint and that ai is orthogonal to all ai-1 vectors. It turns out that v1, v2, … are eigenvalues and a1, a2, … are eigenvectors of the variance-covariance matrix of x1, x2, …, xn PCA is to find the eigenvalues and eigenvectors.

Typical Form of Data A data set in a 8x3 matrix. The rows could be species and columns sampling sites. 100 97 99 96 90 90 80 75 60 75 85 95 62 40 28 77 80 78 92 91 80 75 85 100 X = A matrix is often referred to as a nxp matrix (n for number of rows and p for number of columns). Our matrix has 8 rows and 3 columns, and is an 8x3 matrix. A variance-covariance matrix has n = p, and is called n-dimensional square matrix. Xuhua Xia

What are Principal Components? PC = a1X1 + a2X2 + … anXn Principal components are linear combinations of the observed variables. The coefficients of these principal components are chosen to meet four criteria What are the four criteria? Xuhua Xia

What are Principal Components? The four criteria: There are exactly p principal components (PCs), each being a linear combination of the observed variables; The PCs are mutually orthogonal (i.e., perpendicular and uncorrelated); The components are extracted in order of decreasing variance. The components are in the form of eigenvalues and eigenvector of unit length. Xuhua Xia

A Simple Data Set X Y X 1 1 Y 1 1 X Y X 1 1.414 Y 1.414 2 X1 X2 -1.2649 -1.7889 -0.6325 -0.8944 0 0 0.6325 0.8944 1.2649 1.7889 Note that, for standardized X and Y, the denominator of r_XY is sqrt(SS_x*SS_y) = sqrt(var_x*df * var_y*df) = sqrt(df*df) = df. That is, for standardized variables, correlation matrix and var-covar matrix are the same. X Y X 1 1 Y 1 1 X Y X 1 1.414 Y 1.414 2 Correlation matrix Covariance matrix Xuhua Xia

General observations The total variance is 3 (= 1 + 2) The two variables, X and Y, are perfectly correlated, with all points fall on the regression line. The spatial relationship among the 5 points can therefore be represented by a single dimension. For this reason, PCA is often referred to as a dimension-reduction technique. What would happen if we apply PCA to the data? Xuhua Xia

Graphic PCA -2 -1.5 -1 -0.5 0.5 1 1.5 2 X Y Xuhua Xia

R functions Don’t use scientific notation. X1 X2 -1.2649 -1.7889 -0.6325 -0.8944 0 0 0.6325 0.8944 1.2649 1.7889 options("scipen"=100, "digits"=6) objPCA<-prcomp(~X1+X2) objPCA<-prcomp(md) objPCA<-prcomp(md,scale.=T) predict(objPCA,md) predict(objPCA,data.frame(X1=0.3,X2=0.5) screeplot(objPCA) Don’t use scientific notation. Requesting the PCA to be carried out on the covariance matrix (default) rather than the correlation matrix. Use scale.=TRUE to request PCA on correlation matrix Help decide how many PCs to keep when there are many variables Xuhua Xia

A positive definite matrix When you run the SAS program, the log file will warn that “The Correlation Matrix is not positive definite.”. What does that mean? A symmetric matrix M (such as a correlation matrix or a covariance matrix) is positive definite if z’Mz > 0 for all non-zero vectors z with real entries, where z’ is the transpose of z. Given our correlation matrix with all entries being 1, it is easy to find z that lead to z’Mz = 0. So the matrix is not positive definite: if the correlations (off-diagnal elements) are smaller than 1, then the solutions would be complex. Replace the correlation matrix with the covariance matrix and solve for z. Xuhua Xia

SAS Output Principal component scores What’s the variance in PC1? Standard deviations: [1] 1.7320714226 0.0000440773 Rotation: PC1 PC2 X1 0.577347 0.816499 X2 0.816499 -0.577347 PC1 PC2 [1,] -2.19092 0.0000278767 [2,] -1.09545 -0.0000557540 [3,] 0.00000 0.0000000000 [4,] 1.09545 0.0000557540 [5,] 2.19092 -0.0000278767 better to output in variance (eigenvalue) accounted for by each PC eigenvectors: PC1 = 0.57735X1+0.81650X2 Principal component scores What’s the variance in PC1? Xuhua Xia

PCA on correlation matrix (scale.=T) Standard deviations: [1] 1.4142135619 0.0000381717 Rotation: PC1 PC2 X1 0.707107 0.707107 X2 0.707107 -0.707107 PC1 PC2 [1,] -1.788850 0.0000241421 [2,] -0.894435 -0.0000482837 [3,] 0.000000 0.0000000000 [4,] 0.894435 0.0000482837 [5,] 1.788850 -0.0000241421 Xuhua Xia

The Eigenvalue Problem The covariance matrix. The Eigenvalue is the set of values that satisfy this condition. The resulting eigenvalues (There are n eigenvalues for n variables). The sum of eigenvalues is equal to the sum of variances in the covariance matrix. Finding the eigenvalues and eigenvectors is called an eigenvalue problem (or a characteristic value problem). Xuhua Xia

Get the Eigenvectors An eigenvector is a vector (x) that satisfies the following condition: A x = x In our case A is a variance-covariance matrix of the order of 2, and a vector x is a vector specified by x1 and x2. Xuhua Xia

Get the Eigenvectors We want to find an eigenvector of unit length, i.e., x12 + x22 = 1 We therefore have From Previous Slide Solve x1 The first eigenvector is one associated with the largest eigenvalue. Xuhua Xia

Get the PC Scores First PC score Original data (x and y) Eigenvectors Second PC score The original data in a two dimensional space is reduced to one dimension.. Xuhua Xia

Crime Data in 50 States PROC PRINCOMP OUT=CRIMCOMP; STATE MURDER RAPE ROBBE ASSAU BURGLA LARCEN AUTO ALABAMA 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7 ALASKA 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3 ARIZONA 9.5 34.2 138.2 312.3 2346.1 4467.4 439.5 ARKANSAS 8.8 27.6 83.2 203.4 972.6 1862.1 183.4 CALIFORNIA 11.5 49.4 287.0 358.0 2139.4 3499.8 663.5 COLORADO 6.3 42.0 170.7 292.9 1935.2 3903.2 477.1 CONNECTICUT 4.2 16.8 129.5 131.8 1346.0 2620.7 593.2 DELAWARE 6.0 24.9 157.0 194.2 1682.6 3678.4 467.0 FLORIDA 10.2 39.6 187.9 449.1 1859.9 3840.5 351.4 GEORGIA 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9 HAWAII 7.2 25.5 128.0 64.1 1911.5 3920.4 489.4 IDAHO 5.5 19.4 39.6 172.5 1050.8 2599.6 237.6 ILLINOIS 9.9 21.8 211.3 209.0 1085.0 2828.5 528.6 . . . . . . . . PROC PRINCOMP OUT=CRIMCOMP; Xuhua Xia

STATE MURDER RAPE ROBBE ASSAU BURGLA LARCEN AUTO Alabama 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7 Alaska 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3 Arizona 9.5 34.2 138.2 312.3 2346.1 4467.4 439.5 Arkansas 8.8 27.6 83.2 203.4 972.6 1862.1 183.4 California 11.5 49.4 287.0 358.0 2139.4 3499.8 663.5 Colorado 6.3 42.0 170.7 292.9 1935.2 3903.2 477.1 Connecticut 4.2 16.8 129.5 131.8 1346.0 2620.7 593.2 Delaware 6.0 24.9 157.0 194.2 1682.6 3678.4 467.0 Florida 10.2 39.6 187.9 449.1 1859.9 3840.5 351.4 Georgia 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9 Hawaii 7.2 25.5 128.0 64.1 1911.5 3920.4 489.4 Idaho 5.5 19.4 39.6 172.5 1050.8 2599.6 237.6 Illinois 9.9 21.8 211.3 209.0 1085.0 2828.5 528.6 Indiana 7.4 26.5 123.2 153.5 1086.2 2498.7 377.4 Iowa 2.3 10.6 41.2 89.8 812.5 2685.1 219.9 Kansas 6.6 22.0 100.7 180.5 1270.4 2739.3 244.3 Kentucky 10.1 19.1 81.1 123.3 872.2 1662.1 245.4 Louisiana 15.5 30.9 142.9 335.5 1165.5 2469.9 337.7 Maine 2.4 13.5 38.7 170.0 1253.1 2350.7 246.9 Maryland 8.0 34.8 292.1 358.9 1400.0 3177.7 428.5 Massachusetts 3.1 20.8 169.1 231.6 1532.2 2311.31140.1 Michigan 9.3 38.9 261.9 274.6 1522.7 3159.0 545.5 Minnesota 2.7 19.5 85.9 85.8 1134.7 2559.3 343.1 Mississippi 14.3 19.6 65.7 189.1 915.6 1239.9 144.4 Missouri 9.6 28.3 189.0 233.5 1318.3 2424.2 378.4 Montana 5.4 16.7 39.2 156.8 804.9 2773.2 309.2 Nebraska 3.9 18.1 64.7 112.7 760.0 2316.1 249.1 Nevada 15.8 49.1 323.1 355.0 2453.1 4212.6 559.2 New Hampshire 3.2 10.7 23.2 76.0 1041.7 2343.9 293.4 New Jersey 5.6 21.0 180.4 185.1 1435.8 2774.5 511.5

Crime data (cont.) New Mexico 8.8 39.1 109.6 343.4 1418.7 3008.6 259.5 New York 10.7 29.4 472.6 319.1 1728.0 2782.0 745.8 North Carolina 10.6 17.0 61.3 318.3 1154.1 2037.8 192.1 North Dakota 0.9 9.0 13.3 43.8 446.1 1843.0 144.7 Ohio 7.8 27.3 190.5 181.1 1216.0 2696.8 400.4 Oklahoma 8.6 29.2 73.8 205.0 1288.2 2228.1 326.8 Oregon 4.9 39.9 124.1 286.9 1636.4 3506.1 388.9 Pennsylvania 5.6 19.0 130.3 128.0 877.5 1624.1 333.2 Rhode Island 3.6 10.5 86.5 201.0 1489.5 2844.1 791.4 South Carolina 11.9 33.0 105.9 485.3 1613.6 2342.4 245.1 South Dakota 2.0 13.5 17.9 155.7 570.5 1704.4 147.5 Tennessee 10.1 29.7 145.8 203.9 1259.7 1776.5 314.0 Texas 13.3 33.8 152.4 208.2 1603.1 2988.7 397.6 Utah 3.5 20.3 68.8 147.3 1171.6 3004.6 334.5 Vermont 1.4 15.9 30.8 101.2 1348.2 2201.0 265.2 Virginia 9.0 23.3 92.1 165.7 986.2 2521.2 226.7 Washington 4.3 39.6 106.2 224.8 1605.6 3386.9 360.3 West Virginia 6.0 13.2 42.2 90.9 597.4 1341.7 163.3 Wisconsin 2.8 12.9 52.2 63.7 846.9 2614.2 220.7 Wyoming 5.4 21.9 39.7 173.9 811.6 2772.2 282.0 md<-read.fwf("crime.txt",c(14,6,5,6,6,7,7,6),header=T) attach(md) cor(md[,2:8] objPCA<-prcomp(md[,2:8],scale.=T) objPCA summary(objPCA) PCScore<-predict(objPCA,md) If you copy the data to a text file, add a top line with a comment sign #, otherwise you need to specify the 'sep=' with read.fwf

Correlation Matrix MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO MURDER 1.0000 0.6012 0.4837 0.6486 0.3858 0.1019 0.0688 RAPE 0.6012 1.0000 0.5919 0.7403 0.7121 0.6140 0.3489 ROBBERY 0.4837 0.5919 1.0000 0.5571 0.6372 0.4467 0.5907 ASSAULT 0.6486 0.7403 0.5571 1.0000 0.6229 0.4044 0.2758 BURGLARY 0.3858 0.7121 0.6372 0.6229 1.0000 0.7921 0.5580 LARCENY 0.1019 0.6140 0.4467 0.4044 0.7921 1.0000 0.4442 AUTO 0.0688 0.3489 0.5907 0.2758 0.5580 0.4442 1.0000 If variables are not correlated, there would be no point in doing PCA. The correlation matrix is symmetric, so we only need to inspect either the upper or lower triangular matrix. Xuhua Xia

Eigenvalues screeplot(objPCA,type = "lines") > summary(objPCA) Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation 2.029 1.113 0.852 0.5625 0.5079 0.4712 0.3522 Proportion of Variance 0.588 0.177 0.104 0.0452 0.0369 0.0317 0.0177 Cumulative Proportion 0.588 0.765 0.869 0.9137 0.9506 0.9823 1.0000 screeplot(objPCA,type = "lines") Xuhua Xia

Eigenvectors Do these eigenvectors mean anything? PC1 PC2 PC3 PC4 PC5 PC6 PC7 MURDER -0.3003 -0.62918 0.17824 -0.23216 0.53810 0.25912 0.267589 RAPE -0.4318 -0.16944 -0.24421 0.06219 0.18848 -0.77327 -0.296490 ROBBE -0.3969 0.04224 0.49588 -0.55793 -0.52002 -0.11439 -0.003902 ASSAU -0.3966 -0.34353 -0.06953 0.62984 -0.50660 0.17235 0.191751 BURGLA -0.4402 0.20334 -0.20990 -0.05757 0.10101 0.53599 -0.648117 LARCEN -0.3574 0.40233 -0.53922 -0.23491 0.03008 0.03941 0.601688 AUTO -0.2952 0.50241 0.56837 0.41922 0.36980 -0.05729 0.147044 Do these eigenvectors mean anything? All crimes are negatively correlated with the first eigenvector, which is therefore interpreted as a measure of overall safety. The 2nd eigenvector has positive loadings on AUTO, LARCENY and ROBBERY and negative loadings on MURDER, ASSAULT and RAPE. It is interpreted to measure the preponderance of property crime over violent crime…... Xuhua Xia

biplot(objPCA) Xuhua Xia

Plot PC1 and PC2 2 1 PC2 -1 -2 -4 -2 2 4 PC1 Massachusetts Rhode Island 2 Hawaii Connecticut Delaware 1 Minnesota Colorado New Jersey Utah Vermont Arizona New Hampshire Iowa Washington Wisconsin Oregon Maine New York North Dakota Montana Alaska Nebraska PC2 California Michigan Illinois Ohio Indiana Wyoming Kansas Idaho Nevada Maryland Pennsylvania South Dakota Florida Missouri Texas Oklahoma Virginia -1 West Virginia New Mexico Tennessee Kentucky Georgia Arkansas North Carolina -2 South Carolina Louisiana Alabama Mississippi -4 -2 2 4 PC1

PC Plot: Crime Data Maryland Nevada, New York, California North and South Dakota Mississippi, Alabama, Louisiana, South Carolina Xuhua Xia

Steps in a PCA Generate a correlation or variance-covariance matrix Obtain eigenvalues and eigenvectors Generate principal component (PC) scores Choose the number of PCs Plot the PC scores in the space with reduced dimensions When to use a correlation matrix: 1. When different units are used for different variables 2. When data are from species of very different mean densities Xuhua Xia