Krishna Rajan Data Dimensionality Reduction: Introduction to Principal Component Analysis Case Study: Multivariate Analysis of Chemistry-Property data.

Slides:



Advertisements
Similar presentations
Covariance Matrix Applications
Advertisements

Dimensionality Reduction PCA -- SVD
Chapter Nineteen Factor Analysis.
PCA + SVD.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Lecture 7: Principal component analysis (PCA)
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.
Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Psychology 202b Advanced Psychological Statistics, II April 7, 2011.
LISA Short Course Series Multivariate Analysis in R Liang (Sally) Shan March 3, 2015 LISA: Multivariate Analysis in RMar. 3, 2015.
Principal Component Analysis
Principal Components An Introduction Exploratory factoring Meaning & application of “principal components” Basic steps in a PC analysis PC extraction process.
Factor Analysis There are two main types of factor analysis:
Principal component analysis (PCA)
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 9(b) Principal Components Analysis Martin Russell.
09/05/2005 סמינריון במתמטיקה ביולוגית Dimension Reduction - PCA Principle Component Analysis.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
CSE 300: Software Reliability Engineering Topics covered: Software metrics and software reliability.
3D Geometry for Computer Graphics
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir
Principal component analysis (PCA)
Principal component analysis (PCA) Purpose of PCA Covariance and correlation matrices PCA using eigenvalues PCA using singular value decompositions Selection.
Computer Vision Spring ,-685 Instructor: S. Narasimhan WH 5409 T-R 10:30am – 11:50am Lecture #18.
SVD(Singular Value Decomposition) and Its Applications
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
Principal Components Analysis (PCA). a technique for finding patterns in data of high dimension.
Chapter 2 Dimensionality Reduction. Linear Methods
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Chapter 3 Data Exploration and Dimension Reduction 1.
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #19.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
What is it? Principal Component Analysis (PCA) is a standard tool in multivariate analysis for examining multidimensional data To reveal patterns between.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques.
N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
1 Sample Geometry and Random Sampling Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.
Principal Component Analysis (PCA). Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite)
CSE 185 Introduction to Computer Vision Face Recognition.
Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.
Lecture 12 Factor Analysis.
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton
Principle Component Analysis and its use in MA clustering Lecture 12.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Principal Component Analysis Zelin Jia Shengbin Lin 10/20/2015.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Feature Extraction 主講人:虞台文.
Principal Components Analysis ( PCA)
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Unsupervised Learning II Feature Extraction
Unsupervised Learning II Feature Extraction
Principal Component Analysis (PCA)
Factor analysis Advanced Quantitative Research Methods
Principal Component Analysis (PCA)
Dimension Reduction via PCA (Principal Component Analysis)
Principal Component Analysis
Descriptive Statistics vs. Factor Analysis
Principal Components Analysis
Principal Component Analysis (PCA)
Feature space tansformation methods
Chapter_19 Factor Analysis
Factor Analysis (Principal Components) Output
Principal Component Analysis
Principal Component Analysis
Presentation transcript:

Krishna Rajan Data Dimensionality Reduction: Introduction to Principal Component Analysis Case Study: Multivariate Analysis of Chemistry-Property data in Molten Salts C. Suh 1, S. Graduciz 2, M. Gaune-Escard 2, K. Rajan 1 Combinatorial Sciences and Materials Informatics Collaboratory 1 Iowa State University 2 CNRS, Marseilles, France

Krishna Rajan From a set of N correlated descriptors, we can derive a set of N uncorrelated descriptors (the principal components). Each principal component (PC) is a suitable linear combination of all the original descriptors. PCA reduces the information dimensionality that is often needed from the vast arrays of data in a way so that there is minimal loss of information ( from Nature Reviews Drug Discovery 1, (2002) : INTEGRATION OF VIRTUAL AND HIGH THROUGHPUT SCREENING Jürgen Bajorath ; and Materials Today; MATERIALS INFORMATICS, K. Rajan, October PRINCIPAL COMPONENT ANALYSIS: PCA

Krishna Rajan Functionality 1 = F ( x 1, x 2, x 3, x 4, x 5, x 6, x 7, x 8 ……) Functionality 2 = F ( x 1, x 2, x 3, x 4, x 5, x 6, x 7, x 8 ……) PC 1= A 1 X 1 + A 2 X 2 + A 3 X 3 + A 4 X 4 ……. PC 2 = B 1 X 1 + B 2 X 2 + B 3 X 3 +B 4 X 4 ……. PC 3 = C 1 X 1 + C 2 X 2 + C 3 X 3 + C 4 X 4 ……. X1 = f ( x2) X2 = g( x3) X3= h(x4) ……. I II III …….

Krishna Rajan Database of molten salts properties tabulates numerous properties for each chemistry : What can we learn beyond a “search and retrieve” function? Can we find a multivariate correlation (s) among all chemistries and properties? Challenge of reducing the dimensionality of the data set DIMENSIONALITY REDUCTION: Case study

Krishna Rajan Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.

Krishna Rajan Dimensionality Reduction of Molten Salts Data (Janz’s Molten Salts Database:1700 chemistries with 7 variables.) X 1 = f (x 2 ) X 2 = g (x 3 ) X 3 = h (x 4 ) Melting point = F ( x 1, x 2, x 3, x 4, x 5, x 6, x 7, x 8 ……) Density = F ( x 1, x 2, x 3, x 4, x 5, x 6, x 7, x 8 ……) Where x i = molten salt compound chemistries …… …….

Krishna Rajan Mathematically, PCA relies on the fact that most of the descriptors are interrelated and these correlations in some instances are high. It results in a rotation of the coordinate system in such a way that the axes show a maximum of variation (covariance) along their directions. This description can be mathematically condensed to a so-called eigenvalue problem. The data manipulation involves decomposition of the data matrix X into two matrices T and P. The two matrices P and T are orthogonal. The matrix P is usually called the loadings matrix, and the matrix T is called the scores matrix. The eigenvectors of the covariance matrix constitute the principal components. The corresponding eigenvalues give a hint to how much "information" is contained in the individual components.

Krishna Rajan The loadings can be understood as the weights for each original variable when calculating the principal component. The matrix T contains the original data in a rotated coordinate system. The mathematical analysis involves finding these new “data” matrices T and P. The dimensions of T( ie its rank) that captures all the information of the entire data set of A ( ie # of variables) is far less than that of X ( ideally 2 or 3). One now compresses the N dimensional plot of the data matrix X into 2 or 3 dimensional plot of T and P.

Krishna Rajan The first principal component accounts for the maximum variance (eigenvalue) in the original dataset. The second, third ( and higher order) principal components are orthogonal (uncorrelated) to the first and accounts for most of the remaining variance. A new row space is constructed in which to plot the data, where the axes represent the weighted linear combinations of the variables affecting the data. Each of these linear combinations are independent of each other and hence orthogonal. The data when plotted in this new space is essentially a correlation plot, where the position of each data point not only captures all the influences of the variables on that data but also its relative influence compared to the other data. PC 1= A 1 X 1 + A 2 X 2 + A 3 X 3 + A 4 X 4 ……. PC 2 = B 1 X 1 + B 2 X 2 + B 3 X 3 +B 4 X 4 ……. PC 3 = C 1 X 1 + C 2 X 2 + C 3 X 3 + C 4 X 4 …….

Krishna Rajan PC1 PC2 PC3 PC4 PC5 …………… Minimal contribution to additional information content beyond higher order principal components.. “Scree” plot helps to identify the # of PCs needed to capture reduced dimensionality NB…depending upon nature of data set, this can be within 2, 3 or higher principal components but still less than the # of variables in original data set Eigenvalue

Krishna Rajan Thus the mth PC is orthogonal to all others and has the mth largest variance in the set of PCs. Once the N PCs have been calculated using eigenvalue/ eigenvector matrix operations, only PCs with variances above a critical level are retained (scree test). The M-dimensional principal component space has retained most of the information from the initial N-dimensional descriptor space, by projecting it into orthogonal axes of high variance. The complex tasks of prediction or classification are made easier in this compressed, reduced dimensional space.

Krishna Rajan PCA: algorithmic summary

Krishna Rajan Dimensionality Reduction of Molten Salts Data (Janz’s Molten Salts Database:1700 instances with 7 variables.) Bivariate representation of the data sets Multivariate (PCA) representation of the data sets

Krishna Rajan INTERPRETATIONS OF PRINCIPAL COMPONENT PROJECTIONS Trends in bonding captured along the PC1 axis of scoring plot Correlations between variables captured in loading plot

Krishna Rajan To summarize, when we start with a multivariate data matrix PCA analysis permits us to reduce the dimensionality of that data set. This reduction in dimensionality now offers us better opportunities to: Identify the strongest patterns in the data Capture most of the variability of the data by a small fraction of the total set of dimensions Eliminate much of the noise in the data making it beneficial for both data mining and other data analysis algorithms PCA : summary