PCA for analysis of complex multivariate data. Interpretation of large data tables by PCA In industry, research and finance the amount of data is often.

Slides:



Advertisements
Similar presentations
Regression analysis Relating two data matrices/tables to each other Purpose: prediction and interpretation Y-data X-data.
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Mutidimensional Data Analysis Growth of big databases requires important data processing.  Need for having methods allowing to extract this information.
Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry.
Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Dimension reduction (1)
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Multiple Regression. Outline Purpose and logic : page 3 Purpose and logic : page 3 Parameters estimation : page 9 Parameters estimation : page 9 R-square.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
The Simple Linear Regression Model: Specification and Estimation
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 14 Using Multivariate Design and Analysis.
MAE 552 Heuristic Optimization Instructor: John Eddy Lecture #19 3/8/02 Taguchi’s Orthogonal Arrays.
1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
CALIBRATION Prof.Dr.Cevdet Demir
New Methods in Ecology Complex statistical tests, and why we should be cautious!
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Linear Regression/Correlation
Factor Analysis Psy 524 Ainsworth.
Advantages of Multivariate Analysis Close resemblance to how the researcher thinks. Close resemblance to how the researcher thinks. Easy visualisation.
1 Statistical Tools for Multivariate Six Sigma Dr. Neil W. Polhemus CTO & Director of Development StatPoint, Inc.
BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
CHAPTER 26 Discriminant Analysis From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved OPIM 303-Lecture #9 Jose M. Cruz Assistant Professor.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
Data Mining Manufacturing Data Dave E. Stevens Eastman Chemical Company Kingsport, TN.
Advanced Correlational Analyses D/RS 1013 Factor Analysis.
A B S T R A C T The study presents the application of selected chemometric techniques to the pollution monitoring dataset, namely, cluster analysis,
Data set Proteins consumption shows the estimates of the average protein consumption from different food sources for the inhabitants of 25 European countries.
Chapter 14 Repeated Measures and Two Factor Analysis of Variance
CLASSIFICATION. Periodic Table of Elements 1789 Lavosier 1869 Mendelev.
In the name of GOD. Zeinab Mokhtari 1-Mar-2010 In data analysis, many situations arise where plotting and visualization are helpful or an absolute requirement.
Examining Data. Constructing a variable 1. Assemble a set of items that might work together to define a construct/ variable. 2. Hypothesize the hierarchy.
Chapter 13 Repeated-Measures and Two-Factor Analysis of Variance
Analyzing Expression Data: Clustering and Stats Chapter 16.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 1.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10.
PCA application Processed cheese Ref: Ellekjær, Ilseng and Næs (1996). A case study of the use of exp. design and mult. anal. in product development. Food.
Principal Component Analysis (PCA)
Principal Component Analysis
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
FACTOR ANALYSIS.  The basic objective of Factor Analysis is data reduction or structure detection.  The purpose of data reduction is to remove redundant.
Principal Components Analysis ( PCA)
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
Chapter 14 Repeated Measures and Two Factor Analysis of Variance PowerPoint Lecture Slides Essentials of Statistics for the Behavioral Sciences Seventh.
Unsupervised Learning
GRAPHICAL REPRESENTATIONS OF A DATA MATRIX
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Exploring Microarray data
The Simple Linear Regression Model: Specification and Estimation
CH 5: Multivariate Methods
CS548 Fall 2017 Anomaly Detection
Principal Component Analysis (PCA)
Principal Component Analysis
Quality Control at a Local Brewery
Example of PCR, interpretation of calibration equations
Introduction to Statistical Methods for Measuring “Omics” and Field Data PCA, PcoA, distance measure, AMOVA.
Principal Component Analysis
Checking the data and assumptions before the final analysis.
Multivariate Methods Berlin Chen
One-Factor Experiments
Unsupervised Learning
Chapter 4 –Dimension Reduction
Presentation transcript:

PCA for analysis of complex multivariate data

Interpretation of large data tables by PCA In industry, research and finance the amount of data is often very large Little information is available a priori There is a need for methods based on few assumptions and which can give a simple and easily understandable overview –Overall broad interpretation –Ideas for further analyses –Generating hypotheses PCA is such a method!!!!

PCA used for Interpretation Pre-processing for regression Classification SPC Noise reduction Pre-processing for other statistical analyses

Examples of use in industry Process monitoring Sensory analysis (tasting etc.) –Product development and quality control Rheological measurements Process prediction Spectroscopy (NIR and other)

Examples of use outside industry Psychology Food science Information retrieval systems Consumer studies, marketing

PCA 1.Compresses the information –Finds the directions with most variability –Projects the information down on these dimensions 2.Presents the information in simple plots –Scores plot Projection of data onto subspace –Loadings plot Plot of relation between original variables and subspace dimensions

Data structure for PCA, data matrix Rows are objects, ”samples” Columns are variables

Scatter plots, vectors Vector x=( x 1,x 2,…x K ) Can be plotted. If several vectors are plotted it is called a scatter plot

X=(x1,x2,x3) x1 x2 x3

Principal component analysis Data Matrix X Variables Objects PCA Scores plot Loadings plot Other results

X1 X2 X3 X PC 1 PC 2

Model X=TP T + E The matrix X is modelled as components (systematic effects) plus residuals, E (noise) PCA model

The main plots Scores plot –For interpreting relations among samples Loadings plot –For interpreting relations among variables Explained variance plot

PC1 PC2 Scores plot/projection (T) t1 t2 70% 25%

x1 pc1 pc2 Loadings plot x2 x3

Loadings plots Usually 2-dimensional For spectroscopy and other continuous measurements, 1-dimensional plots are used.

Guidelines for how to interpret the plots Variables which are close have high correlation Samples which are close are similar Variables on opposite side of origin have negative correlation Objects on the right are dominated by variables to the right and so on….

Variance pr. component Sum of the variances of the original x-variables is equal to the sum of the variances of the scores. We can talk about variance pr. component and explained variance (in %) pr. component Can be presented in a cumulative way (or not)

Explained variance No. of components % 100% Cumulative plot (in % or absolute units)

123 Number of components Explained variance Non-cumulative plot (in % or absolute units) Bar plots can also be used

Sensory analysis of sausages Goals of the analysis Investigate the possibility of using dairy ingredients in sausages –Type and concentration –Focus on sensory properties Investigate the interaction of diary ingredients with other ingredients and process parameters Characterise the differences among the dairy ingredients used in sausages

Sensory analysis of sausages Factorial design in 4 variables –5 dairy ingredients Na caseinate Na caseinate (high viscosity) Skim milk Whey protein Demineralised whey powder –3 concentration levels 1%, 3% and 5% –2 starch levels 2% and 4% –2 cooking temperatures 76 and 82 degrees C. Published: Baardseth et al, J. Food Science.

Variables/attributes used Graininess Stickiness Firmness Juiciness Fatness Elasticity Colour hue Colour intensity Whiteness Meat taste Off-taste Rancidity Smokiness

70%

Loadings and scores Scores split up according to ingredient on next slide

Above average Below average Demineralised whey powder Na caseinate Na caseinate (high viscosity) Skim milk Whey protein Can also be done using colours

We have got information about Which samples that are similar Which variables that are similar or very different Which samples that are characterised by which variables Which design variables that are most important for variation Differences among the ingredients

Pre-processing If variables are in very different units, it may be advantageous to standardise the variables prior to PCA X new =X old /std(X) for each variable Be aware of noise!! Can be tested by ANOVA or replicates.

Standard deviations Viscosity pH Water content Temp Variables of different types Difficult to compare

Pre-processing In spectroscopy usually not done Very important if measurements from different instruments are used together

Outlier detection Outliers may always be present Influence the solution New information? Important to detect them

Tools for outlier detection Residuals = –Plot residuals pr. object –Compute sum of squared residuals pr. object Leverage, distance to mean within space (Mahalanobis distance)

e ”normal samples” PCA plane Leverage point x1 x2 x3

Validation Plots, how natural is the solution: Relate to knowledge and design. Steep increase of explained variance Can also use cross-validation –Leave out one sample and test on the rest. Repeat for all samples. Compute explained prediction variance.