Download presentation

Presentation is loading. Please wait.

Published bySierra Verrier Modified over 2 years ago

1
Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

2
Why Explorative Data Analysis ? Classical Science ? [ System Paradigm change in natural sciences Hypothesis driven

3
Why Explorative Data Analysis? Classical Science Science with advanced technologies ? [ System Explorative Analysis of data ? System Paradigm change in natural sciences Hypothesis drivenData driven

4
Explorative Data Analysis Advanced technology: High throughput (high quality) analysis NMR, HPLC, GC, MS/MS, immune assays, Hybrids Nano/Sensor technology Genomics (gene expression profiling) Proteomics, Metabolomics Fingerprinting Profiling in drug design Overwhelming amount of data

5
Explorative Data Analysis Visualization (principal component analysis, projections) Unsupervised Pattern recognition (clustering) Supervised Pattern recognition (classification) Quantitative analysis (correlations, predictions)

6
Principal Component Analysis: an Example 150 samples of Italian wines from the same region 3 different cultivars Is it possible to characterise cultivars ? Which variables are relevant for which cultivars ?

7
p (13 properties) (variables) (150 wine samples) n (objects) X ij Flavanoid concentration of sample 75 X x ij 1 7 75 xjxj xixi Flavanoid concentration Data Matrix

8
Principal Component Analysis Barplot of 1 wine sample

9
Principal Component Analysis Line plot of 1 wine sample Barplot of 1 wine sample

10
Principal Component Analysis Line plot of 1 wine sample Barplot of 1 wine sample

11
Principal Component Analysis Line plot of 1 wine sampleBarplot of 1 wine sample

12
Data Matrix Representation Data Matrix Representation xjxj xixi X x ij 1p n xjxj xixi # samples # properties

13
xjxj xixi X x ij 113 150 13 1 p (13)- dimensional Variable space 150 samples j xixi Sample 75 S p (13) Data Matrix Representation Data Matrix Representation

14
xjxj xixi X x ij 113 150 13 1 150 1 i p (13)- dimensional Variable space 13 variables150 samples n (150)-dimensional Object space j xixi Sample 75 Property 7 (flavanoids) S p (13) S n (150) Data Matrix Representation Data Matrix Representation

15
Explorative Data Analysis

16
r (2)-dim. space of variables Principal Component Analysis Principal Component Analysis PCA: visualization : projection in 2 dimensions 1 p (13)- dim. space of variables S p (13) j xixi 1 i n (150)-dim. space of objects S n (150) 13 variables150 samples lv 2 lv 1 S2S2 13 variables x x xx xx x x x x x lv 1 lv 2 S2S2 150 samples r (2)-dim. space of objects 13 150

17
Principal Component Analysis x3 x1 x2 3 variables : S 3 12 samples

18
Principal Component Analysis x3 x1 x2 3 variables : S 3 12 samples

19
Principal Component Analysis S3S3 12 samples PC 1 PC 1 = l 11 x1 + l 12 x2 + l 13 x3 x3 x1 x2

20
x3 x1 x2 PC 1 PC 1 = l 11 x1 + l 12 x2 + l 13 x3 Criterion: Maximum variance of projections (x) x x x x x x x x x x x S3S3 12 samples Principal Component Analysis

21
PC 1 = l 11 x1 + l 12 x2 + l 13 x3 PC 2 = l 21 x1 + l 22 x2 + l 23 x3 Criterion: Maximum variance of projections (x) PC1 PC2 x2 x3 x1 x2 PC 1 x x x x x x x x x x x S3S3 12 samples PC 2 Principal Component Analysis

22
Principal Components Space PC 1 PC 2 S2S2 12 samples

23
r (2)-dim. space pc 2 pc 1 S2S2 1 p (13)- dim. space of variables S p (13) j xixi 13 150 samples Principal Component Analysis Score plot

24
r (2)-dim. space pc 2 pc 1 S2S2 1 p (13)- dim. space of variables S p (13) j xixi 13 150 samples Principal Component Analysis Score plot PC1 (38%) PC2 (20%) Wine data: score plot

25
pc 2 pc 1 S2S2 150 1 i n (150)- dim. Space of objects S n (150) 13 variables x x xx xx x x x x x Loading plot Principal Component Analysis

26
pc 2 pc 1 S2S2 150 1 i n (150)- dim. Space of objects S n (150) 13 variables x x xx xx x x x x x Loading plot Principal Component Analysis Wine data: loading plot PC1 (38%) PC2 (20%)

27
Singular Value Decomposition (SVD) X np = U nr D rr V T rp Left singular vectors PC scores Right singular vectors PC loadings p n r r r n p r X U VTVT = U T U =V T V =I

28
S2S2 S p (13) i S n (150) n 1 1 j xixi p S2S2 Loading plot 13 variables pc 1 pc 2 pc 1 Score plot 150 samples pc 2 x x xx xx x x x x x Principal Component Analysis : Biplot pc 2 pc 1 x xx x x x x x x x x 150 samples + 13 variables BIPLOT

29
Principal Component Analysis: an Example PC1 (38%) PC2 (20%)

30
Principal Component Analysis: Some Issues How many PC’s ? Scaling Outliers

31
How many PC’s ? No of PC’s Cumulative % of varianceScree plot 100% No of PC’s Log variance 231156423564

32
How many PC’s ? Wine data

33
How many PC’s ?

34
PCA: Scaling For better interpretation; may obscure results raw data; Mean-centering: (column wise, row wise, double) Auto-scaling (column wise, row wise) …..

35
Wine data mean-centered Wine data autoscaled PCA: Scaling

36
Wine data raw Wine data mean-centered PC1 (99.79%) PC2 (0.20%) PC1 (99.79%) PC2 (0.20%) PCA: Scaling

37
x3 x1 x2 3 variables : S 3 12 samples PC1 PCA: Outliers

38
x3 x1 x2 3 variables : S 3 12 + 1 outlier PC1 PCA: Outliers

39
x3 x1 x2 3 variables : S 3 PC1 Leverage effect PCA: Outliers

40
Gene expression values Principal Component Analysis: a Recent Research Example X x ij 1 4 Treatments genes 50.000 xjxj Organon Department of Cell Biology

41
PCA Interaction Gene Treatment

Similar presentations

OK

Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.

Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google