Download presentation
Presentation is loading. Please wait.
1
Center for Biofilm Engineering BSTM– July 2009 Al Parker Statistician and Research Engineer Montana State University Some statistical considerations in molecular methods
2
Acknowledgments Colleagues in the CBE: James Moberly, Seth D’Imperio, Brent Peyton Markus Dieser Marty Hamilton
3
How to extract useful information from hundreds to thousands of response variables (eg. micro- array analysis) measured from only a few replicates (experiments or environmental samples) The problem
4
Statistical thinking Multivariate Statistics attempts to organize and summarize data sets with large numbers of response variables “organize and summarize” = dimension reduction In this talk, I will focus on abundance data, estimated for example from micro-array or clone analysis of PCR
5
Statistical thinking Hierarchical Clustering Principle Components Canonical Correlation
6
Hierarchical Clustering (38 variables, 9 replicates)
7
Similarity or Distance Linkage: How the similarity measure determines clusters
8
Two different ways to generate clusters with the same similarity measure
9
A Distance or Similarity Measure Correlation measures the strength and direction of a linear relationship between paired variables x and y Corr(x,y) = (n-1)S x S y Σ(x i – mean(x))(y i – mean(y)) Unitless Values between -1 and 1
10
An example (2 variables, 9 replicates) Corr(Actinobacteria, Acidobacteria) =.7833
11
Another (made up) example Corr(species 1, species 2) = 0.000
12
A matrix of scatterplots for 6 variables
13
AcidobacteriaActinobacteriaBacteroidetesChloroflexiProteobacteriaVerrucomicrobia Acidobacteria10.78330.75890.85560.84440.7975 Actinobacteria0.783310.89930.82570.96980.8230 Bacteroidetes0.75890.899310.79010.93930.8392 Chloroflexi0.85560.82570.790110.87040.9699 Proteobacteria0.84440.96980.93930.870410.8621 Verrucomicrobia0.79750.82300.83920.96990.86211 A correlation matrix of 6 variables
14
Principle Components Analysis (PCA) PCA uses the correlation matrix formed by the original variables to optimally construct a smaller number of new variables which capture the maximum amount of variability in the original variables PCA applied to the correlation matrix is not affected by disparate units between the different variables The number of new variables is only as large as the number of replicates
15
PCA with 2 (standardized) responses Original variable #2 Original variable #1
16
PCA with 2 (standardized) responses 1 st PC - 78% 1 st PC is loaded by Orig Var #1 Original variable #2 Original variable #1 2 nd PC – 22% 2 nd PC is loaded by Orig Var #2
17
PCA terminology The new variables are called principle components The amount of variability of the original data captured by each component is given The correlation between the original variables and the principle components are principle component loadings
18
Reducing 7 original variables to 2 PCs 1. Water depth 2. Core depth 3. Fe 4. Mn 5. Cu 6. Pb 7. Zn Original variables: 1 st PC: Metals 2 nd PC: Water depth and Core depth New variables = Principle Components 55% 18%
19
Reducing 7 original variables to 2 PCs 1 st PC - 55% 2 nd PC – 18% Total: 73%
20
PCA is another way to cluster
21
Canonical Correlation Analysis (CCA) CCA uses the correlation matrix to determine the (linear) relationship between input variables (eg. environmental variables) and response variables (eg. phylogenic data) CCA simultaneously finds new variables from the input and response variables which have maximal correlation The number of new variables (canonical components) can be no larger than the number of replicates
22
1. Water depth 2. Core depth 3. Fe 4. Mn 5. Cu 6. Pb 7. Zn Original environmental variables: CCA Example (7 inputs, 6 outputs, 9 replicates) 1. Acidobacteria 2. Actinobacteria 3. Bacteroidetes 4. Chloroflexi 5. Proteobacteria 6. Verrucomicrobia Original microbial variables:
23
1. Water depth 2. Core depth 3. Fe 4. Mn 5. Cu 6. Pb 7. Zn Original environmental variables: 1. Acidobacteria 2. Actinobacteria 3. Bacteroidetes 4. Chloroflexi 5. Proteobacteria 6. Verrucomicrobia Original microbial variables: CCA (7 inputs, 6 outputs, 9 replicates) 1 st CC: Water depth and Core depth 1 st CC: Acidobacteria,…, Verucomicrobia 2 nd CC: Metals 2 nd CC: Bacteroidetes
24
CCA (7 inputs, 6 outputs, 9 replicates) 1 st CC: Water depth and Core depth 1 st CC: Acidobacteria,…, Verucomicrobia 2 nd CC: Metals 2 nd CC: Bacteroidetes
25
Summary PROBLEM: Lots of variables measured from a few samples SOME APPROACHES: Cluster similar variables together Principle component analysis creates a few new variables which optimally represent the data Canonical correlation analysis describes the optimal (linear) relationship between input and output variables
26
Fin
28
Principal Component Analysis: water depth, core depth (, Mn-Total, Fe-Total, C Eigenanalysis of the Correlation Matrix Eigenvalue 3.8467 1.2443 1.0043 0.6628 0.1567 0.0830 0.0023 Proportion 0.550 0.178 0.143 0.095 0.022 0.012 0.000 Cumulative 0.550 0.727 0.871 0.965 0.988 1.000 1.000 Variable PC1 PC2 PC3 PC4 PC5 PC6 PC7 water depth (cm) 0.090 -0.529 -0.732 0.338 0.131 0.201 -0.062 core depth (cm) -0.193 0.702 -0.154 0.558 0.194 0.313 0.009 Mn-Total 0.488 0.163 -0.171 -0.016 -0.366 0.084 0.752 Fe-Total 0.477 0.228 -0.126 -0.057 -0.504 0.154 -0.651 Cu-Total 0.227 -0.358 0.608 0.633 -0.119 0.188 0.004 Zn-Total 0.463 0.019 0.147 -0.326 0.634 0.505 -0.026 Pb-Total 0.474 0.142 -0.055 0.253 0.376 -0.735 -0.080
29
CCA (7 input variables, 9 replicates) 1 st CC: Water depth Core depth 2 nd CC: Metals
30
CCA (6 response variables, 9 replicates) 1 st CC: Acidobacteria,…, Verucomicrobia 2 nd CC: Bacteroidetes
31
Hierarchical Clustering The large number of variables are organized into a smaller number of similar clusters One can choose a representative variable from each cluster (eg. a mean)
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.