Center for Biofilm Engineering BSTM– July 2009 Al Parker Statistician and Research Engineer Montana State University Some statistical considerations in.

Center for Biofilm Engineering BSTM– July 2009 Al Parker Statistician and Research Engineer Montana State University Some statistical considerations in molecular methods

Acknowledgments Colleagues in the CBE:  James Moberly, Seth D’Imperio, Brent Peyton  Markus Dieser  Marty Hamilton

How to extract useful information from hundreds to thousands of response variables (eg. micro- array analysis) measured from only a few replicates (experiments or environmental samples) The problem

Statistical thinking  Multivariate Statistics attempts to organize and summarize data sets with large numbers of response variables  “organize and summarize” = dimension reduction  In this talk, I will focus on abundance data, estimated for example from micro-array or clone analysis of PCR

Statistical thinking  Hierarchical Clustering  Principle Components  Canonical Correlation

Hierarchical Clustering (38 variables, 9 replicates)

Similarity or Distance Linkage: How the similarity measure determines clusters

Two different ways to generate clusters with the same similarity measure

A Distance or Similarity Measure Correlation measures the strength and direction of a linear relationship between paired variables x and y Corr(x,y) = (n-1)S x S y Σ(x i – mean(x))(y i – mean(y))  Unitless  Values between -1 and 1

An example (2 variables, 9 replicates) Corr(Actinobacteria, Acidobacteria) =.7833

Another (made up) example Corr(species 1, species 2) = 0.000

A matrix of scatterplots for 6 variables

AcidobacteriaActinobacteriaBacteroidetesChloroflexiProteobacteriaVerrucomicrobia Acidobacteria10.78330.75890.85560.84440.7975 Actinobacteria0.783310.89930.82570.96980.8230 Bacteroidetes0.75890.899310.79010.93930.8392 Chloroflexi0.85560.82570.790110.87040.9699 Proteobacteria0.84440.96980.93930.870410.8621 Verrucomicrobia0.79750.82300.83920.96990.86211 A correlation matrix of 6 variables

Principle Components Analysis (PCA)  PCA uses the correlation matrix formed by the original variables to optimally construct a smaller number of new variables which capture the maximum amount of variability in the original variables  PCA applied to the correlation matrix is not affected by disparate units between the different variables  The number of new variables is only as large as the number of replicates

PCA with 2 (standardized) responses Original variable #2 Original variable #1

PCA with 2 (standardized) responses 1 st PC - 78% 1 st PC is loaded by Orig Var #1 Original variable #2 Original variable #1 2 nd PC – 22% 2 nd PC is loaded by Orig Var #2

PCA terminology  The new variables are called principle components  The amount of variability of the original data captured by each component is given  The correlation between the original variables and the principle components are principle component loadings

Reducing 7 original variables to 2 PCs 1. Water depth 2. Core depth 3. Fe 4. Mn 5. Cu 6. Pb 7. Zn Original variables: 1 st PC: Metals 2 nd PC: Water depth and Core depth New variables = Principle Components 55% 18%

Reducing 7 original variables to 2 PCs 1 st PC - 55% 2 nd PC – 18% Total: 73%

PCA is another way to cluster

Canonical Correlation Analysis (CCA)  CCA uses the correlation matrix to determine the (linear) relationship between input variables (eg. environmental variables) and response variables (eg. phylogenic data)  CCA simultaneously finds new variables from the input and response variables which have maximal correlation  The number of new variables (canonical components) can be no larger than the number of replicates

1. Water depth 2. Core depth 3. Fe 4. Mn 5. Cu 6. Pb 7. Zn Original environmental variables: CCA Example (7 inputs, 6 outputs, 9 replicates) 1. Acidobacteria 2. Actinobacteria 3. Bacteroidetes 4. Chloroflexi 5. Proteobacteria 6. Verrucomicrobia Original microbial variables:

1. Water depth 2. Core depth 3. Fe 4. Mn 5. Cu 6. Pb 7. Zn Original environmental variables: 1. Acidobacteria 2. Actinobacteria 3. Bacteroidetes 4. Chloroflexi 5. Proteobacteria 6. Verrucomicrobia Original microbial variables: CCA (7 inputs, 6 outputs, 9 replicates) 1 st CC: Water depth and Core depth 1 st CC: Acidobacteria,…, Verucomicrobia 2 nd CC: Metals 2 nd CC: Bacteroidetes

CCA (7 inputs, 6 outputs, 9 replicates) 1 st CC: Water depth and Core depth 1 st CC: Acidobacteria,…, Verucomicrobia 2 nd CC: Metals 2 nd CC: Bacteroidetes

Summary PROBLEM: Lots of variables measured from a few samples SOME APPROACHES:  Cluster similar variables together  Principle component analysis creates a few new variables which optimally represent the data  Canonical correlation analysis describes the optimal (linear) relationship between input and output variables

Principal Component Analysis: water depth, core depth (, Mn-Total, Fe-Total, C Eigenanalysis of the Correlation Matrix Eigenvalue 3.8467 1.2443 1.0043 0.6628 0.1567 0.0830 0.0023 Proportion 0.550 0.178 0.143 0.095 0.022 0.012 0.000 Cumulative 0.550 0.727 0.871 0.965 0.988 1.000 1.000 Variable PC1 PC2 PC3 PC4 PC5 PC6 PC7 water depth (cm) 0.090 -0.529 -0.732 0.338 0.131 0.201 -0.062 core depth (cm) -0.193 0.702 -0.154 0.558 0.194 0.313 0.009 Mn-Total 0.488 0.163 -0.171 -0.016 -0.366 0.084 0.752 Fe-Total 0.477 0.228 -0.126 -0.057 -0.504 0.154 -0.651 Cu-Total 0.227 -0.358 0.608 0.633 -0.119 0.188 0.004 Zn-Total 0.463 0.019 0.147 -0.326 0.634 0.505 -0.026 Pb-Total 0.474 0.142 -0.055 0.253 0.376 -0.735 -0.080

CCA (7 input variables, 9 replicates) 1 st CC: Water depth Core depth 2 nd CC: Metals

CCA (6 response variables, 9 replicates) 1 st CC: Acidobacteria,…, Verucomicrobia 2 nd CC: Bacteroidetes

Hierarchical Clustering  The large number of variables are organized into a smaller number of similar clusters  One can choose a representative variable from each cluster (eg. a mean)

Center for Biofilm Engineering BSTM– July 2009 Al Parker Statistician and Research Engineer Montana State University Some statistical considerations in.

Similar presentations

Presentation on theme: "Center for Biofilm Engineering BSTM– July 2009 Al Parker Statistician and Research Engineer Montana State University Some statistical considerations in."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Center for Biofilm Engineering BSTM– July 2009 Al Parker Statistician and Research Engineer Montana State University Some statistical considerations in.

Similar presentations

Presentation on theme: "Center for Biofilm Engineering BSTM– July 2009 Al Parker Statistician and Research Engineer Montana State University Some statistical considerations in."— Presentation transcript:

Similar presentations

About project

Feedback