Presentation is loading. Please wait.

Presentation is loading. Please wait.

Center for Biofilm Engineering BSTM– July 2009 Al Parker Statistician and Research Engineer Montana State University Some statistical considerations in.

Similar presentations


Presentation on theme: "Center for Biofilm Engineering BSTM– July 2009 Al Parker Statistician and Research Engineer Montana State University Some statistical considerations in."— Presentation transcript:

1 Center for Biofilm Engineering BSTM– July 2009 Al Parker Statistician and Research Engineer Montana State University Some statistical considerations in molecular methods

2 Acknowledgments Colleagues in the CBE:  James Moberly, Seth D’Imperio, Brent Peyton  Markus Dieser  Marty Hamilton

3 How to extract useful information from hundreds to thousands of response variables (eg. micro- array analysis) measured from only a few replicates (experiments or environmental samples) The problem

4 Statistical thinking  Multivariate Statistics attempts to organize and summarize data sets with large numbers of response variables  “organize and summarize” = dimension reduction  In this talk, I will focus on abundance data, estimated for example from micro-array or clone analysis of PCR

5 Statistical thinking  Hierarchical Clustering  Principle Components  Canonical Correlation

6 Hierarchical Clustering (38 variables, 9 replicates)

7 Similarity or Distance Linkage: How the similarity measure determines clusters

8 Two different ways to generate clusters with the same similarity measure

9 A Distance or Similarity Measure Correlation measures the strength and direction of a linear relationship between paired variables x and y Corr(x,y) = (n-1)S x S y Σ(x i – mean(x))(y i – mean(y))  Unitless  Values between -1 and 1

10 An example (2 variables, 9 replicates) Corr(Actinobacteria, Acidobacteria) =.7833

11 Another (made up) example Corr(species 1, species 2) = 0.000

12 A matrix of scatterplots for 6 variables

13 AcidobacteriaActinobacteriaBacteroidetesChloroflexiProteobacteriaVerrucomicrobia Acidobacteria10.78330.75890.85560.84440.7975 Actinobacteria0.783310.89930.82570.96980.8230 Bacteroidetes0.75890.899310.79010.93930.8392 Chloroflexi0.85560.82570.790110.87040.9699 Proteobacteria0.84440.96980.93930.870410.8621 Verrucomicrobia0.79750.82300.83920.96990.86211 A correlation matrix of 6 variables

14 Principle Components Analysis (PCA)  PCA uses the correlation matrix formed by the original variables to optimally construct a smaller number of new variables which capture the maximum amount of variability in the original variables  PCA applied to the correlation matrix is not affected by disparate units between the different variables  The number of new variables is only as large as the number of replicates

15 PCA with 2 (standardized) responses Original variable #2 Original variable #1

16 PCA with 2 (standardized) responses 1 st PC - 78% 1 st PC is loaded by Orig Var #1 Original variable #2 Original variable #1 2 nd PC – 22% 2 nd PC is loaded by Orig Var #2

17 PCA terminology  The new variables are called principle components  The amount of variability of the original data captured by each component is given  The correlation between the original variables and the principle components are principle component loadings

18 Reducing 7 original variables to 2 PCs 1. Water depth 2. Core depth 3. Fe 4. Mn 5. Cu 6. Pb 7. Zn Original variables: 1 st PC: Metals 2 nd PC: Water depth and Core depth New variables = Principle Components 55% 18%

19 Reducing 7 original variables to 2 PCs 1 st PC - 55% 2 nd PC – 18% Total: 73%

20 PCA is another way to cluster

21 Canonical Correlation Analysis (CCA)  CCA uses the correlation matrix to determine the (linear) relationship between input variables (eg. environmental variables) and response variables (eg. phylogenic data)  CCA simultaneously finds new variables from the input and response variables which have maximal correlation  The number of new variables (canonical components) can be no larger than the number of replicates

22 1. Water depth 2. Core depth 3. Fe 4. Mn 5. Cu 6. Pb 7. Zn Original environmental variables: CCA Example (7 inputs, 6 outputs, 9 replicates) 1. Acidobacteria 2. Actinobacteria 3. Bacteroidetes 4. Chloroflexi 5. Proteobacteria 6. Verrucomicrobia Original microbial variables:

23 1. Water depth 2. Core depth 3. Fe 4. Mn 5. Cu 6. Pb 7. Zn Original environmental variables: 1. Acidobacteria 2. Actinobacteria 3. Bacteroidetes 4. Chloroflexi 5. Proteobacteria 6. Verrucomicrobia Original microbial variables: CCA (7 inputs, 6 outputs, 9 replicates) 1 st CC: Water depth and Core depth 1 st CC: Acidobacteria,…, Verucomicrobia 2 nd CC: Metals 2 nd CC: Bacteroidetes

24 CCA (7 inputs, 6 outputs, 9 replicates) 1 st CC: Water depth and Core depth 1 st CC: Acidobacteria,…, Verucomicrobia 2 nd CC: Metals 2 nd CC: Bacteroidetes

25 Summary PROBLEM: Lots of variables measured from a few samples SOME APPROACHES:  Cluster similar variables together  Principle component analysis creates a few new variables which optimally represent the data  Canonical correlation analysis describes the optimal (linear) relationship between input and output variables

26 Fin

27

28 Principal Component Analysis: water depth, core depth (, Mn-Total, Fe-Total, C Eigenanalysis of the Correlation Matrix Eigenvalue 3.8467 1.2443 1.0043 0.6628 0.1567 0.0830 0.0023 Proportion 0.550 0.178 0.143 0.095 0.022 0.012 0.000 Cumulative 0.550 0.727 0.871 0.965 0.988 1.000 1.000 Variable PC1 PC2 PC3 PC4 PC5 PC6 PC7 water depth (cm) 0.090 -0.529 -0.732 0.338 0.131 0.201 -0.062 core depth (cm) -0.193 0.702 -0.154 0.558 0.194 0.313 0.009 Mn-Total 0.488 0.163 -0.171 -0.016 -0.366 0.084 0.752 Fe-Total 0.477 0.228 -0.126 -0.057 -0.504 0.154 -0.651 Cu-Total 0.227 -0.358 0.608 0.633 -0.119 0.188 0.004 Zn-Total 0.463 0.019 0.147 -0.326 0.634 0.505 -0.026 Pb-Total 0.474 0.142 -0.055 0.253 0.376 -0.735 -0.080

29 CCA (7 input variables, 9 replicates) 1 st CC: Water depth Core depth 2 nd CC: Metals

30 CCA (6 response variables, 9 replicates) 1 st CC: Acidobacteria,…, Verucomicrobia 2 nd CC: Bacteroidetes

31 Hierarchical Clustering  The large number of variables are organized into a smaller number of similar clusters  One can choose a representative variable from each cluster (eg. a mean)


Download ppt "Center for Biofilm Engineering BSTM– July 2009 Al Parker Statistician and Research Engineer Montana State University Some statistical considerations in."

Similar presentations


Ads by Google