Download presentation

Presentation is loading. Please wait.

Published byKaitlyn Reeves Modified over 2 years ago

1
Discrimination Models and Variance Stabilizing Transformations of Metabolomic NMR Data Institute on Research and Statistics, Sacramento 04/08/04 Parul Vora Purohit

2
Biodata and omics Genome Project Genomics - Study of Genes Proteomics - Study of proteins Metabolomics - Study of metabolites * cellomics, CHOmics, chromonoics, etc. Analytical techniques Microarray Spectroscopy Mass Spectroscopy NMR Spectroscopy *

3
NMR Spectroscopy Curtsey ~ Joseph Medendorp / Public Information / University of Kentucky Intense homogenous and magnetic field High Powered RF transmittor capable of delivering short pulses ~ 500 MHz stimulate 1 H nuclear spin transitions Probe which enables the coils used to excite and detect the signal Plot of signal vs shift in frequency from original pulse Measured in ppm (ratio from the original signal)

4
NMR Data Allows detection of compounds with H content Shift characterizes the chemicals (metabolites) Examples: 2.14 ppm – glutamine – γ CH 2 group 2.27 ppm - valine – β CH group 6.91 ppm – tyrosine – C3, 5H ring ~65,000 points (variables) per sample

5
Questions Classification ~ Can we distinguish sick organisms from the healthy ones? Identification ~ Which metabolites play a role in the disease (biomarker)? DIFFERENCES IN THE DETAILS!

6
Abalone Data A set of 18 abalone 8 healthy, 5 stunted, 5 sick Tissue from muscle Questions : Can we classify the abalone accurately ? Can we detect any metabolites that are markers?

7
Problems / Solutions Multivariate Techniques Matrix of 65,000 (variables) x 18 (samples) Too many variables as compared to the number of samples Dimension Reduction by Binning Classification and metabolite marker identification using PCA and Cluster Analysis Methods assume that the data is normally distributed with a constant variance Generalized Log Transformation improves results!

8
NMR Data Pre-Processing Background Subtraction TMSP Peak (standard at 0 ppm removed) Water Peak Removal ppm removed) Normalization Integrated Intensity normalized to 1.0 to remove the effects of systematic intensity changes between abalone Binning / Size

9
Binned Spectrum Bin Size Range = ppm – 0.7 ppm Intensity of Bin = Integrated Intensity of all points in Bin Restricted Region of interest to 0.2 ppm – 10.0 ppm Bin Size =.04 ppm 239 Bins

10
Principal Component Analysis (PCA) Technique that allows for the explanation of the variance- covariance of the variables in terms of a linear combination of them X = t 1 p T 1 + t 2 p T 2 + …+ t k p T k + E p i - eigenvectors Projections of the original data matrix on these components give the relations between the samples – Scores Plot A plot of the eigenvectors of the covariance matrix gives a relationship between the variables – Loadings Plot Reduces the dimension of the problem; a few components suffice to explain the variance * Courtesy Wise, B. M. and Gallagher, N. B., PLS_Toolbox 2.1

11
PCA Results Scores PlotLoadings Plot

12
Cluster Analysis - Hierarchical Transformed Data – Groups Clearly Identified Untransformed Data

13
Generalized Log Transformation Shown* that a transformation of the form f(y) = ln( y + (y 2 + c) ) can lead to a variance stabilizing effect on the data The parameter c can be obtained by Maximum Likelihood or ANOVA methods and is ~ of the value c ~ σ 2 / S 2 where σ 2 is the variance of the noise and S 2 the variance of the high peaks * Durbin, B., Hardin, J., Rocke, D. M., Bioinformatics, 2002, 18, s105-s110 * Sue Geller, Jeff Gregg, Paul Hagerman, David Rocke, Transformation and Normalization of Oligonucleotide Microarray Data, 2003

14
Maximum Likelihood* Need replicates to determine accurate the SSE (c) Find c for the minimum SSE Find c steps using Newtons method or educated intervals * Box, G. and Cox. D.R. (1964) An Analysis of transformations. J. roy. Stat. Soc.. Series B (Methodological), 26, 211. c Error Sum of Squares

15
Transformed Spectrum Bin Size =.04 ppm 239 Bins, c = 2.7e-7 Calculate c using the replicate data by maximum likelihood methods Use transformation of the form using replicates, Transform data to stabilize the variance f(y) = ln( y + (y 2 + c) )

16
Stabilized Variance Bin Size =.04ppm C = 2.7E-7

17
Scores Plot – Transformation Effects Untransformed DataTransformed Data

18
Loadings Plot – Transformation Effects Untransformed DataTransformed Data

19
Cluster Analysis - Hierarchical Transformed Data – Groups Clearly Identified Untransformed Data

20
Raw Spectra – Significant Bins Bin 124 – 5.38 ppm Bin 76 – 3.22 ppm Bin 125 – 5.42 ppm Bin 77 – 3.26 ppm Bin 126 – 5.46 ppm Bin 78 – 3.3 ppm Healthy Stunt. Sick Glycogen, Sucrose, Fructose ?

21
Conclusions Demonstrated the use of data reduction techniques, multi-variate techniques for studying NMR and Mass Spectrometer data Demonstrated the use of these techniques to identify metabolite and protein bio-markers Showed the usefulness of transformations in rendering the data more useful

22
Acknowledgements David M. Rocke, CIPIC David L. Woodruff, CIPIC Mark R. Viant, U. of Birmingham, U. K.

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google