Presentation on theme: "Institute on Research and Statistics, Sacramento 04/08/04"— Presentation transcript:
1Discrimination Models and Variance Stabilizing Transformations of Metabolomic NMR Data Institute on Research and Statistics, Sacramento04/08/04Parul Vora Purohit
2Biodata and ‘omics Genome Project Analytical techniques Genomics - Study of GenesProteomics - Study of proteinsMetabolomics - Study of metabolites *cellomics, CHOmics, chromonoics, etc.Analytical techniquesMicroarray SpectroscopyMass SpectroscopyNMR Spectroscopy *
3NMR Spectroscopy Intense homogenous and magnetic field High Powered RF transmittor capable of delivering short pulses ~ 500 MHz stimulate 1H nuclear spin transitionsProbe which enables the coils used to excite and detect the signalPlot of signal vs shift in frequency from original pulseMeasured in ppm (ratio from the original signal)Curtsey ~ Joseph Medendorp / Public Information / University of Kentucky
4NMR Data Allows detection of compounds with H content Shift characterizes the chemicals (metabolites)Examples:2.14 ppm – glutamine – γ CH2 group2.27 ppm - valine – β CH group6.91 ppm – tyrosine – C3, 5H ring~65,000 points (variables) per sample
5QuestionsClassification ~ Can we distinguish sick organisms from the healthy ones?Identification ~ Which metabolites play a role in the disease (biomarker)?DIFFERENCES IN THE DETAILS!
6Abalone Data A set of 18 abalone Tissue from muscle Questions : 8 healthy, 5 stunted, 5 sickTissue from muscleQuestions :Can we classify the abalone accurately ?Can we detect any metabolites that are markers?
7Problems / Solutions Multivariate Techniques Matrix of 65,000 (variables) x 18 (samples)Too many variables as compared to the number of samplesDimension Reduction by BinningClassification and metabolite marker identification using PCA and Cluster AnalysisMethods assume that the data is normally distributed with a constant varianceGeneralized Log Transformation improves results!
8NMR Data Pre-Processing Background Subtraction‘TMSP Peak (standard at 0 ppm removed)Water Peak Removalppm removed)NormalizationIntegrated Intensity normalized to 1.0 to remove the effects of systematic intensity changes between abaloneBinning / Size
9Binned Spectrum Bin Size Range = 0.00125 ppm – 0.7 ppm Bin Size = .04 ppm239 BinsBin Size Range = ppm – 0.7 ppmIntensity of Bin = Integrated Intensity of all points in BinRestricted Region of interest to 0.2 ppm – 10.0 ppm
10Principal Component Analysis (PCA) Technique that allows for the explanation of the variance-covariance of the variables in terms of a linear combination of themX = t1pT1 + t2pT2 + …+ tkpTk + E pi - eigenvectorsProjections of the original data matrix on these components give the relations between the samples – Scores PlotA plot of the eigenvectors of the covariance matrix gives a relationship between the variables – Loadings PlotReduces the dimension of the problem; a few components suffice to explain the variance* Courtesy Wise, B. M. and Gallagher, N. B., PLS_Toolbox 2.1
12Cluster Analysis - Hierarchical Transformed Data – Groups Clearly IdentifiedUntransformed Data
13Generalized Log Transformation Shown* that a transformation of the formf(y) = ln( y + (y2 + c) )can lead to a variance stabilizing effect on the dataThe parameter c can be obtained by MaximumLikelihood or ANOVA methods and is ~ of the valuec ~ σ2 / S2where σ2 is the variance of the noise and S2 the variance of the high peaks*Durbin, B., Hardin, J., Rocke, D. M., Bioinformatics, 2002, 18, s105-s110* Sue Geller, Jeff Gregg, Paul Hagerman, David Rocke, Transformation and Normalization of Oligonucleotide Microarray Data, 2003
14Maximum Likelihood* Need replicates to determine accurate the SSE (c) Find c for the minimum SSEFind c steps using Newton’s method or educated intervals* Box, G. and Cox. D.R. (1964) An Analysis of transformations. J. roy. Stat. Soc.. Series B (Methodological), 26, 211.Error Sum of Squaresc
15Transformed SpectrumCalculate ‘c’ using the replicate data by maximum likelihood methodsUse transformation of the form using replicates,Transform data to stabilize the variancef(y) = ln( y + (y2 + c) )Bin Size = .04 ppm239 Bins, c = 2.7e-7
21ConclusionsDemonstrated the use of data reduction techniques, multi-variate techniques for studying NMR and Mass Spectrometer dataDemonstrated the use of these techniques to identify metabolite and protein bio-markersShowed the usefulness of transformations in rendering the data more useful
22Acknowledgements David M. Rocke, CIPIC David L. Woodruff, CIPIC Mark R. Viant, U. of Birmingham, U. K.