Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multivariate Data Analysis for Metabolomics Data generated by MS / NMR Spectroscopy Metabolomics Workshop Research Triangle Park / NC July 14th-15th H.

Similar presentations


Presentation on theme: "Multivariate Data Analysis for Metabolomics Data generated by MS / NMR Spectroscopy Metabolomics Workshop Research Triangle Park / NC July 14th-15th H."— Presentation transcript:

1 Multivariate Data Analysis for Metabolomics Data generated by MS / NMR Spectroscopy Metabolomics Workshop Research Triangle Park / NC July 14th-15th H. Thiele, Bruker Daltonik, Bremen

2 Clinical Diagnostics find metabolic markers for disease progression (e.g. cancer) find metabolic markers for disease progression (e.g. cancer) diagnose inborn errors or other diseases diagnose inborn errors or other diseases study of genetic differences study of genetic differencesToxicology markers for drug toxicity and drug efficacy markers for drug toxicity and drug efficacy analyze time course of toxicological response analyze time course of toxicological response Food Science quality control / classification of origin quality control / classification of origin health/flavor enhancement of agrochemical products health/flavor enhancement of agrochemical products MS + NMR together: Parallel Statistics >> more confidence Hyphenation >> ultimate characterization tool MS + NMR together: Parallel Statistics >> more confidence Hyphenation >> ultimate characterization tool Why do Metabolic Profiling ?

3 Fundamental Issue in MVS: Dimension Reduction Bucketing several bucketing techniques for optimum design of variables

4 Dynamic Peak Bucketing Scheme a 1 a 2 a 3 a 4 a 5 a 1 0 a 2 a 3 0 a 4 a 5 b 1 b 2 b 3 0 b 4 b 5 b 6 0 0 a 1 0 a 2 a 3 0 a 4 a 5 0 0 b 1 b 2 b 3 0 b 4 b 5 b 6 c 1 c 2 c 3 0 0 c 4 c 5 c 6 0 a 1 a 2 a 3 a 4 a 5 b 1 b 2 b 3 b 4 b 5 b 6 c 1 c 2 c 3 c 4 c 5 c 6 Spectra are bucketed one by one Bucket table gets a new column whenever a new peak occurs Spectra not having peaks at new positions get corresponding 0

5 Kernel Bucketing for LC-MS Data Intensity Rt-m/z pairs... 1min - 535 m/z 1min - 536 m/z 1 min - 537 m/z... Sample X500505 Sample Y0250 Sample Z500505 LC-MS chromatograms of N samples Table of N samples 166.1 203.0 283.2 336.3 373.2 471.1 538.2 615.3 673.1 767.0 100 200 300 400 500 600 700 800 m/z Time Intensity m/z Bucketing Bucketing Parameter e.g. m/z bucket width = 1, Kernel 0.3 Da Time bucket width = 60s, Kernel 10s Time [s] m/z 60s 1Da 100 0.3 Da 10s 50 30 100 Intensity

6 Which Bucketing Technique to be used ? Rectangular, equidistant bucketing standard, good compromise if no a priori knowledge Variable sized bucketing makes shifts ineffective, allows selective usage Point wise bucketing often used for broad line spectra as a special case of rectangular, equidistant bucketing Dynamic peak bucketing allows very fine bucketing without getting huge tables, requires stable shifts or masses Kernelized bucketing variant of rectangle bucketing to reduce effect of shifts

7 Data Preprocessing : Spectral Background Subtraction Measured Data are contaminated by solvents and chemical noise Intensity of contaminants may dominate the relevant data Chemical Noise Solvent at m=75.2, Baseline and Scaled Noise Estimate Detection of Traces by dynamic grouping

8 Data Preprocessing : Spectral Background Subtraction Subtraction of spectral background makes relevant data visible Base Peak Chromatogram (BPC) before and after Background Subtraction Hidden traces of m=180.2 and m=208.2 in BPC Visible traces of m=180.2 and m=208.2 in BPC tim e Intensity

9 Peak Picking Tasks Find compounds defined by RT, m/z, z and area Take together isotopic peaks and charge states

10 How to analyze large numbers of complex LC-MS chromatograms or NMR spectra with the target of simple discrimination or grouping? Multivariate Statistics in Metabolomics Spectroscopy - healthy / non-healthy - high / low quality NMR or MS = sensor LC-MS chromatogram or NMR spectrum= fingerprint Use Pattern Recognition Techniques!

11 Pattern Recognition (PR) Objectives of PR: Statistical characterization Model building Classification Methods of PR: Exploratory Data Analysis Statistical Tests Principle Component Analysis (PCA) >> variance analysis Unsupervised Pattern Recognition Cluster Analysis Supervised Pattern Recognition Discriminant Analysis: LDA, PCA-DA Classification of samples by various means e.g. Genetic Algorithm, SVM, …

12 Idea of Principal Component Analysis (PCA) xy xy xyx y PC 3 PC 2 PC 1 e.g. Principal Component Analysis

13 classification model spectra list Bucketing coordinate transformation distance measures comparison to critical values Input Classification using PCA

14 Coordinate Transformation Scores PC 1 PC 2 ppm1 ppm2 Loadings

15 200 normal candidates Pattern in PC1/PC2 scores plot reveals candidates with inborn errors. orotic aciduria mevalonic aciduria maple syrup disease PC 1,2 Baby Urine Samples : PCA - NMR

16 > 400 baby urines, PCA, disease vectors indicating strength of metabolic disorders New Born Screening by NMR Bucket Analysis from 9 to 0.4ppm in 0.04ppm steps Excluded: 6 to 4.5 ppm residual water and urea Results of BEST-NMR at 600 MHz 1D-spectra Noesy presat 64 scans 6c 6 PC1/PC2 PC3/PC4 PC11/PC12

17 Hippuric acid vector Distance from normals distribution is a measure for concentration of the molecule representing an inborn error CH 2 - group of hippuric acid New Born Screening by NMR 4c 4

18 Fig. 2: Scores plot of LC-MS data from a subset of baby urines (born 2003). PC 2 -Scores PC 1 -Scores Fig. 1: Scores plot of NMR data from baby urines (born 2003). PCA : NMR vs. LC-MS

19 2468t [min] eXpose 94 vs. 114 BPC: -All MS 0 2 4 6 8 263.1 Sample 94 BPC: -All MS 0 2 4 6 8 Intensity * 10 4 0 2 4 6 8 Sample 114 BPC: -All MS 263.1037 264.1062 -MS, 5.5-6.8min 263.1037 264.1068 C13 H15 N2 O4,263.10 260261262263264265266267m/z Calculated Pattern Measured Pattern Generate Molecular Formula of mass 263.1037 m/z @ 5.9min. The determined formula C13H15N2O4 corresponds to phenyl-acetylglutamate. LC-MS data of Samples 114 and 94

20 Loadings in PCA indicate the importance of the original variables (buckets) in the variance space. In ideal cases a set of loadings refers to signals of a compound. Interpretation scores / loadings Combining Spectra and Statistical Data

21 Analysis of Bucket Variables Loadings Plot Bucket table Data ViewerMenu bar / Options

22 The covariance matrix looks like a TOCSY, cross - peaks indicate correlated fluctuations. This includes multi molecular fluctuations. Rows at cursor position are shown on top of the 2D matrix. Covariance matrix

23 Interesting rows can be saved to disk as 1D NMR spectra and used for spectra base searching as any other 1D spectrum. Often, a small number of compounds from the spectra base match well while others do not. row from covariance matrix reference spectrum from spectra base Covariance Analysis

24 no peak PCA analysis of 69 newborn urine LC-MS spectra LC-MS (run 1) LC-MS (run 2) 1: selecting two LC-MS runs differing in the PC1 values from scores plot 2: selecting bucket (spectral region) from loadings plot with high PC1 value Scores plot (PC1-PC2) Loadings plot (PC1-PC2) peak Generate Sum- Formula

25 Sum-Formula Generation Intensity m/z Electron Configuration M +* M+M+ C/H Ratio, Elemental Limits Experimental Peak Intensities & Masses List of Hits & Mass/Intensity Patterns Fast, exact calculation of isotopic Patterns Fast Formula Generator using CHNO Algorithm Isotope Masses, Abundances N-rule; isotope distribution double bond equiv. Molecular Constraints

26 Formula Scoring: Isotopic pattern as additional decision criteria for elemental composition List of Hits & Mass/Intensity Patterns Intensity m/z Experimental Intensities & Masses { Theoretical Intensities & Masses } Three independent Scores: Intensity Ratios Intensity weighted mean Masses Intensity weighted Peak Distances

27 Simulated mass spectrum of Chlorpyriphos 12 C isotope peak 13 C isotope peak (11% int.) 3 x Cl isotope peaks Calculating the elemental composition

28 Clinical Proteomics The samples are different The experimental and mathematical techniques are similar But the goal is the same

29 Workflow Clinical Proteomics Patients Serum Samples Isolation Analysis Clinical Results * ** DiseaseNormal Binding Washing Elution Detection W. Pusch et. al., Pharmacogenomics (2003) 4(4), 463-476 MALDI-TOF MS

30 Data Preparation Aim: Extraction of the same set of features from each individual spectrum. These features will be used for model generation and later for classification of new spectra. As with metabonomics the identification of the features is of large interest Steps: Quality Checks for spectra Recalibration, Baseline correction, Noise Reduction Peak detection and area calculation Normalization of peak areas

31 Tasks for Clicinal Proteomics Data Analysis 1.Data preprocessing 2.Peak annotation 3.Statistical characterization 4.Discriminance analysis Most of the tasks are quite similar for both kinds of applications except for the dimensionality of the original data: 1D MALDI 2D LCMS-ESI MS

32 Data Preprocessing : Recalibration Solution: application of a recalibration algorithm Result: peaks are aligned to each other Problem: peaks are not aligned to each other as in this example; For LC-MS it is usually the retention time

33 Data Preprocessing : Recalibration Selection of prominent peaks (e.g. 30% occurrence) Use this peak list with average masses as calibrants Assignment of peaks and calibrants with a mass tolerance Recalibration of all spectra by solving least square problems for a linear mass shift Spec 1 Spec 2 m/z Prominent peak shifted Intensity linear mass shift Mass shift Above mass tolerance In tol. but not prominent

34 Data Transformation Wavelet vs. Fourier Aim: Transformation of spectrum from time-amplitude (mass spectra) domain into time-frequency (Fourier) or time-scale (Wavelet) representation Benefit: - decomposition into distinct frequency/scale bands -significant features (peaks, patterns) occur on specific frequencies/scales Steps for wavelet decomposition: low pass filter  approximation coefficients high pass filter  detail coefficients

35 Wavelets for Feature Extraction Aim: Determine features from the spectra which are discriminant for class separation Method: Wavelet-Transformation gives information about the signal localized in time (m/z) and frequency lower freq. corresponds to raw structural information of the spectrum higher freq. corresponds to fine/detailed information in contrast to FFT we get knowledge were the feature is located in time (m/z) Feature selection: we get much features (dep. on time resolution) Brute force + sophisticated feature selection needed

36 Peak detection Problem: Common sets of peaks needed for later model stage Different peaks vary to a different extent over all spectra Small peaks, nevertheless giving a good separation between classes, might be overlooked only considering single spectra single spectrum average spectrum

37 Peak detection Determination of peak positions by use of Average-Spectrum Integration over start and end Masses for detected Peaks Solution - ClinProTools: Blue areas indicate picked peaksRed area for picked peaks in Model (see later) average spectrum

38 Peak detection and area calculation ClinProTools: area of peak between fixed start and end points –end point level –zero level Other possibilities: –peak intensity –peak range end start intensity m/z peak intensity peak range end point level zero level

39 Average spectra per class Peak at ca. 2022Da – Idx 24 in Model of most imp. 15 Peaks for GA and SVM Avg.-spec class 1 Avg.-spec class 2 Avg.-spec class 3 Avg.-spec class 4

40 Univariate Statistics: Getting a basic idea about the data descriptive / robust statistics Welch`s t-test / Wilcoxen test Calculation based on: peak intensities / peak areas Statistic peak area sorted according p-value

41 Algorithms for Discriminate Analysis Some alternatives to classical linear DA: Feature selection: Genetic Algorithms (GA) + Cluster Analysis Support Vector Machines (SVM)

42 GA: Application to MS data Solution = combinations of peaks Initial population: randomly generated solutions Start with multiple initial populations using a migration schema Each solution is assigned a fitness value according to its ability to separate two or more classes (using centroid or KNN- clustering and by determination of between and within class distances) New generations of population are formed using –Selection: the fitter a solution, the higher the chance for being selected as a parent –Crossover: parents form new solutions by exchanging some of their peaks, new solutions replace parents –Mutation: random changes in solutions Result: combinations of peaks, which separate classes best

43 Start Set 10001200170021002500 Chromosome 1 17001800200022002300 Chromosome 2 Mutation 10001200 1500 21002500170018002000 2150 2300 Cross Over 10001200 1500 21002500 170018002000 2150 2300 10001200 15002150 230017001800200021002500 Selection Fitness Test using k-NN Discard disadvantages Keep advantages 50-500 cycles GA: Genetic Evolution

44 KNN: k-nearest neighbor clustering Spectrum = point in R n (e.g. areas of selected peaks) Determination of k nearest neighbors for each spectrum Classification of all points using classes of neighboring points Example: point A is classified as class 2, point B as class 1 Fitness value: –percentage of correctly classified points –calculation of between/within distances class 1 Legend: class 2 B A

45 Centroid clustering Spectrum = point in R n (e.g. areas of selected peaks) Spectrum by spectrum is analyzed (iterative process): –if it is the first spectrum or too far away from all existing clusters, a new cluster with just this spectrum is created –otherwise it is assigned to the nearest cluster, the centroid is recalculated Fitness value: –pureness & #clusters (optimal: k clusters, with all spectra of one class) –calculation of between/within distances 1 2 3 4

46 GA: Results Prediction capability of GA (plot for best 2 peaks) ~91% pred. * ~76% pred. * ~66% pred. * * Prediction acc. for a model with 25 peaks

47 SVM: Support Vector Machine SVM: Calculation of direction in R n, which separates best between two classes (supervised method) PCA (principal component analysis): calculation of direction in R n, which best explains variability (unsuper- vised method, i.e. without looking at class memberships) PCA = SVM PCA support vectors

48 What is SVM – Basic problem Class 1 Class 2 Many decision boundaries can separate these two classes. Which should be chosen? Assume we have two classes of data points with two peaks Now I look for that line which – optimal - separates these 2 classes The green boundaries are valid but bad ones

49 What is SVM – Basic idea The decision boundary should be as far away from the data of both classes as possible. We should maximize the margin, m: Class 1 Class 2 m This problem can be solved by mathematical optimization theory This problem can be solved by mathematical optimization theory

50 SVM: Application to MS data SVM: quadratic optimization problem, solved by an iterative process using Sequential Minimal Optimization In simplest case a hyperplane separating classes is calculated Therefrom contribution of individual peaks is calculated From Spectra in 3 classes : We get: Separating hyperplanes Recognition & Prediction accuracy Peak ranking highlighting potential biomarker-patterns

51 SVM: Application to MS data - results Rec.Pred. Class 19069 Class 28683 Class 39076 Rank123… Index71718… Mass121214501469… Note its plotted in 2D but in fact it is high dimenensional

52 SVM: Results Prediction capability of SVM (plot for best 2 peaks) ~70% pred. * ~93% pred. * ~75% pred. * * Prediction acc. for a model with 25 peaks


Download ppt "Multivariate Data Analysis for Metabolomics Data generated by MS / NMR Spectroscopy Metabolomics Workshop Research Triangle Park / NC July 14th-15th H."

Similar presentations


Ads by Google