Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multivariate Data Analysis for Metabolomics Data generated by

Similar presentations

Presentation on theme: "Multivariate Data Analysis for Metabolomics Data generated by"— Presentation transcript:

1 Multivariate Data Analysis for Metabolomics Data generated by
MS / NMR Spectroscopy Metabolomics Workshop Research Triangle Park / NC July 14th-15th H. Thiele, Bruker Daltonik, Bremen

2 Why do Metabolic Profiling ?
Clinical Diagnostics find metabolic markers for disease progression (e.g. cancer) diagnose inborn errors or other diseases study of genetic differences Toxicology markers for drug toxicity and drug efficacy analyze time course of toxicological response Food Science quality control / classification of origin health/flavor enhancement of agrochemical products MS + NMR together: Parallel Statistics >> more confidence Hyphenation >> ultimate characterization tool

3 Fundamental Issue in MVS: Dimension Reduction
Bucketing several bucketing techniques for optimum design of variables The application of multivariate analysis techniques such as PCA to NMR and MS data is well established. Spectra first need to to be carefully processed and calibrated and are then subjected to bucketing. Bucketing: A bucket table is used as input for PCA calculation. For optimum design of variables several bucketing techniques are available. Figure: bucketing methods (integral) rectangular, equidistant bucketing b) variable size bucketing c) pointwise bucketing Some bucketing methods are based on area integration. With rectangular bucketing each spectrum is divided into a set of equally sized buckets. This is agood compromise if no a priori knowledge is available. Piont wise bucketing is a special case of rectangular bucketing. It is often used for broad line spectra. Variable sized bucketing allows the definition of arbitrary regions. By chossing suitable sizes unwanted signal shifts can be made ineffective. By omitting Individual spectral areas unwanted signals can be excluded from the bucket table.

4 Dynamic Peak Bucketing Scheme
a a2 a a4 a5 a1 a2 a3 a4 a5 a1 0 a2 a3 0 a4 a5 b1 b2 b3 0 b4 b5 b6 a1 0 a2 a a4 a5 b1 b2 b b4 b5 b6 c1 c2 c c4 c5 c6 0 b b2 b3 b b5b6 c1c2c c4c c6 Dynamic peak bucketing is based on peak positions and peak intensity. The spectra are bucketed one by one. The bucket table gets a new column whenever a new peak occurs within a delta depending on dimension of the spectra. Spectra not having peaks at the new positions get corresponding 0. Dynamic peak Bucketing allows very fine bucketing without getting huge tables. Stable signal positions (ppm, mass, time) in carefully calibrated spectra are required. It is the method of choice For LC-MS data. Spectra are bucketed one by one Bucket table gets a new column whenever a new peak occurs Spectra not having peaks at new positions get corresponding 0

5 Kernel Bucketing for LC-MS Data
LC-MS chromatograms of N samples Bucketing Parameter e.g. m/z bucket width = 1, Kernel 0.3 Da Time bucket width = 60s, Kernel 10s 166.1 203.0 283.2 336.3 373.2 471.1 538.2 615.3 673.1 767.0 100 200 300 400 500 600 700 800 m/z Time Intensity Bucketing Time [s] m/z 60s 1Da 100 0.3 Da 10s 50 30 Intensity Rt-m/z pairs ... 1min - 535 m/z 536 m/z 1 min - 537 m/z Sample X 500 50 5 Sample Y 25 Sample Z Kernel idea: Distribute the intensity of peaks near the boarder of the buckets. Pro: minimal data manipulation strict bucketing (with a fixed grid as above) allows direct classification (number of variables = const) We create feature vectors with const length -> create a Model using class lables -> match new Data against model Contra: exact peak position and accurate mass information cannot directly be retrieved from bucket restricted performance due to handling of large amounts of data Intensity Table of N samples

6 Which Bucketing Technique to be used ?
Rectangular, equidistant bucketing standard, good compromise if no a priori knowledge Variable sized bucketing makes shifts ineffective, allows selective usage Point wise bucketing often used for broad line spectra as a special case of rectangular, equidistant bucketing Dynamic peak bucketing allows very fine bucketing without getting huge tables, requires stable shifts or masses Kernelized bucketing variant of rectangle bucketing to reduce effect of shifts

7 Data Preprocessing : Spectral Background Subtraction
Measured Data are contaminated by solvents and chemical noise Intensity of contaminants may dominate the relevant data Chemical Noise Solvent at m=75.2, Baseline and Scaled Noise Estimate Detection of Traces by dynamic grouping Baseline (Chemical noise) is a term slowly varying in time; A lot of peaks without significant information 2. picture. Sooming out details show a variation of peaks within a certain range Green: baseline estimation Blue : BASELINE PLUS NOISE

8 Data Preprocessing : Spectral Background Subtraction
Subtraction of spectral background makes relevant data visible Hidden traces of m=180.2 and m=208.2 in BPC Intensity Base Peak Chromatogram (BPC) before and after Background Subtraction time Intensity Visible traces of m=180.2 and m=208.2 in BPC Blue trace is the measured signal for the BPC Red signal is the BPC after the background subtraction Next pictures zoom out mass ranges Upper picture show mass and which are correctly visible as BPC in the red trace Lower graph shows zoomed region of BPC and it clearly shows masses and 208.2 As dominant in the BPC time Intensity time

9 Find compounds defined by RT, m/z, z and area
Peak Picking Tasks Find compounds defined by RT, m/z, z and area Take together isotopic peaks and charge states

10 LC-MS chromatogram or NMR spectrum= fingerprint
Multivariate Statistics in Metabolomics Spectroscopy How to analyze large numbers of complex LC-MS chromatograms or NMR spectra with the target of simple discrimination or grouping? - healthy / non-healthy - high / low quality NMR or MS = sensor LC-MS chromatogram or NMR spectrum= fingerprint Use Pattern Recognition Techniques! 1 picture: PCA in-model plane 2: hierarchical cluster analysis 3: statistical analysis 4: SVM : support vector machine

11 Objectives of PR: Methods of PR: Pattern Recognition (PR)
Statistical characterization Model building Classification Methods of PR: Exploratory Data Analysis Statistical Tests Principle Component Analysis (PCA) >> variance analysis Unsupervised Pattern Recognition Cluster Analysis Supervised Pattern Recognition Discriminant Analysis: LDA, PCA-DA Classification of samples by various means e.g. Genetic Algorithm, SVM, …

12 e.g. Principal Component Analysis
Idea of Principal Component Analysis (PCA) x y x y x y PC3 PC2 PC1 e.g. Principal Component Analysis

13 Input spectra list classification model Bucketing coordinate
Classification using PCA Input spectra list classification model Bucketing coordinate transformation distance measures comparison to critical values

14 Coordinate Transformation
Scores PC1 PC2 ppm1 ppm2 Loadings

15 Pattern in PC1/PC2 scores plot reveals candidates
Baby Urine Samples : PCA - NMR mevalonic aciduria orotic aciduria maple syrup disease 200 normal candidates PC 1,2 Pattern in PC1/PC2 scores plot reveals candidates with inborn errors.

16 disease vectors indicating strength of metabolic disorders
New Born Screening by NMR > 400 baby urines, PCA, disease vectors indicating strength of metabolic disorders PC1/PC2 PC3/PC4 PC11/PC12 Bucket Analysis from 9 to 0.4ppm in 0.04ppm steps Excluded: 6 to 4.5 ppm residual water and urea Results of BEST-NMR at 600 MHz 1D-spectra Noesy presat 64 scans 6c 6

17 Distance from normals distribution is a measure for
New Born Screening by NMR Distance from normals distribution is a measure for concentration of the molecule representing an inborn error Hippuric acid vector CH2- group of hippuric acid 4c 4

18 PCA : NMR vs. LC-MS PC2-Scores PC1-Scores Fig. 1: Scores plot of NMR data from baby urines (born 2003). Fig. 2: Scores plot of LC-MS data from a subset of baby urines (born 2003). A large number of urines from normal infants and infants diagnosed with various inborn errors of metabolism were analyzed using NMR spectroscopy. A subset of these samples was also analyzed using LC/MS. Examination of Principle Components Analysis (PCA) from both the NMR and LC/MS data shows that there is a clustering of subjects based on the presence or absence of inborn errors. Figures 1 and 2 show the PCA scores plots which illustrate the group clustering. The scores plot from the LC/MS data (Figure 1) shows most subjects with inborn errors, in red, are separated from the control group, in blue, with three subjects, 141, 122 and 224 not separated from the control group. The scores plot from NMR data shows 141, 122 and 224 well separated from control while subjects 123,133 and 140 are not. Subject 114, who has been diagnosed with methylmalonic aciduria, is clearly separated from control in the LC/MS scores plot while from the NMR data the separation from control is minimal. The LC/MS data for subject 114 was collected and analysed five times to test the robustness of the data collection and peak extraction algorithms. The very tight clustering in the PCA plot for subject 114 clearly shows that the methods employed are robust and reproducible. These data illustrate the complementarity of LC/MS and NMR data when doing metabolic profiling with all the subjects with inborn errors being identified in one or the other of the PCA plots.

19 The determined formula C13H15N2O4
LC-MS data of Samples 114 and 94 -MS, min C13 H15 N2 O4 ,263.10 260 261 262 263 264 265 266 267 m/z Calculated Pattern Measured Generate Molecular Formula of mass min. 2 4 6 8 Sample 114 BPC: -All MS Sample 94 BPC: -All MS 2 4 6 8 Intensity * 104 8 eXpose 94 vs. 114 BPC: -All MS 263.1 6 4 2 2 4 6 8 t [min] The determined formula C13H15N2O4 corresponds to phenyl-acetylglutamate.

20 Interpretation scores / loadings
Combining Spectra and Statistical Data Interpretation scores / loadings Loadings in PCA indicate the importance of the original variables (buckets) in the variance space. In ideal cases a set of loadings refers to signals of a compound. Combining spectra and statistical data The figure shows a PCA analysis of 412 newborn 1D-NMR urine spectra. Showing scores plot (PC1-PC2), upper left corner), loadings plot (PC1-PC2), upper right corner) And an outlier spectrum overlaid with hippuric acid reference spectrum found in Spectra database search. By viewing suitable low dimensional projections of the variance space it is Clearly observed that the majority of spectra of this ensemble is located in a group while a few other spectra are outlying in several directions. The loadings which express the relation between original bucket variables and PCs indicate which buckets are responsible for the outlying behavior. These are the ones that are lined up along the same directions as are the outlying spectra in the scores plots.

21 Menu bar / Options Data Viewer Loadings Plot Bucket table
Analysis of Bucket Variables Menu bar / Options Data Viewer Further understanding can only be obtained if these loadings can be related to chemical Compounds and biochemical processes. To perform this we translate a set of loadings Into a NMR spectral pattern and use it for a spectra base matching. The spectra base contains A number of pure reference compounds measured under suitable conditions. With simple Mouse clicks urine spectra can be selected from the scores plot and identified compounds From the spectra base are visualized on top of them. Since it is not always clear which of the loadings to include and since there may be more than one compound that causes the outlying behavior The spectra base matching is typically done in several iterations. After individual loadings have been assigned to individual compounds a display of the corresponding column of the bucket table shows in which of the spectra this compound occurs. Loadings Plot Bucket table

22 Covariance matrix The covariance matrix looks like a TOCSY, cross - peaks indicate correlated fluctuations. This includes multi molecular fluctuations. Rows at cursor position are shown on top of the 2D matrix.

23 Covariance Analysis Interesting rows can be saved to disk
row from covariance matrix reference spectrum from spectra base Interesting rows can be saved to disk as 1D NMR spectra and used for spectra base searching as any other 1D spectrum. Often, a small number of compounds from the spectra base match well while others do not.

24 no peak peak LC-MS (run 1) LC-MS (run 2)
PCA analysis of 69 newborn urine LC-MS spectra 1: selecting two LC-MS runs differing in the PC1 values from scores plot 2: selecting bucket (spectral region) from loadings plot with high PC1 value Scores plot (PC1-PC2) Loadings plot (PC1-PC2) Generate Sum- Formula The AMIX software is able to perform bucket table calculations and statistics for 1D, 2D NMR, and LC-MS spectra and even combine different results. In case of PCA with LC-MS data the loadings analysis is even more straightforward. In this case we use the dynamic peak bucketing and take The resolution of the mass spectrometer as the allowed bucket width. The time axis may be neglected or not. The resulting bucket table is of the maximum possible resolution and the loadings correspond well defined masses. They can directly be selected from the loadings plot for input to sum formula calculation. The separation in figure is caused by a loading at m/z. Sum formula calculation yields C16H14N3O4S (ampicillin). The children of the right group were treated with ampicillin. By moving the correlated cursor to this mass in the loadings plot the corresponding area can be checked in all displayed LC-MS spectra. The mass of ampicillin is not contained in the left Spectrum (taken from left group in the scores plot) but clearly visible in the right spectrum (taken from the right group). no peak peak LC-MS (run 1) LC-MS (run 2)

25 Sum-Formula Generation
Intensity m/z Electron Configuration M+* M+ C/H Ratio, Elemental Limits N-rule; isotope distribution Isotope Masses, Abundances double bond equiv. Experimental Peak Intensities & Masses Fast, exact calculation of isotopic Patterns Molecular Constraints Fast Formula Generator using CHNO Algorithm List of Hits & Mass/Intensity Patterns

26 Mass/Intensity Patterns
Formula Scoring: Isotopic pattern as additional decision criteria for elemental composition List of Hits & Mass/Intensity Patterns Intensity Three independent Scores: Intensity Ratios Intensity weighted mean Masses Intensity weighted Peak Distances Experimental Intensities & Masses m/z { Theoretical Intensities & Masses }

27 Calculating the elemental composition
Simulated mass spectrum of Chlorpyriphos 12C isotope peak 13C isotope peak (11% int.) 3 x Cl isotope peaks

28 The samples are different
Clinical Proteomics The samples are different The experimental and mathematical techniques are similar But the goal is the same

29 Workflow Clinical Proteomics
Binding Washing Patients Serum Samples Isolation Elution Normal Normal Disease Analysis Detection * Clinical Results MALDI-TOF MS W. Pusch et. al., Pharmacogenomics (2003) 4(4),

30 Data Preparation Aim: Extraction of the same set of features from each individual spectrum. These features will be used for model generation and later for classification of new spectra. As with metabonomics the identification of the features is of large interest Steps: Quality Checks for spectra Recalibration, Baseline correction, Noise Reduction Peak detection and area calculation Normalization of peak areas Without a good data preparation no useful models can be build, e.g. when peaks are not aligned, comparison is not possible Differences between classes need to be preserved while noise in data can be eliminated One approach: using peak areas instead of single intensities

31 Tasks for Clicinal Proteomics Data Analysis
Data preprocessing Peak annotation Statistical characterization Discriminance analysis Most of the tasks are quite similar for both kinds of applications except for the dimensionality of the original data: 1D MALDI 2D LCMS-ESI MS

32 Data Preprocessing : Recalibration
Problem: peaks are not aligned to each other as in this example; For LC-MS it is usually the retention time Solution: application of a recalibration algorithm Result: peaks are aligned to each other Recalibration algorithm involves the following steps: Peak detection on all spectra On each spectrum: deletion of peaks which are to close to each other Calculation of a list of reference masses on the basis of the peak lists of all spectra For each spectrum: recalibration using its peak list and the list of reference masses (i.e. intensity values are assigned slightly modified mass values)

33 Data Preprocessing : Recalibration
Prominent peak Mass shift Above mass tolerance shifted Intensity In tol. but not prominent Spec 1 Spec 2 linear mass shift m/z Selection of prominent peaks (e.g. 30% occurrence) Use this peak list with average masses as calibrants Assignment of peaks and calibrants with a mass tolerance Recalibration of all spectra by solving least square problems for a linear mass shift Peaks with background color are prominent peaks The distances between annotated peaks are increasing The purple peak has not a corresponding peak in the first spectrum but is nevertheless prominent The last two peak are in tolerance but not prominent (of course we would need to plot more spectra to make this clear)

34 Data Transformation Wavelet vs. Fourier
Aim: Transformation of spectrum from time-amplitude (mass spectra) domain into time-frequency (Fourier) or time-scale (Wavelet) representation Benefit: - decomposition into distinct frequency/scale bands significant features (peaks, patterns) occur on specific frequencies/scales Steps for wavelet decomposition: low pass filter  approximation coefficients high pass filter  detail coefficients

35 Wavelets for Feature Extraction
Aim: Determine features from the spectra which are discriminant for class separation Method: Wavelet-Transformation gives information about the signal localized in time (m/z) and frequency lower freq. corresponds to raw structural information of the spectrum higher freq. corresponds to fine/detailed information in contrast to FFT we get knowledge were the feature is located in time (m/z) Feature selection: we get much features (dep. on time resolution) Brute force + sophisticated feature selection needed

36 Peak detection Problem:
Common sets of peaks needed for later model stage Different peaks vary to a different extent over all spectra Small peaks , nevertheless giving a good separation between classes, might be overlooked only considering single spectra average spectrum single spectrum Peaks are detected on average spectrum to avoid overlooking of peaks which are in some spectra very small (as in the picture)

37 Peak detection Solution - ClinProTools: average spectrum
Determination of peak positions by use of Average-Spectrum Integration over start and end Masses for detected Peaks Blue areas indicate picked peaks Red area for picked peaks in Model (see later) average spectrum

38 Peak detection and area calculation
ClinProTools: area of peak between fixed start and end points end point level zero level Other possibilities: peak intensity peak range intensity peak intensity peak range end point level start end zero level m/z

39 Average spectra per class
Peak at ca. 2022Da – Idx 24 in Model of most imp. 15 Peaks for GA and SVM Avg.-spec class 1 Avg.-spec class 2 Avg.-spec class 3 For this example we see already in the avg.-spectrum a peak which clearly separate class 2 from all other Happens not very often We also see that the average spectra are quite similar in the considered range Avg.-spec class 4

40 Univariate Statistics: Getting a basic idea about the data
Calculation based on: peak intensities / peak areas descriptive / robust statistics Welch`s t-test / Wilcoxen test Statistic peak area sorted according p-value The table shows the part of the peak statistics results calculated with the ClinPro Tools software. The peaks are sorted according to decreasing separation power, as indicated by the P value. Note the low P values showing highly significant differences between the two classes. Signals with nearly identical Intensities (marked with asterisks) in the data set can serve as internal controls in comparision to differentially Expressed signals. Index, peak number AveMass, average mass Ave1+2, average intensity of class 1 and 2 Dave, difference of average intensities StdDev1+2, standard deviation of class 1 and 2 Conf1+2, confidence interval of class 1 and 2 P value, probability that the respective intensity distribution can be observed by chance T-test, result of the t-test

41 Algorithms for Discriminate Analysis
Some alternatives to classical linear DA: Feature selection: Genetic Algorithms (GA) + Cluster Analysis Support Vector Machines (SVM)

42 GA: Application to MS data
Solution = combinations of peaks Initial population: randomly generated solutions Start with multiple initial populations using a migration schema Each solution is assigned a fitness value according to its ability to separate two or more classes (using centroid or KNN-clustering and by determination of between and within class distances) New generations of population are formed using Selection: the fitter a solution, the higher the chance for being selected as a parent Crossover: parents form new solutions by exchanging some of their peaks, new solutions replace parents Mutation: random changes in solutions Result: combinations of peaks, which separate classes best Workflow use a specific representation for each solution (in our case: combination of peaks) Build k populations consisting of many solutions (initial solution) Its expected that many population + an appropriate migration schema help to get a better solution and faster convergence Idea: Each population has it own solution and is optimized individual After a predetermined number of steps some individuals are allowed to migrate between different populations. The individuals remain in their foreign population for a number of steps scatter their genetic information Get new genetic information from the population and migrate back This idea is influenced by nature e.g. with New Zealand as an individual population separated from the rest with a specific migration probability through the small margin between New Zealand and Australia Calculate fitness value for each solution We aim on maximizing the between class distance and minimize the within class distances In addition we can look on the performance of the clusterings and for centroid consider such parameters as purity and number of clusters) Iterative application of “genetic operators”: selection, crossover, and mutation Aim: combination of two good solution will yield an even better one, solutions with low fitness values will be eliminated Stop after a predefined number of generations (or a fitness limit has been reached, or ...) Result is the solution with the highest fitness value found so far (does not need to be the global optimum)

43 Fitness Test using k-NN
GA: Genetic Evolution Chromosome 1 Chromosome 2 1000 1200 1700 2100 2500 1700 1800 2000 2200 2300 Start Set 1000 1200 1500 2100 2500 1700 1800 2000 2150 2300 Mutation 1000 1200 1500 2150 2300 1700 1800 2000 2100 2500 50-500 cycles Cross Over 1000 1200 1500 2150 2300 1700 1800 2000 2100 2500 Fitness Test using k-NN Selection Discard disadvantages Keep advantages

44 KNN: k-nearest neighbor clustering
Spectrum = point in Rn (e.g. areas of selected peaks) Determination of k nearest neighbors for each spectrum Classification of all points using classes of neighboring points Example: point A is classified as class 2, point B as class 1 Fitness value: percentage of correctly classified points calculation of between/within distances B A class 1 Legend: class 2 So far, no definition of fitness value, just vague description as ability to separate two classes. Now definition of fitness. For a given peak combination of n peaks the peak areas build an n-dimensional space Rn In this space distance calculation between two spectra: just like in 2D and 3D: Euclidian distance, square root of sum of squared differences: sqrt ( (x1-x2)^2 + (y1-y2)^2 + (z1-z2)^ ) Usual k = 3. Often an uneven number is used and majority of class membership determines classification result (see example point B)

45 Centroid clustering Spectrum = point in Rn (e.g. areas of selected peaks) Spectrum by spectrum is analyzed (iterative process): if it is the first spectrum or too far away from all existing clusters, a new cluster with just this spectrum is created otherwise it is assigned to the nearest cluster, the centroid is recalculated Fitness value: pureness & #clusters (optimal: k clusters, with all spectra of one class) calculation of between/within distances 1 2 3 4 Centroid of a cluster is the mean of all points belonging to this cluster Result of iterative process: clusters which contain spectra from either just one class or from both classes Aim: construct pure clusters (just containing spectra from one class) Ultimate aim (not always reachable): exactly two clusters, each of them contains all spectra from one class

46 GA: Results Prediction capability of GA (plot for best 2 peaks)
Data: 3 Classes, 1 Cancer (blue [c4]), Control (red [c2]), Benign (green [c3]) What you see: GA has computed the best models for (here) a maximum of 25 peaks, from this the 25 peaks which contributed most, have been extracted (gives avg. 78% prediction) Two of these peaks have been picked for display (~61% pred.) for each spectrum the peak areas of two peaks are computed and plotted in this diagram each point represents one spectrum * Prediction acc. for a model with 25 peaks

47 SVM: Support Vector Machine
SVM: Calculation of direction in Rn, which separates best between two classes (supervised method) PCA (principal component analysis): calculation of direction in Rn, which best explains variability (unsuper-vised method, i.e. without looking at class memberships) PCA = SVM SVM PCA Difference between well known PCA and SVM PCA: unsupervised method (does not take into account the class membership), looks for direction, which explains variability SVM: supervised method, tries to separate both classes Direction of best separation and direction of greatest variation may coincide (left picture) but they may also be quite different (right picture) The support Vectors are the points which are closest to the margin with some additional constraints The Support Vectors (right picture) are the only points in the training set which determine the final solution of the optimization The direction can be calculated from the Support Vectors ignoring all other points support vectors

48 What is SVM – Basic problem
Assume we have two classes of data points with two peaks Now I look for that line which – optimal - separates these 2 classes Many decision boundaries can separate these two classes. Which should be chosen? Class 2 Class 1 The green boundaries are valid but bad ones

49 This problem can be solved by mathematical optimization theory
What is SVM – Basic idea The decision boundary should be as far away from the data of both classes as possible. We should maximize the margin, m: Class 2 m Class 1 This problem can be solved by mathematical optimization theory

50 SVM: Application to MS data
SVM: quadratic optimization problem, solved by an iterative process using Sequential Minimal Optimization In simplest case a hyperplane separating classes is calculated Therefrom contribution of individual peaks is calculated From Spectra in 3 classes : Quadratic optimization problem: Quadratic objective function Constraints are linear inequalities Iterative process applied Sequential Minimal Optimization (SMO) from J. Platt (Microsoft Research) Very fast – special adaptation to SVM problem Direction is a linear combination of the peaks Peaks with high absolute coefficients in this linear combination contribute most to this direction We get: Separating hyperplanes Recognition & Prediction accuracy Peak ranking highlighting potential biomarker-patterns

51 SVM: Application to MS data - results
Note its plotted in 2D but in fact it is high dimenensional Rec. Pred. Class 1 90 69 Class 2 86 83 Class 3 76 Rank 1 2 3 Index 7 17 18 Mass 1212 1450 1469

52 SVM: Results Prediction capability of SVM (plot for best 2 peaks)
Data: 3 Classes, 1 Cancer (blue), Control (red), Beginn (green) What you see: SVM has computed directions, from this the 25 peaks which contributed most, have been extracted (gives avg. 79% prediction) Two of these peaks have been picked for display (~61% pred.) for each spectrum the peak areas of two peaks are computed and plotted in this diagram each point represents one spectrum ~75% pred. * * Prediction acc. for a model with 25 peaks

Download ppt "Multivariate Data Analysis for Metabolomics Data generated by"

Similar presentations

Ads by Google