Softberry Mass Spectra (SMS) processing tools

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

AIME03, Oct 21, 2003 Classification of Ovarian Tumors Using Bayesian Least Squares Support Vector Machines C. Lu 1, T. Van Gestel 1, J. A. K. Suykens.
Shibing Deng Pfizer, Inc. Efficient Outlier Identification in Lung Cancer Study.
Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
1336 SW Bertha Blvd, Portland OR 97219
Mutual Information as a Measure for Image Quality of Temporally Subtracted Chest Radiographs Samantha Passen Samuel G. Armato III, Ph.D.
Basics of discriminant analysis
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 4: The Normal Distribution and Z-Scores.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Proteomics Informatics – Data Analysis and Visualization (Week 13)
Classification of multiple cancer types by multicategory support vector machines using gene expression data.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Statistical Inferences Based on Two Samples Chapter 9.
INF380 - Proteomics-61 INF380 – Proteomics Chapter 6 – Mass Spectrometry – MALDI TOF The MALDI-TOF instruments are the simplest MS instruments suitable.
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
MS Calibration for Protein Profiles We need calibration for –Accurate mass value Mass error: (Measured Mass – Theoretical Mass) X 10 6 ppm Theoretical.
Evaluating Results of Learning Blaž Zupan
Module 1: Measurements & Error Analysis Measurement usually takes one of the following forms especially in industries: Physical dimension of an object.
Machine Learning 5. Parametric Methods.
Logistic Regression Saed Sayad 1www.ismartsoft.com.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Serum Diagnosis of Chronic Fatigue Syndrome (CFS) Using Array-based Proteomics Pingzhao Hu W Le, S Lim, B Xing, CMT Greenwood and J Beyene Hospital for.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
MECH 373 Instrumentation and Measurements
Lecture 1.31 Criteria for optimal reception of radio signals.
Chapter 7. Classification and Prediction
EZDC spectra reconstruction and calibration
Background on Classification
Fig. 1. proFIA approach for peak detection and quantification
Signal processing.
Evaluating Results of Learning
Statistical Models for Automatic Speech Recognition
Analyzing Redistribution Matrix with Wavelet
Performance Measures II
Using Laser Raman Spectroscopy to Reduce False Positives of Autofluorescence Bronchoscopies: A Pilot Study  Michael A. Short, PhD, Stephen Lam, MD, Annette.
Fitting Curve Models to Edges
Histogram Histogram is a graph that shows frequency of anything. Histograms usually have bars that represent frequency of occuring of data. Histogram has.
K Nearest Neighbor Classification
Chapter 5 Normal Distribution
Dealing with Noisy Data
Sensitivity of RNA‐seq.
Proteomics Informatics David Fenyő
Volume 6, Issue 5, Pages e5 (May 2018)
Douglas Walker 1, Karan Uppal 2, Dean Jones 2, Tianwei Yu 3,*
Pattern Recognition and Image Analysis
Satellite data Marco Puts
Predicting Breast Cancer Diagnosis From Fine-Needle Aspiration
Predicting Frost Using Artificial Neural Network
EE513 Audio Signals and Systems
X.1 Principal component analysis
Pejman Mohammadi, Niko Beerenwinkel, Yaakov Benenson  Cell Systems 
6.1 Introduction to Chi-Square Space
Parametric Methods Berlin Chen, 2005 References:
Kevin Wood Bieri, Katelyn N. Bobbitt, Laura Lee Colgin  Neuron 
Feature extraction and alignment for LC/MS data
Mathematical Foundations of BME
Top mass measurements at the Tevatron and the standard model fits
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Processing of fragment ion information in DTA files to remove isotope ions and noise. Processing of fragment ion information in DTA files to remove isotope.
Mass Spectrum Normalization
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
Proteomics Informatics David Fenyő
Lecture 16. Classification (II): Practical Considerations
Template application to detect KC-like activity.
Introduction to Artificial Intelligence Lecture 22: Computer Vision II
More on Maxent Env. Variable importance:
Signal and systems analyses for discriminating chronotoxicity classes based on selected circadian mRNA gene expressions in liver and colon mucosa. Signal.
Operation manual of AI SIDA
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Softberry Mass Spectra (SMS) processing tools   This is a collaborative project for analysis of mass spectra data with Universal Prediction Limited (UK) (http://www.universal-prediction.com).

Processing mass spectra: main analysis steps Calibration Data resampling Data smoothing Detection of the baseline and its subtraction from intensity Normalization Peaks identification Peaks alignment Sample classification and patient outcome prediction from MS data

Calibration: removing systematic noise from equipment Spectrum with set of calibration peaks: used to find transform to put these peaks to known MZ positions Raw data Spectrum peak location Calibration peak location Sample spectrum after calibration transform Raw data Calibrated data

Resampling: removing excessive data, transform to common MZ scale The data resampling allows to discriminate the excessive data and to bring the mi values to common scale. As a result, different spectra will have the same m value counts, and, thus, will be comparable. Reduction in number of spectrum points allows to lower the noise and to eliminate excessive data, but, at the same time, to keep the spectrum shape. Initial data Resampled data

Smoothing: random noise elimination Data smoothing procedure is intended for data noise elimination. During the smoothing, the values of intensity for each mzi point are being averaged by several neighboring points. Initial data Smoothed data

Baseline processing This step of data processing is applied for elimination of the systematic artifacts that occur due to matrix and chemicals used in the experiments or as a result of detector overload. It results in background noise that may occur to be significant for some m values. Initial data Baseline subtracted Baseline

Normalization: bring spectrum intensity to common scale Normalization allows to bring peaks intensity values to a common scale, and thus it becomes possible to compare data from different spectra. Initial data Normalized data

Peak identification The current step of analysis lies in searching for peaks in spectrum with high signal-noise ratio. Peaks, in themselves, are identified as points of local spectrum maximum. Peak location Sample intensity

Peak alignment On analyzing several spectra the question if there are common peaks for these spectra easily arises. To solve this question it is necessary to compare peaks locations and intensity for spectra of interest. It is mandatory that for all spectra to be compared the previous steps are to be completed with the same parameters. Sample 1 Sample 2 Common peaks Specific peak

Using LDA with MS data to predict patient outcome Dataset description. MS data were taken from work of Gammerman et al, 2008. Control data: We used 153 control samples (no ovarian cancer detected) as ‘NO’ dataset. In this work we considered control samples as a general pool of healthy people. Cancer patient data: The all data contain 75 samples from patients with identified ovarian cancer (OC) taken from 0 to 75 months prior to diagnosis from 18 patients. To train LDA classifier we used these patients samples taken from 1 to 6 months prior to diagnosis (‘YES’ dataset, 17 samples).

MS data processing We used algorithms described in Gammerman et al, 2008 to preprocess mass pectra from 228 samples total. The processing included: Calibration Resampling Smoothing Normalization Peak identification Peak alignment and peak group detection As result, 374 peak groups were detected for all sample data.

List of top 20 peak groups with highest representation in analyzed samples Peak Group Index PeakID MeanMass MinMass MaxMass NumPeaks (of 228) Max Intensity 5 3191.554 3188.161 3193.358 211 45.57914 20 1770.479 1769.719 1772.318 195 29.40414 18 2009.877 2009.076 2012.017 193 30.74098 24 825.7725 825.2985 826.2407 189 26.30554 42 3333.192 3329.355 3334.906 184 19.21943 2 2026.901 2025.914 2029.441 177 53.74245 37 2267.009 2266.025 2268.258 20.36678 17 2985.741 2983.11 2989.592 167 31.3781 90 2552.984 2551.655 2554.576 157 11.41295 8 1894.954 1894.057 1896.423 147 42.05795 78 2114.491 2111.304 2116.45 13.52459 7 1863.654 1862.77 1864.733 144 42.50182 10 1449.102 1448.24 1451.12 136 35.4617 56 1584.659 1582.731 1586.55 133 15.79827 55 2567.124 2563.25 2568.585 132 16.01036 23 944.728 944.0944 945.2638 130 27.91649 3 2647.657 2646.315 2648.923 126 50.25482 6 6647.589 6635.569 6651.674 121 44.14933 12 1395.111 1394.238 1397.255 120 34.98699

Selection of LDA features for classification of cancer and non-cancer samples We used LDA function that uses 2 prediction features: Logarithm CA125 serum tumor marker level. Logarithm of the MS signal intensity within peak group MZ range. For each MS data: if peak was presented in the peak group we take logarithm of its intensity; if no peak was detected we took average signal intensity for the MZ range corresponding to peak group; if the intensity values were all zero for MZ within the peak group range, we set the log intensity value to -10. Thus, LDF (linear discriminant function) is LDF=a1*x1+a2*x2+b, where x1 is log(CA125 level), x2 is log(Peak intensity for some peak). We test the utility of the x2 feature (MS intensity) for all the peak groups that have the largest peak number (listed at the previous slide), 20 peak groups were tested in total. For each peak intensity we calculated LDF value and made classification for cancer/non-cancer samples. The classification performances (fraction of true predictions) were estimated for each of the 20 peak groups.

Example of data input for LDA analysis The information for LDA classification is represented as table containing (1) sample index, (2) time of sampling (before diagnosis for cancer patients), (3) patient index (case), (4) logarithm of CA125 level (denoted as lnCA125), and (5-25) logarithm of mass spectra peak intensity for 20 peak groups (denoted as MZ_NNNN, where NNNN is mean mass value for peak group).

LDF=a1*logCA125+a2*logPi17+b; a1=5.247, a2=-0.006, b=-18.639 It was found, that peak group 17 provide the best performance for LDA classification if used with CA125 level. LDF=a1*logCA125+a2*logPi17+b; a1=5.247, a2=-0.006, b=-18.639 Peak Group Index PeakID MeanMass MinMass MaxMass NumPeaks (of 228) Max Intensity 5 3191.554 3188.161 3193.358 211 45.57914 20 1770.479 1769.719 1772.318 195 29.40414 18 2009.877 2009.076 2012.017 193 30.74098 24 825.7725 825.2985 826.2407 189 26.30554 42 3333.192 3329.355 3334.906 184 19.21943 2 2026.901 2025.914 2029.441 177 53.74245 37 2267.009 2266.025 2268.258 20.36678 17 2985.741 2983.11 2989.592 167 31.3781 90 2552.984 2551.655 2554.576 157 11.41295 8 1894.954 1894.057 1896.423 147 42.05795 78 2114.491 2111.304 2116.45 13.52459 7 1863.654 1862.77 1864.733 144 42.50182 10 1449.102 1448.24 1451.12 136 35.4617 56 1584.659 1582.731 1586.55 133 15.79827 55 2567.124 2563.25 2568.585 132 16.01036 23 944.728 944.0944 945.2638 130 27.91649 3 2647.657 2646.315 2648.923 126 50.25482 6 6647.589 6635.569 6651.674 121 44.14933 12 1395.111 1394.238 1397.255 120 34.98699

LDA classification example for peak group #17 This peak group defined for peaks with MZ values in the range [2983.0, 2989.6] . The distribution of CA125 and peak intensities for control (blue points) and OC patients (1-6 months before diagnosis; red points) shown below. Classification results are also shown: 7 control data were classified as disease (< 5%). Log(CA125) Log(I);MZ=2986 Cancer detected Control Number of samples=171 (control(0)=154;disease(1)=17) Fraction of true predictions: 0.959064[164] Class 0: Fraction of true positives : 0.954545[147] Fraction of false negatives : 0.045455[7] Class 1: Fraction of true positives : 1.000000[17] Fraction of false negatives : 0.000000[0]

LDV values distribution for control and cancer samples LDF calculated for features: CA125 and peak #17 intensity Non-cancer Cancer Number of samples Control 1-6 months prior to cancer detection LDF values

LDF value vs time before diagnosis The LDF values were calculated for all OC patients samples (18 patients). The results shown below. The X axis – time before diagnosis. For most samples the LDF value exceed zero in the range 10 months before diagnosis. Y-axis – LDF value. 5 samples show no increase of LDF values for this period (they have small number of samples). One patient (ID 3480) have LDF value greater than zero for all period of time. Thus positive LDF values based on CA125 and MS peak intensity [2983.0, 2989.6] can be used as OC markers for prognosis within 6 months. Patient ID Cancer LDF value Time (months) Non-cancer Wrong classification of non-cancer case for some OC patients at time=0