Presentation is loading. Please wait.

Presentation is loading. Please wait.

Softberry Mass Spectra (SMS) processing tools

Similar presentations


Presentation on theme: "Softberry Mass Spectra (SMS) processing tools"— Presentation transcript:

1 Softberry Mass Spectra (SMS) processing tools
  This is a collaborative project for analysis of mass spectra data with Universal Prediction Limited (UK) (

2 Processing mass spectra: main analysis steps
Calibration Data resampling Data smoothing Detection of the baseline and its subtraction from intensity Normalization Peaks identification Peaks alignment Sample classification and patient outcome prediction from MS data

3 Calibration: removing systematic noise from equipment
Spectrum with set of calibration peaks: used to find transform to put these peaks to known MZ positions Raw data Spectrum peak location Calibration peak location Sample spectrum after calibration transform Raw data Calibrated data

4 Resampling: removing excessive data, transform to common MZ scale
The data resampling allows to discriminate the excessive data and to bring the mi values to common scale. As a result, different spectra will have the same m value counts, and, thus, will be comparable. Reduction in number of spectrum points allows to lower the noise and to eliminate excessive data, but, at the same time, to keep the spectrum shape. Initial data Resampled data

5 Smoothing: random noise elimination
Data smoothing procedure is intended for data noise elimination. During the smoothing, the values of intensity for each mzi point are being averaged by several neighboring points. Initial data Smoothed data

6 Baseline processing This step of data processing is applied for elimination of the systematic artifacts that occur due to matrix and chemicals used in the experiments or as a result of detector overload. It results in background noise that may occur to be significant for some m values. Initial data Baseline subtracted Baseline

7 Normalization: bring spectrum intensity to common scale
Normalization allows to bring peaks intensity values to a common scale, and thus it becomes possible to compare data from different spectra. Initial data Normalized data

8 Peak identification The current step of analysis lies in searching for peaks in spectrum with high signal-noise ratio. Peaks, in themselves, are identified as points of local spectrum maximum. Peak location Sample intensity

9 Peak alignment On analyzing several spectra the question if there are common peaks for these spectra easily arises. To solve this question it is necessary to compare peaks locations and intensity for spectra of interest. It is mandatory that for all spectra to be compared the previous steps are to be completed with the same parameters. Sample 1 Sample 2 Common peaks Specific peak

10 Using LDA with MS data to predict patient outcome
Dataset description. MS data were taken from work of Gammerman et al, 2008. Control data: We used 153 control samples (no ovarian cancer detected) as ‘NO’ dataset. In this work we considered control samples as a general pool of healthy people. Cancer patient data: The all data contain 75 samples from patients with identified ovarian cancer (OC) taken from 0 to 75 months prior to diagnosis from 18 patients. To train LDA classifier we used these patients samples taken from 1 to 6 months prior to diagnosis (‘YES’ dataset, 17 samples).

11 MS data processing We used algorithms described in Gammerman et al, 2008 to preprocess mass pectra from 228 samples total. The processing included: Calibration Resampling Smoothing Normalization Peak identification Peak alignment and peak group detection As result, 374 peak groups were detected for all sample data.

12 List of top 20 peak groups with highest representation in analyzed samples
Peak Group Index PeakID MeanMass MinMass MaxMass NumPeaks (of 228) Max Intensity 5 211 20 195 18 193 24 189 42 184 2 177 37 17 167 90 157 8 147 78 7 144 10 136 56 133 55 132 23 130 3 126 6 121 12 120

13 Selection of LDA features for classification of cancer and non-cancer samples
We used LDA function that uses 2 prediction features: Logarithm CA125 serum tumor marker level. Logarithm of the MS signal intensity within peak group MZ range. For each MS data: if peak was presented in the peak group we take logarithm of its intensity; if no peak was detected we took average signal intensity for the MZ range corresponding to peak group; if the intensity values were all zero for MZ within the peak group range, we set the log intensity value to -10. Thus, LDF (linear discriminant function) is LDF=a1*x1+a2*x2+b, where x1 is log(CA125 level), x2 is log(Peak intensity for some peak). We test the utility of the x2 feature (MS intensity) for all the peak groups that have the largest peak number (listed at the previous slide), 20 peak groups were tested in total. For each peak intensity we calculated LDF value and made classification for cancer/non-cancer samples. The classification performances (fraction of true predictions) were estimated for each of the 20 peak groups.

14 Example of data input for LDA analysis
The information for LDA classification is represented as table containing (1) sample index, (2) time of sampling (before diagnosis for cancer patients), (3) patient index (case), (4) logarithm of CA125 level (denoted as lnCA125), and (5-25) logarithm of mass spectra peak intensity for 20 peak groups (denoted as MZ_NNNN, where NNNN is mean mass value for peak group).

15 LDF=a1*logCA125+a2*logPi17+b; a1=5.247, a2=-0.006, b=-18.639
It was found, that peak group 17 provide the best performance for LDA classification if used with CA125 level. LDF=a1*logCA125+a2*logPi17+b; a1=5.247, a2=-0.006, b= Peak Group Index PeakID MeanMass MinMass MaxMass NumPeaks (of 228) Max Intensity 5 211 20 195 18 193 24 189 42 184 2 177 37 17 167 90 157 8 147 78 7 144 10 136 56 133 55 132 23 130 3 126 6 121 12 120

16 LDA classification example for peak group #17
This peak group defined for peaks with MZ values in the range [2983.0, ] . The distribution of CA125 and peak intensities for control (blue points) and OC patients (1-6 months before diagnosis; red points) shown below. Classification results are also shown: 7 control data were classified as disease (< 5%). Log(CA125) Log(I);MZ=2986 Cancer detected Control Number of samples=171 (control(0)=154;disease(1)=17) Fraction of true predictions: [164] Class 0: Fraction of true positives : [147] Fraction of false negatives : [7] Class 1: Fraction of true positives : [17] Fraction of false negatives : [0]

17 LDV values distribution for control and cancer samples
LDF calculated for features: CA125 and peak #17 intensity Non-cancer Cancer Number of samples Control 1-6 months prior to cancer detection LDF values

18 LDF value vs time before diagnosis
The LDF values were calculated for all OC patients samples (18 patients). The results shown below. The X axis – time before diagnosis. For most samples the LDF value exceed zero in the range 10 months before diagnosis. Y-axis – LDF value. 5 samples show no increase of LDF values for this period (they have small number of samples). One patient (ID 3480) have LDF value greater than zero for all period of time. Thus positive LDF values based on CA125 and MS peak intensity [2983.0, ] can be used as OC markers for prognosis within 6 months. Patient ID Cancer LDF value Time (months) Non-cancer Wrong classification of non-cancer case for some OC patients at time=0


Download ppt "Softberry Mass Spectra (SMS) processing tools"

Similar presentations


Ads by Google