Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.

Similar presentations


Presentation on theme: "Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology."— Presentation transcript:

1 Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

2 Re-ranking identified spectra (Keller Analytical Chemistry 2002) (Anderson J Proteome Research 2003) (Käll Nature Methods 2007)

3 EAMPK EAMPK? This is the problem we set out to solve

4 Modified problem: Is this peptide assignment correct? m/z Intensity VVVTGLGMLSPVGNTVESTWK +2 1304.4 +1 888.14 +1

5 Peptide-spectrum match features Total peptide mass Charge (+1, +2 or +3) Total ion current Peak count Preliminary SEQUEST score (Sp) Sp rank Cross-correlation score (XCorr) Change in XCorr (delta Cn) Mass difference Percent of theoretical peaks matched Percent of observed peaks matched Percent of peptide fragment ion current matched Percent sequence identity between top and second- ranked peptides

6

7 Uses linear discriminant analysis rather than SVM. Uses a four-dimensional feature space (XCorr, delta Cn, ln SpRank, delta mass). Uses EM to fit distributions to the discriminants of the two classes, yielding a probability. Learns a simple, independent probability model of the number of tryptic termini. Publicly available software, PeptideProphet, is widely used.

8

9 Peptide-spectrum matches against the real database Peptide-spectrum matches against the shuffled database q=0.01

10

11 Features

12

13 2780 PSMs 13706 PSMs 8050 PSMs 12691 PSMs 1% FDR 10863 PSMs

14 Cleaving with elastase

15 Variation by data set Black lines are q = 0.01 Yellow line is y=x Red line = equal q value thresholds Elastase data setChymotrypsin data set

16 Percolator best match SEQUEST best match

17

18 Protein identification

19 The protein ID problem Proteins Peptides Spectra EEAMPFKCYCYGGLGKCYCLLIGKFTEILYCDLNRVNILLGLPK 1.0 0.95 0.98 0.87 0.74

20 The peptide-to-protein mapping is many-to-many

21 0.03 0.10 0.91 0.99 0.97 ≥ 0.90 Proteins (X)Peptides (Y)Spectra (D) One- and two-peptide rules use a simple threshold

22 0.03 0.10 0.91 0.99 0.97 ≥ 0.90 Proteins (X)Peptides (Y)Spectra (D) Select the minimum number of proteins to explain the peptides IDPicker

23 ProteinProphet 0.03 0.10 0.91 0.99 0.97 0.3 0.7 0.8 0.2 1 1 0.55 0.45 Proteins (X)Peptides (Y)Spectra (D) Use an EM-like procedure…

24 0.03 0.10 0.910.3 0.7 0.8 0.2 1 0.45 0.91 0.03 0.97 0.99 1 0.55 0.97 Proteins (X)Peptides (Y)Spectra (D) ProteinProphet

25 Proteins (X)Peptides (Y) 0.8 x 0.03 0.10 0.3 x 0.91 Spectra (D) 0.7 x 0.91 0.2 x 0.03 0.45 x 0.97 0.99 0.55 x 0.97 ProteinProphet

26

27 EM-like algorithm E-step M-step All proteins containing peptide i Probability of protein n Weight of link from peptide i to protein n Maximum probability assigned to peptide i

28 Nested Mixture Model 0.03 0.10 0.91 0.03 0.97 0.99 0.97 Proteins (X)Peptides (Y)Spectra (D) Modeled as mixture of present and absent Model number of matches conditional on protein states Model distribution of scores conditional on peptide states (Li Ann Applied Science 2010)

29 Shen et al. 2008 Li et al. 2008 Model the MS/MS process generatively (forward) using free parameters. Sum over all possible protein and peptide states to get posterior probabilities. Use Expectation-maximization to get parameter estimates. Model the MS/MS process generatively using an existing static peptide detectability model. Use Markov chain Monte Carlo to estimate posterior probabilities. Generatively model: Y | X D | Y Perform inference to get Pr(X | D) The emergence of graphical Bayesian methods

30 Fido Fido performs exact calculations on a Bayesian network model

31 Barista uses a neural network to score PSMs Input units: 17 PSM features Hidden units Output unit PSM feature vector

32 The Barista model includes spectra, peptides and proteins R1R1 R2R2 R3R3 E1E1 E2E2 E3E3 E4E4 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 Proteins Peptides Spectra Neural network score function Number of peptides in protein R

33 Model Training Search against a database containing real (target) and shuffled (decoy) proteins. For each protein, the label y  {+1, -1} indicates whether it is a target or decoy. Hinge loss function: L(F(R),y) = max(0, 1-yF(R)) Goal: Choose parameters W such that F(R) > 0 if y = 1, F(R) < 0 if y = -1. repeat Pick a random protein (R i, y i ) Compute F(R i ) if (1 – yF(R i )) > 0 then Make a gradient step to optimize L(F(R i ),y i ) end if until convergence

34 Barista performs well in target/decoy evaluation

35 Why does Barista work well? Sources of information loss during two-stage analysis: Spectra that are not confidently assigned to a peptide during the initial search are lost. Also lost are lower-ranked peptides that match a given spectrum, corresponding to –the correct peptide when the top-ranked peptide is incorrect, or –a second correct peptide when the spectrum is chimeric. A single score is less informative than a rich feature vector describing the PSM.


Download ppt "Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology."

Similar presentations


Ads by Google