PeptideProphet Explained Brian C. Searle Proteome Software Inc. www.proteomesoftware.com 1336 SW Bertha Blvd, Portland OR 97219 (503) 244-6027 An explanation.

Slides:



Advertisements
Similar presentations
Bayesian mixture models for analysing gene expression data Natalia Bochkina In collaboration with Alex Lewin, Sylvia Richardson, BAIR Consortium Imperial.
Advertisements

2 3 J. Proteome Res., 2011, 10 (1), pp 153–160 DOI: /pr100677g.
1336 SW Bertha Blvd, Portland OR 97219
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
Database Searches. Peptide mass fingerprinting digestMS Search HIT SCORE Protein X 1000 Protein Y 50 Protein Z 5 Protein X theoretical digestProtein Y.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.
Bin Ma, CTO Bioinformatics Solutions Inc. June 5, 2011.
Heuristic alignment algorithms and cost matrices
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Evaluating Hypotheses
Cbio course, spring 2005, Hebrew University (Alignment) Score Statistics.
Identification System Errors Guide to Biometrics – Chapter 6 Handbook of Fingerprint Recognition Presented By: Chris Miles.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Previous Lecture: Regression and Correlation
Scaffold Download free viewer:
Getting Started with Hypothesis Testing The Single Sample.
My contact details and information about submitting samples for MS
Facts and Fallacies about de Novo Sequencing & Database Search.
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Absolute protein quantification estimated by spectral counting using large datasets in PeptideAtlas Ning Zhang 1*, Eric W. Deutsch 1*, Henry Lam 1, Hamid.
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
Evaluated Reference MS/MS Spectra Libraries Current and Future NIST Programs.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
Multiple testing correction
Intermediate Statistical Analysis Professor K. Leppel.
Statistical Significance R.Raveendran. Heart rate (bpm) Mean ± SEM n In men ± In women ± The difference between means.
The Normal Distribution The “Bell Curve” The “Normal Curve”
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS St. Edward’s University.
PARAMETRIC STATISTICAL INFERENCE
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Inquiry 1 written AND oral reports due Th 9/24 or M 9/28.
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
A Comprehensive Comparison of the de novo Sequencing Accuracies of PEAKS, BioAnalyst and PLGS Bin Ma 1 ; Amanda Doherty-Kirby 1 ; Aaron Booy 2 ; Bob Olafson.
Laxman Yetukuri T : Modeling of Proteomics Data
Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Novel Algorithms for the Quantification Confidence in Quantitative Proteomics with Stable Isotope Labeling* Novel Algorithms for the Quantification Confidence.
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.
INF380 - Proteomics-71 INF380 – Proteomics Chap 7 –Protein Identification and Characterization by MS Protein identification in our context means that we.
Linear Search Efficiency Assessment P. Pete Chong Gonzaga University Spokane, WA
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
The observed and theoretical peptide sequence information Cal.MassObserved. Mass ±da±ppmStart Sequence EndSequenceIon Score C.I%modification FLPVNEK.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 9 Testing a Claim 9.2 Tests About a Population.
Figure S1 Figure S1. Effect of SA on spore germination of M. oryae. The data presented were the means (± standard error) of spore germination from three.
Application of meta-search, grid-computing, and machine-learning can significantly improve the sensitivity of peptide identification. The PepArML meta-search.
Deducing protein composition from complex protein preparations by MALDI without peptide separation.. TP #419 Kenneth C. Parker SimulTof Corporation, Sudbury,
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.
Review Design of experiments, histograms, average and standard deviation, normal approximation, measurement error, and probability.
2015/06/03 Park, Hyewon 1. Introduction Protein assembly Transforms a list of identified peptides into a list of identified proteins. 2 Duplicate Spectrum.
Using Scaffold OHRI Proteomics Core Facility. This presentation is intended for Core Facility internal training purposes only.
Cedar: A Multi-Tiered Protein Identification Scheme for Shotgun Proteomics Terry Farrah (1); Eric Deutsch (1); Gilbert Omenn (2,1); Ruedi Aebersold (3),
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Estimating standard error using bootstrap
Biostatistics Lecture /5 & 6/6/2017.
Proteomic Parsimony through Bipartite Graph Analysis Improves Accuracy and Transparency 2013/05/28 Ahn, Soohan.
Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1
Peptide & Protein Identification by MS/MS
Protein Identification Using Mass Spectrometry
False discovery rate estimation
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
Presentation transcript:

PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation of the Peptide Prophet algorithm developed by Keller, A., Nesvizhskii, A. I., Kolker, E., and Aebersold, R. (2002) Anal. Chem. 74, 5383 –5392

Threshold model spectrumscoresproteinpeptide sort by match score Before PeptideProphet was developed, a threshold model was the standard way of evaluating the peptides matched by a search of MS/MS spectra against a protein database. The threshold model sorts search results by a match score. Before PeptideProphet was developed, a threshold model was the standard way of evaluating the peptides matched by a search of MS/MS spectra against a protein database. The threshold model sorts search results by a match score.

Set some threshold spectrumscoresproteinpeptide sort by match score SEQUEST XCorr > 2.5 dCn > 0.1 Mascot Score > 45 X!Tandem Score < 0.01 Next, a threshold value was set. Different programs have different scoring schemes, so SEQUEST, Mascot, and X!Tandem use different thresholds. Different thresholds may also be needed for different charge states, sample complexity, and database size. Next, a threshold value was set. Different programs have different scoring schemes, so SEQUEST, Mascot, and X!Tandem use different thresholds. Different thresholds may also be needed for different charge states, sample complexity, and database size.

Below threshold matches dropped spectrumscoresproteinpeptide sort by match score SEQUEST XCorr > 2.5 dCn > 0.1 Mascot Score > 45 X!Tandem Score < 0.01 “correct” “incorrect” Peptides that are identified with scores above the threshold are considered “correct” matches. Those with scores below the threshold are considered “incorrect”. There is no gray area where something is possibly correct. Peptides that are identified with scores above the threshold are considered “correct” matches. Those with scores below the threshold are considered “incorrect”. There is no gray area where something is possibly correct.

There has to be a better way Poor sensitivity/specificity trade-off, unless you consider multiple scores simultaneously. No way to choose an error rate (p=0.05). Need to have different thresholds for: –different instruments (QTOF, TOF-TOF, IonTrap) –ionization sources (electrospray vs MALDI) –sample complexities (2D gel spot vs MudPIT) –different databases (SwissProt vs NR) Impossible to compare results from different search algorithms, multiple instruments, and so on. The threshold model has these problems, which PeptideProphet tries to solve:

Creating a discriminant score spectrumscoresproteinpeptide sort by match score PeptideProphet starts with a discriminant score. If an application uses several scores, (SEQUEST uses Xcorr,  Cn, and Sp scores; Mascot uses ion scores plus identity and homology thresholds), these are first converted to a single discriminant score.

Discriminant score for SEQUEST ln( XCorr ) 8.4* ln(# AAs ) 7.4* 0.2*ln( rankSp ) 0.3* 0.96  Cn D  Mass                    SEQUEST’s XCorr (correlation score) is corrected for length of the peptide. High correlation is rewarded. SEQUEST’s  Cn tells how far the top score is from the rest. Being far ahead of others is rewarded. The top ranked by SEQUEST’s Sp score has ln(rankSp)=0. Lower ranked scores are penalized. Poor mass accuracy (big  Mass) is also penalized. For example, here’s the formula to combine SEQUEST’s scores into a discriminant score:

Discriminant score (D) Number of spectra in each bin Once Peptide Prophet calculates the discriminant scores for all the spectra in a sample, it makes a histogram of these discriminant scores. For example, in the sample shown here, 70 spectra have scores around 2.5. Once Peptide Prophet calculates the discriminant scores for all the spectra in a sample, it makes a histogram of these discriminant scores. For example, in the sample shown here, 70 spectra have scores around 2.5. Histogram of scores

“correct” “incorrect” Discriminant score (D) Number of spectra in each bin This histogram shows the distributions of correct and incorrect matches. PeptideProphet assumes that these distributions are standard statistical distributions. Using curve-fitting, PeptideProphet draws the correct and incorrect distributions. This histogram shows the distributions of correct and incorrect matches. PeptideProphet assumes that these distributions are standard statistical distributions. Using curve-fitting, PeptideProphet draws the correct and incorrect distributions. Mixture of distributions

“correct” “incorrect” Discriminant score (D) Number of spectra in each bin Bayesian statistics Once correct and incorrect distributions are drawn, PeptideProphet uses Bayesian statistics to compute the probability p(+|D) that a match is correct, given a discriminant score D.

“correct” “incorrect” Discriminant score (D) Number of spectra in each bin Probability of a correct match The statistical formula looks fierce, but relating it to the histogram shows that the prob of a score of 2.5 being correct is

PeptideProphet model is accurate All Technical Replicates Together (large) Individual Samples (small) Keller et al., Anal Chem 2002 Keller, et al. checked PeptideProphet on a control data set for which they knew the right answer. Ideally, the PeptideProphet- computed probability should be identical to the actual probability, corresponding to a 45-degree line on this graph. They tested PeptideProphet with both large and small data sets and found pretty good agreement with the real probability. Since it was published, the Institute for Systems Biology has used PeptideProphet on a number of protein samples of varying complexity. Keller, et al. checked PeptideProphet on a control data set for which they knew the right answer. Ideally, the PeptideProphet- computed probability should be identical to the actual probability, corresponding to a 45-degree line on this graph. They tested PeptideProphet with both large and small data sets and found pretty good agreement with the real probability. Since it was published, the Institute for Systems Biology has used PeptideProphet on a number of protein samples of varying complexity.

PeptideProphet more sensitive than threshold model correctly identifies everything, with no error Keller et al, Anal Chem 2002 This graph shows the trade-offs between the errors (false identifications) and the sensitivity (the percentage of possible peptides identified). The ideal is zero error and everything identified (sensitivity = 100%). PeptideProphet corresponds to the curved line. Squares 1–5 are thresholds chosen by other authors. This graph shows the trade-offs between the errors (false identifications) and the sensitivity (the percentage of possible peptides identified). The ideal is zero error and everything identified (sensitivity = 100%). PeptideProphet corresponds to the curved line. Squares 1–5 are thresholds chosen by other authors.

PeptideProphet compared to Sequest Xcorr cutoff of 2 XCorr>2 dCn>0.1 NTT=2 Keller et al., Anal Chem 2002 correctly identifies everything, with no error For example, for a threshold of Xcorr > 2 and  Cn>.1 with only fully tryptic peptides allowed (see square 5 on the graph), Sequest’s error rate is only 2%. However, its sensitivity is only 0.6 — that is, only 60% of the spectra are identified. Using PeptideProphet, the same 2% error rate identifies 90% of the spectra, because the discriminant score is tuned to provide better results. For example, for a threshold of Xcorr > 2 and  Cn>.1 with only fully tryptic peptides allowed (see square 5 on the graph), Sequest’s error rate is only 2%. However, its sensitivity is only 0.6 — that is, only 60% of the spectra are identified. Using PeptideProphet, the same 2% error rate identifies 90% of the spectra, because the discriminant score is tuned to provide better results.

+2 XCorr>2 +3 XCorr>2.5 dCn>0.1 NTT>=1 Keller et al., Anal Chem 2002 correctly identifies everything, with no error Another example uses a different threshold for charge +2 and charge +3 spectra (see square 2 on the graph). For this threshold, the error rate is 8% and the sensitivity is 80%. At an error rate of 8%, PeptideProphet identifies 95% of the peptides. Another example uses a different threshold for charge +2 and charge +3 spectra (see square 2 on the graph). For this threshold, the error rate is 8% and the sensitivity is 80%. At an error rate of 8%, PeptideProphet identifies 95% of the peptides. Peptide Prophet compared to charge- dependent cutoff

PeptideProphet allows you to choose an error rate Keller et al., Anal Chem 2002 correctly identifies everything, with no error A big advantage is that you can choose any error rate you like, such as 5% for inclusive searches, or 1% for extremely accurate searches.

Poor sensitivity/specificity trade-off unless you consider multiple scores simultaneously. No way to choose an error rate (p=0.05). Need to have different thresholds for: –different instruments (QTOF, TOF-TOF, IonTrap) –ionization sources (electrospray vs MALDI) –sample complexities (2D gel spot vs MudPIT) –different databases (SwissProt vs NR) Impossible to compare results from different search algorithms, multiple instruments, and so on. There has to be a better way Recall the problems that PeptideProphet was designed to fix. How well did it do?

PeptideProphet better scores Poor sensitivity/specificity trade-off unless you consider multiple scores simultaneously. No way to choose an error rate (p=0.05). Need to have different thresholds for: –different instruments (QTOF, TOF-TOF, IonTrap) –ionization sources (electrospray vs MALDI) –sample complexities (2D gel spot vs MudPIT) –different databases (SwissProt vs NR) Impossible to compare results from different search algorithms, multiple instruments, and so on. discriminant score The discriminant score combines the various scores into one optimal score.

PeptideProphet better control of error rate Poor sensitivity/specificity trade-off unless you consider multiple scores simultaneously. No way to choose an error rate (p=0.05). Need to have different thresholds for: –different instruments (QTOF, TOF-TOF, IonTrap) –ionization sources (electrospray vs MALDI) –sample complexities (2D gel spot vs MudPIT) –different databases (SwissProt vs NR) Impossible to compare results from different search algorithms, multiple instruments, and so on. discriminant score estimate error with distributions The error vs. sensitivity curves derived from the distributions allow you to choose the error rate.

PeptideProphet better adaptability Poor sensitivity/specificity trade-off unless you consider multiple scores simultaneously. No way to choose an error rate (p=0.05). Need to have different thresholds for: –different instruments (QTOF, TOF-TOF, IonTrap) –ionization sources (electrospray vs MALDI) –sample complexities (2D gel spot vs MudPIT) –different databases (SwissProt vs NR) Impossible to compare results from different search algorithms, multiple instruments, and so on. discriminant score estimate error with distributions curve-fit distributions to data (EM) Each experiment has a different histogram of discriminant scores, to which the probability curves are automatically adapted.

PeptideProphet better reporting Poor sensitivity/specificity trade-off unless you consider multiple scores simultaneously. No way to choose an error rate (p=0.05). Need to have different thresholds for: –different instruments (QTOF, TOF-TOF, IonTrap) –ionization sources (electrospray vs MALDI) –sample complexities (2D gel spot vs MudPIT) –different databases (SwissProt vs NR) Impossible to compare results from different search algorithms, multiple instruments, and so on. discriminant score estimate error with distributions curve-fit distributions to data (EM) report P-values Because results are reported as probabilities, you can compare different programs, samples, and experiments.

PeptideProphet Summary Identifies more peptides in each sample. Allows trade-offs: wrong peptides against missed peptides. Provides probabilities: –easy to interpret –comparable between experiments Automatically adjusts for each data set. Has been verified on many real proteome samples: see