Interpreting MS/MS Proteomics Results

Slides:



Advertisements
Similar presentations
A small taste of inferential statistics
Advertisements

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Ascore Explained Brian C. Searle Proteome Software Inc. Portland, Oregon USA A probability-based approach for high-throughput.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 14 From Randomness to Probability.
Psych 5500/6500 t Test for Two Independent Groups: Power Fall, 2008.
Presented to AGIFORS YM Study Group Bangkok, Thailand May 2001 Larry Weatherford University of Wyoming Dispersed Fares within a Fare Class: How Can We.
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
2 3 J. Proteome Res., 2011, 10 (1), pp 153–160 DOI: /pr100677g.
1336 SW Bertha Blvd, Portland OR 97219
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.
Protein Sequencing and Identification by Mass Spectrometry.
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
ProReP - Protein Results Parser v3.0©
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
Scaffold Download free viewer:
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
My contact details and information about submitting samples for MS
A combination of the words Proteomics and Genomics. Proteogenomics commonly refer to studies that use proteomic information, often derived from mass spectrometry,
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Bootstrapping applied to t-tests
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Proteomics Informatics – Data Analysis and Visualization (Week 13)
Statistical Techniques I
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 – Multiple comparisons, non-normality, outliers Marshall.
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
By C. Kohn Waterford Agricultural Sciences.   A major concern in science is proving that what we have observed would occur again if we repeated the.
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Laxman Yetukuri T : Modeling of Proteomics Data
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
CS 461b/661b: Bioinformatics Tools and Applications Software Algorithm Mathematical Models Biology Experiments and Data.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
Software Project MassAnalyst Roeland Luitwieler Marnix Kammer April 24, 2006.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
INF380 - Proteomics-71 INF380 – Proteomics Chap 7 –Protein Identification and Characterization by MS Protein identification in our context means that we.
Welcome to MM570 Psychological Statistics
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
Week 6. Statistics etc. GRS LX 865 Topics in Linguistics.
Error tolerant search Large number of spectra remain without significant score. Reasonable number of fragment ion peaks might have not match. – Underestimated.
Doug Raiford Phage class: introduction to sequence databases.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Deducing protein composition from complex protein preparations by MALDI without peptide separation.. TP #419 Kenneth C. Parker SimulTof Corporation, Sudbury,
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.
김지형. Introduction precursor peptides are dynamically selected for fragmentation with exclusion to prevent repetitive acquisition of MS/MS spectra.
Peptide de novo sequencing Peptide de novo sequencing is the analytical process that derives a peptide’s amino acid sequence from its tandem mass spectrum.
A Database of Peak Annotations of Empirically Derived Mass Spectra
Protein Identification via Database searching
Proteomics Informatics David Fenyő
Basic Local Alignment Search Tool
NoDupe algorithm to detect and group similar mass spectra.
Bioinformatics for Proteomics
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
Proteomics Informatics David Fenyő
Presentation transcript:

Interpreting MS/MS Proteomics Results The first thing I should say is that none of the material presented is original research done at Proteome Software but we do strive to make the tools presented here available in our software product Scaffold. With that caveat aside… Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting (February 2nd, 2006) The first thing I should say is that none of the material presented is original research done at Proteome Software, but we do strive to make the tools presented here available in our software product Scaffold. With that caveat aside… Illustrated by Toni Boudreault

Organization SEQUEST Identify X! Tandem/Mascot Differ Combine This is an foremost an introduction so we’re first going to talk about Then we’re going to talk about the motivations behind the development of the first really useful bioinformatics technique in our field, SEQUEST. how you go about identifying proteins with tandem mass spectrometry in the first place This technique has been extended by two other tools called X! Tandem and Mascot. X! Tandem/Mascot We’re also going to talk about how these programs differ Differ Combine This is an foremost an introduction so we’re first going to talk about how you go about identifying proteins with tandem mass spectrometry in the first place. Then we’re going to talk about the motivations behind the development of the first really useful bioinformatics technique in our field, SEQUEST. This technique has been extended by two other tools called X! Tandem and Mascot. We’re also going to talk about how these programs differ and how we can use that to our advantage by considering them simultaneously using probabilities. and how we can use that to our advantage by considering them simultaneously using probabilities.

Start with a protein A A I E P A T H K K Q I G L R K L N V I T I D D C So, this is proteomics, so we’re going to use tandem mass spectrometry to identify proteins-- hopefully many of them, and hopefully very quickly. I G L R K L N V I T I D So, this is proteomics, so we’re going to use tandem mass spectrometry to identify proteins-- hopefully many of them, and hopefully very quickly. D C G V R T A

Cut with an enzyme A A I E P A T H K K Q I G L R K L N V I T I D D C G And to use this technique you generally have to lyse the protein into peptides about 8 to 20 amino acids in length and… K K Q I G L R K L N V I T I D And to use this technique you generally have to lyse the protein into peptides about 8 to 20 amino acids in length and… D C G V R T A

Look at each peptide individually. Select a peptide A I E P A T H K K Q I G L Look at each peptide individually. R K L We select the peptide by mass using the first half of the tandem mass spectrometer N V I T I D Look at each peptide individually. We select the peptide by mass using the first half of the tandem mass spectrometer D C G V R T A

Impart energy in collision cell H2O The mass spectrometer imparts energy into the peptide causing it to fragment at the peptide bonds between amino acids. The mass spectrometer imparts energy into the peptide causing it to fragment at the peptide bonds between amino acids.

Measure mass of daughter ions The masses of these fragment ions is recorded using the second mass spectrometer. A E P T A E P A E Intensity 399.2 The masses of these fragment ions is recorded using the second mass spectrometer. A 298.1 201.1 72.0 M/z

B-type Ions A E P T I R Intensity M/z 72.0 129.0 97.0 101.0 113.1 These ions are commonly called B ions, based on nomenclature you don’t really want to know about… A E P T I R H2O Intensity 72.0 129.0 97.0 101.0 113.1 174.1 These ions are commonly called B ions, based on nomenclature you don’t really want to know about. The mass difference between the peaks corresponds directly to the amino acid sequence. M/z But the mass difference between the peaks corresponds directly to the amino acid sequence.

B-type Ions A E P T I R Intensity M/z 72.0 129.0 97.0 101.0 113.1 H2O Intensity 72.0 129.0 97.0 101.0 113.1 174.1 A-0 AE-A AEP -AE AEPT -AEP AEPTI -AEPT AEPTIR -AEPTI For example, the A-E peak minus the A peak should produce the mass of E. You can build these mass differences up and derive a sequence for the original peptide. This is pretty neat and it makes tandem mass spectrometry one of the best tools out there for sequencing novel peptides. For example, the A-E peak minus the A peak should produce the mass of E. You can build these mass differences up and derive a sequence for the original peptide This is pretty neat and it makes tandem mass spectrometry one of the best tools out there for sequencing novel peptides. M/z

But there are a couple confounding factors. So, it seems pretty easy, doesn’t it? For example… So, it seems pretty easy, doesn’t it? But there are a couple confounding factors. For example…

B-type Ions A E P T I R Intensity M/z CO CO CO CO CO CO H2O B ions have a tendency to degrade and lose carbon monoxide producing… B-type Ions A E P T I R H2O CO CO CO CO CO CO Intensity B ions have a tendency to degrade and lose carbon monoxide producing… M/z

A-type Ions A E P T I R M/z A ions. CO CO CO CO CO CO H2O Furthermore… A ions. Furthermore… M/z

… The second half are represented as Y ions that sequence backwards. Y-type Ions And, unfortunately, this is the real world, so… R I T P E A H2O Intensity The second half are represented as Y ions that sequence backwards. And, unfortunately, this is the real world, so… M/z

Y-type Ions R I T P E A Intensity M/z … All the peaks have different measured heights and many peaks can often be missing. Y-type Ions R I T P E A H2O Intensity All the peaks have different measured heights and many peaks can often be missing. M/z

B-type, A-type, Y-type Ions All these peaks are seen together simultaneously and we don’t even know… B-type, A-type, Y-type Ions R I T P E A H2O Intensity All these peaks are seen together simultaneously and we don’t even know… M/z

Finally, as with all analytical techniques, What type of ion they are, making the mass differences approach even more difficult. Finally, as with all analytical techniques, Intensity What type of ion they are, making the mass differences approach even more difficult. Finally, as with all analytical techniques, M/z

producing a final spectrum that looks like… There’s noise, producing a final spectrum that looks like… Intensity There’s noise, producing a final spectrum that looks like… M/z

Intensity M/z ….This, on a good day. And so it’s actually fairly difficult to… ….This, on a good day. Intensity This, on a good day. And so it’s actually fairly difficult to… M/z

… compute the mass differences to sequence the peptide, certainly in a computer automated way. H2O Intensity 72.0 129.0 97.0 101.0 113.1 174.1 Compute the mass differences to sequence the peptide, certainly in a computer automated way. M/z

So the community needed a new technique. Now, it wasn’t all without hope… So the community needed a new technique. Now, it wasn’t all without hope…

Known Ion Types B-type ions A-type ions Y-type ions We knew a couple of things about peptide fragmentation. B-type ions A-type ions Y-type ions Not only do we know to expect B, A, and Y ions, but… We knew a couple of things about peptide fragmentation. Not only do we know to expect B, A, and Y ions, but…

Known Ion Types B-type ions A-type ions Y-type ions … We also know a couple of other variations on those ions that come up. B-type ions A-type ions Y-type ions B- or Y-type +2H ions B- or Y-type -NH3 ions B- or Y-type -H2O ions We even know something about the… We also know a couple of other variations on those ions that come up. We even know something about the…

Known Ion Types B-type ions A-type ions Y-type ions … likelihood of seeing each type of ion, Known Ion Types B-type ions A-type ions Y-type ions B- or Y-type +2H ions B- or Y-type -NH3 ions B- or Y-type -H2O ions 100% 20% 50% where generally B and Y ions are most prominent. Likelihood of seeing each type of ion, where generally B and Y ions are most prominent.

So it’s actually pretty easy to guess what a spectrum should look like If we know the amino acid sequence of a peptide, we can guess what the spectra should look like! if we know what the peptide sequence is. So it’s actually pretty easy to guess what a spectrum should look like if we know what the peptide sequence is.

Model Spectrum ELVISLIVESK So as an example, consider the peptide ELVIS LIVES K that was synthesized by Rich Johnson in Seattle ELVISLIVESK So as an example, consider the peptide ELVIS LIVES K that was synthesized by Rich Johnson in Seattle *Courtesy of Dr. Richard Johnson http://www.hairyfatguy.com/

Model Spectrum We can create a hypothetical spectrum based on our rules We can create a hypothetical spectrum based on our rules

B/Y type ions (100%) B/Y +2H type ions A type ions (50%) B/Y -NH3/-H2O Where B and Y ions are estimated at 100%, plus 2 ions are estimated at 50%, Where B and Y ions are estimated at 100%, plus 2 ions are estimated at 50%, and A ions and other stragglers are at 20%. and other stragglers are at 20%. B/Y +2H type ions (50%) A type ions B/Y -NH3/-H2O (20%)

Model Spectrum So if we consider the spectrum that was derived from the ELVIS LIVES K peptide… So if we consider the spectrum that was derived from the ELVIS LIVES K peptide…

Model Spectrum We can find where the overlap is between the hypothetical and the actual spectra… We can find where the overlap is between the hypothetical and the actual spectra…

Model Spectrum And say conclusively based on the evidence that the spectrum does belong to the ELVIS LIVES K peptide. And say conclusively based on the evidence that the spectrum does belong to the ELVIS LIVES K peptide.

The more important question is But who cares? The more important question is “what about situations where we don’t know the sequence?” But who cares? The more important question is “what about situations where we don’t know the sequence?”

We guess! Well, we can guess.

PepSeq AAAAAAAAAA AAAAAAAAAC AAAAAAAACC AAAAAAACCC ELVISLIVESK And so this was an approach followed by a program called PepSeq which would guess every combination of amino acids possible AAAAAAAAAA AAAAAAAAAC AAAAAAAACC AAAAAAACCC ELVISLIVESK WYYYYYYYYY YYYYYYYYYY build a hypothetical spectrum, and find the best matching hypothetical. … And so this was an approach followed by a program called PepSeq, which would guess every combination of amino acids possible, build a hypothetical spectrum, and find the best matching hypothetical. … J. Rozenski et al., Org. Mass Spectrom., 29 (1994) 654-658.

but it’s clearly impossibly hard with larger peptides PepSeq This was a start, but it’s clearly impossibly hard with larger peptides Impossibly hard after 7 or 8 amino acids! High false positive rate because you consider so many options and there’s a lot of room to overfit the data. This was a start, but it’s clearly impossibly hard with larger peptides and there’s a lot of room to overfit the data.

Another strategy is needed! PepSeq So obviously this isn’t going to work in the long run. Another strategy is needed! Impossibly hard after 7 or 8 amino acids! High false positive rate because you consider so many options This was a start, but it’s clearly impossibly hard with larger peptides and there’s a lot of room to overfit the data.

Sequencing Explosion … We needed a new invention to come around and that was shotgun Sanger-sequencing 1977 Shotgun sequencing invented, bacteriophage fX174 sequenced. 1989 Yeast Genome project announced 1990 Human Genome project announced 1992 First chromosome (Yeast) sequenced 1995 H. influenza sequenced 1996 Yeast Genome sequenced 2000 Human Genome draft … In 89 and 90 the Yeast and Human Genome projects were announced followed by the first chromosome in 92 We needed a new invention to come around, and that was shotgun Sanger-sequencing. In 89 and 90 the Yeast and Human Genome projects were announced, followed by the first chromosome in 92, et cetra, et cetra et cetra, et cetra

for use in tandem mass spectrometry. Sequencing Explosion 1977 Shotgun sequencing invented, bacteriophage fX174 sequenced. 1989 Yeast Genome project announced 1990 Human Genome project announced 1992 First chromosome (Yeast) sequenced 1995 H. influenza sequenced 1996 Yeast Genome sequenced 2000 Human Genome draft Eng, J. K.; McCormack, A. L.; Yates, J. R. III J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. … In 1994 Jimmy Eng and John Yates published a technique to exploit genome sequencing In 1994 Jimmy Eng and John Yates published a technique to exploit genome sequencing for use in tandem mass spectrometry. And the idea was for use in tandem mass spectrometry. And the idea was …

SEQUEST .…instead of searching all possible peptide sequences, Now, in the post- genomic world this seems like a pretty trivial idea, search only those in genome databases. but back then there was a lot of assumption placed on the idea that we’d actually have a complete Human genome in a reasonable amount of time. Instead of searching all possible peptide sequences, search only those in genome databases. Now, in the post genomic world this seems like a pretty trivial idea, but back then there was a lot of assumption placed on the idea that we’d actually have a complete Human genome in a reasonable amount of time.

SEQUEST 2*1014 -- All possible 11mers (ELVISLIVESK) 2*1010 -- All possible peptides in NR 1*108 -- All tryptic peptides in NR 4*106 -- All Human tryptic peptides in NR So, In terms of 11amino acid peptides So, In terms of 20 amino acid peptides, we’re talking about a 10 quadrillion fold difference between searching that and every possible 20mer in current non-redundant protein database from the NCBI, and a quintillion fold difference between the PepSeq search space and all tryptic NR peptides. and it made hypothetical spectrum matching feasible. So that was huge, we’re talking about a 10 thousand fold difference between searching every possible 11mer those in the current non-redundant protein database from the NCBI it made hypothetical spectrum matching feasible. And a 100 million fold difference for searching human trypic peptides

SEQUEST Model Spectrum Instead of trying to make a better model, SEQUEST made a couple of other interesting improvements as well they decided just to make the actual spectrum look like the model with normalization… Jimmy and John noted that there was a discontinuity between the intensities of the hypothetical spectrum and the actual spectrum. SEQUEST made a couple of other interesting improvements as well. Jimmy and John noted that there was a discontinuity between the intensities of the hypothetical spectrum and the actual spectrum. Instead of trying to make a better model, they decided just to make the actual spectrum look like the model with normalization… SEQUEST Model Spectrum

Like so. SEQUEST Model Spectrum For a scoring function they decided to use Cross-Correlation, Like so. which basically sums the peaks that overlap between hypothetical and the actual spectra Like so. For a scoring function they decided to use Cross-Correlation, which basically sums the peaks that overlap between hypothetical and the actual spectra SEQUEST Model Spectrum

SEQUEST Model Spectrum And then they shifted the spectra back and …. And then they shifted the spectra back and SEQUEST Model Spectrum

… Forth so that the peaks shouldn’t align. They used this number, also called the Auto-Correlation, as their background. Forth so that the peaks shouldn’t align. They used this number, also called the Auto-Correlation, as their background. SEQUEST Model Spectrum

SEQUEST XCorr Cross Correlation (direct comparison) Auto Correlation This is another representation of the Cross Correlation and the Auto Correlation. Cross Correlation (direct comparison) Correlation Score Auto Correlation (background) This is another representation of the Cross Correlation and the Auto Correlation. Offset (AMU) Gentzel M. et al Proteomics 3 (2003) 1597-1610

SEQUEST XCorr Cross Correlation (direct comparison) Auto Correlation The XCorr score is the Cross Correlation divided by the average of the auto correlation over a 150 AMU range. SEQUEST XCorr The XCorr is high if the direct comparison is significantly greater than the background, Cross Correlation (direct comparison) which is obviously good for peptide identification. Correlation Score Auto Correlation (background) The XCorr score is the Cross Correlation divided by the average of the auto correlation over a 75 AMU range. The XCorr is high if the direct comparison is significantly greater than the background, which is obviously good for peptide identification. Offset (AMU) XCorr = Gentzel M. et al Proteomics 3 (2003) 1597-1610

SEQUEST DeltaCn And this XCorr is actually a pretty robust method for estimating how accurate the match is, and so far, there really haven’t been any significant improvements on it. The DeltaCn is another score that scientists often use. It measures how good the XCorr is relative to the next best match. As you can see, this is actually a pretty crude calculation. And this is actually a pretty robust method for estimating how accurate the match is, and so far, there really haven’t been any significant improvements on it. The DeltaCn is another score that scientists often use, which measures how good the XCorr is relative to the next best match. As you can see, this is actually a pretty crude calculation.

Strong (XCorr) Weak (DeltaCn) Accuracy Score Relative Score SEQUEST Here’s another representation of that sentiment. The XCorr is a strong measure of accuracy, whereas the DeltaCn is a weak measure of relative goodness. . Accuracy Score Relative Score Strong (XCorr) Weak (DeltaCn) SEQUEST Here’s another representation of that sentiment. The XCorr is a strong measure of accuracy, whereas the DeltaCn is a weak measure of relative goodness.

Mascot and X! Tandem fit that bill. Obviously, there could be an alternative method that focuses more on the success of the relative score. Mascot and X! Tandem fit that bill. Accuracy Score Relative Score Strong (XCorr) Weak (DeltaCn) SEQUEST Alternate Method Obviously, there could be an alternative method that focuses more on the success of the relative score. Mascot and X! Tandem fit that bill. Weak Strong

X! Tandem Scoring by-Score= Sum of intensities of peaks matching B-type or Y-type ions HyperScore= Now the X! Tandem accuracy score is rather crude. It only considers B and Y ions and Now the X! Tandem accuracy score is rather crude. It only considers B and Y ions and attaches these factorial terms with a very admittedly hand waving argument. and attaches these factorial terms with an admittedly hand waving argument. Fenyo, D.; Beavis, R. C. Anal. Chem., 75 (2003) 768-774

Distribution of “Incorrect” Hits But instead of just considering the best match to the second best, it looks at the distribution of lower scoring hits, assuming that they are all wrong. This is somewhat based on ideas pioneered with the BLAST algorithm. Here, every bar represents the number of matches at a given score. The X! Tandem creators found that the distribution decays (or slopes down) exponentially… # of Matches But instead of just considering the best match to the second best, it looks at the distribution of lower scoring hits, assuming that they are all wrong. This is somewhat based on ideas pioneered with the BLAST algorithm. Here, every bar represents the number of random matches at a given score. The X! Tandem creators found that the distribution decays (or slopes down) exponentially… Second Best Best Hit Hyper Score

Estimate Likelihood (E-Value) …and the log of the distribution is relatively linear because of the exponential decay. Log(# of Matches) Best Hit And the log of the distribution is relatively linear because of the exponential decay. Hyper Score

Estimate Likelihood (E-Value) Hyper Score Expected Number Of Random Matches Log(# of Matches) Best Hit If the distribution represents the number of random matches at any given score, the linear fit should correspond to the expected number of random matches. If the distribution represents the number of random matches at any given score, the linear fit should correspond to the expected number of random matches.

Estimate Likelihood (E-Value) Score of 60 has 1/10 chance of occurring at random Log(# of Matches) Best Hit And from this, you can calculate the likelihood that the best match is random. In this case, a score of 60 corresponds with a log number of matches being -1, which means the estimated number of random matches for that score is 1 over 10. This is called an E-Value, or Expected-Value. And from this, you can calculate the likelihood that the best match is random. In this case, a score of 60 corresponds with a log number of matches being -1 This is called an E-Value, or Expected-Value. which means the estimated number of random matches for that score is 0.1

X! Tandem and Mascot Now, X! Tandem calculates this E-Value empirically. E-Value= Likelihood that match is incorrect relative to N guesses Empirical (X! Tandem) P-Value= Likelihood that match is incorrect (E~P·N) Theoretical (Mascot) Another search engine, Mascot, tries to get at the same kind of number using theoretical calculations, Now, X! Tandem calculates this E-Value empirically. Another search engine, Mascot, tries to get at the same kind of number using theoretical calculations, most likely based on the number of identified peaks and the likelihood of finding certain amino acids in the genome database. They’ve never explicitly published their algorithm, so we’ll never really know, but I suspect it’s something smart. I just want to bring up a point that we’ll touch on a little later… most likely based on the number of identified peaks and the likelihood of finding certain amino acids in the genome database. They’ve never explicitly published their algorithm, so we’ll never really know, but I suspect it’s something smart. I just want to bring up a point that we’ll touch on a little later…

X! Tandem and Mascot which is NOT 1 minus the P-Value. …the E-Value that X! Tandem calculates and the P-Value that Mascot calculates are probabilistically based, but they can only estimate the likelihood that the match is wrong. X! Tandem and Mascot E-Value= Likelihood that match is incorrect relative to N guesses Empirical (X! Tandem) P-Value= Likelihood that match is incorrect (E~P·N) Theoretical (Mascot) Probability= Likelihood that match is correct Note (Probability≠1-P)! The E-Value that X! Tandem calculates and the P-Value that Mascot calculates are probabilistically based, but they can only estimate the likelihood that the match is wrong. This is realistically not nearly as useful as knowing the probability that a peptide identification is right, which is NOT 1 minus the P-Value. This is realistically not nearly as useful as knowing the probability that a peptide identification is right, which is NOT 1 minus the P-Value.

Accuracy Score Relative Score X! Tandem SEQUEST XCorr HyperScore Now, let’s go back and fill in the X! Tandem part of our accuracy/relativity scoring grid. Accuracy Score Relative Score X! Tandem SEQUEST XCorr HyperScore DeltaCn E-Value Now, let’s go back and fill in the X! Tandem part of our accuracy/relativity scoring grid.

Accuracy Score Relative Score X! Tandem SEQUEST XCorr HyperScore To reiterate, the XCorr is an excellent measure of accuracy… Accuracy Score Relative Score X! Tandem SEQUEST XCorr HyperScore DeltaCn E-Value To reiterate, the XCorr is an excellent measure of accuracy

…whereas the E-Value is an excellent measure of how good the best score is relative to the rest. If we assume that accuracy and relativity scores are independent measures of goodness, could we use both the SEQUEST’s XCorr and X! Tandem’s E-Value together? Accuracy Score Relative Score X! Tandem SEQUEST XCorr HyperScore DeltaCn E-Value Whereas the E-Value is an excellent measure of how good the best score is relative to the rest. If we assume that accuracy and relativity scores are independent measures of goodness, could we use both the SEQUEST’s XCorr and X! Tandem’s E-Value together?

10 Protein Control Sample And the answer is a resounding yes. Each point on this graph is a spectrum, where correct identifications are marked in red, while incorrect identifications are marked in blue. X! Tandem: -log(E-Value) We know what’s correct and incorrect because this is a control sample. And the answer is a resounding yes. Each point on this graph is a spectrum, where correct identifications are marked in red, while incorrect identifications are marked in blue. We know what’s correct and incorrect because this is a control sample. Although in general the spectra SEQUEST scores well are spectra X!Tandem also scores well, there is considerable scatter between the search engines. SEQUEST: Discriminant Score Although in general the spectra SEQUEST scores well are spectra X!Tandem also scores well, there is considerable scatter between the search engines.

10 Protein Control Sample One might wonder if X! Tandem and Mascot use similar scoring approaches, would they benefit as much, but the answer is surprisingly still yes! X! Tandem: -log(E-Value) One might wonder if X! Tandem and Mascot use similar scoring approaches, would they benefit as much, but the answer is surprisingly still yes! Now, why are the scores so different? Mascot: Ion-Identity Score Now, why are the scores so different?

Well, here are a couple of possible reasons. Why So Different? Well, here are a couple of possible reasons. Sequest Considers relative intensities X! Tandem Considers semi-tryptic peptides Considers only B/Y-type Ions Mascot Considers theoretical P-Value relative to search space SEQUEST is the only method to consider relative intensities. Well, here are a couple of possible reasons. SEQUEST is the only method to consider relative intensities.

Why So Different? Sequest X! Tandem Mascot X! Tandem is the only method to consider peptides outside the standard search space by default, Sequest Considers relative intensities X! Tandem Considers semi-tryptic peptides Considers only B/Y-type Ions Mascot Considers theoretical P-Value relative to search space such as semi-tryptic peptides. However, it’s the only score that considers only B and Y ions, as opposed to a complete model. X! Tandem is the only method to consider peptides outside the standard search space by default, such as semi-tryptic peptides. However, it’s the only score that only considers B and Y ions, as opposed to a complete model.

Why So Different? Sequest X! Tandem Mascot Considers relative intensities X! Tandem Considers semi-tryptic peptides Considers only B/Y-type Ions Mascot Considers theoretical P-Value relative to search space And Mascot is the only search engine to compute a completely theoretical P-Value And Mascot is the only search engine to compute a completely theoretical P-Value

Consider Multiple Algorithms? So we clearly want to consider multiple search engines simultaneously, X! Tandem: -log(E-Value) but how? So we clearly want to consider multiple search engines simultaneously, but how? Mascot: Ion-Identity Score

How To Compare Search Engines? SEQUEST: XCorr>2.5, DeltaCn>0.1 Mascot: Ion Score-Identity Score>0 X! Tandem: E-Value<0.01 You can’t use a thresholding system For example, a SEQUEST match with an XCorr of 2.5 doesn’t mean the same thing because it’s impossible to find corresponding thresholds. as an X! Tandem match with an E-Value of 0.01. You can’t use a thresholding system because it’s impossible to find corresponding thresholds. For example, a SEQUEST match with an XCorr of 2.5 doesn’t mean the same thing as an X! Tandem match with an E-Value of 0.01.

How To Compare Search Engines? SEQUEST: XCorr>2.5, DeltaCn>0.1 Mascot: Ion Score-Identity Score>0 X! Tandem: E-Value<0.01 The simplest way would be to convert the scores into probabilities and compare those. We advocate for Andrew Keller and Alexy Nesviskii’s Peptide Prophet approach because it actually calculates a true probability, not just a p-value. The simplist way would be to convert the scores into probabilities and compare those. We advocate for Andrew Keller and Alexy Nesviskii’s Peptide Prophet approach because it actually calculates a true probability, not just a p-value. Need to convert scores to probabilities!

Mascot: Ion-Identity Score 10 Protein Control Sample (Q-ToF) X! Tandem approach Other Incorrect IDs for Spectrum So if you remember, X! Tandem considers the best peptide match for a spectrum against a distribution of incorrect matches Possibly Correct? # of Matches So if you remember, X! Tandem considers the best peptide match for a spectrum against a distribution of incorrect matches Mascot: Ion-Identity Score

Mascot: Ion-Identity Score 10 Protein Control Sample (Q-ToF) Peptide Prophet approach ALL Other “Best” Matches Well, Peptide Prophet looks across the entire sample, and not at just one spectrum at a time. It compares the best match against all of the other best matches in the sample, which is clearly bimodal. Possibly Correct? # of Matches Well, Peptide Prophet looks across the entire sample, and not at just one spectrum at a time. It compares the best match against all of the other best matches in the sample, which is clearly bimodal. Mascot: Ion-Identity Score Keller, A. et al Anal. Chem. 74, 5383-5392

Mascot: Ion-Identity Score 10 Protein Control Sample (Q-ToF) Peptide Prophet approach ALL Other “Best” Matches The low mode represents matches that are most likely wrong while the high mode represents matches that are probably right. Possibly Correct? # of Matches The low mode represents matches that are most likely wrong while the high mode represents matches that are probably right. Mascot: Ion-Identity Score Keller, A. et al Anal. Chem. 74, 5383-5392

10 Protein Control Sample (Q-ToF) Peptide Prophet approach Peptide Prophet curve fits two distributions to the modes, following the assumption that the low scoring distribution is “Incorrect” “Incorrect” and that the higher scoring distribution is “correct”. Possibly Correct? # of Matches “Correct” Peptide Prophet curve fits two distributions to the modes, following the assumption that the low scoring distribution is “Incorrect” and that the higher scoring distribution is “correct”. Mascot: Ion-Identity Score

Mascot: Ion-Identity Score 10 Protein Control Sample (Q-ToF) “Incorrect” These two distributions can be analyzed using Bayesian statistics with this formula. Now that formula looks pretty complex, but… Possibly Correct? # of Matches These two distributions can be analyzed using Bayesian statistics with this formula. Now that formula looks pretty complex, but… “Correct” Mascot: Ion-Identity Score

Mascot: Ion-Identity Score 10 Protein Control Sample (Q-ToF) “Incorrect” It just calculates the height of the correct distribution at a particular score, divided by the height of both distributions. # of Matches It just calculates the height of the correct distribution at a particular score, divided by the height of both distributions. “Correct” Mascot: Ion-Identity Score

Mascot: Ion-Identity Score 10 Protein Control Sample (Q-ToF) This is essentially the probability of having that score and being correct divided by the probability of just having that score “Incorrect” This is essentially the probability of having that score and being correct divided by the probability of just having that score. “Correct” Mascot: Ion-Identity Score

Mascot: Ion-Identity Score “Incorrect” Possibly Correct? # of Matches “Correct” Mascot: Ion-Identity Score This is a neat method because it actually considers the likelihood of being correct, rather than X! Tandem and Mascot, which only calculate the probability of being incorrect. This is a neat method because it actually considers the likelihood of being correct, rather than X! Tandem and Mascot, which only calculate the probability of being incorrect. It’s because of this that Peptide Prophet can get produce a true probability, which is important when the sample characteristics change. It’s because of this that Peptide Prophet can get produce a true probability, which is important when the sample characteristics change.

Q-ToF: # of Matches Mascot: Ion-Identity Score “Incorrect” Possibly For example, the control sample we’ve been looking at was derived from Q-ToF data For example, the control sample we’ve been looking at was derived from Q-ToF data, which produces pretty high quality results. which produces pretty high quality results

Mascot: Ion-Identity Score Q-ToF: Ion Trap: “Incorrect” If you compare that to the same sample on run on an Ion Trap, the probability of being correct is greatly diminished. Possibly Correct? # of Matches “Correct” If you’ll note, the Incorrect distribution doesn’t change very much between the two analyses, however, the likelihood that the identification is right changes dramatically! Mascot: Ion-Identity Score “Incorrect” If you compare that to the same sample on run on an Ion Trap, the probability of being correct is greatly diminished. If you’ll note, the Incorrect distribution doesn’t change very much between the two analyses, however, the likelihood that the identification is right changes dramatically! As Peptide Prophet considers the correct distribution, it is immune to fluctuations between samples. P-Values and E-Values don’t consider this information, so they can’t be compared across multiple samples, or different examinations of the same sample (hence the reason why we need to use Peptide Prophet for comparing two different search engines) Possibly Correct? # of Matches “Correct”

Mascot: Ion-Identity Score Ion Trap: As Peptide Prophet considers the correct distribution, it is immune to fluctuations between samples. P-Values and E-Values don’t consider this information, so they can’t be compared across multiple samples, or different examinations of the same sample hence the reason why we need to use Peptide Prophet for comparing two different search engines Mascot: Ion-Identity Score “Incorrect” As Peptide Prophet considers the correct distribution, it is immune to fluctuations between samples. P-Values and E-Values don’t consider this information, so they can’t be compared across multiple samples, or different examinations of the same sample (hence the reason why we need to use Peptide Prophet for comparing two different search engines) Possibly Correct? # of Matches “Correct”

Consider Multiple Algorithms? X! Tandem: -log(E-Value) So going back to the scatter plot between X! Tandem and Mascot, So going back to the scatter plot between X! Tandem and Mascot, we can use Peptide Prophet to compute the score threshold that represents a 95% cut-off… Mascot: Ion-Identity Score we can use Peptide Prophet to compute the score threshold that represents a 95% cut-off…

Consider Multiple Algorithms? X! Tandem: 2.6=95% Mascot: -2.5=95% Like so. X! Tandem: -log(E-Value) Like so. This allows you to fairly consider the answers from both search engines simultaneously. The important thing to note, is that if you looked at a different sample, these thresholds should change depending on the height of the correct distributions Mascot: Ion-Identity Score This allows you to fairly consider the answers from both search engines simultaneously. The important thing to note, is that if you looked at a different sample, these thresholds should change depending on the height of the correct distributions

Conclusion So in conclusion, All search engines use different criteria, producing different scores Using multiple search engines simultaneously yields better results Peptide Prophet can normalize search engine results all of the search engines look at different criteria So in conclusion, all of the search engines look at different criteria

Conclusion And we can leverage this to identify more peptides All search engines use different criteria, producing different scores Using multiple search engines simultaneously yields better results Peptide Prophet can normalize search engine results And we can leverage this to identify more peptides

Conclusion And that Peptide Prophet is a great mechanism for doing that All search engines use different criteria, producing different scores Using multiple search engines simultaneously yields better results Peptide Prophet can normalize search engine results because it calculates true probabilities, instead of p-values And that Peptide Prophet is a great mechanism for doing that because it calculates true probabilities, instead of p-values

The End