Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical calibration of MS/MS spectrum library search scores Barbara Frewen January 10, 2011 University of Washington.

Similar presentations


Presentation on theme: "Statistical calibration of MS/MS spectrum library search scores Barbara Frewen January 10, 2011 University of Washington."— Presentation transcript:

1 Statistical calibration of MS/MS spectrum library search scores Barbara Frewen January 10, 2011 University of Washington

2 Protein identification Proteins B0205.7 casein kinase C29A12.3a lig-1 DNA ligase C29E6.1a mucin like protein … Protein Mixture Digestion to Peptides Peptides EYWDYEAHMIEWGQIDDYQLVR GGTNIITLLDVVK VVVFLFDLLYFNGEPLV YQTTGQVQYSCLVR LIVVNSEDQLR HPLISLLLLIAFYSTSSEAFVPK …

3 Acquiring MS/MS spectra µLC/µLC MS/MS MS Digest to Peptides Isolate Proteins Cell lysis Load onto column

4 Which proteins are in my sample? Proteins B0205.7 casein kinase C29A12.3a lig-1 DNA ligase C29E6.1a mucin like protein … Protein Mixture Digestion to Peptides Peptides EYWDYEAHMIEWGQIDDYQLVR GGTNIITLLDVVK VVVFLFDLLYFNGEPLV YQTTGQVQYSCLVR LIVVNSEDQLR HPLISLLLLIAFYSTSSEAFVPK …

5 Matching a spectrum to a peptide sequence De novo Infer peptide sequence from m/z of observed peaks Database search Compare observed peaks to predict peaks for each peptide from a list of candidate sequences Library search Compare observed peaks to known spectra

6 Building a spectrum library Ideally, infuse synthesized peptides – ISB has gold standard spectra from five peptides per protein in human – University of Washington (MacCoss) will have spectra from 790 transcription factors and 350 kinases Alternatively, use high-quality peptide-spectrum matches from shotgun proteomics experiments – BiblioSpec now parses search results from SEQUEST, Mascot, X! Tandem, ProteinPilot, Scaffold

7 Library file formats BiblioSpec binary SQTLite compact fast flexible/extensible accessible

8 Using a spectrum library Spectrum identification via library searching Resource for designing SRM directed experiments Compact, unified format for compiling results and sharing between labs

9 Searching a spectrum library SEQUEST Peptide ID list MS/MS query spectra Scan1 0.7 EGSSDEEVP… Scan1 0.3 TFAEILNPI… Scan1 0.2 ARFDLNNHD… ------------------- Scan2 0.5 EDEESIRAV… Scan2 0.2 WLGDDCFMV… Scan2 0.1 IDRAAWKAV… ------------------- Scan3 0.2 EITTRDMGN… Scan3 0.1 GRNMCTAKL… BiblioSpec 3 NGISLTIVR 3 QWDKEPPR 2 FMACSDEK 1 CGCCLYNT 2 GDTIENFK Library of identified spectra 765.1 940.4 593.9 300.4 522.3 m/z 594.2 score = 0.2

10 Comparing library and database search Created a large library of spectra from worm peptides Identified a different set of spectra using both library and database search Compared BiblioSpec results with SEQUEST results to evaluate performance spectrum score library SEQUEST agree? 34 0.l7 AFEQWK LVVAMKNO False positive 35 0.83 DLAVER DLAVERYES True positive 36 …

11 Similarity score discriminates between correct and incorrect matches insert hist/roc Histogram of search scoresROC and 1% ROC curve AUC = 0.978 disagree agree

12 BiblioSpec and SEQUEST results agree BiblioSpec found 91% of SEQUEST IDs Two reasons BiblioSpec and SEQUEST disagree: – Query ion not in library – BiblioSpec found a different peptide to be more similar Only 7% of query spectra not correctly identified were in library. Most disagreed because the correct match was not in library.

13 Compute p-values to evaluate results The BiblioSpec search score provides good discrimination But it’s unclear where to place a threshold between correct and incorrect matches Use statistical methods to estimate the probability that a match is incorrect and to estimate the fraction of incorrect matches above a score threshold.

14 How likely is the match incorrect? distribution of scores for a spectrum vs all possible incorrect matches score low score large area to right p-value = 0.4 high score small area to right p-value = 0.01

15 Estimating the null distribution Representative sample of scores from incorrect matches Guarantee they are incorrect by using decoys In database searching, scores from decoy peptides are used to estimate the null distribution How can we create decoy spectra?

16 Generate decoy spectra by shifting the m/z of the peaks Requirements: fast to generate sequence agnostic representative scores Evaluation: score distributions mimic real spectra generate a data set of incorrect matches to real spectra decoy spectrum real spectrum

17 Circularly shifted peaks are similar to real spectra

18

19 Percolator computes p-values Semi–supervised machine learning to classify correct verses incorrect matches Trains with high-scoring real matches vs decoy matches Classifies all real matches using that model http://per-colator.com Käll et al. 2007 Nature Methods Käll et al. 2008 Bioinformatics

20 Evaluate p-values Compute p-values for incorrect matches to real spectra Percolator p-values should correspond with rank-based p-values ID Percolator rank rank/n 745AF_8518 0.000230787 1 1/n 691AF_10025 0.000461467 2 2/n 691AF_10107 0.000692201 3 3/n 691AF_10301 0.000922934 4 4/n...... 691AF_5048 0.001153669 12 12/n......

21 Calibrating p-values Rank p-value Calculated p-value

22 Better discrimination with p-values Percolator combines: search score delta m/z delta search score charge petpide length candidates copies in library recall (tp / tp + fn) precision (tp / tp + fp)

23 Better discrimination with p-values

24 p-values distinguish between correct and incorrect matches recall (tp / tp + fn) precision (tp / tp + fp)

25 p-values distinguish between correct and incorrect matches

26 p-values provide a universal metric for comparing to other search results Spectra Compiled results library search database search high scoring matches low scoring spectra high scoring matches

27 Acknowledgements MacCoss lab Jesse Canterbury Michael Bereman Jarrett Egertson Greg Finney Eileen Heimer Edward Hsieh Alana Killeen Brendan MacLean Gennifer Merrihew Daniela Tomazela Mike MacCoss Bill Noble

28

29 Number of real matches above fixed a q-value q-value thresholdranked by p-valueranked by search score 0.00131941605 0.0134502683 0.0538253421

30 Percolator distinguishes between correct and incorrect matches

31 Spectrum-sequence assignments spectrum score library SEQUEST agree? 34 0.l7 AFEQWK LVVAMKNO False positive 35 0.83 DLAVER DLAVERYES True positive 36 …

32 Test procedure MS/MS spectra whole worm lysate 4 fractionation methods 31 MuDPITs, 6,634,874 spectra SEQUEST DTASelect BlibFilter List of spectrum-sequence pairs 366,400 spectra estimated 51 false positives filescanseq run1.ms2 404 DALLQW… run1.ms2 651 PJAMVM… run5.ms2 924 SAITTY… … BlibBuild Library Multiple spectra per peptide Library Scan1 0.7 EGSSDEEVP… Scan1 0.3 TFAEILNPI… Scan1 0.2 ARFDLNNHD… ------------------- Scan2 0.5 EDEESIRAV… Scan2 0.2 WLGDDCFMV… Scan2 0.1 IDRAAWKAV… ------------------- Scan3 0.2 EITTRDMGN… Scan3 0.1 GRNMCTAKL… BlibSearch Peptide ID List Filtered Library Statistics 26,708 spectra 21,264 sequences 3,573 proteins Query Spectra unfractionated worm one MuDPIT, 220,845 spectra similar DTASelect criteria 14,926 spectra 5,358 ions

33 Optimize processing parameters Noise removal – a fixed number of peaks – a fixed fraction of the total intensity – all peaks above a defined noise level Intensity normalization – log transform – bin peaks, divide by base peak in each bin – square root of intensity – square root weighted by peak m/z 100

34 Uses of Spectrum Libraries A basis for spectrum identification via spectrum-spectrum searches A reference for designing SRM experiments – Skyline A repository for spectrum identifications – A unified format for consolidating results, sharing with other labs

35 Spectrum shuffling techniques Blindly shuffle peaks Shuffle blocks of peaks Shift peaks circularly Identify fragment ions from peptides, shuffle sequence and move peaks accordingly

36 Parameter Test Results Intensity Adjustments: BIN bin peaks, divide by max per bin MZ weight peak intensity by m/z SQ square root of intensity Noise Reduction: T top n peaks used C top 50% of peak intensity Processing Order: N noise first I intensity first IntensityNoiseOrderScore MZTOPN 50I0.9918 MZTOPN 100N0.9915 MZHALFI0.9887 MZTOPN 200N0.9882 BINTOPN 100N0.9881 MZTOPN 100I0.9873 MZTOPN 200I0.9861 MZTOPN 50N0.9859 MZTOPN 300N0.9856 BINTOPN 200N0.9853 MZTOPN 300I0.9838 BINTOPN 50I0.9825 BINHALFI0.9811 IntensityNoiseOrderScore SQTOPN 50N0.9807 BINTOPN 100I0.9803 BINTOPN 300I0.9788 SQTOPN 100N0.9787 BINTOPN 200I0.9777 BINTOPN 50N0.9769 BINTOPN 300N0.9766 SQTOPN 300N0.9761 SQHALFI0.9756 SQTOPN 200N0.9751 BINHALFN0.9635 MZHALFN0.9465 SQHALFN0.9442


Download ppt "Statistical calibration of MS/MS spectrum library search scores Barbara Frewen January 10, 2011 University of Washington."

Similar presentations


Ads by Google