Presentation is loading. Please wait.

Presentation is loading. Please wait.

Protein Identification by Sequence Database Search Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.

Similar presentations


Presentation on theme: "Protein Identification by Sequence Database Search Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center."— Presentation transcript:

1 Protein Identification by Sequence Database Search Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center

2 2 Peptide Mass Fingerprint Cut out 2D-Gel Spot

3 3 Peptide Mass Fingerprint Trypsin Digest

4 4 Peptide Mass Fingerprint MS

5 5 Peptide Mass Fingerprint

6 6 Trypsin: digestion enzyme Highly specific Cuts after K & R except if followed by P Protein sequence from sequence database In silico digest Mass computation For each protein sequence in turn: Compare computer generated masses with observed spectrum

7 7 Protein Sequence Myoglobin GLSDGEWQQV LNVWGKVEAD IAGHGQEVLI RLFTGHPETL EKFDKFKHLK TEAEMKASED LKKHGTVVLT ALGGILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISDA IIHVLHSKHP GDFGADAQGA MTKALELFRN DIAAKYKELG FQG

8 8 Protein Sequence Myoglobin GLSDGEWQQV LNVWGKVEAD IAGHGQEVLI RLFTGHPETL EKFDKFKHLK TEAEMKASED LKKHGTVVLT ALGGILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISDA IIHVLHSKHP GDFGADAQGA MTKALELFRN DIAAKYKELG FQG

9 9 Amino-Acid Masses Amino-AcidResidual MWAmino-AcidResidual MW AAlanine71.03712MMethionine131.04049 CCysteine103.00919NAsparagine114.04293 DAspartic acid115.02695PProline97.05277 EGlutamic acid129.04260QGlutamine128.05858 FPhenylalanine147.06842RArginine156.10112 GGlycine57.02147SSerine87.03203 HHistidine137.05891TThreonine101.04768 IIsoleucine113.08407VValine99.06842 KLysine128.09497WTryptophan186.07932 LLeucine113.08407YTyrosine163.06333

10 10 Peptide Mass & m/z Peptide Molecular Weight: N-terminal-mass (0.00) + Sum (AA masses) + C-terminal-mass (18.010560) Observed Peptide m/z: (Peptide Molecular Weight + z * Proton-mass (1.007825)) / z Monoisotopic mass values!

11 11 Peptide Masses 1811.90 GLSDGEWQQVLNVWGK 1606.85 VEADIAGHGQEVLIR 1271.66 LFTGHPETLEK 1378.83 HGTVVLTALGGILK 1982.05 KGHHEAELKPLAQSHATK 1853.95 GHHEAELKPLAQSHATK 1884.01 YLEFISDAIIHVLHSK 1502.66 HPGDFGADAQGAMTK 748.43 ALELFR

12 12 Peptide Mass Fingerprint GLSDGEWQQVLNVWGK VEADIAGHGQEVLIR LFTGHPETLEK HGTVVLTALGGILK KGHHEAELKPLAQSHATK GHHEAELKPLAQSHATK YLEFISDAIIHVLHSK HPGDFGADAQGAMTK ALELFR

13 13 Sample Preparation for Tandem Mass Spectrometry Enzymatic Digest and Fractionation

14 14 Single Stage MS MS

15 15 Tandem Mass Spectrometry (MS/MS) MS/MS

16 16 Peptide Fragmentation H…-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH R i-1 RiRi R i+1 AA residue i-1 AA residue i AA residue i+1 N-terminus C-terminus Peptides consist of amino-acids arranged in a linear backbone.

17 17 Peptide Fragmentation

18 18 Peptide Fragmentation -HN-CH-CO-NH-CH-CO-NH- RiRi R i+1 bibi y n-i y n-i-1 b i+1

19 19 Peptide Fragmentation -HN-CH-CO-NH-CH-CO-NH- RiRi CH-R’ bibi y n-i y n-i-1 b i+1 R” i+1 aiai x n-i cici z n-i

20 20 Peptide Fragmentation Peptide: S-G-F-L-E-E-D-E-L-K MWion MW 88b1b1 S GFLEEDELKy9y9 1080 145b2b2 SG FLEEDELKy8y8 1022 292b3b3 SGF LEEDELKy7y7 875 405b4b4 SGFL EEDELKy6y6 762 534b5b5 SGFLE EDELKy5y5 633 663b6b6 SGFLEE DELKy4y4 504 778b7b7 SGFLEED ELKy3y3 389 907b8b8 SGFLEEDE LKy2y2 260 1020b9b9 SGFLEEDEL Ky1y1 147

21 21 Peptide Fragmentation

22 22 Peptide Identification Given: The mass of the precursor ion, and The MS/MS spectrum Output: The amino-acid sequence of the peptide

23 23 Sequence Database Search

24 24 Sequence Database Search

25 25 Sequence Database Search

26 26 Sequence Database Search No need for complete ladders Possible to model all known peptide fragments Sequence permutations eliminated All candidates have some biological relevance Practical for high-throughput peptide identification Correct peptide might be missing from database!

27 27 Peptide Candidate Filtering Digestion Enzyme: Trypsin Cuts just after K or R unless followed by a P. Basic residues (K & R) at C-terminal attract ionizing charge, leading to strong y-ions “Average” peptide length about 10-15 amino-acids Must allow for “missed” cleavage sites

28 28 Peptide Candidate Filtering >ALBU_HUMAN MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKAL VLIAFAQYLQQCPFEDHVKLVNEVTEFAK… No missed cleavage sites MK WVTFISLLFLFSSAYSR GVFR R DAHK SEVAHR FK DLGEENFK ALVLIAFAQYLQQCPFEDHVK LVNEVTEFAK …

29 29 Peptide Candidate Filtering >ALBU_HUMAN MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKAL VLIAFAQYLQQCPFEDHVKLVNEVTEFAK… One missed cleavage site MKWVTFISLLFLFSSAYSR WVTFISLLFLFSSAYSRGVFR GVFRR RDAHK DAHKSEVAHR SEVAHRFK FKDLGEENFK DLGEENFKALVLIAFAQYLQQCPFEDHVK ALVLIAFAQYLQQCPFEDHVKLVNEVTEFAK …

30 30 Peptide Candidate Filtering Peptide molecular weight Only have m/z value Need to determine charge state Ion selection tolerance Mass for each amino-acid symbol? Monoisotopic vs. Average “Default” residual mass Depends on sample preparation protocol Cysteine almost always modified

31 31 Peptide Molecular Weight Same peptide, i = # of C 13 isotope i=0 i=1 i=2 i=3 i=4

32 32 Peptide Molecular Weight …from “Isotopes” – An IonSource.Com Tutorial

33 33 Peptide Molecular Weight Peptide sequence WVTFISLLFLFSSAYSR Potential phosphorylation? S,T,Y + 80 Da WVTFISLLFLFSSAYSR2018.06 WVTFISLLFLFSSAYSR2098.06 WVTFISLLFLFSSAYSR2098.06 WVTFISLLFLFSSAYSR2098.06 WVTFISLLFLFSSAYSR2098.06 WVTFISLLFLFSSAYSR2098.06 WVTFISLLFLFSSAYSR2098.06 WVTFISLLFLFSSAYSR2178.06 WVTFISLLFLFSSAYSR2178.06 …… WVTFISLLFLFSSAYSR2418.06 - 7 Molecular Weights - 64 “Peptides”

34 34 Peptide Scoring Peptide fragments vary based on The instrument The peptide’s amino-acid sequence The peptide’s charge state Etc… Search engines model peptide fragmentation to various degrees. Speed vs. sensitivity tradeoff y-ions & b-ions occur most frequently The scores have no apriority “scale”

35 35 Peptide Identification High-throughput workflows demand we analyze all spectra, all the time. Spectra may not contain enough information to be interpreted correctly...cell phone call drops in and out Spectra may contain too many irrelevant peaks …bad static Peptides may not match our assumptions …its all Greek to me “Don’t know” is an acceptable answer!

36 36 Peptide Identification Rank the best peptide identifications Is the top ranked peptide correct?

37 37 Peptide Identification Rank the best peptide identifications Is the top ranked peptide correct?

38 38 Peptide Identification Rank the best peptide identifications Is the top ranked peptide correct?

39 39 Peptide Identification Incorrect peptide has best score Correct peptide is missing? Potential for incorrect conclusion What score ensures no incorrect peptides? Correct peptide has weak score Insufficient fragmentation, poor score Potential for weakened conclusion What score ensures we find all correct peptides?

40 40 Statistical Significance Can’t prove particular identifications are right or wrong......need to know fragmentation in advance! A minimal standard for identification scores......better than guessing. p-value, E-value, statistical significance

41 41 Random Peptide Models "Generate" random peptides Real looking fragment masses No theoretical model! Must use empirical distribution Usually require they have the correct precursor mass Score function can model anything we like!

42 42 Random Peptide Models Fenyo & Beavis, Anal. Chem., 2003

43 43 Random Peptide Models Fenyo & Beavis, Anal. Chem., 2003

44 44 Random Peptide Models Truly random peptides don’t look much like real peptides Just use (incorrect) peptides from the sequence database! Caveats: Correct peptide (non-random) may be included Homologous incorrect peptides may be included (Incorrect) peptides are not independent

45 45 Extrapolating from the Empirical Distribution Often, the empirical shape is consistent with a theoretical model Geer et al., J. Proteome Research, 2004 Fenyo & Beavis, Anal. Chem., 2003

46 46 False Positive Rate Estimation A form of statistical significance Search engine independent Easy to implement Assumes a single threshold for all spectra Best if E-value or similar is used to compute a spectrum normalized score

47 47 False Positive Rate Estimation Each spectrum is a chance to be right, wrong, or inconclusive. At any given threshold, how many peptide identifications are wrong? Computed for an entire spectral dataset Given identification criteria: SEQUEST Xcorr, E-value, Score, etc., plus......threshold Use “decoy” sequences random, reverse, cross-species Identifications must be incorrect!

48 48 Decoy Search Strategies Concatenated target & decoy “Competition” for best hit... Masks good decoy scores due to spectral variation Separate searches Cleaner estimation of false hit distribution More conservative than concatenation Must ensure: Decoy searches do not change target peptide scores Single score distribution across dataset

49 49 Decoy Search Strategies Reversed Decoys Captures redundancy of peptide sequences Susceptible to mass-shift anomalies Bad choice for protein-level statistics Shuffled & Random Decoys Multiple independent decoys can be created. Better estimation of tail probabilities More conservative than reversed decoys

50 50 False Positive Rate Estimation: Concatenated Target & Decoy 1.Choose a threshold t. 2.Count # of (rank 1) target ids (T t ) with score ≥ t. 3.Count # of (rank 1) decoy ids (D t ) with score ≥ t. 4.Compute FPR = ( 2 x D t ) / ( T t + D t ) Principle: Decoy peptides equally likely as false hits at rank 1 Issues: What to do with decoy hits? Change in database size may affect scores

51 51 False Positive Rate Estimation: Separate Decoy Search 1.Choose a threshold t. 2.Count # of (rank 1) target ids (T t ) with score ≥ t. 3.Count # of (rank 1) decoy ids (D t ) with score ≥ t. 4.Compute FPR = D t / T t Principle: Find the distribution of false hit scores, apply to target Issues: Can choose to merge after the fact... Decoy search cannot change target scores A few good decoy scores can inflate small FDR values

52 52 Peptide Prophet Re-analysis of SEQUEST results Spectrum dependant scores (XCorr) + Additional features form discriminant score Assumes that many of the spectra are not correctly identified These identifications act like decoy hits

53 53 Peptide Prophet Distribution of spectral scores in the results Keller et al., Anal. Chem. 2002

54 54 Peptides to Proteins Nesvizhskii et al., Anal. Chem. 2003

55 55 Peptides to Proteins

56 56 Peptides to Proteins A peptide sequence may occur in many different protein sequences Variants, paralogues, protein families Separation, digestion and ionization is not well understood Proteins in sequence database are extremely non-random, and very dependent No great tools for assessing statistical confidence of protein identifications.

57 57 Mascot MS/MS Ions Search

58 58 Mascot MS/MS Search Results

59 59 Mascot MS/MS Search Results

60 60 Mascot MS/MS Search Results

61 61 Mascot MS/MS Search Results

62 62 Mascot MS/MS Search Results

63 63 Mascot MS/MS Search Results

64 64 Mascot MS/MS Search Results

65 65 Sequence Database Search Traps and Pitfalls Search options may eliminate the correct peptide Precursor mass tolerance too small Fragment m/z tolerance too small Incorrect precursor ion charge state Non-tryptic or semi-tryptic peptide Incorrect or unexpected modification Sequence database too conservative Unreliable taxonomy annotation

66 66 Sequence Database Search Traps and Pitfalls Search options can cause infinite search times Variable modifications increase search times exponentially Non-tryptic search increases search time by two orders of magnitude Large sequence databases contain many irrelevant peptide candidates

67 67 Sequence Database Search Traps and Pitfalls Best available peptide isn’t necessarily correct! Score statistics (e-values) are essential! What is the chance a peptide could score this well by chance alone? The wrong peptide can look correct if the right peptide is missing! Need scores (or e-values) that are invariant to spectrum quality and peptide properties

68 68 Sequence Database Search Traps and Pitfalls Search engines often make incorrect assumptions about sample prep Proteins with lots of identified peptides are not more likely to be present Peptide identifications do not represent independent observations All proteins are not equally interesting to report

69 69 Sequence Database Search Traps and Pitfalls Good spectral processing can make a big difference Poorly calibrated spectra require large m/z tolerances Poorly baselined spectra make small peaks hard to believe Poorly de-isotoped spectra have extra peaks and misleading charge state assignments

70 70 Summary Protein identification from tandem mass spectra is a key proteomics technology. Protein identifications should be treated with healthy skepticism. Look at all the evidence! Spectra remain unidentified for a variety of reasons.

71 71 Further Reading Matrix Science (Mascot) Web Site www.matrixscience.com Seattle Proteome Center (ISB) www.proteomecenter.org Proteomic Mass Spectrometry Lab at The Scripps Research Institute fields.scripps.edu UCSF ProteinProspector prospector.ucsf.edu


Download ppt "Protein Identification by Sequence Database Search Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center."

Similar presentations


Ads by Google