1 Mass Spectrometry-based Proteomics Xuehua Shen (Adapted from slides with textbook)

2 Outline Motivation of proteomics Mass spectrometry-based proteomics Instrumentation of mass spectrometry De novo sequencing algorithm Database search Algorithms of real software (e.g., sequence tags)

3 Motivation Proteins are working units of the cells –The number of found genes is much less than the number of expressed proteins –Directly related with cell processes and diseases >1,000,000 distinct protein forms ~30,000 human genes DNA Alternative splicing mRNA Protein Post-translational Modification >100,000 RNA messages SNP

4 Tools for Proteomics Edman degradation reaction NMR (Nuclear Magnetic Resonance) X-ray crystallography Protein array Mass Spectrometry

5 Mass Spectrometry-based Proteomics Primary sequence (sequencing, identification) Post-translational modification (PTM) (characterization) Quantitative proteomics (quantification) Protein-protein interaction

7 Components of Mass Spectrometer Ion source (ESI and MALDI) Mass analyzer (ion traps, TOF, Quadrupole, FT, etc.) –Mass-to-charge ratio (m/z) Ion detector

8 Peptide and Intact Protein Peptide: a fragment of protein Some enzymes, e.g. trypsin, break protein into peptides. Some technology put intact protein into the mass spectrometer

9 Peptide Fragmentation Peptides tend to fragment along the backbone. Fragments can also loose neutral chemical groups like NH 3 and H 2 O. H...-HN-CH-CO... NH-CH-CO-NH-CH-CO-…OH R i-1 RiRi R i+1 H+H+ N-TerminusC-Terminus Collision Induced Dissociation

10 Ideal Mass Spectrum

11 Real Mass Spectrum

12 N- and C-terminal Peptides N-terminal peptides C-terminal peptides

13 Terminal peptides and ion types Peptide Mass (D) 57 + 97 + 147 + 114 = 415 Peptide Mass (D) 57 + 97 + 147 + 114 – 18 = 397 without

14 N- and C-terminal Peptides N-terminal peptides C-terminal peptides 415 486 301 154 57 71 185 332 429

15 N- and C-terminal Peptides N-terminal peptides C-terminal peptides 415 486 301 154 57 71 185 332 429

16 N- and C-terminal Peptides 415 486 301 154 57 71 185 332 429

17 N- and C-terminal Peptides 415 486 301 154 57 71 185 332 429 Problem: Reconstruct peptide from the set of masses of fragment

18 Mass Spectra GVDLK mass 0 57 Da = ‘G’ 99 Da = ‘V’ L K DVG The peaks in the mass spectrum: –Prefix –Fragments with neutral losses (-H 2 O, -NH 3 ) –Noise and missing peaks. and Suffix Fragments. D H2OH2O

19 Protein Identification with MS/MS GVDLK mass 0 Intensity mass 0 MS/MS Peptide Identification:

20 Protein Identification by Tandem Mass Spectrometry Sequence MS/MS instrument De Novo interpretation Sherenga Database search Sequest

21 De Novo vs. Database Search W R A C V G E K D W L P T L T W R A C V G E K D W L P T L T De Novo AVGELTK Database Search Database of known peptides MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE, ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN.. Mass, Score

22 Pros and Cons of de novo Sequencing Advantage: –Gets the sequences that are not necessarily in the database. –An additional similarity search step using these sequences may identify the related proteins in the database. Disadvantage: –Requires higher quality data. –Often contains errors.

23 Current Status It is still a open problem of protein sequencing no matter whether using de novo sequencing or database search methods Following algorithms only deal with simplified (or ideal) spectrums Some algorithms combine de novo sequencing and database search

24 Outline Motivation of proteomics Mass spectrometry-based proteomics Instrumentation of mass spectrometry De novo sequencing Database search Algorithms of real software (e.g., sequence tags)

25 De novo Peptide Sequencing Sequence

26 Peptide Sequencing Problem Goal: Find a peptide with maximal match between an experimental and theoretical spectrum. Input: –S: experimental spectrum –Δ : set of possible ion types –m: parent mass Output: –P: peptide with mass m, whose theoretical spectrum matches the experimental S spectrum the best

27 Procedure of De Novo Sequencing Build spectrum graph –How to create vertices (from masses) –How to create edges (from mass differences) Find best path or rank paths of spectrum graph –How to find candidate paths –How to score paths

28 S E Q U E N C E b Mass/Charge (M/Z) From Sequence to Spectrum

29 a Mass/Charge (M/Z) S E Q U E N C E From Sequence to Spectrum (cont.)

30 S E Q U E N C E Mass/Charge (M/Z) a is an ion type shift in b From Sequence to Spectrum (cont.)

31 y Mass/Charge (M/Z) E C N E U Q E S From Sequence to Spectrum (cont.)

32 Mass/Charge (M/Z) Intensity From Sequence to Spectrum (cont.)

33 Mass/Charge (M/Z) Intensity From Sequence to Spectrum (cont.)

34 noise Mass/Charge (M/Z) From Sequence to Spectrum (cont.)

35 MS/MS Spectrum Mass/Charge (M/z) Intensity

36 Some Mass Differences between Peaks Correspond to Amino Acids s s s e e e e e e e e q q q u u u n n n e c c c

37 Now decoding from spectrum to sequence…? Build spectrum graph

38 Vertices of Spectrum Graph Vertices are generated by reverse shifts corresponding to ion types Δ={δ 1, δ 2,…, δ k } Every mass s in an MS/MS spectrum generates k vertices V(s) = {s+δ 1, s+δ 2, …, s+δ k } corresponding to potential N-terminal peptides Vertices of the spectrum graph: {initial vertex}  V(s 1 )  V(s 2 ) ...  V(s m )  {terminal vertex}

39 Reverse Shifts Shift in H 2 O+NH 3 Shift in H 2 O

40 Edges of Spectrum Graph Two vertices with mass difference corresponding to an amino acid A: –Connect with an edge labeled by A Gap edges for di- and tri-peptides – Potential sequence tag method (covered later)

41 Best Path of Spectrum Graph How to find candidate paths There are many paths, how to find the correct one? We need scoring to evaluate paths

42 Find Candidate Paths Heuristics: find a path with maximum number of edges Longest path problem in DAG DFS (Depth First Search)

43 Path Score p(P,S) = probability that peptide P produces spectrum S= {s 1,s 2,…s q } p(P, s) = the probability that peptide P generates a peak s Scoring = computing probabilities

44 Finding Optimal Paths in the Spectrum Graph For a given MS/MS spectrum S, find a peptide P’ maximizing p(P,S) over all possible peptides P: Peptides = paths in the spectrum graph P’ = the optimal path in the spectrum graph Some software rank paths

45 Ions and Probabilities A peptide has all k peaks with probability and no peaks with probability A peptide also produces a ``random noise'' with uniform probability q R in any position.

46 Ratio Test Scoring for Partial Peptides Incorporates premiums for observed ions and penalties for missing ions. Example: for k=4, assume that for a partial peptide P’ we only see ions δ 1,δ 2,δ 4. The score is calculated as:

47 Why Not Sequence De Novo? De novo sequencing is still not very accurate! Less than 30% of the peptides sequenced were completely correct! Algorithm Amino Acid Accuracy Whole Peptide Accuracy Lutefisk (Taylor and Johnson, 1997). 0.5660.189 SHERENGA (Dancik et. al., 1999). 0.6900.289 Peaks (Ma et al., 2003). 0.6730.246 PepNovo (Frank and Pevzner, 2005). 0.7270.296

48 Thank you ! The End

49 De Novo vs. Database Search W R A C V G E K D W L P T L T W R A C V G E K D W L P T L T De Novo AVGELTK Database Search Database of known peptides MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE, ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN..

50 De Novo vs. Database Search: A Paradox de novo algorithms are much faster, even though their search space is much larger! A database search scans all peptides in the search space to find best one. De novo eliminates the need to scan all peptides by modeling the problem as a graph search. Why not sequence de novo?

51 Outline Motivation of proteomics Mass spectrometry-based proteomics Instrumentation: Mass Spectrometry De novo sequencing algorithm Database search Algorithms of real software (e.g., sequence tags)

52 Peptide Identification Problem Goal: Find a peptide from the database with maximal match between an experimental and theoretical spectrum. Input: –S: experimental spectrum –database of peptides –Δ : set of possible ion types –m: parent mass Output: –A peptide of mass m from the database whose theoretical spectrum matches the experimental S spectrum the best

53 MS/MS Database Search Database search in mass-spectrometry has been very successful in identification of already known proteins. Experimental spectrum can be compared with theoretical spectra of database peptides to find the best fit. SEQUEST (Yates et al., 1995) But reliable algorithms for identification of modified peptides is a much more difficult problem.

54 Post-Translational Modifications Proteins are involved in cellular signaling and metabolic regulation. They are subject to a large number of biological modifications. Almost all protein sequences are post- translationally modified and 200 types of modifications of amino acid residues are known.

55 Examples of Post-Translational Modification Post-translational modifications increase the number of “letters” in amino acid alphabet and lead to a combinatorial explosion in both database search and de novo approaches.

56 Search for Modified Peptides: Virtual Database Approach Yates et al.,1995: an exhaustive search in a virtual database of all modified peptides. Exhaustive search leads to a large combinatorial problem, even for a small set of modifications types. Problem (Yates et al.,1995). Extend the virtual database approach to a large set of modifications.

57 Exhaustive Search for Modified Peptides YFDSTDYNMAK 2 5 =32 possibilities, with 2 types of modifications! Phosphorylation? Oxidation? For each peptide, generate all modifications. Score each modification.

58 Modified Peptide Identification Problem Goal: Find a modified peptide from the database with maximal match between an experimental and theoretical spectrum. Input: –S: experimental spectrum –database of peptides –Δ : set of possible ion types –m: parent mass –Parameter k (# of mutations/modifications) Output: –A peptide of mass m that is at most k mutations/modifications apart from a database peptide and whose theoretical spectrum matches the experimental S spectrum the best

59 Peptide Identification Problem: Challenge Very similar peptides may have very different spectra! Goal: Define a notion of spectral similarity that correlates well with the sequence similarity. If peptides are a few mutations/modifications apart, the spectral similarity between their spectra should be high.

60 Spectrum Alignment See 8.14 and 8.15 in the text book for one algorithm Complicated for real spectrums

61 Quality Measure of Mass Spectrometer Sensitivity Mass accuracy Resolution Dynamic range

62 Ion Types Some masses correspond to fragment ions, others are just random noise Knowing ion types Δ={δ 1, δ 2,…, δ k } lets us distinguish fragment ions from noise We can learn ion types δ i and their probabilities q i by analyzing a large test sample of annotated spectra.

63 Reverse Shifts Two peaks b-H 2 O and b are given by the Mass Spectrum With a +H 2 O shift, if two peaks coincide that is a possible vertex. Mass/Charge (M/Z) Intensity Red: Mass Spectrum Blue: shift (+H 2 O) b/b-H 2 O+H 2 O b-H 2 O b+H 2 O

64 Sequencing of Modified Peptides De novo peptide sequencing is invaluable for identification of unknown proteins: However, de novo algorithms are designed for working with high quality spectra with good fragmentation and without modifications. Another approach is to compare a spectrum against a set of known spectra in a database.

65 Database Search: Sequence Analysis vs. MS/MS Analysis Sequence analysis: similar peptides (that a few mutations apart) have similar sequences MS/MS analysis: similar peptides (that a few mutations apart) have dissimilar spectra

66 Deficiency of the Shared Peaks Count Shared peaks count (SPC): intuitive measure of spectral similarity. Problem: SPC diminishes very quickly as the number of mutations increases. Only a small portion of correlations between the spectra of mutated peptides is captured by SPC.

67 Ions and Probabilities Tandem mass spectrometry is characterized by a set of ion types {δ 1,δ 2,..,δ k } and their probabilities {q 1,...,q k } δ i -ions of a partial peptide are produced independently with probabilities q i

68 De Novo vs. Database Search: The database of all peptides is huge ≈ O(20 n ). The database of all known peptides is much smaller ≈ O(10 8 ). However, de novo algorithms can be much faster, even though their search space is much larger! A database search scans all peptides in the database of all known peptides search space to find best one. De novo eliminates the need to scan database of all peptides by modeling the problem as a graph search.

69 Probabilistic Model For a position t δj  T i the probability p(t, P,S) that peptide P produces a peak at position t. Similarly, for t  R, the probability that P produces a random noise peak at t is:

70 Probabilistic Score For a peptide P with n amino acids, the score for the whole peptides is expressed by the following ratio test:

71 For a position t that represents ion type d j : q j, if peak is generated at t p(P,s t ) = 1-q j, otherwise Peak Score

72 Peak Score (cont.) For a position t that is not associated with an ion type: q R, if peak is generated at t p R (P,s t ) = 1-q R, otherwise q R = the probability of a noisy peak that does not correspond to any ion type

1 Mass Spectrometry-based Proteomics Xuehua Shen (Adapted from slides with textbook)

Similar presentations

Presentation on theme: "1 Mass Spectrometry-based Proteomics Xuehua Shen (Adapted from slides with textbook)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Mass Spectrometry-based Proteomics Xuehua Shen (Adapted from slides with textbook)

Similar presentations

Presentation on theme: "1 Mass Spectrometry-based Proteomics Xuehua Shen (Adapted from slides with textbook)"— Presentation transcript:

Similar presentations

About project

Feedback