Wi’07Bafna Proteomics via Mass Spectrometry (a bioinformatics perspective) Vineet Bafna www.cse.ucsd.edu/~vbafna.

Slides:



Advertisements
Similar presentations
Protein Quantitation II: Multiple Reaction Monitoring
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
In-depth Analysis of Protein Amino Acid Sequence and PTMs with High-resolution Mass Spectrometry Lian Yang 2 ; Baozhen Shan 1 ; Bin Ma 2 1 Bioinformatics.
MN-B-C 2 Analysis of High Dimensional (-omics) Data Kay Hofmann – Protein Evolution Group Week 5: Proteomics.
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
CSE182 CSE182-L13 Mass Spectrometry Quantitation and other applications.
Protein Sequencing and Identification by Mass Spectrometry.
Fa 05CSE182 CSE182-L7 Protein sequencing and Mass Spectrometry.
Heuristic alignment algorithms and cost matrices
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Data Processing Algorithms for Analysis of High Resolution MSMS Spectra of Peptides with Complex Patterns of Posttranslational Modifications Shenheng Guan.
Fa 06CSE182 CSE182-L13 Mass Spectrometry Quantitation and other applications.
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
Mass Spectrometry Peptide identification
The restriction mapping problem revisited Gopal Pandurangan and H. Ramesh Journal of Computer and System Sciences 526~544(2002)
Fa 06CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Fa 05CSE182 CSE182-L8 Mass Spectrometry. Fa 05CSE182 Bio. quiz What is a gene? What is a transcript? What is translation? What are microarrays? What is.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
CSE182-L10 MS Spec Applications + Gene Finding + Projects.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Previous Lecture: Regression and Correlation
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
My contact details and information about submitting samples for MS
CSE182-L12 Mass Spectrometry Peptide identification CSE182.
1 Mass Spectrometry-based Proteomics Xuehua Shen (Adapted from slides with textbook)
Proteomics Josh Leung Biology 1220 April 13 th, 2010.
Fa 05CSE182 CSE182-L9 Mass Spectrometry Quantitation and other applications.
CSE182 CSE182-L13 Mass Spectrometry Quantitation and other applications.
Protein sequencing and Mass Spectrometry. Sample Preparation Enzymatic Digestion (Trypsin) + Fractionation.
Proteome.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
PROTEIN QUANTIFICATION AND PTM JUN SIN HSS.I. PROJECT 1.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Common parameters At the beginning one need to set up the parameters.
1 Chemical Analysis by Mass Spectrometry. 2 All chemical substances are combinations of atoms. Atoms of different elements have different masses (H =
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Laxman Yetukuri T : Modeling of Proteomics Data
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
Proteomics The science of proteomics Applications of proteomics Proteomic methods a. protein purification b. protein sequencing c. mass spectrometry.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
INF380 - Proteomics-71 INF380 – Proteomics Chap 7 –Protein Identification and Characterization by MS Protein identification in our context means that we.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Peptide Identification via Tandem Mass Spectrometry Sorin Istrail.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
Error tolerant search Large number of spectra remain without significant score. Reasonable number of fragment ion peaks might have not match. – Underestimated.
Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.
De Novo Peptide Sequencing via Probabilistic Network Modeling PepNovo.
Salamanca, March 16th 2010 Participants: Laboratori de Proteomica-HUVH Servicio de Proteómica-CNB-CSIC Participants: Laboratori de Proteomica-HUVH Servicio.
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Using Scaffold OHRI Proteomics Core Facility. This presentation is intended for Core Facility internal training purposes only.
김지형. Introduction precursor peptides are dynamically selected for fragmentation with exclusion to prevent repetitive acquisition of MS/MS spectra.
Quantitation using Pseudo-Isobaric Tags (QuPIT) and Quantitation using Pseudo-isobaric Amino acids in Cell culture (QuPAC) Parimal Samir Andrew J. Link.
MassMatrix Search Results Explained
Protein Identification via Database searching
Protein/Peptide Quantification
Proteomics Informatics David Fenyő
A perspective on proteomics in cell biology
NoDupe algorithm to detect and group similar mass spectra.
Proteomics Informatics David Fenyő
Kuen-Pin Wu Institute of Information Science Academia Sinica
Presentation transcript:

Wi’07Bafna Proteomics via Mass Spectrometry (a bioinformatics perspective) Vineet Bafna

Wi’07Bafna Nobel Citation 2002

Wi’07Bafna Nobel Citation, 2002

Wi’07Bafna Proteomics via MS Enzymatic Digestion (Trypsin) + Fractionation Q: Sufficient to identify peptides?

Wi’07Bafna Peptide MS Instrument software usually detects peaks, and computes features (peak, area, m/z…) m/z

Wi’07Bafna Single Stage MS Mass Spectrometry

Wi’07Bafna MS versus Micro-array cDNA sample Protein/Peptide? sample Unlike micro-array, peptide id is not trivial at the end of the MS experiment! Identification is an important part of pre- processing

Wi’07Bafna MS based proteomics Identification – Identify all the proteins in the proteome, specific organelles, specific pathways, complexes… Quantitation – Is a protein differentially-expressed in certain conditions? Others – Protein 3D structure, protein protein interactions,… We will consider an informatics-centered perspective

Wi’07Bafna Protein Identification The preferred mode is through tandem mass spectrometry of peptides. Is identifying peptides sufficient? Rough probability for co-occurrence of a 15-aa peptide? With higher accuracy instruments, it may be possible to do intact proteins as well.

Wi’07Bafna Tandem MS of peptides Secondary Fragmentation Ionized parent peptide

Wi’07Bafna The peptide backbone H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH R i-1 RiRi R i+1 AA residue i-1 AA residue i AA residue i+1 N-terminus C-terminus The peptide backbone breaks to form fragments with characteristic masses.

Wi’07Bafna Ionization The peptide backbone breaks to form fragments with characteristic masses. Ionized parent peptide H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH R i-1 RiRi R i+1 AA residue i-1 AA residue i AA residue i+1 N-terminus C-terminus H+H+

Wi’07Bafna Fragment ion generation H...-HN-CH-CO NH-CH-CO-NH-CH-CO-…OH R i-1 RiRi R i+1 AA residue i-1 AA residue i AA residue i+1 N-terminus C-terminus The peptide backbone breaks to form fragments with characteristic masses. Ionized peptide fragment H+H+

Wi’07Bafna Tandem MS for Peptide ID 147 K 1166 L E D E E L F G S y ions b ions [M+2H] 2+ m/z % Intensity

Wi’07Bafna Peak Assignment 147 K 1166 L E D E E L F G S y ions b ions y2y2 y3y3 y4y4 y5y5 y6y6 y7y7 b3b3 b4b4 b5b5 b8b8 b9b9 [M+2H] 2+ b6b6 b7b7 y9y9 y8y8 m/z % Intensity Peak assignment implies Sequence (Residue tag) Reconstruction!

Wi’07Bafna Ion types, and offsets P = prefix residue mass S = Suffix residue mass b-ions = P+1 – (NH2-CHR-CO-..-NH-CHR-CO(+)) y-ions = S+19 – (NH3(+)-CHR-CO-..NH-CHR-COOH) a-ions = P-27, and so on.. H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH R i-1 RiRi R i+1 AA residue i-1 AA residue i AA residue i+1 N-terminus C-terminus H+H+

Wi’07Bafna MS Quiz: Why aren’t all tandem MS peaks of the same intensity? Do the intensities for a peptide vary from spectrum to spectrum?

Wi’07Bafna Database Searching for peptide ID For every peptide from a database – Reject if it has the wrong mass, else: – Generate a hypothetical spectrum – Compute a correlation between observed and experimental spectra – Choose the best Database searching is very powerful and is the de facto standard for MS. – Sequest, Mascot, Inspect, and many others …SARLSQETFSDLWKLLPENNVLSPLP….

Wi’07Bafna So what’s new? The Id picture is very simplistic. Only 20-30% of spectra are conclusively identified. Many reasons: – Spectra are noisy. – Databases are incomplete. Sometimes, we need to do a de novo interpretation – Post-translational modifications. – Instrument performance is critical. The algorithms for identification must be sensitive to these issues. We present a systematic look at identification software.

Wi’07Bafna Modules for Peptide Id Interpretation (D) Input Spectrum Output: all that can be extracted from the spectrum (peptides/tags/parent mass/charge) Indexing/Filtering Input: Db (set of peptides) Output: pre-processing of the database, peptide subset. Scoring Input; peptide set, spectrum Output: ranked list of scores Validation Significance of the top hit. I/F SVD Db

Wi’07Bafna De novo interpretation of mass spectra The so called de novo algorithms focus exclusively on the D module. There is no database (I/F). Limited scoring and validation Important when no database exists! – Also important for db search I/F SVD

Wi’07Bafna De Novo Interpretation: Example S G E K b-ions y-ions b y y M/Z b 1 1 2

Wi’07Bafna The simplest case Suppose only (and all) the prefix ions were visible. Would identification be easy? We have two problems: – There is a mix of b and y ions. Separating them is critical! – Other ions besides b,y, including neutral losses, noise and so on. We need to account for them. S G E K b-ions y-ions S G E K

Wi’07Bafna Separating b-, and y-ions is solved using a combinatorial formulation (forbidden pairs) Separating b,y from all others is solved using a statistical approach. Together, they form the basis for a de novo sequencer.

Wi’07Bafna De Novo Interpretation: Example S G E K b-ions y-ions b y y M/Z b Ion Offsets b=P+1 y=S+19=M-P+19

Wi’07Bafna Computing possible prefixes We know the parent mass M=401. Consider a mass value 88 Assume that it is a b-ion, or a y-ion If b-ion, it corresponds to a prefix of the peptide with residue mass 88-1 = 87. If y-ion, y=M-P+19. – Therefore the prefix has mass P=M-y+19= =332 Compute all possible Prefix Residue Masses (PRM) for all ions.

Wi’07Bafna Putative Prefix Masses Prefix Mass M=401 by S G E K Only a subset of the prefix masses are correct. The correct mass values form a ladder of amino-acid residues

Wi’07Bafna Spectral Graph Each prefix residue mass (PRM) corresponds to a node. Two nodes are connected by an edge if the mass difference is a residue mass G

Wi’07Bafna Spectral Graph Each peak, when assigned to a prefix/suffix ion type generates a unique prefix residue mass. Spectral graph: – Each node u defines a putative prefix residue M(u). – (u,v) in E if M(v)-M(u) is the residue mass of an a.a. (tag) or 0. – Paths in the spectral graph correspond to a interpretation S G E K

Wi’07Bafna Re-defining de novo interpretation Find a subset of nodes in spectral graph s.t. – 0, M are included – Each peak contributes at most one node (interpretation)(*) – Each adjacent pair (when sorted by mass) is connected by an edge (valid residue mass) – An appropriate objective function (ex: the number of peaks interpreted) is maximized S G E K G

Wi’07Bafna Two problems Too many nodes. – A. Only a small fraction correspond to b/y ions (leading to true PRMs). – B. Even if the b/y ions were correctly predicted, each peak generates multiple possibilities, only one of which is correct. We need to find a path that uses each peak only once (algorithmic problem). – In general, the forbidden pairs problem is NP-hard S G E K

Wi’07Bafna However,.. The b,y ions have a special non-interleaving property Consider pairs (b 1,y 1 ), (b 2,y 2 ) – Note that b 1 +y 1 = b 2 +y 2 – If (b 1 y 2

Wi’07Bafna Non-Intersecting Forbidden pairs S G E K If we consider only b,y ions, ‘forbidden’ node pairs are non-intersecting, The de novo problem can be solved efficiently using a dynamic programming technique

Wi’07Bafna The forbidden pairs method There may be many paths that avoid forbidden pairs. We choose a path that maximizes an objective function, – EX: the number of peaks interpreted – Here we assume a function , which gives a score to a PRM. The score captures the likelihood that the PRM is correct.

Wi’07Bafna The forbidden pairs method Sort the PRMs according to increasing mass values. For each node u, f(u) represents the forbidden pair Let m(u) denote the mass value of the PRM u f(u)

Wi’07Bafna D.P. for forbidden pairs Consider all pairs u,v – m[u] M/2 Define S(u,v) as the best score of a forbidden pair path from 0->u, v- >M Is it sufficient to compute S(u,v) for all u,v? uv

Wi’07Bafna D.P. for forbidden pairs Note that the best interpretation is given by uv

Wi’07Bafna D.P. for forbidden pairs Denote the forbidden pair of node v by f(v). – What is f(f(v))? Note that we have one of two cases. 1. Either u v) 2. Or, u > f(v) (and f(u) < v) Case 1. – Extend v, do not touch f(u) u f(u) vw

Wi’07Bafna The complete algorithm for all u /* increasing mass values from 0 to M/2 */ for all v /* decreasing mass values from M to M/2 */ if (u > f[v]) else if (u < f[v]) If (u,v)  E /* maxI is the score of the best interpretation */ maxI = max {maxI,S[u,v]}

Wi’07Bafna De Novo: Second issue Given only b,y ions, a forbidden pairs path will solve the problem. However, recall that there are MANY other ion types. – Typical length of peptide: 15 – Typical # peaks? ? – #b/y ions? – Most ions are “Other” a ions, neutral losses, isotopic peaks….

Wi’07Bafna De novo: Weighting nodes in Spectrum Graph Factors determining if the ion is b or y – Intensity – Support ions b- and y-ions are the most likely ions to lose water/ammonia – Isotopic peaks

Wi’07Bafna Offset frequency function b, and y-ions show offsets due to neutral losses

Wi’07Bafna De novo: Weighting nodes A probabilistic network to model support ions (Pepnovo)

Wi’07Bafna De Novo Interpretation Summary The main challenge is to separate b/y ions from everything else (weighting nodes), and separating the prefix ions from the suffix ions (Forbidden Pairs). As always, the abstract idea must be supplemented with many details. – Noise peaks, incomplete fragmentation – A PRM is first scored on its likelihood of being correct, and the forbidden pair method is applied subsequently.

Wi’07Bafna Db search versus de novo interpretation FilterValidationScore De novo Db 55M peptides 1.Traditional db search simply have the scoring module. 2.De novo is useful when the peptide is not in the database, but not as accurate. 3.It can be thought of as a database search over a much larger database. 4.PT modifications change the picture.

Wi’07Bafna Filtering FilterValidationScoreextension De novo Db 55M peptides Candidate Peptides (700) 1.Db indexing/filtering is a key mechanism for reducing the search space

Wi’07Bafna Filtering Define a filter as a computational tool that rapidly screens a database, removing much of it but retaining the true peptide. Can you suggest commonly used filters? 1. Parent mass 2. Trypsin digested peptides

Wi’07Bafna Parent Mass filter Sort all peptides in the database by their parent mass. Search only the peptides that are within some mass tolerance. The filter does not work when you have modifications.

Wi’07Bafna The dynamic nature of the proteome The proteome of the cell is changing Various extra-cellular, and other signals activate pathways of proteins. A key mechanism of protein activation is PT modification These pathways may lead to other genes being switched on or off Mass Spectrometry is key to probing the proteome

Wi’07Bafna Db search for putatively modified peptides. Ex:YFDSTDYNMAK 2 5 =32 possibilities, with 2 types of modifications! In contrast, de novo search space does not change significantly. Phosphorylation? oxidation For each peptide, generate all mods. Score each modification Is parent mass still a good filter?

Wi’07Bafna Enzymatic digestion rules as a filter Consider only tryptic peptides – Trypsin cleaves after R,K (not if RP, or KP) Tryptic peptide filters may not be very effective – Missed cleavage – End-point degradation – Endogenous peptide activity

Wi’07Bafna Example of non-tryptic peptide analysis Experiment: a massive oversampling of a proteome (14M spectra of a prokaryotic genome) Plot the absolute postion of the most N-terminal peptide (not-necessarily tryptic) (Gupta et al., 2007) Two peaks are seen, at position 2, and position ~22

Wi’07Bafna Signal Peptide discovery

Wi’07Bafna PT modifications/processing The problem was not intractable, but impractical. Identifications of modified peptides was not routine, and is left to specialized cases. Better filtering technology makes it practical to explore modifications on a large scale. Increase in time is modest with increasing number of modifications. The technology can only improve. Should be possible to improve speed by another order of magnitude.

Wi’07Bafna We will filter databases via a trick from sequence searching

Wi’07Bafna Sequence Search Basics Q: Given k words (s i has length l i ), and a database of size n, find all matches to these words in the database string. How fast can this be done? 1:POTATO 2:POTASSIUM 3:TASTE P O T A S T P O T A T O dictionary database

Wi’07Bafna Dict. Matching & string matching How fast can you do it, if you only had one word of length m? – Trivial algorithm O(nm) time – Pre-processing O(m), Search O(n) time. Dictionary matching – Trivial algorithm (l 1 +l 2 +l 3 …)n – Using a keyword tree, l p n (l p is the length of the longest pattern) – Aho-Corasick: O(n) after preprocessing O(l 1 +l 2..) We will consider the most general case

Wi’07Bafna Sequence tag filters Basics 1. De novo sequencing can be used to get partial sequence information from the spectra. 2. Exact matching for sequence is fast. 3. We can search a database (size n) with k substrings (of any length) in time that is proportional to n, but independent of the size or number of substrings. Aho-Corasick trie data-structure.

Wi’07Bafna Dictionary matching YAK S N N F F AT YFAK YFNS FNTA …..Y F R A Y F N T A….. In each step, either f, or l is incremented. Total time is 2n, independent of automaton size. f l

Wi’07Bafna Tag-based filtering A tag is a short peptide with a prefix and suffix mass Efficient: An average tripeptide tag matches Swiss-Prot ~700 times Tagging is related to de novo sequencing yet different. Objective: Compute a subset of short strings, at least one of which must be correct. Longer tags=> better filter. Analogy: Using tags to search the proteome is similar to moving from full Smith-Waterman alignment to BLAST

Wi’07Bafna Tag generation W R A C V G E K D W L P T L T TAG Prefix Mass AVG 0.0 WTD PET Using local paths in the spectrum graph, construct peptide tags. Use the top ten tags to filter the database Tagging is related to de novo sequencing yet different. Objective: Compute a subset of short strings, at least one of which must be correct. Longer tags=> better filter.

Wi’07Bafna Tag-based search Tags (from multiple spectra) are used to construct a trie For each string match, attempt to extend, matching the prefix and suffix mass using flanking sequence (and PTMs) Retain best matches for detailed scoring

Wi’07Bafna Tag based search (dictionary matching) YFD DST STD TDY YNM Y M FD N Y M FD N …..YFDSTGSGIFDESTMTKTYFDSTDYNMAK…. De novo trie scan

Wi’07Bafna Speed & Sensitivity FilteredPeptides: # peptides that pass the filter Time: Scan time + filteredpeptides * Scoring-time For sequence tag filters, the scan time can be amortized out, by combining scan for many spectra all at once. – Build one automaton from multiple spectra Thus, filter efficiency is key to speed. Filter sensitivity is also important

Wi’07Bafna Given: – tag with prefix and suffix masses xyz – match in the database Compute if a suffix and prefix match with allowable modifications. How fast can this be done? Compute a candidate peptide with most likely attachment point. Fast Extension xyz

Wi’07Bafna Filtering Database filtering is a critical (and relatively unexplored) strategy for MS searches. De novo sequencing to get tags is an effective strategy Are other forms of filtering possible? – Alg. Question: given a spectrum, find all peptides that match a subset of theoretical fragments. How quickly can you do that?

Wi’07Bafna Overview FilterSignificanceScoreextension De novo Db 55M peptides Candidate Peptides (700)

Wi’07Bafna Scoring Input: – Candidate peptide with attached modifications – Spectrum Output: – Score function: – Key: the score must normalize for length, as variable modifications can change peptide length.

Wi’07Bafna Score function Score is a log-odds function. Numerator: Prob. That the spectrum was generated by n theoretical fragments from the peptide. Denominator: Probability that the spectrum was generated by n randomly generated peaks

Wi’07Bafna Probabilistic Scoring Empirically computed Depends upon tolerance Theoretical fragments Peaks i-th peak j-th frag.

Wi’07Bafna Overview FilterValidationScoreextension De novo Db 55M peptides Candidate Peptides (700)

Wi’07Bafna P-value computation The score function allows us to rank peptides. The top scoring one may not be the correct one. We consider a collection of +ve (top scoring correct one), and -ve spectra. Consider a bunch of other scores that help separate the +ve from the -ve.

Wi’07Bafna Quality scores and p-values Features: Score S: as computed Explained Intensity I: fraction of total intensity explained by annotated peaks. Explained peaks: fraction of top 25 peaks annotated. b-y score B: fraction of b+y ions annotated  -score : difference between the best and second best Choose a final score as a (linear) combination of the features. The weights can be trained using a discriminative strategy.

Wi’07Bafna Separating power of features

Wi’07Bafna Computing confidence values Discriminative training of feature weights is used to maximize the separation. The distribution of -ve example scores can supply a confidence value.

Wi’07Bafna Common ID Tools Sequest – De Novo + Filtering: Parent Mass, Enzyme specificity – Scoring: Simple cross-correlation of theoretical and experimental peaks – Validation: Xcorr (based on difference between best and second best) Mascot: Similar MS-BLAST – Uses 3rd party de novo prediction – Filtering using Blast code – Limited scoring – Validation at the protein level

Wi’07Bafna Identification summary While MS technologies are important for proteomics in general, they are the key technology for identification. They can probe the proteome dynamically, and help identify mutations and modifications. The algorithms are continually improved by improved versions of one or more modules.

Wi’07Bafna Modifications While we discussed modification in general, identification and validation of modifications remains an important theme. Larger modifications are interesting in themselves (such that glycan chains in glycosylation), and might be identified using MS ABRF delta mass resource

Wi’07Bafna Isotopic Profiles

Wi’07Bafna Mass Measurement? VAPEEHPVLLTEAPLNPK (Mol. Mass=1953)

Wi’07Bafna Mass-Charge ratio The charge is due to a proton (H+) which has a mass of ~1 Da The X-axis is (M+Z)/Z – Z=1 implies that peak is at M+1 – Z=2 implies that peak is at (M+2)/2 M=1000, Z=2, peak position is at 501 – Suppose you see a peak at 501. Is the mass 500, or is it 1000?

Wi’07Bafna Isotopic peaks Ex: Consider peptide SAM Mass = You should see: Instead, you see

Wi’07Bafna Isotopes C-12 is the most common. Suppose C-13 occurs with probability 1% EX: SAM – Composition: C11 H22 N3 O5 S1 What is the probability that you will see a single C-13? MS spectrum for SAM

Wi’07Bafna All atoms have isotopes Isotopes of atoms – O16,18, C-12,13, S32,34…. – Each isotope has a frequency of occurrence If a molecule (peptide) has a single copy of C-13, that will shift its peak by 1 Da With multiple copies of a peptide, we have a distribution of intensities over a range of masses (Isotopic profile). How can you compute the isotopic profile of a peak?

Wi’07Bafna Isotope Calculation Denote: – N c : number of carbon atoms in the peptide – P c : probability of occurrence of C-13 (~1%) – Then +1 N c =50 +1 N c =200

Wi’07Bafna Isotope Calculation Example Suppose we consider Nitrogen, and Carbon N N : number of Nitrogen atoms P N : probability of occurrence of N-15 Pr(peak at M) Pr(peak at M+1)? Pr(peak at M+2)? How do we generalize? How can we handle Oxygen (O-16,18)?

Wi’07Bafna General isotope computation Definition: – Let p i,a be the abundance of the isotope with mass i Da above the least mass – Ex: P 0,C : abundance of C-12, P 2,O : O-18 etc. Characteristic polynomial Prob{M+i}: coefficient of x i in  (x) (a binomial convolution)

Wi’07Bafna Quiz How can you determine the charge on a peptide?  Difference between the first and second isotope peak is 1/Z  Proposal:  Given a mass, predict a composition, and the isotopic profile  Do a ‘goodness of fit’ test to isolate the peaks corresponding to the isotope  Compute the difference

Wi’07Bafna Isotopic Profile Application In DxMS, hydrogen atoms are exchanged with deuterium The rate of exchange indicates how buried the peptide is (in folded state) Consider the observed characteristic polynomial of the isotope profile  t1,  t2, at various time points. Then The estimates of p 1,H can be obtained by a deconvolution Such estimates at various time points should give the rate of incorporation of Deuterium, and therefore, the accessibility.

Wi’07Bafna Quantitation via MS

Wi’07Bafna Quantitation The intensity of the peak depends upon – Abundance, ionization potential, substrate etc. Two peptides with the same abundance can have very different intensities. Assumption: relative abundance can be measured by comparing the ratio of a peptide in 2 samples.

Wi’07Bafna Quantitation issues The two samples might be from a complex mixture. How do we identify identical peptides in two samples? In micro-array this is possible because the cDNA is spotted in a precise location? Can we have a ‘location’ for proteins/peptides MS based quantitation must be coupled with quantitation

Wi’07Bafna 2D Gel based separation Intact proteins are used Iso-electric focusing and Mol Weight used to separate. Intensity of spot is used to measure abundance. Problems: All 3 measurements are not that precise (low reproducibility) Labor intensive Many proteins do not get separated.

Wi’07Bafna LC-MS based separation As the peptides elute (separated by physiochemical properties), spectra is acquired. HPLC ESI TOF Spectrum (scan) p1 p2 pn p4 p3

Wi’07Bafna LC-MS Maps time m/z I Peptide 2 Peptide 1 x x x x x x x x x x x x x x time m/z Peptide 2 elution A peptide/feature can be labeled with the triple (M,T,I): – monoisotopic M/Z, centroid retention time, and intensity An LC-MS map is a collection of features

Wi’07Bafna Peptide Features Isotope pattern Elution profile Peptide (feature) Capture ALL peaks belonging to a peptide for quantification !

Wi’07Bafna Relative abundance using MS Differential Isotope labeling (ICAT/SILAC) External standards (AQUA) Direct Map comparison

Wi’07Bafna ICAT Introduced by Aebersold ICAT reagent is attached to particular amino-acids (Cys) Affinity purification leads to simplification of complex mixture “diseased” Cell state 1 Cell state 2 “Normal” Label proteins with heavy ICAT Label proteins with light ICAT Combine Fractionate protein prep - membrane - cytosolic Proteolysis Isolate ICAT- labeled peptides Nat. Biotechnol. 17: ,1999

Wi’07Bafna Differential analysis using ICAT ICAT pairs at known distance diseased normal

Wi’07Bafna ICAT issues The tag is heavy, and decreases the dynamic range of the measurements. The tag might break off Only Cysteine containing peptides are retrieved Non-specific binding to strepdavidin

Wi’07Bafna Serum ICAT data

Wi’07Bafna Serum ICAT data Instead of pairs, we see entire clusters at 0, +8,+16,+22 What is going on?

Wi’07Bafna SILAC A different isotope labeling strategy Mammalian cells do not ‘manufacture’ all amino- acids. Where do they come from? Labeled amino-acids are added to amino-acid deficient culture, and are incorporated into all proteins as they are synthesized No chemical labeling or affinity purification is performed. Leucine can be used (10% abundance vs 2% for Cys)

Wi’07Bafna SILAC vs ICAT Leucine is higher abundance than Cys No affinity tagging done Fragmentation patterns for the two peptides are identical – Identification is easier Ong et al. MCP, 2002

Wi’07Bafna Incorporation of Leu-d3 at various time points Doubling time of the cells is 24 hrs. Peptide = VAPEEHPVLLTEAPLNPK What is the charge on the peptide?

Wi’07Bafna Quantitation on controlled mixtures

Wi’07Bafna Identification MS/MS of differentially labeled peptides

Wi’07Bafna Questions The quantitation of peptides also depends upon the amount mixed right in the beginning (what if you mixed more of one sample?) How can you control for such errors? What happens when you see singletons (unpaired features)? – In ICAT, it can be a non-Cys peptide (chemical noise), OR – Under-expressed Cys-peptide – How can you differentiate between the two cases?

Wi’07Bafna Map comparison The retention time and M/Z are a signature for the peptide. Can we use this signature to compare the intensity of the same peptide in two samples? T M/Z

Wi’07Bafna Map 1 (normal) Map 2 (diseased) Map Comparison for Quantification

Wi’07Bafna Data reduction (feature detection) Features Each feature is represented by – Monoisotopic M/Z, centroid retention time, aggregate intensity

Wi’07Bafna Comparison of features across maps Hard to reduce features to single spots Matching paired features is critical M/Z is accurate, but time is not. A time scaling might be necessary

Wi’07Bafna Time scaling: Alignment Each time scan is a vector of intensities. Two scans in different runs can be scored for similarity (using a dot product) Compute an alignment to match scans against each other. Advantage: does not rely on feature detection. Disadvantage: Might not handle affine shifts in time scaling, but is better for local shifts

Wi’07Bafna Spectral comparison The dot product is the standard technique for comparing two spectra Bin the peaks by partitioning the masses. Each mass bin gets an aggregate intensity corresponding to the peaks that fall in it. The normalized dot-product of the resulting vectors measures the similarity partitions Vector v 1 v1v1 v2v2

Wi’07Bafna Basic geometry What is ||x|| 2 ? What is x/||x|| Dot product x=(x 1,x 2 ) y

Wi’07Bafna Matching runs Compute a matching of the vectors with the best score. What should the indel penalties be like? v1v1 w1w1 v2v2 vjvj wiwi w i T v j

Wi’07Bafna Time scaling: Approach 2 (geometric matching) Match features based on M/Z, and (loose) time matching. Objective  f (t 1 -t 2 ) 2 Let t 2 ’ = a t 2 + b. Select a,b so as to minimize  f (t 1 -t’ 2 ) 2

Wi’07Bafna Geometric Matching Pair up everything with identical m/z – Very loose constraints on time. Give every pair a cost equal to the time difference Iterate over all (a,b), to find one that mimimizes the minimum cost matching.

Wi’07Bafna Comparing the different approaches Time scaling via Geometric matching depends critically upon reliable feature identification. Time scaling via spectral comparison depends upon coordinated elution of peptide features

Wi’07Bafna Quantitation Summary Once we have features matched across runs, we have data identical to microarrays. Features can be ‘identified’ in separate MS2 experiments Feature detection, LC-MS mapping/ICAT decouple the identification from the quantitation The difficulty of producing such data makes it a challenging problem for bioinformatics run feature intensity (Identification via MS2)

Wi’07Bafna Other applications of MS

Wi’07Bafna MS application: Protein-protein interaction Proteins combine to form functional complexes. An antibody is a special kind of protein that can recognize a specific protein Use an antibody to recognize a protein in a complex. Isolate & Purify the complex that binds to the antibody. Identify all the proteins in the complex via mass spectrometry.

Wi’07Bafna MS application:Protein Structure Use chemical cross-linkers to link spatially proximal residues. Denature and digest the protein. Identify the cross-linked peptides. This provides extra structural constraints which help predict structure.

Wi’07Bafna Cross-linking Cross-links are ‘fixed’ length polymers that bind to amino- acids. How can they help predict structure? Protocol – Cross-link native protein – Denature, digest – MS/MS (identify cross- linked peptides) Potentially valuable, but not widely used

Wi’07Bafna Identifying Cross-linked peptides Identify all peptide pairs, whose mass explains the parent mass. Given a list of peptide pairs, find the pair, and the linked position that best explains the MS2 data. What is the number of possible candidate pairs. Fragmentation in the presence of linkers is poorly understood How do you separate cross- linked peptides from singly linked, and non-cross-linked peptides?

Wi’07Bafna Overlap peptides and shotgun based identification

Wi’07Bafna Motivation Database search of MS/MS spectra works well whenever the protein sequence is known and the protein is not modified/mutated BUT: – Not all protein sequences are available Examples include proteins from snake and scorpion venom Integrilin, a successful blood clot prevention drug distributed by Millenium, was derived from rattlesnake venom Sequences are still determined by Edman sequencing – Database search is likely to fail if the analyzed protein contains unexpected post-translational modifications or mutations.

Wi’07Bafna Shotgun Assembly How to use the redundant information in MS/MS spectra from overlapping peptides to construct a de-novo interpretation of the protein sequence? Non-specific proteases or sets of proteases with different specificities result in very rich digestion patterns:

Wi’07Bafna Unanticipated modifications

Wi’07Bafna MS-Alignment Dynamic Programming can be used to capture mass-offsets (putative PTMs). In large data-sets, true PTMs should be over-represented. Tsur et al.’ Nat. Bio. 2005

Wi’07Bafna PTM Frequency Matrix 50,000 spectra from a sample of IKKb were searched in blind mode, and identifications with p- value <0.05 were retained Cell shading indicates the number of annotations with modification ( , a)

Wi’07Bafna PTM Frequency Matrix Oxidation Methylation Sodium Double oxidation Dimethylation

Wi’07Bafna PTM selection: Output

Wi’07Bafna Filtering  -correct annotations M+17 (from oxidized methionine, incorrect mass) A+14 (from methylated lysine, incorrect placement)

Wi’07Bafna Overlapping peptides

Wi’07Bafna Features for robust PTM identification A number of features are used to validate mass-shifts obtained from MS-alignment, including – evidence from overlapping peptides, – number of spectra. – Delta scores from other possibilities, – Evidence from multiple sites, de-novo identifications etc. – These features are used to train an SVM. Validation is done at spectrum, peptide, and site levels Final searches performed using a fixed false discovery rates on a random database.

Wi’07Bafna Blind Search of a lens data-set Human lenses are a rich source of modified proteins: – Lens proteins do not turnover, but accumulate modifications over time – Hundreds of papers in the last 20 years reporting crystallin modifications – Yates‘ lab pioneered use of various proteases for high-throughput PTM validation/discovery in lenses (MacCoss et al, 2003) – Sample from a 93-year old patient – Wilmarth et al. (Jnl. Prot. Res.,06)

Wi’07Bafna Modifications on cataractous lens LocationModification Mass Putative annotation S,T-18dehydration Q-17deamidation W-2cross-linking H14methylation M,W16oxidation S,H28double methylation N-term42acetylation N-term43carbamylation K,non-terminal43carbamylation W44carboxylation R55unknown K58carboxymethylation K72carboxyethylation LocationModification mass TypePutative annotationComment M-48Chem. artifactloss of methane sulfenic acidreported on same site W4PTMkynureninereported in cataractous lenses S30/73unknown W32PTMformylkynureninereported in cataractous lenses N-term57unknowncarboxyamidomethylationIn-vivo N-term modification? N-term271unknown Table 1: Rediscovered all modifications previously identified by database search except deamidation on N/Q (+1 Da). Table 2: Identified 6 new modification events

Wi’07Bafna Lens: Unknown modifications Unknown modification R+55 was found on both data-sets, and assigned to overlapping sets of sites. Some evidence for Q+161 (glycosylation?) in a few spectra

Wi’07Bafna Spectra for putative R+55 modification

Wi’07Bafna 2. Selecting modifications (summary) A ‘strength in numbers’ approach: The more spectra, the better. – A recent unpublished search with ~18M spectra reveals ~500 modified sites, with a 2% FDR Overlapping peptides are strong evidence (incorrect matches unlikely to overlap) These and other features are being used in an automated tool for identifying modified sites.

Wi’07Bafna Eukaryotic genes A eukaryotic gene has a complex structure. A gene may be transcribed (translated) into many alternative isoforms. A ‘typical’ exon is ~150bp. A ‘typical’ tryptic peptide is ‘15’ aa. What is the probability that a typical peptide crosses a splice junction? ATG 5’ UTR intronexon 3’ UTR Translation start

Wi’07Bafna Searching spliced-exon graph Instead of searching 6 frame translations, we search a compact representation of putative exons, and exon-pairs! Each putative exon is a node. Splicing and SNPs are represented by edges The tag based search extends efficiently to spliced-exon graphs.

Wi’07Bafna Genomic Search Results ~18M spectra from a kidney cell line were searched against a human splice-exon graph. Validation of 39,000 exons and 11,000 introns Novel or extended exons in 16 genes, confirm translation of 224 hypothetical proteins. Discover over 40 alternative splicing events. 308 coding SNPs. Tanner et al., Genome Research

Wi’07Bafna Retinoblastoma gene (novel 3’ Exon)

Wi’07Bafna EX: Novel Exons intron

Wi’07Bafna Validating hypothetical protein

Wi’07Bafna Conclusion Key technology for proteomics Leading technology for protein identification, PT modifications, and protein level quantitation Applications to protein structure, interactions, pathways and other proteomic problems