Protein Identification by Peptide Mass Fingerprinting

Protein Identification by Peptide Mass Fingerprinting
Aim: Introduce concepts behind PMF and some tools that use the technique

Protein identification by mass spectrometry - background
Endoproteases cut at specific positions in the amino acid sequence. Trypsin cuts after R (arginine) and K (lysine). Digestion of a protein with trypsin yields a peptide fingerprint A mass spectrometer measures mass to charge ratio (m/z) of the peptide masses.

Theoretical digest of a protein
K <EK> D K <ALK> S K <GWK> I K <MEK> G R <GLVK> S R <YVR> A K <SPIK> V R <ADTR> E K <LEHK> D K <AMGYR> V K <GQIVGR> Y K <EELFR> S <SIPETQK> G R <YVVDTSK> K <DIVGAVLK> A K <IGDYAGIK> W K <DIPVPKPK> A R <VLGIDGGEGK> E R <EALDFFAR> G K <ANELLINVK> Y K <GVIFYESHGK> L K <CCSDVFNQVVK> S K <SISIVGSYVGNR> A R <SIGGEVFIDFTK> E R <ANGTTVLVGMPAGAK> C K <VVGLSTLPEIYEK> M K <LPLVGGHEGAGVVVGMGENVK> G K <ATDGGAHGVINVSVSEAAIEASTR> Y K <YSGVCHTDLHAWHGDWPLPVK> L 275.3 330.4 389.4 406.5 415.5 436.5 443.5 461.4 525.6 596.7 628.7 692.7 801.8 810.9 813.9 835.9 893.1 944.1 968.1 1013.2 1136.2 1241.4 1251.4 1312.4 2019.3 2312.4 2418.7 >gi|532319|pir|TVFV2E|TVFV2E envelope protein SIPETQKGVIFYESHGKLEHKDIPVPKPKANELLINVKYSGVCHTDLHAWHGDWPLPVKLPLVGGHEGAGVVVGMGENVKGWKIGDYAGIKWLNGSCMACEYCELGNESNCPHADLSGYTHDGSFQQYATADAVQAAHIPQGTDLAQVAPILCAGITVYKALKSANLMAGHWVAISGAAGGLGSLAVQYAKAMGYRVLGIDGGEGKEELFRSIGGEVFIDFTKEKDIVGAVLKATDGGAHGVINVSVSEAAIEASTRYVRANGTTVLVGMPAGAKCCSDVFNQVVKSISIVGSYVGNRADTREALDFFARGLVKSPIKVVGLSTLPEIYEKMEKGQIVGRYVVDTS A specific enzyme, here trypsin, cuts the protein into peptides. Trypsin cuts after arginine (R) and lysine (K) and the masses of the peptides can then be calculated. Each protein in a database can be cut in silico to generate a list of peptide masses.

MALDI-TOF mass spectrum
A mass spectrum is acquired from a protein digest

Peptide mass fingerprinting - principle
A MALDI-TOF spectrum of a protein digest is matched against theoretical digestions of a protein sequence database A scoring system is used to identify the best database hit. The idea was first presented at a conference in 1989, and the first papers were published in 1993 by five independent groups.

Peptide mass fingerprinting = PMF MS database matching
Protein(s) Peptides Peaklist Mass spectra Enzymatic digestion Result: ranked list of protein candidates Match In-silico digestion …MAIILAGGHSVRFGPKAFAEVNGETFYSRVITLESTNMFNEIIISTNAQLATQFKYPNVVIDDENHNDKGPLAGIYTIMKQHPEEELFFVVSVDTPMITGKAVSTLYQFLV … - MAIILAGGHSVR - FGPK - AFAEVNGETFYSR - VITLESTNMFNEIIISTNAQLATQFK - YPNVVIDDENHNDK … Sequence database entry Theoretical proteolytic peptides Theoretical peaklist

Some critical parameters for PMF
Spectrum peak extraction and deisotoping. In the ideal case, all peptide peaks in a spectrum are used for matching, and no other peaks. Is the protein in the database? Scoring algorithm Protein modifications considered Calibration / mass accuracy

From raw data to peak list
Spectrum (m) = baseline (m) + signal (m) + noise (m) Source: Markus Müller

Peptide mass fingerprinting
What you have: Set of peptide mass values Information about the protein: molecular weight, pI, species. Information about the experimental conditions: mass spectrometer precision, calibration used, possibility of missed-cleavages, possible modifications Biological characteristics: post-translational modifications, fragments What to do: Match between this information and a protein sequence database What you get: a list of probable identified proteins

Data about the protein 2D Gels  molecular weight and pI Importance:
reduce the search space favours the “good proteins” Problems: 2D not always available imprecise measure proteins may be fragmented proteins may be modified Species  databases not always complete, search on close species

Data about experimental conditions
Accepted mass tolerance  due to imprecise measures and calibration problems Source: Introduction to proteomics: tools for the new biology:. Daniel C. Liebler. Human Press. 2002

Need for calibration Example: Helicobacter spectrum search in Uniprot from ExPASy. 0.1 Da mass tolerance Da mass tolerance Da tol, recalibrated

Data about experimental conditions
Number of allowed missed-cleavages  digestion not always perfect Chemical modifications  depends on the sample preparation carboxymethylation of cysteines oxidation of methionines etherification of side chain carboxylic groups of glutamic (D) and aspartic (E) acids together with the carboxyl-terminal group

Biological characteristics
Post-translational modifications  identification and characterisation of proteins. Use of database annotations Protein fragments  many forms for each protein, experimental problems Protein variants  mutations Depends on the databases available in the identification tools and on the tools too

Scoring systems Essential for the identification! Gives a confidence value to each matched protein Three types of scores Shared peaks count: simply counts the number of matched mass values (peaks) Probabilistic scores: confidence value depends on probabilistic models or statistic knowledge used during the match (obtained from the databases) Statistic-learning: knowledge extraction from the influence of different properties used to match the proteins (obtained from the databases)

Filtering Removal of contaminant peaks before database searching.
- trypsin autolytic peaks - keratins (hair,skin) - matrix peaks - machine artefacts

Taking advantage of contaminants!
Identification of contaminants < 20 ppm Outlier Multipoint calibration on contaminants Source: Karin Hjerno

Some PMF tools Non exhaustive list! Tool Source website Aldente
Mascot MS-Fit prospector.ucsf.edu/ ProFound prowl.rockefeller.edu/profound_bin/WebProFound.exe PepMAPPER wolf.bms.umist.ac.uk/mapper/ PeptideSearch PepFrag prowl.rockefeller.edu/prowl/pepfragch.html Non exhaustive list!

MS-Fit and MOWSE* NCBInr and other databases, index of masses (many enzymes). Considers chemical and biological modifications. Statistic score which considers the mass frequencies. Calculates the frequency of peptide masses in all protein masses for the whole database. The frequencies are then normalised. The protein score is the inverse of the sum of the normalised frequencies of matched masses. The pFactor reduces the weight of masses with missed-cleavages in the frequency computation. Why talk about MS-Fit before Mascot? Because it’s score algorithm (Mowse) is used by Mascot. Better to explain then in the beginning. *MOlecular Weight SEarch

Mowse score considers the peptide frequencies.

Mascot Choice of several databases.
Considers multiple chemical modifications. 0 to 9 missed-cleavages. Score based on a combination of probabilistic and statistic approaches (is based on Mowse score). Does not consider SwissProt annotations

Mascot - principles Probability-based scoring
Computes the probability P that a match is random Significance threshold p< 0.05 (accepting that the probability of the observed event occurring by chance is less than 5%) The significance of that result depends on the size of the database being searched. Mascot shades in green the insignificant hits Score: -10Log10(P)

Mascot Input

Decoy Output Hints about the significance of the score

Output Sequence coverage Peptides matched Error function
The first time a peptide match to a query (one spectrum) appears in the report, it is shown in bold face. Whenever the top ranking peptide match appears, it is shown in red. This means that protein hits with peptide matches that are both bold and red are the most likely assignments. These hits represent the highest scoring protein that contains one or more top ranking peptide matches. Error function

Aldente SwissProt/TrEMBL db, indexed masses (trypsine and many others). Considers chemical modifications and user specified modifications. Considers biological modifications (annotations SWISS-PROT). 0 or 1 missed-cleavages. Use of robust alignment method (Hough transform): Determines deviation function of spectrometer Resolves ambiguities Less sensitive to noise

Aldente – matching and alignment principle
All the couples (peak / peptide) Experimental masses / peaks Theoretical masses / peptides

The user defines the space area to search in
Dalton error space Experimental masses / peaks Theoretical masses / peptides

Ppm error space Experimental masses / peaks Theoretical masses / peptides

Dalton and ppm error space Experimental masses / peaks Theoretical masses / peptides

Aldente locates and solves ambiguities
1 peak / 3 peptides 2 peaks / 2 peptides Experimental masses / peaks Theoretical masses / peptides

Aldente finds the best way to fit the peptides with the peaks
Experimental masses / peaks Theoretical masses / peptides

Aldente – summary Spectrometer calibration error internal error Experimental masses / peaks The Hough Transform estimates from the experimental data the deviation function of the mass spectrometer (the calibration error function). The program optimizes the set of best matches, excluding noise and outliers, to find the best alignment. Summary Theoretical masses / peptides

Aldente - Input

Output Hints about the significance of the score

Information from Swiss-Prot annotation
Information from Swiss-Prot annotation. Processed protein (signal peptide is cleaved).

What is the expected information in an identification result?
A summary of the search parameters A list of potentially identified proteins (AC numbers) with scores and other evidences Possibilities to validate/invalidate the provided results (info on the data processing, on the statistics, links to external resources, etc.) Possibilities to export the (validated) data in various formats

Hints to know when the identification is correct
The higher the number of masses that match the protein, the better Good sequence coverage: the larger the sub-sequences and the higher the sequence coverage value, the better Scores: the higher, the better. Distance to 2nd hit (but there can be more than one protein) Filter on the correct species if you know it (reduces the search space, time, and errors) Better when the errors are more or less constant among all peptides found. If you have time, try many tools and compare the results

Protein characterization with PMF data
- Exact primary structure - Splicing variants - Sequence conflicts - PTMs 1 protein entry does not represent 1 unique molecule Characterization tools at ExPASy using peptide mass fingerprinting data Prediction tools PTMs and AA substitutions Oligosaccharide structures Unspecific cleavages FindMod GlycoMod FindPept

Detection of PTMs in MS Unmodified tryptic masses  m/z => PTM
624.3 769.8 893.4 994.5 994.5 1056.1 1326.7 1501.9 1759.8 1923.4 1923.4 2100.6 1759.8 Unmodified tryptic masses 624.3 1056.1 769.8 1326.7 1501.9 893.4 2100.6 600 2200 600 2200  m/z => PTM 994.5 994.5 1923.4 1923.4 Tryptic masses of a modified protein 1759.8 1759.8 624.3 624.3 1056.1 1070.1 769.8 769.8 1326.7 1326.7 1501.9 1501.9 893.4 2100.6 2100.6 893.4 600 2200 600 2200

FindMod http://www.expasy.org/tools/findmod/ AA modifications DB entry
experimental options experimental masses

} } FindMod Output unmodified peptides, modified peptides
known in SWISS-PROT and chemically modified peptides } putatively modified peptides predicted by mass differences + putative AA substitutions

Protein Identification by Peptide Mass Fingerprinting

Similar presentations

Presentation on theme: "Protein Identification by Peptide Mass Fingerprinting"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Protein Identification by Peptide Mass Fingerprinting

Similar presentations

Presentation on theme: "Protein Identification by Peptide Mass Fingerprinting"— Presentation transcript:

Similar presentations

About project

Feedback