Popitam, une méthode tolérante aux mutations/modifications pour l'identification de protéines à partir de données de spectrométrie de masse (MS/MS) Patricia.

Popitam, une méthode tolérante aux mutations/modifications pour l'identification de protéines à partir de données de spectrométrie de masse (MS/MS) Patricia Hernandez Swiss Institute of Bioinformatics

Overview - proteomics - proteome - proteome visualization: 2D gels - protein identification - classical workflow - shared peak count - modifications and identification - modified peptides - SPC - spectral alignment, de novo sequencing, tag extraction - Popitam - overview - tags - scoring function, genetic programming - some results

Proteome --> Proteomics: science that studies proteins expressed by a genome --> proteome --> changes with the state of development, the tissue or the environmental conditions --> identification and quantification --> 3D structure prediction --> localisation in the cell --> biological function --> modifications --> interactions with other proteins... proteomics

--> a simple way to "see" a proteome --> numerous proteins from a biological sample (example: blood) are separated according to 2 criteria : molecular weight of the protein isoelectric point --> this method allows separating simultaneously thousands of proteins and displaying them on a two-dimensional map --> spot = (generally) one purified protein --> we can "see" the proteins, but we don't know to which protein corresponds a given spot... 2d gels proteomics

Spots identification: classical workflow select an unknown purified protein cut the aa chain into peptides (every K and R aa ) measure the mass of the peptides by ms MS/MS identification --> identify a spot = give a protein name to a spot --> protein databases (for example SwissProt) - records all known proteic sequences - annotated MS identification (PMF) select a peptide fragment it measure the mass of the fragments by ms MGQGWATAGLPSFRPEPYKCYGHPVP SQEASQQVTVKTHGTSSQATTSSQK… MGQGWATAGLPSFR PEPYK CYGHPVPSQEASQQVTVK... MG MGQ MGQGWA WAT WATA... protein identification

MS: virtually cut the theo. seq. into peptides and compute masses MS spectrum: list of the masses of peptides that constitute the protein of interest MS/MS spectrum: list of masses of fragments that constitute a peptide of the protein of interest protein database hbb_human compare the list of experimental and theoretical masses in order to find the best match between experimental and virtual spectra --> detection --> ions --> noise Shared peak count MS/MS: virtually cut the theo. seq. into peptides, and further cut the peptides into fragments, and compute the masses protein identification p i g

Modified peptides (1) PTMs --> most eukaryote proteins --> addition of a chemical group : --> participate to: CONFLICT (different sources report differing sequences) --> in about 4'600 human entries VARIANT (authors report that sequence variants exist) = alleles --> in about 2'200 human entries MUTATIONS associated with diseases --> 187 references to mutations and diseases in COMMENTS section modifications and identification The sequence of the database may differ from the experimental peptide: - methylation:+14 - phosphroylation:+80 - glycosylation: >800... - proteic structures - proteic functions - control of metabolic pathways

Modified peptides (2) MGQGWATAGLPSFRPEPYKCYGHPVP SQEASQQVTVKTHGTSSQATTSSQK… PEPYK PYK EPYK PEP a modified protein digestion MS, selection of the peptide fragmentation m/z intensity m/z intensity modifications and identification

SPC and modified peptides "Shared peak count" algorithms have to introduce modifications into the theoretical peptide databases. m/z intensity m/z intensity m/z intensity m/z intensity modified experimental MS/MS spectrum experimental MS/MS spectrum theoretical peptide modifications and identification

AAIEGKLMQRAPALK modifications and identification Database size (1) New database, if the two following modifications are taken into account - modification occurring on amino acid A: A->a - modification occurring on amino acids L: L->l and E: E->e = all the peptide from the initial database, plus all modified peptides that can be built from the initial database AAIEGK LMQR APALK AAIEGK aAIEGK AaIEGK aaIEGK AAIeGK aAIeGK AaIeGK aaIeGK LMQR lMQR APALK aPALK APaLK aPaLK APAlK aPAlK APalK aPalK

B (L,p,k) gives the probability to have k positions of modification in a sequence of lenght L, if p is the probability that a position may be modified (we assume the positions to be independent) Aim: assess the number of peptides that contain zero, one, two... "positions" for a possible modification L = 10, p = 1/20: 800'000 = 478'990 + 252'100 + 59'710 + 8'380 + 771 + c L= 10, p= 5/20: 800'000 = 45'050 + 150'169 + 225'254 + 200'225 + 116'798 + c N0N1N2N0N1N2 xxxx oxxx xoxx xxox xxxo ooxx oxox oxxo xoox xoxo xxoo modifications and identification Database size (2)

Expected number s of peptides that may contain exactly M modifications Expected size of database when taking into account 0 to M modifications N0N1N2N0N1N2 xxxx oxxx xoxx xxox xxxo ooxx oxox... modifications and identification Database size (3)

SwissProt Human, 10'000 proteins n = 806'787 peptides [300,3000] (=~from 3 to 30 aa) L = 11 amino acids 0 to 3 modifications occuring on one specific amino acid: p=1/20 P 0to3_mod = 1'375'700 + c 0 to 3 modifications that may occur on several loci: Phosphorylation: H,D,S,T,Y (eucaryotes): p = 5/20 P 0to3_mod = 4'865'100 + c 0 to 3 modifications that may occur on every amino acid: p=1 P 0to3_mod = 3,97e12 + c Mutation scenario: Each amino acid may mutate into one of the remaining 19 amino acids: All possible words = 19 k -1 P 1_mut = 1.16e14 modifications and identification Database size (3)

2 major problems: - size of the database - a priori knowledge on the deltaMass due to the modification Solutions: Define an identification algorithm that is not based on a SPC --> spectral convolution/alignment - PEDENTA (2000) --> de novo sequencing followed by sequence matching - extraction of one or several complete sequences LUTEFISK (1997), SHERENGA (1999)... - extraction of one or several small tags (PeptideSearch, 1994), Patchwork sequencing... --> Popitam (2003): "guided" sequencing modifications and identification Other strategies

Spectral convolution/alignment SPC score:D(k=0) = 2 SA score: D(k=2) = 6 Pevzner PA, Dancik V, Tang CL: Mutation- tolerant protein identification by mass spectrometry. J.Comput.Biol. 2000, 7:777- 787 A B C D F ABCDEF E theo. MS/MS spectrum exp. MS/MS spectrum Key idea: k-similarity D(k) Given S exp and S theo, the goal is to find a serie of k shifts in S exp that makes S exp and S theo as similar as possible. D(k) represents the maximum number of elements in common between a theoretical and an experimental spectrum after k shifts if (i',j') and (i,j) are co-diagonal otherwise modifications and identification

De novo sequencing Taylor JA, Johnson RS: Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. Rapid Commun.Mass Spectrom. 1997, 11:1067-1075 4/24 Longest path problem in a directed acyclic graph --> dynamic programming --> complete sequences --> mutations, but no modifications modifications and identification

Tag extraction Mann M, Wilm M: Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal.Chem. 1994, 66:4390-4399 Schlosser A, Lehmann WD: Patchwork peptide sequencing: Extraction of sequence information from accurate mass data of peptide tandem mass spectra recorded at high resolution*. Proteomics. 2002, 2:524-533 Island of sequence ions The tags (m1-SEQ-m2) are manually extracted 2 steps: tags as filtering, then SPC Based on very accurate masses (10 mDa) Small tags are extracted from low mass regions (2 aa) modifications and identification

Popitam key's idea Spectrum graph --> good way to structure the information contained in the MS/MS spectrum, allows mutations Tags --> modified source peptides --> fragmented spectra Search space --> use dtb information during tag extraction --> take into account only mutations compatible with the spectrum (graph) --> make only modification scenarios compatible with the current theoretical peptide Scoring function --> take into account a lot of parameters --> genetic programming Popitam

Popitam overview filter Peptide sequence database any source of biological sequences P 1 P 2... IDENTIFICATION I(P 1 ) I(P 2 )... 7/12 MS/MS For each P i extractTags(); processTags(); score(); Popitam initial node final node

Spectrum graph “N-term”: bMass = chargeNb * m/z – (chargeNb-1) – offset “C-term”: bMass = PM – […] measured mass [m/z] bMass (ideal fragmentation) - selection based on intensity - for each peak, make all possible hypotheses b + -NH 3 b+b+ y ++ a + -H 2 0 - # nodes > # peaks - families 5/12 Popitam

Tag extraction peLTE peLet peLvm peITE peIet peIvm petlE LTE Let Lvm ITE Iet Ivm tlE ck TE et vm go E V Popitam 9 nodes,11 edges --> 21 tags

AIGGGLSSVGGSSTIK (1159 peaks) 1 16/97 5.6*10 4 0m02s 2 30/338 5.4*10 6 0m27s 3 44/692 5.7*10 7 3m16s 4 58/1121 3.4*10 8 21m09s 5 72/1667 2.3*10 9 2h17m07s AHFSISNSAEDPFIAIHADSK (145 peaks) 1 24/121 6.1*10 4 0m02s 2 46/308 1.9*10 8 16m15s 3 68/831 2.0*10 10 22h06m47s LVNELTEFAK (125 peaks) Tag extraction (2) Pentium, 1.6 GHz Popitam

Tag extraction (3) ACCACMCAK - CACMCAK k A MCAK CA CMCAK CACMCAK Ck Recursively extract from the graph all tags that are compatible with the current theoretical peptide --> a tag = a path (bMass, edge label, ionic hypothesis…) k MCAK Popitam

KplALVYGE 30 39 43 45 50 58 63 64 68 plALVYGE 39 43 45 50 58 63 64 68 ALVYGE 43 45 50 58 63 64 68 LVYGE 45 50 58 63 64 68 VYGE 50 58 63 64 68 YGE 58 63 64 68 paLKplALvy 0 4 10 16 22 26 31 42 LKplALvy 4 10 16 22 26 31 42 KplALvy 10 16 22 26 31 42 plALvy 16 22 26 31 42 ALvy 22 26 31 42 LKPla 10 13 19 22 31 LKPla 10 14 19 22 31 KPla 13 19 22 31 KPla 14 19 22 31 Tag processing - discard subtags - discard tags that begin the theo. peptide, but not the graph (and vice versa) - discard tags that finish on the last aa, but not on the last node - group "family" tags PLAlv 29 35 40 42 48 LAlv 35 40 42 48 DpaL 65 69 78 84 LKP 11 15 20 24 LVY 16 19 24 29 LVY 44 49 57 62 PAL 19 22 26 31 QDP 10 16 20 24 alkpL 54 63 71 75 avVqd 0 5 9 18 dpAL 37 43 45 50 avVQD 55 60 65 70 75 VQD 60 65 70 75 paLK 59 66 69 75 AVVQDPALKPLALVYGEATSR PeakNb: 1260 ParentMass: 2197.15 NodeNb : 86 EdgeNb : 142 / 1098 29 tags --> 13 subSeqs Popitam

Aim: Find all possible arrangements of subsequences, given the theoretical peptide BUT do not include in a same arrangement tags that are incompatible with the others. Compatibility rules: --> no peak shared --> beginMasses must respect positions in the sequences Subsequence processing (1) 0 KplALVYGE 794.41 0 1 2 6 15 19 21 27 30 1 LKPla 282.17 2 7 29 33 41 2 PLAlv 785.34 6 8 19 21 28 3 DpaL 1673.89 14 20 31 36 4 LKP 284.11 17 22 32 36 5 LVY 410.26 14 22 28 29... A V V Q D P A L K P L A L V Y G E A T S R 0 5 10 15 0 1 2 3 4 5... 0 x x 1 x x 2 x x 3 x x x x 4 x 5 x x x... Compatibility graph Each found clique in the graph is a possible arrangement of subsequences Here, 91 cliques, but most of them are really uninteresting. Popitam

Scoring function (1) AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LKP 284.1 LVY1202.7 AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LKP 284.1 LVY1202.7 avVqd 1.0 AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LKP 284.1 avVqd 1.0 AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LVY1202.7... --> 2 levels scoring: - scoring linked to the subsequences (local) subscores: number of tags that compose the subsequence length of the subsequence occurrence probabilities of the ionic type hypothesized (geometric/arithmetic mean) - scoring linked to the arrangement (global) subscores: global coverage linear regression Popitam

Scoring function (2) How can we combine the subscores in order to build an efficient scoring function ? --> empirical function (expert knowledge) --> probabilitic function --> function built using GENETIC PROGRAMMING population of "programs" : trees nodes : mathematic operators (+, -, *, /, ^,...) bolean operators (AND, OR, NOT...) conditional operators (if-then-else...) iterative functions (do-until...) other specific functions... leaves : subscores, coefficient Popitam GENETIC PROGRAMMING

Genetic operators (1) Initiation: Programs are initially randomly determined (structure, functions, values) Iterations: At each iteration, the programs are evaluated (fitness function). Only the best are allowed to reproduce, using genetic operators (permutation, mutation, crossing-over...). Popitam

Genetic operators (2) Popitam

Genetic programming fitness tree population fitness genetic programming allows testing several scoring functions and making them "cleverly" evolve in order to find an optimal one fitness Popitam scoring function1 scoring function3 scoring function2 if (correctId() ) s i  ]0.5;1[ (according to the discriminative power) else { if (belongToList() ) s i  ]0;0.5] (according to the position in the list) else s i = 0;

Some results Popitam

Popitam, une méthode tolérante aux mutations/modifications pour l'identification de protéines à partir de données de spectrométrie de masse (MS/MS) Patricia.

Similar presentations

Presentation on theme: "Popitam, une méthode tolérante aux mutations/modifications pour l'identification de protéines à partir de données de spectrométrie de masse (MS/MS) Patricia."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Popitam, une méthode tolérante aux mutations/modifications pour l'identification de protéines à partir de données de spectrométrie de masse (MS/MS) Patricia.

Similar presentations

Presentation on theme: "Popitam, une méthode tolérante aux mutations/modifications pour l'identification de protéines à partir de données de spectrométrie de masse (MS/MS) Patricia."— Presentation transcript:

Similar presentations

About project

Feedback