Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes.

Similar presentations


Presentation on theme: "A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes."— Presentation transcript:

1 A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

2 Outline of the talk  Protein families signatures  Similar Fragment Merging Approach (Protomata-L) Characterization  Similar Fragment Pairs (SFPs)  Ordering the SFPs Generalization  Merging of SFP in an automaton  Gap generalization  Identification of Physico-chemical properties  Experiments

3 Protein families  Amino acid alphabet :  Protein sequence :  Protein data set : >AQP1_BOVIN MASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHY PIKSNQTTGAVQDNVKVSLAFGLSI… >AQP1_BOVIN MASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHY PIKSNQTTGAVQDNVKVSLAFGLSI… >AQP2_RAT MWELRSIAFSRAVLAEFLATLLFVFFGLGSALQWAS SPPSVLQIAVAFGLGIGILVQALGH… >AQP3_MOUSE MGRQKELMNRCGEMLHIRYRLLRQALAECLGTLIL VMFGCGSVAQVVLSRGTHGGFLT… {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V} Common function & Common topology (3D structure)

4 Characterization of a protein family x x x x x x C H x \ / x x Zn x x / \ x C H x x C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H ZBT11...Csi..CgrtLpklyslriHmlk..H... ZBT10...Cdi..CgklFtrrehvkrHslv..H... ZBT34...Ckf..CgkkYtrkdqleyHirg..H... Zinc Finger Pattern

5 Expressivity classes of patterns PROSITE PRATT TEIRESIAS PROTOMATA-L

6 Characterization

7 Similar Fragment Pairs  Significantly similar fragment pairs (SFPs)  Natural selection  Important area characterization Data set D:

8 Ordering the SFPs  Problem :  Solution : ordering the SFPs by scoring each SFP S(f1,f2)= ? 3 different scoring functions :  dialign Sd  support Ss  implication Si

9 Dialign Score  Sd ( f1, f2 ) = - log P ( L, Sim ) L = |f1| = |f2| Sim = Sum of the individual similarity values P = Probability that a random SFP of the same L has the same S Blossum62 similarity

10 Support Score  Taking into account the representativeness of SFP  Ss (f1,f2,D) = Number of sequences supporting f1 f2 f is supported by f with respect the triangular inequality : Sd(f,f1) + Sd(f,f2)  Sd(f1,f2)

11 Implication Score  Taking into account a counter-example set N  Discriminative fragments  Lerman index: Si(f1,f2,D,N) =  avec P(X) = -P( Ss(f1,f2,N) ) + P( Ss(f1,f2,D) ) x P(N) P( Ss(f1,f2,D) ) x |N| |X| |D| + |N|

12 Generalization

13 From protein data sets to automata MASEIKLFW M A S E I K L F W

14 From protein data sets to automata MASEIKLFW MGYEVKYRV M G Y E V K Y R V M A S E I K L F W

15 Merging SFPs MASEIKLFW MGYEVKYRV M G Y E V K Y R V M A S E I K L F W

16 Merging SFPs MASEIKLFW MGYEVKYRV M G Y E K Y R V M A S L F W [I, V]

17 Merging SFPs MASEIKLFW MGYEVKYRV M G Y E [I, V] K Y R V M A S L F W MASEVKLFM MGYEIKYRV MASEIKYRV MGYEVKLFW MASEVKYRV MGYEIKLFW

18 Protein Sequence Data Set List of SFPs MCA Automaton / Regular Expression Ordered List of SFPs MERGING

19 Gap Generalization  Merging on themself non-representative transitions  Treat them as "gaps"

20 Identification of Physico-chemical properties  Similar Fragments ~ potential function area  Amino acids share out the same position  Physicochemical property at play  => Generalization from a group (of amino acids) to a Taylor group I,V I,Q,W,P aliphatic x I,L,V no information C C [I,V] [I,L,V] C C [I,Q,W,P] X {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}

21 Likelihood ratio test  To decide if the multi-set A has been generated according to a physico-chemical group G or not by a likelihood ratio test:  Given a threshold, we test the expansion of A to G and reject it when LR G/A <

22 Experiments

23 MIP : the Major Intrinsic Protein Family Family  MIP Subfamilies  AQP, Glpf, Gla

24 Data sets UNIPROT MIP in SWISS-PROT Set « T » (159 seq) Set « M» (44 seq) identity<90% Set « W+» (24 seq) Set « W-» (16 seq) Set « C» (49 seq) Blast(1

25 Experiments  First Common Fragment on a Family MIP family Positive set Comparison with pattern discovery tools Teiresias Pratt Protomata-L (short pattern)  Water-specific Characterization MIP sub-families Positive and negative sets Leave-one-out cross-validation  Protomata-L (short to long pattern)

26 First Common Fragment Automaton  Results of 4 patterns scanned on Swiss-Prot protein Database Set « M» (44 seq) Learning Set Learning set Set « T » (159 seq) Target set

27 From short automata to long automata  Previous experiment only the first SFPs of the ordered list of SFPs short automaton first common fragment automaton  Next experiment larger cut-offs in the list of SFPs Protomat-L is able to create longer automata with more common subparts Long patterns are closed of the topoly (3D-structure) of the family

28 Water-specific characterization  Leave-one-out cross-validation Learning set  W+ \ Si : Positive learning set  W- \ Sj : Negative learning set Test set  { Si U Sj } Control set  Set T Implication score Set « W+» (24 seq) Set « W-» (16 seq) Set « C» (49 seq)

29 Leave-one-out cross-validation

30 Error Correcting Cost The error correcting cost of a sequence S represents the distance (blossum similarity) between S and the closest sequence given by the automaton A. Distibution of sequences with long automata (size Approx. 100)

31 Leave-one-out cross-validation With Error Correcting Cost

32 Leave-one-out cross-validation

33 Conclusion & Perspective  Good characterization of protein family using automata (-> hmm structure)  No need of a multiple alignment  greedy data-driven algorithm Important subparts localization Physico-chemical identification and generalization  Counter example sets  Bringing of knowledge is possible in automata (-> 2D structure)

34 Questions ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??

35 Demo

36 Protomata-L ’s Approach  First Common Fragment

37 Protomata-L ’s Approach  To get a more precise automaton

38 IDENTIFICATION OF PHYSICOCHEMICAL GROUPS Data set (Protein sequences) Pairs of fragments SORT EXTRACTION Initial Automaton(MCA) MERGING IDENTIFICATION OF « GAPS »

39 Structural discrimination

40

41 Aromatique Hydrophobe Non Informatif Generalization of an Aquaporins automaton

42 Physico-chemical properties identification Ratio likelihood test Aliphatic Small x


Download ppt "A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes."

Similar presentations


Ads by Google