Download presentation

Presentation is loading. Please wait.

Published byMarshall Castles Modified over 2 years ago

1
A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

2
Outline of the talk Protein families signatures Similar Fragment Merging Approach (Protomata-L) Characterization Similar Fragment Pairs (SFPs) Ordering the SFPs Generalization Merging of SFP in an automaton Gap generalization Identification of Physico-chemical properties Experiments

3
Protein families Amino acid alphabet : Protein sequence : Protein data set : >AQP1_BOVIN MASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHY PIKSNQTTGAVQDNVKVSLAFGLSI… >AQP1_BOVIN MASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHY PIKSNQTTGAVQDNVKVSLAFGLSI… >AQP2_RAT MWELRSIAFSRAVLAEFLATLLFVFFGLGSALQWAS SPPSVLQIAVAFGLGIGILVQALGH… >AQP3_MOUSE MGRQKELMNRCGEMLHIRYRLLRQALAECLGTLIL VMFGCGSVAQVVLSRGTHGGFLT… {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V} Common function & Common topology (3D structure)

4
Characterization of a protein family x x x x x x C H x \ / x x Zn x x / \ x C H x x C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H ZBT11...Csi..CgrtLpklyslriHmlk..H... ZBT10...Cdi..CgklFtrrehvkrHslv..H... ZBT34...Ckf..CgkkYtrkdqleyHirg..H... Zinc Finger Pattern

5
Expressivity classes of patterns PROSITE PRATT TEIRESIAS PROTOMATA-L

6
Characterization

7
Similar Fragment Pairs Significantly similar fragment pairs (SFPs) Natural selection Important area characterization Data set D:

8
Ordering the SFPs Problem : Solution : ordering the SFPs by scoring each SFP S(f1,f2)= ? 3 different scoring functions : dialign Sd support Ss implication Si

9
Dialign Score Sd ( f1, f2 ) = - log P ( L, Sim ) L = |f1| = |f2| Sim = Sum of the individual similarity values P = Probability that a random SFP of the same L has the same S Blossum62 similarity

10
Support Score Taking into account the representativeness of SFP Ss (f1,f2,D) = Number of sequences supporting f1 f2 f is supported by f with respect the triangular inequality : Sd(f,f1) + Sd(f,f2) Sd(f1,f2)

11
Implication Score Taking into account a counter-example set N Discriminative fragments Lerman index: Si(f1,f2,D,N) = avec P(X) = -P( Ss(f1,f2,N) ) + P( Ss(f1,f2,D) ) x P(N) P( Ss(f1,f2,D) ) x |N| |X| |D| + |N|

12
Generalization

13
From protein data sets to automata MASEIKLFW M A S E I K L F W

14
From protein data sets to automata MASEIKLFW MGYEVKYRV M G Y E V K Y R V M A S E I K L F W

15
Merging SFPs MASEIKLFW MGYEVKYRV M G Y E V K Y R V M A S E I K L F W

16
Merging SFPs MASEIKLFW MGYEVKYRV M G Y E K Y R V M A S L F W [I, V]

17
Merging SFPs MASEIKLFW MGYEVKYRV M G Y E [I, V] K Y R V M A S L F W MASEVKLFM MGYEIKYRV MASEIKYRV MGYEVKLFW MASEVKYRV MGYEIKLFW

18
Protein Sequence Data Set List of SFPs MCA Automaton / Regular Expression Ordered List of SFPs MERGING

19
Gap Generalization Merging on themself non-representative transitions Treat them as "gaps"

20
Identification of Physico-chemical properties Similar Fragments ~ potential function area Amino acids share out the same position Physicochemical property at play => Generalization from a group (of amino acids) to a Taylor group I,V I,Q,W,P aliphatic x I,L,V no information C C [I,V] [I,L,V] C C [I,Q,W,P] X {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}

21
Likelihood ratio test To decide if the multi-set A has been generated according to a physico-chemical group G or not by a likelihood ratio test: Given a threshold, we test the expansion of A to G and reject it when LR G/A <

22
Experiments

23
MIP : the Major Intrinsic Protein Family Family MIP Subfamilies AQP, Glpf, Gla

24
Data sets UNIPROT MIP in SWISS-PROT Set « T » (159 seq) Set « M» (44 seq) identity<90% Set « W+» (24 seq) Set « W-» (16 seq) Set « C» (49 seq) Blast(1

25
Experiments First Common Fragment on a Family MIP family Positive set Comparison with pattern discovery tools Teiresias Pratt Protomata-L (short pattern) Water-specific Characterization MIP sub-families Positive and negative sets Leave-one-out cross-validation Protomata-L (short to long pattern)

26
First Common Fragment Automaton Results of 4 patterns scanned on Swiss-Prot protein Database Set « M» (44 seq) Learning Set Learning set Set « T » (159 seq) Target set

27
From short automata to long automata Previous experiment only the first SFPs of the ordered list of SFPs short automaton first common fragment automaton Next experiment larger cut-offs in the list of SFPs Protomat-L is able to create longer automata with more common subparts Long patterns are closed of the topoly (3D-structure) of the family

28
Water-specific characterization Leave-one-out cross-validation Learning set W+ \ Si : Positive learning set W- \ Sj : Negative learning set Test set { Si U Sj } Control set Set T Implication score Set « W+» (24 seq) Set « W-» (16 seq) Set « C» (49 seq)

29
Leave-one-out cross-validation

30
Error Correcting Cost The error correcting cost of a sequence S represents the distance (blossum similarity) between S and the closest sequence given by the automaton A. Distibution of sequences with long automata (size Approx. 100)

31
Leave-one-out cross-validation With Error Correcting Cost

32
Leave-one-out cross-validation

33
Conclusion & Perspective Good characterization of protein family using automata (-> hmm structure) No need of a multiple alignment greedy data-driven algorithm Important subparts localization Physico-chemical identification and generalization Counter example sets Bringing of knowledge is possible in automata (-> 2D structure)

34
Questions ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??

35
Demo

36
Protomata-L ’s Approach First Common Fragment

37
Protomata-L ’s Approach To get a more precise automaton

38
IDENTIFICATION OF PHYSICOCHEMICAL GROUPS Data set (Protein sequences) Pairs of fragments SORT EXTRACTION Initial Automaton(MCA) MERGING IDENTIFICATION OF « GAPS »

39
Structural discrimination

41
Aromatique Hydrophobe Non Informatif Generalization of an Aquaporins automaton

42
Physico-chemical properties identification Ratio likelihood test Aliphatic Small x

Similar presentations

OK

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on central limit theorem for dummies Ppt on producers consumers and decomposers 3rd Plant life cycle for kids ppt on batteries Ppt on network switching subsystem Ppt on tricks in mathematics Ppt on meeting etiquettes Ppt on life study of mathematician euclid Ppt on zener diode current Seminar ppt on mobile computing free download Ppt on art and craft movement pottery