Presentation is loading. Please wait.

Presentation is loading. Please wait.

Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.

Similar presentations


Presentation on theme: "Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser."— Presentation transcript:

1 Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser : K. T. Sun Presenter : Wei-Liang Liu BIOINFORMATICS Vol. 18 no. 8 2002 Pages 1084–1090

2 2 Introduction (1/2) We present a new algorithm for extracting the consensus pattern, or motif, from a group of related protein sequences. This algorithm involves a statistical method to find short patterns with high frequency and then neural network training to optimize the final classification accuracies. Fuzzy logic is used to increase the flexibility of protein motifs.

3 3 Introduction (2/2) Sequence motif discovery algorithms can be generally categorized into three types: Sequence motif discovery algorithms can be generally categorized into three types: (1) string Alignment algorithms, (1) string Alignment algorithms, (2) exhaustive enumeration algorithms, (2) exhaustive enumeration algorithms, (3) heuristic methods. (3) heuristic methods.

4 4 String alignment algorithms Find sequence motifs by minimizing a cost function which is related to the edit distances between sequences. Find sequence motifs by minimizing a cost function which is related to the edit distances between sequences. Multiple alignment of sequences is a NP-hard problem and its computational time increases exponentially with the sequence size. Multiple alignment of sequences is a NP-hard problem and its computational time increases exponentially with the sequence size.

5 5 Exhaustive enumeration algorithms Exhaustive enumeration algorithms are guaranteed to find the optimal motif, but run in exponential time with respect to the length of motif. Exhaustive enumeration algorithms are guaranteed to find the optimal motif, but run in exponential time with respect to the length of motif.

6 6 Heuristic methods Heuristic methods can have a better performance but are usually less flexible. Heuristic methods can have a better performance but are usually less flexible.

7 7 Neuro-Fuzzy system A neuro-fuzzy system is a neural network and a fuzzy system mapped to each other thus providing advantages of both systems (Halgamuge and Glesner, 1994). A neuro-fuzzy system is a neural network and a fuzzy system mapped to each other thus providing advantages of both systems (Halgamuge and Glesner, 1994). When it is used as a classifier, the outputs are class labels and therefore, no conventional defuzzification is applied. When it is used as a classifier, the outputs are class labels and therefore, no conventional defuzzification is applied.

8 8 Example of a sequence One example of a sequence data is the human zinc finger sequence data ZNF117 [6]: MKRHEMVAKHLVMFYYFAQHLWPEQNIRDSFQKVTLRR YRKCGYENLQLRKGCKSVVECKQHKGDYSGLNQCLKTT LSKIFQCNKYVEVFHKISNSNRHKMRHTENKHFKCKECR KTFCMLSHLTQHKRIHTRVNFYKCEAYGRAFNWSSTLNK HKRIHTGEKPYKCKECGKAFNQTSHLIRHKRIHTEEKPYK CEECGKAFNQSSTLTTHNIIHTGEIPYKCEKCVRAFNQAS KLTEHKLIHTGEKRYECEECGKAFNRSSKLTEHKYIHTGE KLYKCEECDKAFNLSSTLTKHKVIHTGEKLYKCKECGKA FKQFSHLAIHNIIHTGEKLYKCEECGKAFNSSSNLTAHKK NRTGEKPYKCEECGKANLSSTLTPHKTIHI

9 9 Algorithm The aim of this algorithm is to find a consensus pattern,or motif, from sequences belonging to the same family. The aim of this algorithm is to find a consensus pattern,or motif, from sequences belonging to the same family. This motif can be either a rigid or flexible pattern. This motif can be either a rigid or flexible pattern. A rigid pattern may be A–x(5)–B, where there exist a fixed number of gaps/wildcards (in this case, five) between two patterns A and B. A rigid pattern may be A–x(5)–B, where there exist a fixed number of gaps/wildcards (in this case, five) between two patterns A and B. In a flexible pattern, the number of gaps is represented by a lower bound and an upper bound, such as x(2,4). In a flexible pattern, the number of gaps is represented by a lower bound and an upper bound, such as x(2,4).

10 10 Algorithm has four main steps The proposed motif extraction algorithm has four main steps: The proposed motif extraction algorithm has four main steps: sequence preprocessing, sequence preprocessing, motif generation, motif generation, motif selection and motif selection and motif optimization. motif optimization.

11 11 Overview of the algorithm

12 12 Sequence Preprocessing The aim of the preprocessing step is to select the ‘more’ important ‘features’ within a single family sequences so that actual motif extraction becomes faster. The aim of the preprocessing step is to select the ‘more’ important ‘features’ within a single family sequences so that actual motif extraction becomes faster.

13 13 Example (1/2) ABC–x(1,3)–DEF, ABC–x(1,3)–DEF, where x(1,3) represents wild cards of length 1 to 3. Any amino acid symbol can match a wild card. Sequences where x(1,3) represents wild cards of length 1 to 3. Any amino acid symbol can match a wild card. Sequences ABCHHDEF and ABCAAADEF both satisfy the above consensus pattern. ABCHHDEF and ABCAAADEF both satisfy the above consensus pattern. The consensus pattern ABC–x(1,3)–DEF can also be written as A–x(0)–B–x(0)–C–x(1,3)–D–x(0)–E– x(0)–F. The consensus pattern ABC–x(1,3)–DEF can also be written as A–x(0)–B–x(0)–C–x(1,3)–D–x(0)–E– x(0)–F.

14 14 Example (2/2) As a general form, a sequence pattern can be represented as a series of events and intervals (Chang and Halgamuge, 2001): As a general form, a sequence pattern can be represented as a series of events and intervals (Chang and Halgamuge, 2001): E 1 –I 1,2 –E 2 –I 2,3 −... − I (N−1), N –E N E 1 –I 1,2 –E 2 –I 2,3 −... − I (N−1), N –E N Where E 1 is the first event and I 1,2 is the interval gap between the first and second events. Where E 1 is the first event and I 1,2 is the interval gap between the first and second events.

15 15 Vector generation Each element of the vector represents a combination of two events, Ei and E j and their gap I i, j, (where E i occurs before E j ), and the value of each element of the vector is either 1 or 0. Each element of the vector represents a combination of two events, Ei and E j and their gap I i, j, (where E i occurs before E j ), and the value of each element of the vector is either 1 or 0. A value of 1 translates to ‘in this sequence, there is an occurrence of character Ei with interval Ii j before E j ’, and a value of zero is otherwise (there is no such occurrence). A value of 1 translates to ‘in this sequence, there is an occurrence of character Ei with interval Ii j before E j ’, and a value of zero is otherwise (there is no such occurrence).

16 16 Example let us assume the first element of a vector represents ‘A–x(0)–A’. let us assume the first element of a vector represents ‘A–x(0)–A’. The value of this element will be 1 for sequence ‘AABCD’ and 0 for sequence ‘ABACD’, as the short pattern A–x(0)–A occurs in the first sequence but not the second. The value of this element will be 1 for sequence ‘AABCD’ and 0 for sequence ‘ABACD’, as the short pattern A–x(0)–A occurs in the first sequence but not the second.

17 17 Size of Vector For protein sequences, the number of possible events is 20 (there are 20 amino acids) For protein sequences, the number of possible events is 20 (there are 20 amino acids) By considering that only nine patterns in PROSITE out of around 1300 motif patterns have interval gaps of more than 20 (Hart et al.,2000), a maximum gap considered between any two events of 20 should be satisfactory. By considering that only nine patterns in PROSITE out of around 1300 motif patterns have interval gaps of more than 20 (Hart et al.,2000), a maximum gap considered between any two events of 20 should be satisfactory. Therefore the size of the vector is 20 × 20 × 20 = 8000 Therefore the size of the vector is 20 × 20 × 20 = 8000 vector can be implemented vector can be implemented as a 13-bits (213 = 8192) binary data. as a 13-bits (213 = 8192) binary data.

18 18 Protein sequences

19 19 Feature selection By selecting the elements above a certain threshold value (e.g. 0.90). By selecting the elements above a certain threshold value (e.g. 0.90). The value of each vector element represents the frequencies of occurrences of a particular E i – I i, j – E j pattern. The value of each vector element represents the frequencies of occurrences of a particular E i – I i, j – E j pattern. For example,if an element which represents A–x(0)–A has a value of 0.99, then 99% of this group of sequences have ‘AA’ somewhere in their sequences. For example,if an element which represents A–x(0)–A has a value of 0.99, then 99% of this group of sequences have ‘AA’ somewhere in their sequences.

20 20 Motif generation (1/3) For example, if a motif pattern C–x(2)–C–x(3)–F occurs in 90% of the sequences in the family, the short patterns (or important features): (1) C–x(2)–C, (2) C–x(3)–F, and (3) C–x(6)–F For example, if a motif pattern C–x(2)–C–x(3)–F occurs in 90% of the sequences in the family, the short patterns (or important features): (1) C–x(2)–C, (2) C–x(3)–F, and (3) C–x(6)–F must all exist at a frequencey of 90% or greater in the sequences. But the reverse is not always true.

21 21 Motif generation (2/3) Fig.2.Connect important features to form a motif candidate.

22 22 Motif generation (3/3) In Figure 2, F–x(2)–S is not connected because for a motif C–x(2)–C–x(3)–F–x(2)–S to occur frequently, the short patterns C–x(9)–S, C–x(6)–S should have occurred frequently as well (which is not in the above case).

23 23 A good motif pattern A good motif pattern can be simply described as: (1) Correctly identify protein sequences belonging to the family it represents, or maximize ‘true-positives’. (2) Does not identify protein sequences belonging to the other families, or minimize ‘false-positives’.

24 24 Motif optimization (1/2)

25 25 Motif optimization (2/2) The inputs to the network are event intervals. The simple rule (black node in ‘rule base’ layer of Figure 3) in the neuro-fuzzy system is: ‘IF I 1 is μ 1 and I 2 is μ 1, THEN output is μ class ’. μ class is the output of the neuro-fuzzy network.

26 26 Fuzzy inference system A fuzzy inference system embedded in neural network has three main steps: fuzzification, fuzzy inference and defuzzification.

27 27 Sequence Preprocessing (1/3) For example, let T = AGCCTGAT. The first and second level distribution matrices are shown in Table 1:

28 28 Sequence Preprocessing (2/3)

29 29 Sequence Preprocessing (3/3)

30 30 Sequence Fuzzification (1/2) The value of event interval is also fuzzified. For example, if pattern P = T φφG, the event interval fuzzy membership function can be defined as shown in Figure 4. P = T φφG = P = T-X(2)-G

31 31 Sequence Fuzzification (2/2)

32 32 Sequence Inference This step aims to find the most “similar” subsequence in Text T compares to Pattern P. The inference rule used here is: IF event A 1 occurs AND event A 2 occurs AND event interval between A 1 and A 2 is I 1 AND … event A n-1 occurs AND event An occurs AND event interval between A n-1 and An is I n-1, THEN Pattern P exists in Text T with degree Y i.

33 33 Fuzzy Sequence Pattern Matching Algorithm (example) The general structure of a C2H2 zinc finger protein motif (a motif is the signature of a particular group of sequences) is [2]: CφφCφφφφφφφφφφφφHφφH

34 34 Sequence Preprocessing (example) CφφCφφφφφφφφφφφφHφφH

35 35 Sequence Fuzzification (example) We use the following fuzzy rule to describe the event interval: R1: If event interval is I1 between the first two C, then the membership value is μ1 R2: If event interval is I2 between C and H, then themembership value is μ2 R3: If event interval is I3 between the last two H, then the membership value is μ3

36 36 Sequence Inference (example) The inference rule used here is: IF event interval between the first two Cs is I1 AND event interval between C and H is I2 AND event interval between the last two Hs is I3, THEN Pattern P exists in Text T with degree Yi. Where Yi = μ1 × μ2 × μ3 And Y = Max(Y1, Y2, Y3, …, Ym)

37 37 Classify

38 38 Sum of square error For example, sequence Z is ACCABBDACA, and the preliminary motif is A–x(2)–A–x(2)–A. The possible matches are (a) ACCABBDA (A–x(2)–A–x(3)–A) and (b) ABBDACA (A–x(3)–A–x(1)–A). The sum of square error is:for (a) : (2 − 2)2 + (3 − 2)2 = 1 (b) : (3 − 2)2 + (1 − 2)2 = 2. So (a) is the ‘most similar match’ and its event interval values (2, 3) is used as a training input data.

39 39 Result of C2H2 zinc finger protein (1/3)

40 40 Result of C2H2 zinc finger protein (2/3)

41 41 Result of C2H2 zinc finger protein (3/3)

42 42 Result of EGF Protein (1/3)

43 43 Result of EGF Protein (2/3)

44 44 Result of EGF Protein (3/3)

45 45 Discussion The optimization of motif patterns in both EGF and zinc finger protein family increases the rate of true positives. However, with an increase in true positives rate, the rate of false positives also increases. An interesting observation is that in comparison to the motifs suggested in PROSITE, the motifs identified by our method are more flexible and broad.

46 46 Conclusion and future work For future research, optimization of neuro- fuzzy system will be further investigated to implement event fuzzy membership functions for events.


Download ppt "Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser."

Similar presentations


Ads by Google