Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.

Slides:



Advertisements
Similar presentations
Discrete Event Control
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
1 Neural networks. Neural networks are made up of many artificial neurons. Each input into the neuron has its own weight associated with it illustrated.
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Types of Algorithms.
Adaptive Resonance Theory (ART) networks perform completely unsupervised learning. Their competitive learning algorithm is similar to the first (unsupervised)
Measuring the degree of similarity: PAM and blosum Matrix
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
Simple Neural Nets For Pattern Classification
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Chapter 2: Pattern Recognition
Data classification based on tolerant rough set reporter: yanan yean.
Chapter 7 Using Data Flow Diagrams
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
ROC Curves.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Similar Sequence Similar Function Charles Yan Spring 2006.
ROC Curve and Classification Matrix for Binary Choice Professor Thomas B. Fomby Department of Economics SMU Dallas, TX February, 2015.
Decision Tree Models in Data Mining
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,
CHAPTER 12 ADVANCED INTELLIGENT SYSTEMS © 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang.
A Simple Method to Extract Fuzzy Rules by Measure of Fuzziness Jieh-Ren Chang Nai-Jian Wang.
Presented by Tienwei Tsai July, 2005
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
© Negnevitsky, Pearson Education, Will neural network work for my problem? Will neural network work for my problem? Character recognition neural.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Classification Techniques: Bayesian Classification
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
A New Method to Forecast Enrollments Using Fuzzy Time Series and Clustering Techniques Kurniawan Tanuwijaya 1 and Shyi-Ming Chen 1, 2 1 Department of Computer.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Pattern Discovery of Fuzzy Time Series for Financial Prediction -IEEE Transaction of Knowledge and Data Engineering Presented by Hong Yancheng For COMP630P,
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
MINRMS: an efficient algorithm for determining protein structure similarity using root-mean-squared-distance Andrew I. Jewett, Conrad C. Huang and Thomas.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
April 21, 2016Introduction to Artificial Intelligence Lecture 22: Computer Vision II 1 Canny Edge Detector The Canny edge detector is a good approximation.
Deep Feedforward Networks
Adaptive Resonance Theory (ART)
Learning Sequence Motif Models Using Expectation Maximization (EM)
Dr. Unnikrishnan P.C. Professor, EEE
Classification Techniques: Bayesian Classification
Hidden Markov Models Part 2: Algorithms
of the Artificial Neural Networks.
CONTEXT DEPENDENT CLASSIFICATION
EE513 Audio Signals and Systems
Parametric Methods Berlin Chen, 2005 References:
Dr. Unnikrishnan P.C. Professor, EEE
Presentation transcript:

Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser : K. T. Sun Presenter : Wei-Liang Liu BIOINFORMATICS Vol. 18 no Pages 1084–1090

2 Introduction (1/2) We present a new algorithm for extracting the consensus pattern, or motif, from a group of related protein sequences. This algorithm involves a statistical method to find short patterns with high frequency and then neural network training to optimize the final classification accuracies. Fuzzy logic is used to increase the flexibility of protein motifs.

3 Introduction (2/2) Sequence motif discovery algorithms can be generally categorized into three types: Sequence motif discovery algorithms can be generally categorized into three types: (1) string Alignment algorithms, (1) string Alignment algorithms, (2) exhaustive enumeration algorithms, (2) exhaustive enumeration algorithms, (3) heuristic methods. (3) heuristic methods.

4 String alignment algorithms Find sequence motifs by minimizing a cost function which is related to the edit distances between sequences. Find sequence motifs by minimizing a cost function which is related to the edit distances between sequences. Multiple alignment of sequences is a NP-hard problem and its computational time increases exponentially with the sequence size. Multiple alignment of sequences is a NP-hard problem and its computational time increases exponentially with the sequence size.

5 Exhaustive enumeration algorithms Exhaustive enumeration algorithms are guaranteed to find the optimal motif, but run in exponential time with respect to the length of motif. Exhaustive enumeration algorithms are guaranteed to find the optimal motif, but run in exponential time with respect to the length of motif.

6 Heuristic methods Heuristic methods can have a better performance but are usually less flexible. Heuristic methods can have a better performance but are usually less flexible.

7 Neuro-Fuzzy system A neuro-fuzzy system is a neural network and a fuzzy system mapped to each other thus providing advantages of both systems (Halgamuge and Glesner, 1994). A neuro-fuzzy system is a neural network and a fuzzy system mapped to each other thus providing advantages of both systems (Halgamuge and Glesner, 1994). When it is used as a classifier, the outputs are class labels and therefore, no conventional defuzzification is applied. When it is used as a classifier, the outputs are class labels and therefore, no conventional defuzzification is applied.

8 Example of a sequence One example of a sequence data is the human zinc finger sequence data ZNF117 [6]: MKRHEMVAKHLVMFYYFAQHLWPEQNIRDSFQKVTLRR YRKCGYENLQLRKGCKSVVECKQHKGDYSGLNQCLKTT LSKIFQCNKYVEVFHKISNSNRHKMRHTENKHFKCKECR KTFCMLSHLTQHKRIHTRVNFYKCEAYGRAFNWSSTLNK HKRIHTGEKPYKCKECGKAFNQTSHLIRHKRIHTEEKPYK CEECGKAFNQSSTLTTHNIIHTGEIPYKCEKCVRAFNQAS KLTEHKLIHTGEKRYECEECGKAFNRSSKLTEHKYIHTGE KLYKCEECDKAFNLSSTLTKHKVIHTGEKLYKCKECGKA FKQFSHLAIHNIIHTGEKLYKCEECGKAFNSSSNLTAHKK NRTGEKPYKCEECGKANLSSTLTPHKTIHI

9 Algorithm The aim of this algorithm is to find a consensus pattern,or motif, from sequences belonging to the same family. The aim of this algorithm is to find a consensus pattern,or motif, from sequences belonging to the same family. This motif can be either a rigid or flexible pattern. This motif can be either a rigid or flexible pattern. A rigid pattern may be A–x(5)–B, where there exist a fixed number of gaps/wildcards (in this case, five) between two patterns A and B. A rigid pattern may be A–x(5)–B, where there exist a fixed number of gaps/wildcards (in this case, five) between two patterns A and B. In a flexible pattern, the number of gaps is represented by a lower bound and an upper bound, such as x(2,4). In a flexible pattern, the number of gaps is represented by a lower bound and an upper bound, such as x(2,4).

10 Algorithm has four main steps The proposed motif extraction algorithm has four main steps: The proposed motif extraction algorithm has four main steps: sequence preprocessing, sequence preprocessing, motif generation, motif generation, motif selection and motif selection and motif optimization. motif optimization.

11 Overview of the algorithm

12 Sequence Preprocessing The aim of the preprocessing step is to select the ‘more’ important ‘features’ within a single family sequences so that actual motif extraction becomes faster. The aim of the preprocessing step is to select the ‘more’ important ‘features’ within a single family sequences so that actual motif extraction becomes faster.

13 Example (1/2) ABC–x(1,3)–DEF, ABC–x(1,3)–DEF, where x(1,3) represents wild cards of length 1 to 3. Any amino acid symbol can match a wild card. Sequences where x(1,3) represents wild cards of length 1 to 3. Any amino acid symbol can match a wild card. Sequences ABCHHDEF and ABCAAADEF both satisfy the above consensus pattern. ABCHHDEF and ABCAAADEF both satisfy the above consensus pattern. The consensus pattern ABC–x(1,3)–DEF can also be written as A–x(0)–B–x(0)–C–x(1,3)–D–x(0)–E– x(0)–F. The consensus pattern ABC–x(1,3)–DEF can also be written as A–x(0)–B–x(0)–C–x(1,3)–D–x(0)–E– x(0)–F.

14 Example (2/2) As a general form, a sequence pattern can be represented as a series of events and intervals (Chang and Halgamuge, 2001): As a general form, a sequence pattern can be represented as a series of events and intervals (Chang and Halgamuge, 2001): E 1 –I 1,2 –E 2 –I 2,3 −... − I (N−1), N –E N E 1 –I 1,2 –E 2 –I 2,3 −... − I (N−1), N –E N Where E 1 is the first event and I 1,2 is the interval gap between the first and second events. Where E 1 is the first event and I 1,2 is the interval gap between the first and second events.

15 Vector generation Each element of the vector represents a combination of two events, Ei and E j and their gap I i, j, (where E i occurs before E j ), and the value of each element of the vector is either 1 or 0. Each element of the vector represents a combination of two events, Ei and E j and their gap I i, j, (where E i occurs before E j ), and the value of each element of the vector is either 1 or 0. A value of 1 translates to ‘in this sequence, there is an occurrence of character Ei with interval Ii j before E j ’, and a value of zero is otherwise (there is no such occurrence). A value of 1 translates to ‘in this sequence, there is an occurrence of character Ei with interval Ii j before E j ’, and a value of zero is otherwise (there is no such occurrence).

16 Example let us assume the first element of a vector represents ‘A–x(0)–A’. let us assume the first element of a vector represents ‘A–x(0)–A’. The value of this element will be 1 for sequence ‘AABCD’ and 0 for sequence ‘ABACD’, as the short pattern A–x(0)–A occurs in the first sequence but not the second. The value of this element will be 1 for sequence ‘AABCD’ and 0 for sequence ‘ABACD’, as the short pattern A–x(0)–A occurs in the first sequence but not the second.

17 Size of Vector For protein sequences, the number of possible events is 20 (there are 20 amino acids) For protein sequences, the number of possible events is 20 (there are 20 amino acids) By considering that only nine patterns in PROSITE out of around 1300 motif patterns have interval gaps of more than 20 (Hart et al.,2000), a maximum gap considered between any two events of 20 should be satisfactory. By considering that only nine patterns in PROSITE out of around 1300 motif patterns have interval gaps of more than 20 (Hart et al.,2000), a maximum gap considered between any two events of 20 should be satisfactory. Therefore the size of the vector is 20 × 20 × 20 = 8000 Therefore the size of the vector is 20 × 20 × 20 = 8000 vector can be implemented vector can be implemented as a 13-bits (213 = 8192) binary data. as a 13-bits (213 = 8192) binary data.

18 Protein sequences

19 Feature selection By selecting the elements above a certain threshold value (e.g. 0.90). By selecting the elements above a certain threshold value (e.g. 0.90). The value of each vector element represents the frequencies of occurrences of a particular E i – I i, j – E j pattern. The value of each vector element represents the frequencies of occurrences of a particular E i – I i, j – E j pattern. For example,if an element which represents A–x(0)–A has a value of 0.99, then 99% of this group of sequences have ‘AA’ somewhere in their sequences. For example,if an element which represents A–x(0)–A has a value of 0.99, then 99% of this group of sequences have ‘AA’ somewhere in their sequences.

20 Motif generation (1/3) For example, if a motif pattern C–x(2)–C–x(3)–F occurs in 90% of the sequences in the family, the short patterns (or important features): (1) C–x(2)–C, (2) C–x(3)–F, and (3) C–x(6)–F For example, if a motif pattern C–x(2)–C–x(3)–F occurs in 90% of the sequences in the family, the short patterns (or important features): (1) C–x(2)–C, (2) C–x(3)–F, and (3) C–x(6)–F must all exist at a frequencey of 90% or greater in the sequences. But the reverse is not always true.

21 Motif generation (2/3) Fig.2.Connect important features to form a motif candidate.

22 Motif generation (3/3) In Figure 2, F–x(2)–S is not connected because for a motif C–x(2)–C–x(3)–F–x(2)–S to occur frequently, the short patterns C–x(9)–S, C–x(6)–S should have occurred frequently as well (which is not in the above case).

23 A good motif pattern A good motif pattern can be simply described as: (1) Correctly identify protein sequences belonging to the family it represents, or maximize ‘true-positives’. (2) Does not identify protein sequences belonging to the other families, or minimize ‘false-positives’.

24 Motif optimization (1/2)

25 Motif optimization (2/2) The inputs to the network are event intervals. The simple rule (black node in ‘rule base’ layer of Figure 3) in the neuro-fuzzy system is: ‘IF I 1 is μ 1 and I 2 is μ 1, THEN output is μ class ’. μ class is the output of the neuro-fuzzy network.

26 Fuzzy inference system A fuzzy inference system embedded in neural network has three main steps: fuzzification, fuzzy inference and defuzzification.

27 Sequence Preprocessing (1/3) For example, let T = AGCCTGAT. The first and second level distribution matrices are shown in Table 1:

28 Sequence Preprocessing (2/3)

29 Sequence Preprocessing (3/3)

30 Sequence Fuzzification (1/2) The value of event interval is also fuzzified. For example, if pattern P = T φφG, the event interval fuzzy membership function can be defined as shown in Figure 4. P = T φφG = P = T-X(2)-G

31 Sequence Fuzzification (2/2)

32 Sequence Inference This step aims to find the most “similar” subsequence in Text T compares to Pattern P. The inference rule used here is: IF event A 1 occurs AND event A 2 occurs AND event interval between A 1 and A 2 is I 1 AND … event A n-1 occurs AND event An occurs AND event interval between A n-1 and An is I n-1, THEN Pattern P exists in Text T with degree Y i.

33 Fuzzy Sequence Pattern Matching Algorithm (example) The general structure of a C2H2 zinc finger protein motif (a motif is the signature of a particular group of sequences) is [2]: CφφCφφφφφφφφφφφφHφφH

34 Sequence Preprocessing (example) CφφCφφφφφφφφφφφφHφφH

35 Sequence Fuzzification (example) We use the following fuzzy rule to describe the event interval: R1: If event interval is I1 between the first two C, then the membership value is μ1 R2: If event interval is I2 between C and H, then themembership value is μ2 R3: If event interval is I3 between the last two H, then the membership value is μ3

36 Sequence Inference (example) The inference rule used here is: IF event interval between the first two Cs is I1 AND event interval between C and H is I2 AND event interval between the last two Hs is I3, THEN Pattern P exists in Text T with degree Yi. Where Yi = μ1 × μ2 × μ3 And Y = Max(Y1, Y2, Y3, …, Ym)

37 Classify

38 Sum of square error For example, sequence Z is ACCABBDACA, and the preliminary motif is A–x(2)–A–x(2)–A. The possible matches are (a) ACCABBDA (A–x(2)–A–x(3)–A) and (b) ABBDACA (A–x(3)–A–x(1)–A). The sum of square error is:for (a) : (2 − 2)2 + (3 − 2)2 = 1 (b) : (3 − 2)2 + (1 − 2)2 = 2. So (a) is the ‘most similar match’ and its event interval values (2, 3) is used as a training input data.

39 Result of C2H2 zinc finger protein (1/3)

40 Result of C2H2 zinc finger protein (2/3)

41 Result of C2H2 zinc finger protein (3/3)

42 Result of EGF Protein (1/3)

43 Result of EGF Protein (2/3)

44 Result of EGF Protein (3/3)

45 Discussion The optimization of motif patterns in both EGF and zinc finger protein family increases the rate of true positives. However, with an increase in true positives rate, the rate of false positives also increases. An interesting observation is that in comparison to the motifs suggested in PROSITE, the motifs identified by our method are more flexible and broad.

46 Conclusion and future work For future research, optimization of neuro- fuzzy system will be further investigated to implement event fuzzy membership functions for events.