Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Systems and Software Engineering Lab (ISSEL) – ECE – AUTH 10 th Panhellenic Conference in Informatics Machine Learning and Knowledge Discovery.

Similar presentations


Presentation on theme: "Intelligent Systems and Software Engineering Lab (ISSEL) – ECE – AUTH 10 th Panhellenic Conference in Informatics Machine Learning and Knowledge Discovery."— Presentation transcript:

1 Intelligent Systems and Software Engineering Lab (ISSEL) – ECE – AUTH 10 th Panhellenic Conference in Informatics Machine Learning and Knowledge Discovery Group (MLKD) – CSD – AUTH Volos, 11-13 November 2005 Protein Classification with Multiple Algorithms S. Diplaris, G. Tsoumakas, P. A. Mitkas, I. Vlahavas

2 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms2 Outline  Introduction  Motif-based protein classification  Combining classification methods  Experiments  Results and Discussion  Conclusions

3 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms3 Introduction  The amount of protein sequences in public biological databases is constantly increasing. These will be dwarfed by the sequences from the environmental sequencing projects currently underway.  Growing imbalance between the number of sequences in databases and the information about their structure and function.  Protein function prediction can save time and money.

4 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms4 Discovering protein functionality  Identification of a protein’s biological effect can be accomplished in two ways Time-consuming and expensive experiments, that are not always applicable (in vitro) Using computational methods, such as data mining (in silico)

5 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms5 Protein families  According to their functionality, proteins are categorized in families.  Proteins belonging in the same family feature structural relation, thus having similar properties.

6 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms6 Protein Motifs and Profiles   The behavior of a protein is a function of many motifs and profiles, where some overpower others.   Profiles are computational representations of multiple sequence alignments using hidden Markov models.   Motifs are short conserved sub-sequences that usually correspond to active or functional sites.

7 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms7 Problem Description What are its properties? Unknown protein sequence Known- family proteins

8 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms8 Background  Databases Protein Databases (Prosite, Swiss-Prot) Motif/profile Databases (Pfam, Prints)  Direct knowledge of protein function from motifs is impossible Use machine learning methods to discover similarities in protein chains Classify proteins in families using their motifs and profiles

9 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms9 Protein Classification Motif-based protein data Classification algorithm Classifier induction Unknown protein 10-fold validation Algorithm evaluation Prediction of protein function

10 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms10 Data Preprocessing (1/2) PATTERNSPROFILES PROTEINCLASSES GenMiner OUTPUTFILE Protein: P04591 (Gag polyprotein) Belongs in class: PDOC50158 Contains motif: PS50158 (ZF_CCHC)

11 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms11 Data Preprocessing (2/2) VLAEAMSQVT NSATIMMQRG NFRNQRKIVK CFNCGKEGHT ARNCRAPRKK GCWKCGKEGH m3m5m6 0 0 1 0 1 0 1 0... Protein Χ Motifs in protein Χ N-bit binary pattern

12 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms12 Combining Multiple Classification Algorithms  Motivation: Accuracy improvement  Algorithms show different: Biases for generalizing from examples Knowledge representation  Each algorithm tends to err on different parts of the instant space  Solution: Efficient combination of algorithms to correct uncorrelated errors Classifier selection Classifier fusion

13 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms13 Classifier Selection  Select a single algorithm for classifying a new instance  Known approaches: SelectBest: evaluation and selection (ES) Select upon performance on similar learning domains: estimate performance in k-nearest neighbors and rank algorithms Dynamic Selection: use of different algorithm in different parts of the instant space Dynamic Weighting: local performance around the meta-instance space

14 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms14 Classifier Fusion  Fuse decisions from all algorithms  Known approaches: Voting (V) Each model outputs a class value, the majority class wins Weighted Voting (VW) Each model votes with a coefficient based on its accuracy Stacking with Multi-Response Model Trees (SMT) Learn a meta-level model that predicts the correct class based on the base-level classifiers Most accurate classifier of the Stacking family Selective Fusion (SF) Use statistical procedures to select the best sub-group of classifiers Use VW in this subgroup to decide

15 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms15 Experiments  Dataset: 10 most important protein families 662 proteins 1182 motifs Some proteins belonged in more than one class Create separate classes for these groups of proteins Finally: 32 different classes  Evaluation: 10-fold validation

16 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms16 Experiments  9 classification algorithms DT, C4.5: decision trees RIPPER (JRip), PART: rule learning K-nearest neighbor (IBk) K*: instance-based algorithm with entropic distance measure Naïve-Bayes algorithm (NB) SMO: Sequencial Minimal Optimization for training a support vector classifier using polynomial kernels RBF: Radial basis function network  5 classifier combination methods: SMT, V, WV, SB, SF

17 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms17 Results 1

18 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms18 Discussion 1  The reputation of SVM as a state-of-the-art classification method is verified Decision Trees and Instance-Based learning also perform well Naïve-Bayes and RBF exhibit quite low performance  A biologist could use SMO but The rest of the well-performing algorithms could generalize better By combining all these algorithms or a subset of them we get…

19 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms19 Results 2

20 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms20 Discussion 2  WV performed better than SB The voting procedure corrects uncorrelated errors  V did not perform well Bad performing models  State-of-the-art SMT performed very badly Large number of classes led to high dimensionality in the meta-level dataset  SF performed great Selected the best subset of models and fused them 6.3 models on average of the 10 folds Combination of multiple algorithms using appropriate selection results in error reduction

21 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms21 Conclusions  Comparative study of different classification algorithms and algorithm combination methods in motif-based protein classification  To successfully apply the algorithms we need: Multiple algorithms A proper method to discard bad-performing algorithms and combine the best

22 Aristotle University of Thessaloniki 20/05/2015Protein Classification with Multiple Algorithms22 Future issues  Multi-label instances  Effectiveness of alternative representations of the problem  Discover new profiles directly from the protein sequences


Download ppt "Intelligent Systems and Software Engineering Lab (ISSEL) – ECE – AUTH 10 th Panhellenic Conference in Informatics Machine Learning and Knowledge Discovery."

Similar presentations


Ads by Google