Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31, 2002

Presentation Outline  Introduction  Traditional Approaches (kNN, Markov Models) to Sequence Classification  Feature Based Sequence Classification  Experimental Evaluation  Conclusions

Introduction  The amount of biological sequences available in public databases is increasing exponentially GenBank: 16 billion DNA base-pairs PIR: over 230,000 protein sequences  Strong sequence similarity often translates to functional and structural relations  Classification algorithms applied on sequence data can be used to gain valuable insights on functions and relations of sequences E.g. to assign a protein sequence to a protein family

Introduction  K-nearest neighbor, Markov models and Hidden Markov models have been extensively used They have considered the sequential constraints present in datasets  Motivation: Few attempts to use traditional machine learning classification algorithms such as decision trees and support vector machines They were thought of not being able to model sequential nature of datasets

Focus of This Paper  To evaluate some widely used sequence classification algorithms K-nearest neighbor Markov models  To develop a framework to model sequences such that traditional machine learning algorithms can be easily applied Represent each sequence as a vector in a derived feature space, and then use SVMs to build a sequence classifier

Problem Definition - Sequence Classification  A sequence S r = {x 1, x 2, x 3,.. x l } is an ordered list of symbols  The alphabet  for symbols: known in advance and of fixed size N  Each sequence S r has a class label C r  Assumption: Two class labels only (C +, C - )  Goal: To correctly assign a class label to a test sequence

Approach 1: K Nearest Neighbor (KNN) Classifiers  To classify a test sequence S r Locate K training sequences being most similar to S r Assign to S r the class label which occurs the most in those K sequences  Key task: to compute similarity between two sequences

Approach 1: K Nearest Neighbor (KNN) Classifiers  Alignment score as similarity function Compute an optimal alignment between two sequences (by dynamic programming, hence computationally expensive), and then Score this alignment: the score is a function of the no. of matched and unmatched symbols in the alignment

Approach 1: K Nearest Neighbor (KNN) Classifiers  Two variations Global alignment score  Align sequences across their entire length  Can capture position specific patterns  Need to be normalized due to varying sequence lengths Local alignment score  Only portions of two sequences are aligned  Can capture small substrings of symbols which are present in the two sequences but not necessarily at the same position

Approach 2.1: Simple Markov Chain Classifiers  To build a simple Markov chain based classification model Partition training sequences according to class labels Build a simple Markov chain (M) for each smaller dataset  To classify a test sequence S r Compute the likelihood of S r being generated by each Markov chain M, i.e. P(S r | M) Assigns to S r the class label associated with the Markov chain that gives the highest likelihood

Approach 2.1: Simple Markov Chain Classifiers  Log-likelihood ratio (for two class problems):  If L(S r )  0, then C r  C + else C r  C -  Markov principle (for 1 st order Markov chain): each symbol in a sequence depends only on its preceding symbol, so

Approach 2.1: Simple Markov Chain Classifiers  Transition probability  xi-1, xi = P(x i | x i -1 )  Each symbol is associated with a state  A Transition Probability Matrix (TPM) is built for each class

Approach 2.1: Simple Markov Chain Classifiers  Example

Approach 2.1: Simple Markov Chain Classifiers  Higher (k th ) order Markov chain Transition probability for a symbol x l is computed by looking at its k preceding symbols No. of states = N k, each associated with a sequence of k symbols Size of TPM = N k+1 (N k rows x N columns) Pros: Better classification accuracy since they capture longer ordering constraints Cons: No. of states grow exponentially with the order  many infrequent states  poor probability estimates

Approach 2.2: Interpolated Markov Models (IMM)  Build a series of Markov chains starting from the 0 th order up to the k th order  Transition probability for a symbol P(x i |x i-1, x i-2,.., x 1, IMM k ) = sum of weighted transition probabilities of the different order chains from 0 th order up to k th order Weights: Often based on distribution of different states in various order Markov models The right method appears to be dataset dependent

Approach 2.3: Selective Markov Models (SMM)  Build various order Markov chains  Prune non-discriminatory states from higher order chains (will explain how)  Conditional probability P(x i |x i-1, x i-2,.., x 1, SMM k ) is the probability corresponding to highest order chain among remaining states

Approach 2.3: Selective Markov Models (SMM)  Key task: to decide which states are non- discriminatory  Simplest way: use a frequency threshold and prune all states which occur less than it  Method used in experiment: Specify frequency threshold as a parameter  A state-transition pair is kept only if it occurs  times more frequently than its expected frequency, when uniform distribution is assumed

Approach 3: Feature Based Sequence Classification  Sequences are modeled into a form that can be used by traditional machine learning algorithms  Extraction of features that take sequential nature of sequences into account  Motivated by Markov models, support vector machines (SVMs) are used

Approach 3: Feature Based Sequence Classification  SVM A relatively new learning algorithm by Vapnik (1995) Objective: Given a training set in a vector space, find the best hyperplane (with max. margin) that separates two classes Approach: Formulate a constrained optimization problem, then solve it using constrained quadratic programming (QP) Well-suited for high dimensional data Require lots of memory and CPU time

Approach 3: Feature Based Sequence Classification  SVM – Maximum margin (a) A separating hyperplane with a small margin. (b) A separating hyperplane with a larger margin. A better generalization is expected from (b).

Approach 3: Feature Based Sequence Classification  SVM – Feature space mapping Mapping data into a higher dimensional feature space (by using kernel functions) where they are linearly separable.

Approach 3: Feature Based Sequence Classification  Vector space view (simple 1 st order Markov chain) is equivalent to L(S y ) = u t w u and w are of length N 2, each dimension corresponds to a unique pair of symbols  Element in u: frequency of a particular sequence  Element in w: log-ratio of conditional probabilities for + and – classes)

Approach 3: Feature Based Sequence Classification  Vector space view - Example (simple 1 st order Markov chain)

Approach 3: Feature Based Sequence Classification  Vector space view All variants of Markov chains described previously can be transformed in a similar manner  Dimensionality of new space: For higher order Markov chains: N k+1 For IMM: N + N 2 +.. + N k+1 For SMM: no. of non-pruned states Each sequence is viewed as a frequency vector Allows the use of any traditional classifier that operates on objects represented in multi- dimensional vectors

Experimental Evaluation  5 different datasets, each with 2-3 classes Table 1

Experimental Evaluation  Methodology Performance of algorithms was measured using classification accuracy Ten-way cross validation was used Experiments were restricted to two class problems

KNN Classifiers  “Cosine” Sequence: Frequency vector of different symbols in it Similarity `/. sequences: cosine of the two vectors Does not take sequential constraints into account Table 2

KNN Classifiers 1. ‘Global’ outperforms the other two for all K 2. For PS-HT and PS-TS, performance of ‘Cosine’ is comparable to that of ‘Global’ as limited sequential info. can be exploited Table 2

KNN Classifiers 3. ‘Local’ performs very poorly esp. on protein seq.  Not good to base classification only on a single substring 4. Accuracy decreases when K increases Table 2

Simple Markov Chains vs. Their Feature Spaces 1. Accuracy improves with order of each model Only exceptions: For PS-*, accuracy peaks at 2 nd /1 st order, as sequences are very short  higher order models & their features spaces contain very few examples for calculating transition probabilities Table 3

Simple Markov Chains vs. Their Feature Spaces 2. SVM achieves higher accuracies than simple Markov chains (often 5-10% improvement) Table 3

IMM vs. Their Feature Spaces 1. SVM achieves higher accuracies than IMM for most datasets Exceptions: For P-*, higher order IMM models do considerably better (no explanation provided) Table 4

2. Simple Markov chain based classifiers usually outperform IMM Only exceptions: PS-*, since sequences are comparatively short  greater benefit in using different order Markov states IMM vs. Their Feature Spaces Table 4

IMM Based Classifiers vs. Simple Markov Chain Based Classifiers Table 4 IMM Based Part of Table 3 Simple Markov Chain Based

SMM vs. Their Feature Spaces Table 5a   : parameter (for frequency threshold) used in pruning states of different order Markov chains

Table 5b Table 5c

1. SVM usually achieves higher accuracies than SMM 2. For many problems SMM achieves higher accuracy when  increases, but the gains are rather small Maybe because pruning strategy is too simple SMM vs. Their Feature Spaces

Conclusions 1. SVM classifier used on the feature spaces of different Markov chains (and its variants) achieves substantially better accuracies than the corresponding Markov chain classifier.  The linear classification models learnt by SVM is better than those learnt by Markov chain based approaches

Conclusions 2. Proper feature selection can improve accuracy, but increase in amount of info. available does not necessarily guarantee so. (Except PS-*) The max. accuracy attained by SVM on IMM’s feature spaces is always lower than that attained by it on feature spaces of simple Markov chains. Even with simple frequency based feature selection, as done in SMM, overall accuracy is higher.

3. KNN by computing global alignments can take advantage of the relative positions of symbols in aligned sequences Simple experiment: SVM incorporated with info. about position of symbols was able to achieve an accuracy > 97%.  Position specific info. can be useful for building effective classifiers for biological sequences. Conclusions DatasetHighest accuracy Scheme which achieves the highest accuracy S-EI0.9390KNN (K=5, with global sequence alignment) P-MS0.9719KNN (K=1, with global sequence alignment)

References  Mukund Deshpande and George Karypis. Evaluation of Techniques for Classifying Biological Sequences. In proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery (PAKDD), 2002.  Ming-Husan Yang. Presentation entitled “Gentle Guide to Support Vector Machines”.  Alexanda Johannes Smola. Presentation entitled “Support Vector Learning: Concepts and Algorithms”.

Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Similar presentations

Presentation on theme: "Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Similar presentations

Presentation on theme: "Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,"— Presentation transcript:

Similar presentations

About project

Feedback