Support Vector Machine and String Kernels for Protein Classification Christina Leslie Department of Computer Science Columbia University.

Slides:

Advertisements

Similar presentations

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.

Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Support Vector Machines

SVM—Support Vector Machines

Support vector machine

Machine learning continued Image source:

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.

Profiles for Sequences

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.

Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.

1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.

Reduced Support Vector Machine

Support Vector Machines Kernel Machines

Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.

Support Vector Machines

Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.

What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Nycomed Chair for Bioinformatics and Information Mining Kernel Methods for Classification From Theory to Practice 14. Sept 2009 Iris Adä, Michael Berthold,

Online Learning Algorithms

An Introduction to Support Vector Machines Martin Law.

Support Vector Machines

M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.

Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.

Efficient Model Selection for Support Vector Machines

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Hidden Markov Models for Sequence Analysis 4

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.

Support Vector Machine (SVM) Based on Nello Cristianini presentation

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

An Introduction to Support Vector Machines (M. Law)

A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.

Protein Classification Using Averaged Perceptron SVM

Christopher M. Bishop, Pattern Recognition and Machine Learning.

Protein Classification. Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST.

CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.

Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.

Protein Family Classification using Sparse Markov Transducers Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S

Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Final Report (30% final score) Bin Liu, PhD, Associate Professor.

Protein Classification

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.

Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)

An Introduction to Support Vector Machines

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

SVMs for Document Ranking

Presentation transcript:

Support Vector Machine and String Kernels for Protein Classification Christina Leslie Department of Computer Science Columbia University

Learning Sequence-based Protein Classification Problem: classification of protein sequence data into families and superfamilies Motivation: Many proteins have been sequenced, but often structure/function remains unknown Motivation: infer structure/function from sequence-based classification

Sequence Data versus Structure and Function >1A3N:A HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR >1A3N:B HEMOGLOBIN VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH >1A3N:C HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR >1A3N:D HEMOGLOBIN VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH Sequences for four chains of human hemoglobin Tertiary Structure Function: oxygen transport

Structural Hierarchy SCOP: Structural Classification of Proteins Interested in superfamily-level homology – remote evolutionary relationship

Learning Problem Reduce to binary classification problem: positive (+) if example belongs to a family (e.g. G proteins) or superfamily (e.g. nucleoside triphosphate hydrolases), negative (-) otherwise Focus on remote homology detection Use supervised learning approach to train a classifier Labeled Training Sequences Classification Rule Learning Algorithm

Two supervised learning approaches to classification Generative model approach Build a generative model for a single protein family; classify each candidate sequence based on its fit to the model Only uses positive training sequences Discriminative approach Learning algorithm tries to learn decision boundary between positive and negative examples Uses both positive and negative training sequences

Hidden Markov Models for Protein Families Standard generative model: profile HMM Training data: multiple alignment of examples from family Columns of alignment determine model topology 7LES_DROME7LES_DROMELKLLRFLGSGAFGEVYEGQLKTE....DSEEPQRVAIKSLRK ABL1_CAEELABL1_CAEEL IIMHNKLGGGQYGDVYEGYWK RHDCTIAVKALK BFR2_HUMANBFR2_HUMAN LTLGKPLGEGCFGQVVMAEAVGIDK.DKPKEAVTVAVKMLKDD.....A TRKA_HUMANTRKA_HUMAN IVLKWELGEGAFGKVFLAECHNLL...PEQDKMLVAVKALK 

Profile HMMs for Protein Families Match, insert and delete states Observed variables: symbol sequence, x 1.. x L Hidden variables: state sequence,  1..  L Parameters: transition and emission probabilities Joint probability: P( x,  |  )

HMMs: Pros and Cons Ladies and gentlemen, boys and girls: Let us leave something for next week…

Discriminative Learning Discriminative approach Train on both positive and negative examples to learn classifier Modern computational learning theory Goal: learn a classifier that generalizes well to new examples Do not use training data to estimate parameters of probability distribution – “curse of dimensionality”

Learning Theoretic Formalism for Classification Problem Training and test data drawn i.i.d. from fixed but unknown probability distribution D on X  {-1,1} Labeled training set S = {(x 1, y 1 ), …, (x m, y m )}

Support Vector Machines (SVMs) We use SVM as discriminative learning algorithm _ + _ _ _ _ + + _ Training examples mapped to (usually high-dimensional) feature space by a feature map F( x ) = (F 1 ( x ), …, F d ( x )) Learn linear decision boundary: Trade-off between maximizing geometric margin of the training data and minimizing margin violations

SVM Classifiers Linear classifier defined in feature space by f(x) = + b SVM solution gives w =   i x i as a linear combination of support vectors, a subset of the training vectors _ + _ _ _ _ + + _ w b

Advantages of SVMs Large margin classifier: leads to good generalization (performance on test sets) Sparse classifier: depends only on support vectors, leads to fast classification, good generalization Kernel method: as we’ll see, we can introduce sequence-based kernel functions for use with SVMs

Hard Margin SVM Assume training data linearly separable in feature space Space of linear classifiers f w,b (x) =  w, x  + b giving decision rule h w,b (x) = sign(f w,b (x)) If |w| = 1, geometric margin of training data for h w,b  S = Min S y i (  w, x i  + b) _ ++ + _ _ _ _ _ w b

Hard Margin Optimization Hard margin SVM optimization: given training data S, find linear classifier h w,b with maximal geometric margin  S Convex quadratic dual optimization problem Sparse classifier in term of support vectors _ ++ + _ _ _ _ _

Hard Margin Generalization Error Bounds Theorem [Cristianini, Shawe-Taylor]: Fix a real value M > 0. For any probability distribution D on X  {-1,1} with support in a ball of radius R around the origin, with probability 1-  over m random samples S, any linear hypothesis h with geometric margin  S  M on S has error no more than Err D (h)   (m, , M, R) provided that m is big enough

SVMs for Protein Classification Want to define feature map from space of protein sequences to vector space Goals: Computational efficiency Competitive performance with known methods No reliance on generative model – general method for sequence-based classification problems

Spectrum Feature Map for SVM Protein Classification New feature map based on spectrum of a sequence 1.C. Leslie, E. Eskin, and W. Noble, The Spectrum Kernel: A String Kernel for SVM Protein Classification. Pacific Symposium on Biocomputing, C. Leslie, E. Eskin, J. Weston and W. Noble, Mismatch String Kernels for SVM Protein Classification. NIPS 2002.

The k-Spectrum of a Sequence Feature map for SVM based on spectrum of a sequence The k-spectrum of a sequence is the set of all k-length contiguous subsequences that it contains Feature map is indexed by all possible k-length subsequences (“k-mers”) from the alphabet of amino acids Dimension of feature space = 20 k Generalizes to any sequence data AKQDYYYYEI AKQ KQD QDY DYY YYY YYE YEI

k-Spectrum Feature Map Feature map for k-spectrum with no mismatches: For sequence x, F (k) ( x ) = (F t ( x )) {k-mers t }, where F t ( x ) = #occurrences of t in x AKQDYYYYEI ( 0, 0, …, 1, …, 1, …, 2 ) AAA AAC … AKQ … DYY … YYY

(k,m)-Mismatch Feature Map Feature map for k-spectrum, allowing m mismatches: if s is a k-mer, F (k,m) ( s ) = (F t ( s )) {k-mers t }, where F t ( s ) = 1 if s is within m mismatches from t, 0 otherwise extend additively to longer sequences x by summing over all k-mers s in x AKQ DKQ EKQ AAQAAQ AKY … …

The Kernel Trick To train an SVM, can use kernel rather than explicit feature map For sequences x, y, feature map F, kernel value is inner product in feature space: K( x, y ) =  F( x ), F( y )  Gives sequence similarity score Example of a string kernel Can be efficiently computed via traversal of trie data structure

Computing the (k,m)-Spectrum Kernel Use trie (retrieval tree) to organize lexical traversal of all instances of k-length patterns (with mismatches) in the training data Each path down to a leaf in the trie corresponds to a coordinate in feature map Kernel values for all training sequences updated at each leaf node If m=0, traversal time for trie is linear in size of training data Traversal time grows exponentially with m, but usually small values of m are useful Depth-first traversal makes efficient use of memory

Example: Traversing the Mismatch Tree Traversal for input sequence: AVLALKAVLL, k=8, m=1

Example: Traversing the Mismatch Tree Traversal for input sequence: AVLALKAVLL, k=8, m=1

Example: Traversing the Mismatch Tree Traversal for input sequence: AVLALKAVLL, k=8, m=1

Example: Computing the Kernel for Pair of Sequences Traversal of trie for k=3 (m=0) EADLALGKAVF ADLALGADQVFNG A S1:S1: S2:S2:

Example: Computing the Kernel for Pair of Sequences Traversal of trie for k=3 (m=0) EADLALGKAVF ADLALGADQVFNG A D S1:S1: S2:S2:

Example: Computing the Kernel for Pair of Sequences Traversal of trie for k=3 (m=0) EADLALGKAVF ADLALGADQVFNG A D L Update kernel value for K( s 1, s 2 ) by adding contribution for feature ADL s1:s1: s2:s2:

Fast prediction SVM training: determines subset of training sequences corresponding to support vectors and their weights: ( x i,  i ), i = 1.. r Prediction with no mismatches: Represent SVM classifier by hash table mapping support k-mers to weights Test sequences can be classified in linear time via look-up of k-mers Prediction with mismatches: Represent classifier as sparse trie; traverse k-mer paths occurring with mismatches in test sequence

Experimental Design Tested with set of experiments on SCOP dataset Experiments designed to ask: Could the method discover a new family of a known superfamily? Diagram from Jaakkola et al.

Experiments 160 experiments for 33 target families from 16 superfamilies Compared results against SVM-Fisher SAM-T98 (HMM-based method) PSI-BLAST (heuristic alignment-based method)

Conclusions for SCOP Experiments Spectrum Kernel with SVM performs as well as the best-known method for remote homology detection problem Efficient computation of string kernel Fast prediction Can precompute per k-mer scores and represent classifier as a lookup table Gives linear time prdiction for both spectrum kernel, (unnormalized) mismatch kernel General approach to classification problems for sequence data

Feature Selection Strategies Explicit feature filtering Compute score for each k-mer, based on training data statistics, during trie traversal and filter as we compute kernel Feature elimination as a wrapper for SVM training Eliminate features corresponding to small components w i in vector w defining SVM classifier Kernel principal component analysis Project to principal components prior to training

Ongoing and Future Work New families of string kernels, mismatching schemes Applications to other sequence-based classification problems, e.g. splice site prediction Feature selection Explicit and implicit dimension reduction Other machine learning approaches to using sparse string-based models for classification Boosting with string-based classifiers