Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

Slides:

Advertisements

Similar presentations

Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.

Advertisements

 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.

Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.

Profiles for Sequences

Hidden Markov Models Theory By Johan Walters (SR 2003)

Lecture 6, Thursday April 17, 2003

Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.

Heuristic alignment algorithms and cost matrices

Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.

Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.

Profile-profile alignment using hidden Markov models Wing Wong.

Lecture 5: Learning models using EM

Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

CISC667, F05, Lec23, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines (II) Bioinformatics Applications.

Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.

Similar Sequence Similar Function Charles Yan Spring 2006.

Support Vector Machine and String Kernels for Protein Classification Christina Leslie Department of Computer Science Columbia University.

Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.

6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.

Algorithms for variable length Markov chain modeling Author: Gill Bejerano Presented by Xiangbin Qiu.

Protein Classification. PDB Growth New PDB structures.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,

M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.

Introduction to Profile Hidden Markov Models

Masquerade Detection Mark Stamp 1Masquerade Detection.

Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.

Hidden Markov Models for Sequence Analysis 4

BINF6201/8201 Hidden Markov Models for Sequence Analysis

Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.

Protein Classification Using Averaged Perceptron SVM

Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Protein Classification. Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST.

Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.

Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.

1 CISC 841 Bioinformatics (Fall 2007) Kernel engineering and applications of SVMs.

Introduction to String Kernels Blaz Fortuna JSI, Slovenija.

Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.

John Lafferty Andrew McCallum Fernando Pereira

Protein Family Classification using Sparse Markov Transducers Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology.

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S

Step 3: Tools Database Searching

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Fast Kernel Methods for SVM Sequence Classifiers Pavel Kuksa and Vladimir Pavlovic Department of Computer Science Rutgers University.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Final Report (30% final score) Bin Liu, PhD, Associate Professor.

Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.

Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.

An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)

Hidden Markov Models BMI/CS 576

KDD 2004: Adversarial Classification

Combining HMMs with SVMs

Protein Structural Classification

Presentation transcript:

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand Ravindranath Mei Sze Lam

Introduction Problem in Computational Biology  Classification of Proteins into functional and structural classes based on homology of protein sequence data

Methods for Protein Classification and Homology Detection Pairwise sequence alignment Profiles for protein families Consensus patterns using motifs Profiles HMMs

Focus Remote Homology Detection

How is the problem handled currently? Fisher-SVM One of the successful discriminative techniques for protein classification and Best performing method for remote homology detection

Fisher-SVM Build a profile HMM for the positive training sequences, defining loglikelihood function [log P(x/ θ )] for any protein sequence x. θ 0 - maximum likelihood for model parameters

Gradient vector d(log P(x/ θ )/ θ=θ 0 )/d θ assigns to each (positive or negative) training sequence x an explicit vector feature called fisher scores. This feature mapping defines a kernel function, called the fisher kernel. This Fisher kernel can then be used to train a SVM classifier.

Strengths Combines biological information encoded in a HMM with the discriminative power of the SVM algorithm.

Negatives Needs lots of data or sophisticated priors to train the HMM. It is expensive to compute the kernel matrix, as calculating the fisher scores requires computing forward and backward probabilities from the Baun-Welch algorithm.

Mismatch-SVM The (k,m)-mismatch kernel is based on a feature map to a vector space indexed by all possible subsequence of amino acids of a fixed length k. Each instance of a fixed k-length subsequence in an input sequence attributes to all feature coordinates differing from it by at most m mismatches.

Thus, the mismatch kernel adds the biologically important idea of mismatching to the computationally simpler In this paper, it is described how to compute the new kernel efficiently using a for values of (k,m) useful in this application. spectrum kernel mismatch tree data structure Mismatch kernel

Advantages By using mismatch tree data structure the kernel is fast enough to use on real datasets. Considerabily less expensive than the fisher kernel. Performance equal to Fisher-SVM. Outperforms other methods.

This kernel does not depend on any generative model and can be used for other sequence based classification problems.

Feature Maps for Strings (k,m)-mismatch kernel is based on a feature map from the space of all finite sequences from an alphabet A of size | A | = l to the l k –dimensional vector space indexed by the set of k-length subsequences (“k-mers”) from A. where, A - alphabet representing amino acids. l - no. of amino acids.

If α is a k-mer β is all k length sequences N (k,m) ( α ) – set of all k length sequences differing from α by at most m mismatches. we define our feature map Φ (k,m) as Φ (k,m) (α) = ( φ β ( α)) βЄ A k where φ β ( α ) = 1if β belongs to N (k,m) ( α ), φ β ( α ) = 0 otherwise.

For a sequence x of any length, we extends the map additively by summing the feature vectors for all the k-mers in x: Φ (k,m) (x) = Σ (k-mers α in x) Φ (k,m) (α) The (k,m)-mismatch kernal is given by K (k,m) (x,y) = ‹Φ (k,m) (x), Φ (k,m) (y)›. For m = 0, we retrieve the k-spectrum kernal.

Fisher Scores and Spectrum Kernel Even though the spectrum and mismatch feature maps are defined without any reference to a generative model, there is some similarity between the k-spectrum feature map and the fisher scores associated to an order k-1 markov chain model.

Efficient computation of the Mismatch Kernel: Mismatch Tree Data Structure Mismatch tree data structure is used to represent the feature space(the set of all k-mers) and perform a lexical traversal of all k-mers occurring in the sample dataset match with up to m of mismatches.

Example: Traversing the Mismatch Tree Traversal for input sequence: AVLALKAVLL, k=8, m=1

Example: Computing the Kernel for Pair of Sequences Traversal of trie for k=3 (m=0) EADLALGKAVF ADLALGADQVFNG A S1:S1: S2:S2: EADLALGKAVF ADLALGADQVFNG D EADLALGKAVF ADLALGADQVFNG L Update kernel value for K( s 1, s 2 ) by adding contribution for feature ADL

Efficiency Issues for Kernel Computation Depth first search Recursive function  efficient use of memory  no problem for large data sets

Computational Cost Theoretical Computational Cost, O Number of mismatches, m. This increases, computational cost increases exponentially Number of different amino acids in the body The number of characters in the sequence, k. The classifier breaks up proteins into lengths of 5~6. (Longer strings are broken down by summing the feature vectors) Total length of the sample data, N

Computational Cost (cont’d) Worst case scenario for M sequences ? Where is just the M number of sequences to be processed. Supposing M number of sequences are all equal with max. no of non zero entities = M x M x n = M X N

Training and Test Data + Class Class Class Classes of Superfamily Proteins (taken from SCOP database) Families of Proteins, each belonging to 1 of 33 Superfamilies - Any other class other than Class For each class, we want to know whether a given protein sequence belongs to that class – Y/N? 160 experiments were performed on 33 classes. Class we are interested in

Implementation and Comparison of Methods We test the mismatch kernel with a publicly available SVM implementation 4 methods Mismatch Kernel Fisher Kernel SAM-T98 PSI-BLAST Uses SVM implementation HMM Alignment Scoring

Show on board: The closer to 1, the better the score – more true positives to false positives Peformance Measurement - ROC ROC50 ROC

Performance Comparison Comparison of four homology detection methods Many of the Mismatch SVM and Fisher Kernel classifications fall close to 1, meaning there is a low FP error rate; threshold Both classifiers manage to classify almost all of the 33 classes with ROC score > 0.85

Mismatch VS Spectrum  Mismatch kernel outperforms the Spectrum kernel ROCROC50

Mismatch VS Fisher ROC50ROC  No Significant Difference!

Discussion & Conclusion What was it for? Constructing kernel for homology detection What was achieved? A kernel that was equal in performance to the best known classifier but with a lower computational cost Future Work Since does not depend on generative model (unlike Fisher), can be easily used for other stuff, eg. Splice site prediction Since it is computationally cheaper (ie. faster), can be used for practical biological purposes, eg. multiclass prediction