Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Three things everyone should know to improve object retrieval
1 Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis.
Pattern Recognition and Machine Learning
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
A Probabilistic Framework for Semi-Supervised Clustering
Discriminative and generative methods for bags of features
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Support Vector Machines and Kernel Methods
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.
Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Support Vector Machine and String Kernels for Protein Classification Christina Leslie Department of Computer Science Columbia University.
The Implicit Mapping into Feature Space. In order to learn non-linear relations with a linear machine, we need to select a set of non- linear features.
Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff.
Distributed Representations of Sentences and Documents
05/06/2005CSIS © M. Gibbons On Evaluating Open Biometric Identification Systems Spring 2005 Michael Gibbons School of Computer Science & Information Systems.
Protein Classification. PDB Growth New PDB structures.
Ensemble Learning (2), Tree and Forest
Radial Basis Function Networks
M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.
A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data Authors: Eleazar Eskin, Andrew Arnold, Michael Prerau,
Masquerade Detection Mark Stamp 1Masquerade Detection.
Active Learning for Class Imbalance Problem
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Extending the Multi- Instance Problem to Model Instance Collaboration Anjali Koppal Advanced Machine Learning December 11, 2007.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Protein Classification Using Averaged Perceptron SVM
Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.
Protein Classification. Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Classification using Co-Training
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Graph-based WSD の続き DMLA /7/10 小町守.
Introduction to Machine Learning Nir Ailon Lecture 12: EM, Clustering and More.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Data Mining Practical Machine Learning Tools and Techniques
Semi-Supervised Clustering
A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence Yue Ming NJIT#:
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
CSE572: Data Mining by H. Liu
Protein Structural Classification
Presentation transcript:

Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford Noble

Why Semi-supervised &How Unlabeled data may contains information (structure-distribution information) that is helpful for improve classification performance Unlabeled data may contains information (structure-distribution information) that is helpful for improve classification performance Labeled instances are important hints for clustering Labeled instances are important hints for clustering Cluster assumption Cluster assumption Two points which can be connected by a high density path (i.e. in the same cluster) are likely to be of the same label. Decision boundary should lie in a low density region.

Why Semi-supervised &How (Cont.) Recent work in semi-supervised learning has focused on changing the representation given to a classifier by taking into account the structure described by the unlabeled data An Example: Profile hidden Markov models (HMMs) Can be trained iteratively using both positively labeled and unlabeled examples by pulling in close homologs and adding them to the positive set.

Main Results of the Paper Develop simple and scalable cluster kernel techniques for incorporating unlabeled data into the representation of protein sequences. We show that the methods greatly improve the classification performance. We achieve equal or superior performance to previously presented cluster kernel methods and at the same time achieving far greater computational efficiency.

Neighborhood Kernel & Bagged Kernel Neighborhood Kernel Neighborhood Kernel uses averaging over a neighborhood of sequences defined by a local sequence similarity measure Bagged Kernel Bagged Kernel uses bagged clustering of the full sequence dataset to modify the base kernel. The main idea is to change the distance metric so that the relative distance between two points is smaller if the points are in the same cluster.

Neighborhood Kernel & Bagged Kernel (Cont.) kernels make use of two complementary (dis)similarity measures: A base kernel representation that implicitly exploits features useful for discrimination between classes. (mismatch string kernel) A distance measure that describes how close examples are to each other. (BLAST or PSI-BLAST E-values) We note that string kernels have proved to be powerful representations for SVM classification, but do not give sensitive pairwise similarity scores like the BLAST family methods

Neighborhood Kernel (1) Use a standard sequence similarity measure like BLAST or PSI-BLAST to define a neighborhood for each input sequence Use a standard sequence similarity measure like BLAST or PSI-BLAST to define a neighborhood for each input sequence The neighborhood Nbd(x) of sequence x is the set of sequences x ’ with similarity score to x below a fixed E- value threshold, together with x itself. The neighborhood Nbd(x) of sequence x is the set of sequences x ’ with similarity score to x below a fixed E- value threshold, together with x itself. Given a fixed original feature representation, x is represented by the average of the feature vectors for members of its neighborhood

Neighborhood Kernel (2)

Bagged Mismatch Kernel Using K-means Using K-means

Step (2) is a valid kernel because it is the inner product in an nk-dimensional space Φ(x i ) =. Products of kernels as in Step (3) are also valid kernels. The intuition behind the approach is that the original kernel is rescaled by the ‘ probability ’ that two points are in the same cluster, hence encoding the cluster assumption.

Revisiting the Key Idea Changing the representation given to a classifier by taking into account the structure described by the unlabeled data

Experiments We use the same 54 target families and the same test and training set splits as in the remote homology experiments from Liao and Noble (2002) Semi-supervised setting Transductive setting Large-scale experiments

Semi-supervised setting A signed rank test shows that the neighborhood mismatchkernel yields significant improvement over adding homologs (P value =3.9 × 10 −5 )

Transductive Setting A signed rank test (with adjusted P-value cutoff of 0.05) finds no significant difference between the neighborhood kernel, the bagged kernel (k = 100) and the random walk kernel in this transductive setting.

Transductive Setting (Cont.) A signed rank test (with adjusted P-value cutoff of 0.05) finds no significant difference between the neighborhood kernel, the bagged kernel (k = 100) and the random walk kernel in this transductive setting.

Large-scale experiments Using 101,602 Swiss- Prot protein sequences as additional unlabeled data.

Large-scale experiments (Cont.)

Although the cluster kernel method does not outperform the profile kernel on average, results of the two methods are similar (20 wins, 25 losses, 9 ties for the cluster kernel); a signed rank test with a P-value threshold of 0.05 finds no significant difference in performance between the two methods. In practice, in the large-scale experiments reported, the profile kernel takes less time to compute than the neighborhood kernel, since we do not limit the number of sequences represented in the neighborhoods and hence the neighborhood representation is much less compact than the profile kernel representation.

Large-scale experiments (Cont.) The profile kernel might perform best when the training profiles themselves are not too distant from the test sequences (as measured, e.g. by PSIBLAST E-value) Whereas the neighborhood kernel might perform better in the more difficult situation where the test sequences are poorly described by every positive training profile.

Conclusion Described two kernels, the neighborhood mismatch kernel and the bagged mismatch kernel, which combine both approaches and yield state-of-the-art performance in protein classification. Revisiting the key idea: Changing the representation given to a classifier by taking into account the structure described by the unlabeled data

Thanks