A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.

Slides:

Advertisements

Similar presentations

Support Vector Machine

Advertisements

ECG Signal processing (2)

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Pattern Recognition and Machine Learning

An Introduction of Support Vector Machine

SVM—Support Vector Machines

Support vector machine

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Machine learning continued Image source:

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.

Classification and Decision Boundaries

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.

SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.

Classification and risk prediction

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Presented by Zeehasham Rasheed

Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Active Learning for Class Imbalance Problem

GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.

Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.

SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.

Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.

LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.

Supervised Learning. CS583, Bing Liu, UIC 2 An example application An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc)

Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.

Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.

Support Vector Machines Tao Department of computer science University of Illinois.

CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:

Class Imbalance in Text Classification

Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S

Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz.

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Final Report (30% final score) Bin Liu, PhD, Associate Professor.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Predicting Structural Features Chapter 12. Structural Features Phosphorylation sites Transmembrane helices Protein flexibility.

High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.

PREDICT 422: Practical Machine Learning

Chapter 7. Classification and Prediction

Machine Learning Basics

Feature Extraction Introduction Features Algorithms Methods

Introduction Feature Extraction Discussions Conclusions Results

Prediction of RNA Binding Protein Using Machine Learning Technique

Extra Tree Classifier-WS3 Bagging Classifier-WS3

Support Vector Machines Introduction to Data Mining, 2nd Edition by

Support Vector Machine (SVM)

COSC 4335: Other Classification Techniques

COSC 4368 Machine Learning Organization

SVMs for Document Ranking

Presentation transcript:

A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr

Content Abstract Introduction Materials and Methods Dealing with Class Imbalance: A New Supervised Over- Sampling Method 6 5 Experimental Results and Analysis Conclusion

Abstract Ubiquitous Useful for both protein function annotation and drug design It’s a typical imbalanced learning problem Little attention has been paid to the negative impact of class imbalance Proposed over-sampling algorithm, a predictor, called TargetSOS Cross-validation tests and independent validation tests demonstrate the effectiveness of TargetSOS Identifying interaction residues solely from protein sequences

Introduction Why? Effort in early stage, Great success, Imbalanced learning problem, Solutions, Over-sampling technique in this study Why? 1: Nucleotides play critical roles in various metabolic 2: significant importance for protein function analysis and drug design Effort in early stage 1:motif-based methods dominated this field 2:challenges,characterize protein-nucleotide interaction within a relatively narrow range(usually only for a single nucleotide type); require tertiary protein structure as input Machine-learning-based methods(great success) 1: ATPint demonstrated the feasibility of predicting…..solely from protein sequence information 2: NsitePred used for multiple nucleotides based on larger training datasets

Imbalanced learning problem 1: The number of negative samples is significantly larger than that of positive samples 2: No methods have considered this serious class imbalance phenomenon Solutions Sample rescaling-based methods, learning-based methods, active learning, kernel learning,hybrid methods. Supervised over-sampling technique in this study 1: Sample rescaling strategy is basic technique by balancing the sizes of different class by changing the number and distributions within them 2: It’s different with under-sampling technique 3:ROS, SMOTE,ADASYN,SOS 4: New predictor: TargetSOS

Materials and Methods Benchmark Datasets Feature Representation and Classifier A. Extract Feature Vector from the Position-Specific Scoring Matrix B. Extract Feature Vector from the Predicted Protein Secondary Structure C. Support Vector Machine

Benchmark Datasets Two benchmark were chosen to evaluate the efficacy of the proposed SOS algorithm and of the implemented predictor ATP non-redundant, ATP-interacting protein sequences 3104 for ATP binding, for ATP non-binding NUC5 Multiple nucleotide-interacting dataset(5) NUC5 consists of 227, 321, 140, 56, and 105 protein sequences that interact with five types of nucleotides, i.e., ATP, ADP, AMP, GTP, and GDP, respectively, Similar point: which the maximal pairwise sequence identity is less than 40%. Table 1 summarizes the detailed compositions of the two benchmarks datasets:

Feature Representation and Classifier the position-specific scoring matrix (PSSM) and predicted protein secondary structure (PSS), both of which have been demonstrated to be especially useful for protein-nucleotide binding residue prediction, are taken to extract discriminative feature vectors. Support vector machine (SVM) is used as a classifier for constructing a prediction model.

A. Extract Feature Vector from the Position-Specific Scoring Matrix. PSSM is widely used in bioinformatics. In this study, we obtain the PSSM of a query protein sequence by performing PSI-BLAST to search the Swiss-Prot database through three iterations and with as the E-value cutoff against the query sequence. normalize each score, denoted as x, that is contained in the PSSM using the logistic function Based on the normalized PSSM, the feature vector, denoted Logistic PSSM, for each residue in the protein sequence can be extracted by applying a sliding-window technique. the dimensionality of the Logistic PSSM feature vector of a residue is 17*20= 340-D.

B. Extract Feature Vector from the Predicted Protein Secondary Structure PSIPRED can predict the probabilities of each residue in a query protein sequence belonging to three secondary structure classes, i.e., coil, helix, and strand. We obtained the predicted protein secondary structure by performing PSIPRED against the query sequence. The obtained predicted secondary structure is an L*3 probability matrix, where L is the length of the protein sequence. Similar to the Logistic PSSM feature extraction, we can extract a 1763 =51-D feature vector, denoted as PSS, for each residue in the protein by applying a sliding window of size 17.

C. Support Vector Machine. We use SVM as the base-learning model to evaluate the efficacy of the proposed SOS algorithm Let be the set of samples, and +1 and -1 are the labels of positive class and negative class, respectively. In linearly separable cases, SVM constructs a hyperplane that separates the samples of two classes with a maximum margin. The optimal separating hyperplane (OSH) is constructed by finding another vector, w, and a parameter, b, that minimizes and satisfies the following conditions:

To allow for mislabeled examples, we use a soft,margin technique For each training sample, a corresponding slack variable is introduced, i=1,2,3,…,N. Accordingly, the relaxed separation constraint is given as:

Then, the OSH can be solved by minimizing. Furthermore, to address non-linearly separable cases, the ‘‘kernel substitution’’ technique is introduced as follows: first, the input vector xi [Rd is mapped into a higher dimensional Hilbert space, H, by a non-linear kernel function, K(xi,xj); then, the OSH in the mapped space, H, is solved using a procedure similar to that for a linear case, and the decision function is given by:

Dealing with Class Imbalance: A New Supervised Over-Sampling Method A.Random Over-sampling. B. Synthetic Minority Over-sampling Technique. C. Adaptive Synthetic Sampling D. Proposed Supervised Over-sampling.

A. Random Over-sampling. In the ROS technique, the minority set Smin is augmented by replicating randomly selected samples within the set. Easy to perform, but tend to be over-fitted. To solve the problem, We will introduce SMOTE and ADASYN.

B. Synthetic Minority Over-sampling Technique. For each sample xi in Smin, let be the set of the K-nearest neighbors of xi in Smin under the Euclidian distance metric. To synthesize a new sample, an element in SK i, denoted as ^xi, is selected and then multiplied by the feature vector difference between ^xi and xi and by a random number between [0, 1]. Finally, this vector is added to xi : The parameter in the function is a random number. Between 0 and 1.

C. Adaptive Synthetic Sampling SMOTE creates the same number of synthetic samples for each original minority sample without considering the neighboring majority samples, which increases the occurrence of overlapping between classes. In view of this limitation of it, ADASYN is introduced.

D. Proposed Supervised Over-sampling.

Experimental Results and Analysis Evaluation indexes Supervised Over-Sampling Helps to Enhance Prediction Performance Comparisons with Other Over-Sampling Methods Comparisons with Existing Predictors A.Cross-Validation Test. B. Independent Validation Test.

Evaluation Indexes Let TP, FP, TN, and FN be the abbreviations for true positive, false positive, true negative, and false negative, respectively. Then, Sensitivity(Sen), Specificity(Spe), Accuracy(Acc), and the Matthews correlation coefficient (MCC) can be defined as follows:

Supervised Over-Sampling Helps to Enhance Prediction Performance

Figure 1. ROC curves of with-SOS and without-SOS predictions for ATP168 and ATP227 over five-fold cross-validation. (a) ROC curves for ATP168; (b) ROC curves for ATP227.

Comparisons with Other Over-Sampling Methods

Comparisons with Existing Predictors A.Cross-Validation Test. B. Independent Validation Test.

A. Cross-Validation Test.

B. Independent Validation Test.

Conclusion In this study, a new SOS algorithm that balances the samples of different classes by synthesizing additional samples for minority class with a supervised process is proposed to address imbalanced learning problems. We apply the proposed SOS algorithm to protein-nucleotide binding residue prediction, and a web-server, called TargetSOS, is implemented. Cross-validation tests and independent validation tests on two benchmark datasets demonstrate that the proposed SOS algorithm helps to improve the performance of protein-nucleotide binding residue prediction. The findings of this study enrich the understanding of class imbalance learning and are sufficiently flexible to be applied to other bioinformatics problems in which class imbalance exists, such as protein functional residue prediction and disulfide bond prediction.

Thank you for your attention! ·