Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah

Slides:



Advertisements
Similar presentations
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Advertisements

Sparse vs. Ensemble Approaches to Supervised Learning
Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Lazy Learning k-Nearest Neighbour Motivation: availability of large amounts of processing power improves our ability to tune k-NN classifiers.
Three kinds of learning
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
1 Diagnosing Breast Cancer with Ensemble Strategies for a Medical Diagnostic Decision Support System David West East Carolina University Paul Mangiameli.
Sparse vs. Ensemble Approaches to Supervised Learning
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
Efficient Model Selection for Support Vector Machines
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.
An Example of Course Project Face Identification.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
A new way of seeing genomes Combining sequence- and signal-based genome analyses Maik Friedel, Thomas Wilhelm, Jürgen Sühnel FLI Introduction: So far,
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Does one size really fit all? Evaluating classifiers in a Bag-of-Visual-Words classification Christian Hentschel, Harald Sack Hasso Plattner Institute.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Results for all features Results for the reduced set of features
Efficient Image Classification on Vertically Decomposed Data
Trees, bagging, boosting, and stacking
Source: Procedia Computer Science(2015)70:
Basic machine learning background with Python scikit-learn
Evaluating classifiers for disease gene discovery
Feature Extraction Introduction Features Algorithms Methods
Avdesh Mishra, Manisha Panta, Md Tamjidul Hoque, Joel Atallah
Introduction Feature Extraction Discussions Conclusions Results
Brain Hemorrhage Detection and Classification Steps
Prediction of RNA Binding Protein Using Machine Learning Technique
Machine Learning Week 1.
Classifying enterprises by economic activity
Extra Tree Classifier-WS3 Bagging Classifier-WS3
Efficient Image Classification on Vertically Decomposed Data
Support Vector Machine (SVM)
Machine Learning to Predict Experimental Protein-Ligand Complexes
Machine Learning and Verbatim Survey Response
Classification Boundaries
Somi Jacob and Christian Bach
Reecha Khanal Mentor: Avdesh Mishra Supervisor: Dr. Md Tamjidul Hoque
Detection of Sand Boils using Machine Learning Approaches
Machine Learning with Clinical Data
Physics-guided machine learning for milling stability:
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Evaluating Classifiers for Disease Gene Discovery
Pooja Pun, Avdesh Mishra, Simon Lailvaux, Md Tamjidul Hoque
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Machine Learning for Cyber
Presentation transcript:

Prediction of Hierarchical Classification of Transposable Elements using Machine Learning Techniques Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah Department of Computer Science and Biological Sciences The University of New Orleans, New Orleans, LA

Presentation Overview Introduction Data Collection Feature Extraction Hierarchical Classification Strategies Machine Learning Methods for the Prediction of Hierarchical Categories Framework for the Stacking Model Results Conclusions

Transposable Elements (TE) Transposable elements (TEs) or jumping genes are the DNA sequences that have Intrinsic capability to move within a host genome from one genomic location to another Genomic location can either be same or different chromosome TEs were first discovered by Barbara McClintock (a.k.a. maize scientist) in 1948 TEs play an important role in: Modifying functionalities of genes Mutation, Chromosome Breakage E.g. insertion of L1 type TEs in tumor suppressor genes could lead to cancer.

Illustration of Wicker’s Taxonomy for Transposable Elements

Data Collection For our study, we collected pre-annotated DNA sequences of TEs. For the annotation of TEs, the repetitive DNA sequences were obtained from two different public repositories: Repbase PGSB Repbase repository contains TEs from different eukaryotic species. PGSB is a compilation of plant repetitive sequences from different databases: TREP TIGR repeats PlantSat Genbank

Data Collection Table 1. Overall Statistics of Datasets

CC=1, CG=1, GC=1, CA=1, AA=3, AG=1, GT=2, TT=1, TG=1, TC=1 Feature Extraction Each TE in a dataset is represented by a set of k-mers Which are obtained by frequency count of substring of length K = 2, 3, 4 For Example: T C G A For k=2 , the possible combinations with their frequency count in this sequence : CC=1, CG=1, GC=1, CA=1, AA=3, AG=1, GT=2, TT=1, TG=1, TC=1 For k=3 , the possible combinations with their frequency count in this sequence : CCG=1, CGC=1, GCA=1, CAA=1, AAA=2, AAG=1, AGT=1, GTT=1,TTG=1, TGT=1, GTC=1 Feature values were standardized such that the mean = 0 and standard deviation = 1

Hierarchical Classification Strategies Classification of TEs can be treated as hierarchical classification problem The hierarchical classification can be represented by a directed acyclic graph or a tree Hierarchical classification of TEs is performed based on top-down strategies Two recent top-down strategies for the hierarchical classification of TEs are: non-Leaf Local Classifier per Parent Node (nLLCPN) Local Classifier per Parent Node and Branch (LCPNB)

non-Leaf Local Classifier per Parent Node Approach (nLLCPN) In nLLCPN, a multi-class classifier is implemented at each non-leaf node of the graph. Is classified as either 1 or 2 Root 1 1.1 1.1.1 1.1.2 1.4 1.5 2 2.1 2.1.1 2.1.1.2 2.1.1.1 2.1.1.8 2.1.1.5 …CCGCAAAAGTTGTC… Is classified as either itself or 2.1 …CCGCAAAAGTTGTC… Is classified as either itself or 2.1.1 …CCGCAAAAGTTGTC… Is classified as 2.1.1.2 …CCGCAAAAGTTGTC…

Local Classifier per Parent Node and Branch (LCPNB) In LCPNB, a multi-class classifier is implemented at each non-leaf node of the graph and prediction probabilities are obtained for all the classes. Root 1 1.1 1.1.1 1.1.2 1.4 1.5 2 2.1 2.1.1 2.1.1.2 2.1.1.1 2.1.1.8 2.1.1.5 0.4 The path leading to final classification: 2(0.6) 2.1(1) 2.1.1(0.8) 2.1.1.1(0.4) Average = (0.6+1+0.8+0.4)/4 = 0.7 0.2 0.6 0.6 1 0.2 0.2 0.8 0.2 0.4 0.2 0.4 0.4 0.2 0.2

Selection of Hierarchical Classification Strategy

Machine Learning Algorithms We applied several machine learning methods at each non-leaf node of the directed acyclic graph. Artificial Neural Network (ANN) K-Nearest Neighbor (KNN) Logistic Regression (LogReg) ExtraTree Classifier (ET) Random Forest (RF) Gradient Boosting Classifier (GBC) AdaBoost Support Vector Machines (SVM)

Framework of Stacking Based Model Stacking is an ensemble technique, combines several machine learning algorithms to create one predictive model. In the first stage of learning, layer of trained base models invoking state-of-art ML algorithms from Scikit-learn library is created In second stage of learning, first level prediction probabilities from these base models used as features and augmented with the original feature vector. Augmented feature vector was used for training final level of learner i.e., meta-classifier

Selection of Base and Meta Learners Selection of base and meta-learners was influenced by their working technique Selection of hierarchical classification strategy based on the performance - LCPNB Support Vector Machine (SVM) with RBF kernel was optimized using grid search and then fine search Table 2. Combination of base learners and meta learners for two tier learning

Training Procedure for Stacking Based Model Figure 1. Architecture of Stacking-based model for each parent node. a. Training base classifiers with the instances of child nodes which gives a set of prediction probabilities. b. Augmentation of set of prediction probabilities with the feature vector to train meta classifier

Performance Evaluation (Metrics) ℎ𝑖𝑒𝑟𝑎𝑟𝑐ℎ𝑖𝑐𝑎𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 (ℎ𝑃)= 𝑖 | 𝑍 𝑖 ∩ 𝐶 𝑖 | 𝑖 | 𝑍 𝑖 | ℎ𝑖𝑒𝑟𝑎𝑟𝑐ℎ𝑖𝑐𝑎𝑙 𝑟𝑒𝑐𝑎𝑙𝑙 (ℎ𝑅)= 𝑖 | 𝑍 𝑖 ∩ 𝐶 𝑖 | 𝑖 | 𝐶 𝑖 | ℎ𝑖𝑒𝑟𝑎𝑟𝑐ℎ𝑖𝑐𝑎𝑙 𝑓−𝑚𝑒𝑎𝑠𝑢𝑟𝑒 (ℎ𝐹)= 2∗ℎ𝑃∗ℎ𝑅 ℎ𝑃+ℎ𝑅 Here, Ci and Zi represents the set of true and predicted classes for an instance i respectively. The performance of each of the classifier is evaluated using 10-fold cross-validation strategy.

CM1 : Base Classifiers - KNN + SVM + ET ; Meta Classifier - LogReg Results – Evaluation of Performance of Stacking-Based Frameworks Table 3. Performance of stacked models through 10-fold cross validation using LCPNB classification strategy CM1 : Base Classifiers - KNN + SVM + ET ; Meta Classifier - LogReg

Results – Finalizing Stacking Model Selection of final Model : ClassifyTE as CM1 Fig 3. hF for Stacked Models on PGSB, Repbase and Combined datasets.  

Results – Performance of ClassifyTE Table 4. Comparison of ClassifyTE with other state-of-art methods on PGSB, Repbase and Mixed datasets

Conclusions and Future Works With the intent of improving results of individual machine learning method, we designed and developed a stacking-based model We found that ClassifyTE outperforms all the existing hierarchical classification methods for transposable elements in the literature We plan to package it and make it available as a software to be used by biologist and researchers Our tool will be online here https://bmll.cs.uno.edu/  Offline code and data will be here: http://cs.uno.edu/~tamjid/Software.html

Thank You! Any Questions? 9/5/2019