Presentation is loading. Please wait.

Presentation is loading. Please wait.

Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah

Similar presentations


Presentation on theme: "Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah"— Presentation transcript:

1 Prediction of Hierarchical Classification of Transposable Elements using Machine Learning Techniques
Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah Department of Computer Science and Biological Sciences The University of New Orleans, New Orleans, LA

2 Presentation Overview
Introduction Data Collection Feature Extraction Hierarchical Classification Strategies Machine Learning Methods for the Prediction of Hierarchical Categories Framework for the Stacking Model Results Conclusions

3 Transposable Elements (TE)
Transposable elements (TEs) or jumping genes are the DNA sequences that have Intrinsic capability to move within a host genome from one genomic location to another Genomic location can either be same or different chromosome TEs were first discovered by Barbara McClintock (a.k.a. maize scientist) in 1948 TEs play an important role in: Modifying functionalities of genes Mutation, Chromosome Breakage E.g. insertion of L1 type TEs in tumor suppressor genes could lead to cancer.

4 Illustration of Wicker’s Taxonomy for Transposable Elements

5 Data Collection For our study, we collected pre-annotated DNA sequences of TEs. For the annotation of TEs, the repetitive DNA sequences were obtained from two different public repositories: Repbase PGSB Repbase repository contains TEs from different eukaryotic species. PGSB is a compilation of plant repetitive sequences from different databases: TREP TIGR repeats PlantSat Genbank

6 Data Collection Table 1. Overall Statistics of Datasets

7 CC=1, CG=1, GC=1, CA=1, AA=3, AG=1, GT=2, TT=1, TG=1, TC=1
Feature Extraction Each TE in a dataset is represented by a set of k-mers Which are obtained by frequency count of substring of length K = 2, 3, 4 For Example: T C G A For k=2 , the possible combinations with their frequency count in this sequence : CC=1, CG=1, GC=1, CA=1, AA=3, AG=1, GT=2, TT=1, TG=1, TC=1 For k=3 , the possible combinations with their frequency count in this sequence : CCG=1, CGC=1, GCA=1, CAA=1, AAA=2, AAG=1, AGT=1, GTT=1,TTG=1, TGT=1, GTC=1 Feature values were standardized such that the mean = 0 and standard deviation = 1

8 Hierarchical Classification Strategies
Classification of TEs can be treated as hierarchical classification problem The hierarchical classification can be represented by a directed acyclic graph or a tree Hierarchical classification of TEs is performed based on top-down strategies Two recent top-down strategies for the hierarchical classification of TEs are: non-Leaf Local Classifier per Parent Node (nLLCPN) Local Classifier per Parent Node and Branch (LCPNB)

9 non-Leaf Local Classifier per Parent Node Approach (nLLCPN)
In nLLCPN, a multi-class classifier is implemented at each non-leaf node of the graph. Is classified as either 1 or 2 Root 1 1.1 1.1.1 1.1.2 1.4 1.5 2 2.1 2.1.1 …CCGCAAAAGTTGTC… Is classified as either itself or 2.1 …CCGCAAAAGTTGTC… Is classified as either itself or 2.1.1 …CCGCAAAAGTTGTC… Is classified as …CCGCAAAAGTTGTC…

10 Local Classifier per Parent Node and Branch (LCPNB)
In LCPNB, a multi-class classifier is implemented at each non-leaf node of the graph and prediction probabilities are obtained for all the classes. Root 1 1.1 1.1.1 1.1.2 1.4 1.5 2 2.1 2.1.1 0.4 The path leading to final classification: 2(0.6) (1) (0.8) (0.4) Average = ( )/4 = 0.7 0.2 0.6 0.6 1 0.2 0.2 0.8 0.2 0.4 0.2 0.4 0.4 0.2 0.2

11 Selection of Hierarchical Classification Strategy

12 Machine Learning Algorithms
We applied several machine learning methods at each non-leaf node of the directed acyclic graph. Artificial Neural Network (ANN) K-Nearest Neighbor (KNN) Logistic Regression (LogReg) ExtraTree Classifier (ET) Random Forest (RF) Gradient Boosting Classifier (GBC) AdaBoost Support Vector Machines (SVM)

13 Framework of Stacking Based Model
Stacking is an ensemble technique, combines several machine learning algorithms to create one predictive model. In the first stage of learning, layer of trained base models invoking state-of-art ML algorithms from Scikit-learn library is created In second stage of learning, first level prediction probabilities from these base models used as features and augmented with the original feature vector. Augmented feature vector was used for training final level of learner i.e., meta-classifier

14 Selection of Base and Meta Learners
Selection of base and meta-learners was influenced by their working technique Selection of hierarchical classification strategy based on the performance - LCPNB Support Vector Machine (SVM) with RBF kernel was optimized using grid search and then fine search Table 2. Combination of base learners and meta learners for two tier learning

15 Training Procedure for Stacking Based Model
Figure 1. Architecture of Stacking-based model for each parent node. a. Training base classifiers with the instances of child nodes which gives a set of prediction probabilities. b. Augmentation of set of prediction probabilities with the feature vector to train meta classifier

16 Performance Evaluation (Metrics)
ℎ𝑖𝑒𝑟𝑎𝑟𝑐ℎ𝑖𝑐𝑎𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 (ℎ𝑃)= 𝑖 | 𝑍 𝑖 ∩ 𝐶 𝑖 | 𝑖 | 𝑍 𝑖 | ℎ𝑖𝑒𝑟𝑎𝑟𝑐ℎ𝑖𝑐𝑎𝑙 𝑟𝑒𝑐𝑎𝑙𝑙 (ℎ𝑅)= 𝑖 | 𝑍 𝑖 ∩ 𝐶 𝑖 | 𝑖 | 𝐶 𝑖 | ℎ𝑖𝑒𝑟𝑎𝑟𝑐ℎ𝑖𝑐𝑎𝑙 𝑓−𝑚𝑒𝑎𝑠𝑢𝑟𝑒 (ℎ𝐹)= 2∗ℎ𝑃∗ℎ𝑅 ℎ𝑃+ℎ𝑅 Here, Ci and Zi represents the set of true and predicted classes for an instance i respectively. The performance of each of the classifier is evaluated using 10-fold cross-validation strategy.

17 CM1 : Base Classifiers - KNN + SVM + ET ; Meta Classifier - LogReg
Results – Evaluation of Performance of Stacking-Based Frameworks Table 3. Performance of stacked models through 10-fold cross validation using LCPNB classification strategy CM1 : Base Classifiers - KNN + SVM + ET ; Meta Classifier - LogReg

18 Results – Finalizing Stacking Model
Selection of final Model : ClassifyTE as CM1 Fig 3. hF for Stacked Models on PGSB, Repbase and Combined datasets.

19 Results – Performance of ClassifyTE
Table 4. Comparison of ClassifyTE with other state-of-art methods on PGSB, Repbase and Mixed datasets

20 Conclusions and Future Works
With the intent of improving results of individual machine learning method, we designed and developed a stacking-based model We found that ClassifyTE outperforms all the existing hierarchical classification methods for transposable elements in the literature We plan to package it and make it available as a software to be used by biologist and researchers Our tool will be online here  Offline code and data will be here: 

21 Thank You! Any Questions? 9/5/2019


Download ppt "Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah"

Similar presentations


Ads by Google