Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classifying bacteriophage based on genomic information

Similar presentations


Presentation on theme: "Classifying bacteriophage based on genomic information"— Presentation transcript:

1 Classifying bacteriophage based on genomic information
ECE 539 Daniel Griffith

2 Overview Purpose: Tasks completed: Assessment:
This project aimed to classify bacteriophage on the basis of their nucleotide sequence using a Support Vector Machine (SVM) classifier and Principle Component Analysis (PCA) dimensionality reduction. Tasks completed: Aggregated bacteriophage genomes and converted into a numeric vectors Trained SVM classifier to predict class given an unknown bacteriophage genome Performed PCA dimensionality reduction to find optimal number of features Assessment: Overall, the classification rate was below 50%, but was decent given the large number of classes.

3 Approach – data acquisition
Convert DNA sequence into a numerical vector in order to classify using SVM Divide into training and testing sets Dataset with all class sizes ≥ 8 Download all bacteriophage genomes from NCBI database Sort by phage class and remove classes that have too few members Convert nucleotide sequence to 12 dim. vector Divide into training and testing sets Dataset with all class sizes ≥ 20

4 Approach – PCA and SVM computation
All data modification, classification, and dimensionality reduction was conducted using Python 2.7 on a personal machine SVM: Sci-kit Learn LinearSVC module Linear SVM Parameters: C = 1, balanced class weights PCA: Sci-kit Learn (sklearn.decomposition.PCA) Default parameters

5 Results

6 Discussion Classification rate, precision, and recall are lower than ideal Difficult to classify unbalanced class sizes In general, larger classes were easier to classify A more stringent class size threshold, or a greater number of features per genome might improve performance. Reducing dimensionality of 12D feature vector has relatively small affect on classifier performance Reducing from 12 to 8 features has similar classification rate Other classifying methods such as clustering or MLP might have more success

7 References [1] Hatfull GF. (2008) Bacteriophage genomics. Curr Opin Microbiol. 11(5): [2] Yu C, Hernandez T, Zheng H, Yau S-C, Huang H-H, He RL, Yang J, Yau SS-T. (2013) Real time classification of viruses in 12 dimensions. PLoS ONE 8(5): e64328. [3] Remita MA, Halioui A, Diouara AAM, Daigle B, Kiani G, Diallo AB. (2017) A machine learning approach for viral genome classification. BMC Bioinformatics 18:208. [4] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp , 2011. [5] A. M. Martinez and A. C. Kak. PCA versus LDA. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 23(2):228–233, 2001.


Download ppt "Classifying bacteriophage based on genomic information"

Similar presentations


Ads by Google