Improvement of SSR Redundancy Identification by Machine Learning Approach Using Dataset from Cotton Marker Database Pengfei Xuan 1,2, Feng Luo 2, Albert.

Slides:



Advertisements
Similar presentations
Transmembrane Protein Topology Prediction Using Support Vector Machines Tim Nugent and David Jones Bioinformatics Group, Department of Computer Science,
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
ECG Signal processing (2)
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Particle swarm optimization for parameter determination and feature selection of support vector machines Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen,
Table 2 shows that the set TFsf-TGblbs of predicted regulatory links has better results than the other two sets, based on having a significantly higher.
1 Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data Aparna S. Varde, Elke A. Rundensteiner, Carolina Ruiz, Mohammed Maniruzzaman.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Reduced Support Vector Machine
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms.
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
05/06/2005CSIS © M. Gibbons On Evaluating Open Biometric Identification Systems Spring 2005 Michael Gibbons School of Computer Science & Information Systems.
IE486 Work Analysis & Design II Lab 4 – Upper Limb Assessment Supplementary Notes on Evaluation of Analysis Tools March 23, 2007 Vincent G. Duffy, IE &
Evaluating Classifiers
Protein Tertiary Structure Prediction
Identifying Computer Graphics Using HSV Model And Statistical Moments Of Characteristic Functions Xiao Cai, Yuewen Wang.
Dorrie Main, Jing Yu, Sook Jung, Chun-Huai Cheng, Stephen Ficklin, Ping Zheng, Taein Lee, Richard Percy and Don Jones.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
An Attempt at Group Belief Characterization and Detection Danny Dunlavy Computer Science and Informatics Department (1415) Sandia National Laboratories.
From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Jing Yu 1, Sook Jung 1, Chun-Huai Cheng 1, Stephen Ficklin 1, Taein Lee 1, Ping Zheng 1, Don Jones 2, Richard Percy 3, Dorrie Main 1 1. Washington State.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Protein Fold Recognition as a Data Mining Coursework Project Badri Adhikari Department of Computer Science University of Missouri-Columbia.
Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.
Protein Classification Using Averaged Perceptron SVM
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
컴퓨터 과학부 김명재.  Introduction  Data Preprocessing  Model Selection  Experiments.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Cheng-Lung Huang Mu-Chen Chen Chieh-Jen Wang
Development of a Cotton Marker Database (CMD) for Gossypium genome and genetic research CMD Main Goals Collect and integrate.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.
SVMs in a Nutshell.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Prediction model building and feature selection with support.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
The Bovine Genome Sequence: potential resources and practical uses. Nicola Hastings, Andy Law and John L. Williams * * Department of Genetics and Genomics,
Virginia Commonwealth University
7. Performance Measurement
Metagenomic Species Diversity.
How to forecast solar flares?
Cotton Marker Database (CMD) for Genetic And Genome Research Anna Blenda, Pengfei Xuan, David Camak, Feng Luo,
An Enhanced Support Vector Machine Model for Intrusion Detection
Avdesh Mishra, Manisha Panta, Md Tamjidul Hoque, Joel Atallah
Introduction Feature Extraction Discussions Conclusions Results
Features & Decision regions
Extra Tree Classifier-WS3 Bagging Classifier-WS3
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
Support Vector Machine (SVM)
Geneomics and Database Mining and Genetic Mapping
Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen, Zne-Jung Lee
Basic Local Alignment Search Tool (BLAST)
Presentation transcript:

Improvement of SSR Redundancy Identification by Machine Learning Approach Using Dataset from Cotton Marker Database Pengfei Xuan 1,2, Feng Luo 2, Albert Abbott 1, Don Jones 3, and Anna Blenda 1 1 Department of Genetics and Biochemistry, Clemson University, Biosystems Research Complex, 51 New Cherry Street, Clemson, SC, 29634, USA 2 School of Computing, Clemson University, 100 McAdams, Clemson, SC, 29634, USA 3 Cotton Incorporated, 6399 Weston Parkway, Cary, NC, 27513, USA Microsatellites, or simple sequence repeats (SSRs), are used as molecular markers with wide-ranging applications in the field of cotton molecular breeding. The Cotton Marker Database (CMD; provides centralized access to publicly available cotton molecular data. In collaboration with the contributing researchers, we have summarized and provided high quality data for 11,938 SSRs displayed through CMD. However, SSR redundancy is common and inevitable issue for projects coming from different research groups. The method of SSR redundancy detection using the SSR-containing sequence alignment approach gives high number of false-positives even when applying stringent parameters, since the similarity identification is based only on the sequence comparison. To improve the accuracy of the redundant SSRs detection and reduce the cost of expert intervention in polymorphism discovery, we proposed the application of the machine learning approach based on the Support Vector Machine (SVM) algorithm [1, 2]. INTRODUCTION Table 1. Evaluation of results obtained for the tested data. 1. R.-E. Fan, P.-H. Chen, C.-J. Lin. Working set selection using the second order information for training SVM. Journal of Machine Learning Research Lakshmi K, John J. Application of machine learning in SNP discovery. BMC Bioinformatics SSR Training Data (4 features) 99 (Positive) & 106 (Negative) Expert Decision SVM Program LIBSVM Parameter Scaling Cross-validation Grid-search Kernel Functions Best Parameters SSR Testing Data 100 (Positive) & 119 (Negative) Performance Verification SSR Prediction Data 648 (Positive) Model Prediction Classification Model Kernel Function SSR Training and Testing Data TP*FP*TN*FN* Sensitivity % Specificity % Precision % Accuracy % F-score % linear %98.00 polynomial %--- radial basis %97.31 sigmoid %98.02 MATERIALS AND METHODS REFERENCES The CMD SSR dataset (847 markers) was used as training, testing and prediction sets for the SVM algorithm (Figure 1). We chose 4 important SSR features:  Percent match of primer sequences The SSR primer sequence is an important referenced factor in genetic research; it is used to isolate targeted sections of DNA for amplification in PCR. The primer sequence alignment can be calculated by CD-HIT program.  Primer match type Type 1: Forward to forward match, reverse to reverse match. Type 2: forward to reverse match, or reverse to reverse match.  Motif similarity SSR motif similarity is another important factor reflecting the degree of SSR redundancy.  Percent match of SSR-containing sequences A BLAST search allows to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold.  SSR genetic map position Based on this feature, the training data were manually selected and the final results were evaluated. DISCUSSION RESULT Our experiment showed that this machine learning approach based on the 4 selected features gives high sensitivity and specificity, and it can be used either to identify questionable similarity results (Example A), or confirm the initial SSR similarity (Example B) after the first step of the SSR redundancy detection based on the SSR-contsaining sequence alignment. This SVM algorithm can be subsequently used to directly filter the data generated by the BLAST alignment program. We acknowledge with thanks, Cotton Incorporated for funding CMD project and related research Figure 1. The machine learning workflow. SVM with different kernel functions was applied to develop a method for accurate detection of SSR redundancy. The best results were obtained by using the sigmoid kernel, where the obtained sensitivity and F-score values were higher compared to the other kernel functions tested (Table 1). These results indicate that SVM-based method identifies true SSR redundancy with high accuracy. EXAMPLES of SSR Prediction Data Redundancy PairPrimer MatchMatch TypeMotif SimilaritySSR Blast NAU864 - MUSS298 96%Forward – Forward100%812 Marker Name Ch/ LG Position (cM) Cross Map NAU RIL: "TM-1" (G. hirsutum (AD1)) x "3-79" (G. barbadense (AD2)) 2006 MUSS RIL: "TM-1" (G. hirsutum (AD1)) x "3-79" (G. barbadense (AD2)) 2006 Redundancy PairPrimer MatchMatch TypeMotif SimilaritySSR Blast BNL BNL %Reverse – Reverse100%670 Marker Name Ch/LG Position (cM) Cross Map BNL3031 9/LG BC1-RIL: ("Guazuncho2" (G. hirsutum) x "VH8-4602" (G. barbadense)) BC1: [(G. hirsutum "TM-1" x G. barbadense "Hai7124") x "TM-1"] BC1: [(G. hirsutum "TM-1" x G. barbadense "Hai7124") x "TM-1"] BC1: (("Guazuncho2" (G. hirsutum) x "VH8-4602" (G. barbadense)) x "Guazuncho2") BC1: (("Guazuncho2" (G. hirsutum) x "VH8-4602" (G. barbadense)) x "Guazuncho2") 2004 BNL1672 9/LG BC1-RIL: ("Guazuncho2" (G. hirsutum) x "VH8-4602" (G. barbadense)) BC1: [(G. hirsutum "TM-1" x G. barbadense "Hai7124") x "TM-1"] BC1: [(G. hirsutum "TM-1" x G. barbadense "Hai7124") x "TM-1"] BC1: (("Guazuncho2" (G. hirsutum) x "VH8-4602" (G. barbadense)) x "Guazuncho2") BC1: (("Guazuncho2" (G. hirsutum) x "VH8-4602" (G. barbadense)) x "Guazuncho2") 2004 Example B. SSR similarity based on initial sequence alignment and confirmed by SVM. * TP – true positive, FP – false positive, TN – true negative, FN – false negative. Example A.Similarity of 2 SSRs based on initial sequence alignment, but disagreeing with SVM results. The genetic map positions of 2 SSRs do not match, which indicates the correction of SVM prediction.