Optimization of SVM Parameters for Promoter Recognition in DNA Sequences Robertas Damaševičius Software Engineering Department, Kaunas University of Technology.

Slides:



Advertisements
Similar presentations
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Advertisements

Fei Xing1, Ping Guo1,2 and Michael R. Lyu2
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
New EDA-approaches to feature selection for classification (of biological sequences) Yvan Saeys.
Lecture 6 of Introduction to Molecular Biology 生理所 蔡少正
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Introduction to Biological Sequences. Background: What is DNA? Deoxyribonucleic acid Blueprint that carries genetic information from one generation to.
Biological Motivation Gene Finding in Eukaryotic Genomes
Gaussian Processes for Transcription Factor Protein Inference Neil D. Lawrence, Guido Sanguinetti and Magnus Rattray.
Shine-Dalgarno Motif Ribosome binding site located about 13 bases upstream of AUG start codon SD sequence is: 5’-AGGAGGU-3’ Middle GGAG is more highly.
Genome Sequencing & App. of DNA Technologies Genomics is a branch of science that focuses on the interactions of sets of genes with the environment. –
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Efficient Model Selection for Support Vector Machines
CENTRAL DOGMA OF BIOLOGY. Transcription & Translation How do we make sense of the DNA message? Genotype to Phenotype.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
Molecular Biology in a Nutshell (via UCSC Genome Browser) Personalized Medicine: Understanding Your Own Genome Fall 2014.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.
DNA to Protein – 12 Part one AP Biology. What is a Gene? A gene is a sequence of DNA that contains the information or the code for a protein or an RNA.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Protein Fold Recognition as a Data Mining Coursework Project Badri Adhikari Department of Computer Science University of Missouri-Columbia.
A Biology Primer Part III: Transcription, Translation, and Regulation Vasileios Hatzivassiloglou University of Texas at Dallas.
Copyright © 2009 Pearson Education, Inc. Chapter 14 The Genetic Code and Transcription Copyright © 2009 Pearson Education, Inc.
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Transcription … from DNA to RNA.
Control of Gene Expression Chapter 16. Contolling Gene Expression What does that mean? Regulating which genes are being expressed  transcribed/translated.
Prokaryotic cells turn genes on and off by controlling transcription.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
9 th Annual "Humies" Awards 2012 — Philadelphia, Pennsylvania Uday Kamath, Amarda Shehu,Kenneth A De Jong Department of Computer Science George Mason University.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel Dr. Robertas Damaševičius.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
CFE Higher Biology DNA and the Genome Transcription.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Transcription and The Genetic Code From DNA to RNA.
TRANSCRIPTION AND TRANSLATION Vocabulary. GENE EXPRESSION the appearance in a phenotype characteristic or effect attributed to a particular gene.
HOW DO CELLS KNOW WHEN TO EXPRESS A GENE? DO NOW:.
Gene structure and function
1 CISC 841 Bioinformatics (Fall 2008) Review Session.
Gene Structure and Regulation. Gene Expression The expression of genetic information is one of the fundamental activities of all cells. Instruction stored.
Support Feature Machine for DNA microarray data
CENTRAL DOGMA OF BIOLOGY
EL: To find out what a genome is and how gene expression is regulated
Students: Meiling He Advisor: Prof. Brain Armstrong
Regulation of Gene Expression by Eukaryotes
Prokaryotic cells turn genes on and off by controlling transcription.
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Prokaryotic cells turn genes on and off by controlling transcription.
Chapter 11 Gene Expression.
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Transcription.
Lecture 4 By Ms. Shumaila Azam
Introduction to Bioinformatics II
Eukaryotic Transcription
Prokaryotic cells turn genes on and off by controlling transcription.
Prokaryotic cells turn genes on and off by controlling transcription.
Prokaryotic cells turn genes on and off by controlling transcription.
Prokaryotic cells turn genes on and off by controlling transcription.
Gene Structure.
Prokaryotic cells turn genes on and off by controlling transcription.
Gene Structure.
Presentation transcript:

Optimization of SVM Parameters for Promoter Recognition in DNA Sequences Robertas Damaševičius Software Engineering Department, Kaunas University of Technology Studentų , Kaunas, Lithuania

Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Data: genetic (DNA) sequences Meaning: represent genetic information stored in DNA molecule in symbolic form Syntax: 4-letter alphabet {A, C, G, T} Complexity: numerous layers of information  protein-coding genes  regulatory sequences  mRNA sequences responsible for protein structure  directions from DNA packaging and unwinding, etc. Motivation: over 95% - “junk DNA” (biological function is not fully understood) Aim: identify structural parts of DNA  introns, exons, promoters, splice sites, etc.

Continuous Optimization and Knowledge-Based Technologies – EurOPT’ What are promoters? Promoter: a regulatory region of DNA located upstream of a gene, providing a control point for gene transcription Function: by binding to promoter, specific proteins (Transcription Factors) can either promote or repress the transcription of a gene Structure: promoters contain binding sites or “boxes” – short DNA subsequences, which are (usually) conserved exon1exon3 exon2 Promoter StartStop intron Gene

Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Promoter recognition problem Multitude of promoter “boxes” (nucleotide patterns)  TATA, Pribnow, Gilbert, DPE, E-box, Y-box, … “Boxes” within a species are conserved, but there are many exceptions to this rule (a) Exact pattern = TACACC CAATGCAGGA TACACC GATCGGTA (b) Pattern with mismatches = TACACC + 1 mismatch CAATGCAGGA TTCACC GATCGGTA (c) Degenerate pattern = TASDCC ( S ={ C,G }, D ={ A,G,T }) CAATGCAGGA TAGTCC GATCGGTA

Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Support Vector Machine (SVM) are training data vectors, are unknown data vectors, is a target space is the kernel function.

Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Quality of classification Training data  size of dataset, generation of negative examples, imbalanced datasets Mapping of data into feature space  Orthogonal, single nucleotide, nucleotide grouping,... Selection of an optimal kernel function  linear, polynomial, RBF, sigmoid Kernel function parameters SVM learning parameters  Regularization parameter, Cost factor Selection of SVM parameter values – an optimization problem

Continuous Optimization and Knowledge-Based Technologies – EurOPT’ SVM optimization strategies Kernel optimization  Putting additional parameters  Designing new kernels Parameter optimization  Learning parameters only  Kernel parameters only  Learning & kernel parameters Optimization decisions  Optimization method  Objective function

Continuous Optimization and Knowledge-Based Technologies – EurOPT’ SVM (hyper)parameters Kernel parameters Learning parameters

Continuous Optimization and Knowledge-Based Technologies – EurOPT’ SVM parameter optimization methods MethodAdvantagesDisadvantages Random search Simplicity.Depends on selection of random points and their distribution. Very slow as the size of the parameter space increases Grid search Simplicity. A starting point is not required. Box-constraints for grid are necessary. No optimality criteria for the solution. Computationally expensive for a large number of parameters. Solution depends upon coarseness of the grid. Nelder- Mead Few function evaluations. Good convergence and stability. Can fail if the initial simplex is too small. No proof of convergence.

Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Dataset Drosophila sequence datasets:  Promoter dataset: 1842 sequences, each 300 bp length, from -250 bp to +50 bp with regards to the gene transcription site location  Intron dataset: 1799 sequences, each 300 bp length  Coding sequence (CDS) dataset: 2859 sequences, each 300 bp length Datasets for SVM classifier:  Training file: 1260 examples (372 promoters, 361 introns, 527 CDS)  Test file: 6500 examples (1842 promoters, 1799 introns, 2859 CDS) Datasets are unbalanced:  29.5% promoters vs. 70.5% non-promoters in the training dataset  28.3% promoters vs. 71.7% non-promoters in the test dataset

Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Classification requisites Feature mapping: orthogonal Kernel function: power series kernel Metrics:  Specificity (SPC)  Sensitivity (TPR) SVM classifier: SVM light SVM parameter optimization method:  Modified Nelder-Mead (downhill simplex)

Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Modification of Nelder-Mead Optimization time problem:  Call to SVM training and testing function is very time-costly for large datasets  Requires many evaluations of objective function Modifications:  Function value caching  Normalization after reflection step

Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Classification results KernelNo. of optimized parameters Type of optimized parameters Classification evaluation metric Specificity (SPC) Sensitivity (TPR) Linear-none84.83%58.25% Linear3learning91.23%81.38% Polynomial-none81.81%44.90% Polynomial6learning + kernel 87.64%67.48% Power series (2)3kernel94.85%89.69% Power series (3)4kernel94.92%89.95%

Continuous Optimization and Knowledge-Based Technologies – EurOPT’ ROC plot 100

Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Conclusions SVM classifier alone can not achieve satisfactory classification results for a complex unbalanced dataset SVM parameter optimization can improve classification results significantly Best results can be achieved when SVM parameter optimization is combined with kernel function modification Power series kernel is particularly suitable for optimization because of a larger number of kernel parameters

Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Ongoing work and future research Application of SVM parameter optimization for splice site recognition problem [presented in CISIS’2008] Selection of rules for optimal DNA sequence mapping to the feature space [accepted to WCSB’2008] Analysis of the relationships between data characteristics and classifier behavior [accepted to IS’2008] Automatic derivation of formal grammars rules [accepted to KES’2008] Structural analysis of sequences using SVM with grammar inference [accepted to ITA’2008]

Continuous Optimization and Knowledge-Based Technologies – EurOPT’ Thank You. Any questions?