Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimization of SVM Parameters for Promoter Recognition in DNA Sequences Robertas Damaševičius Software Engineering Department, Kaunas University of Technology.

Similar presentations


Presentation on theme: "Optimization of SVM Parameters for Promoter Recognition in DNA Sequences Robertas Damaševičius Software Engineering Department, Kaunas University of Technology."— Presentation transcript:

1 Optimization of SVM Parameters for Promoter Recognition in DNA Sequences Robertas Damaševičius Software Engineering Department, Kaunas University of Technology Studentų 50-415, Kaunas, Lithuania Email: damarobe@soften.ktu.lt

2 Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 2 Data: genetic (DNA) sequences Meaning: represent genetic information stored in DNA molecule in symbolic form Syntax: 4-letter alphabet {A, C, G, T} Complexity: numerous layers of information  protein-coding genes  regulatory sequences  mRNA sequences responsible for protein structure  directions from DNA packaging and unwinding, etc. Motivation: over 95% - “junk DNA” (biological function is not fully understood) Aim: identify structural parts of DNA  introns, exons, promoters, splice sites, etc.

3 Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 3 What are promoters? Promoter: a regulatory region of DNA located upstream of a gene, providing a control point for gene transcription Function: by binding to promoter, specific proteins (Transcription Factors) can either promote or repress the transcription of a gene Structure: promoters contain binding sites or “boxes” – short DNA subsequences, which are (usually) conserved exon1exon3 exon2 Promoter StartStop intron Gene

4 Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 4 Promoter recognition problem Multitude of promoter “boxes” (nucleotide patterns)  TATA, Pribnow, Gilbert, DPE, E-box, Y-box, … “Boxes” within a species are conserved, but there are many exceptions to this rule (a) Exact pattern = TACACC CAATGCAGGA TACACC GATCGGTA (b) Pattern with mismatches = TACACC + 1 mismatch CAATGCAGGA TTCACC GATCGGTA (c) Degenerate pattern = TASDCC ( S ={ C,G }, D ={ A,G,T }) CAATGCAGGA TAGTCC GATCGGTA

5 Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 5 5 Support Vector Machine (SVM) are training data vectors, are unknown data vectors, is a target space is the kernel function.

6 Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 6 6 Quality of classification Training data  size of dataset, generation of negative examples, imbalanced datasets Mapping of data into feature space  Orthogonal, single nucleotide, nucleotide grouping,... Selection of an optimal kernel function  linear, polynomial, RBF, sigmoid Kernel function parameters SVM learning parameters  Regularization parameter, Cost factor Selection of SVM parameter values – an optimization problem

7 Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 7 SVM optimization strategies Kernel optimization  Putting additional parameters  Designing new kernels Parameter optimization  Learning parameters only  Kernel parameters only  Learning & kernel parameters Optimization decisions  Optimization method  Objective function

8 Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 8 SVM (hyper)parameters Kernel parameters Learning parameters

9 Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 9 SVM parameter optimization methods MethodAdvantagesDisadvantages Random search Simplicity.Depends on selection of random points and their distribution. Very slow as the size of the parameter space increases Grid search Simplicity. A starting point is not required. Box-constraints for grid are necessary. No optimality criteria for the solution. Computationally expensive for a large number of parameters. Solution depends upon coarseness of the grid. Nelder- Mead Few function evaluations. Good convergence and stability. Can fail if the initial simplex is too small. No proof of convergence.

10 Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 10 Dataset Drosophila sequence datasets:  Promoter dataset: 1842 sequences, each 300 bp length, from -250 bp to +50 bp with regards to the gene transcription site location  Intron dataset: 1799 sequences, each 300 bp length  Coding sequence (CDS) dataset: 2859 sequences, each 300 bp length Datasets for SVM classifier:  Training file: 1260 examples (372 promoters, 361 introns, 527 CDS)  Test file: 6500 examples (1842 promoters, 1799 introns, 2859 CDS) Datasets are unbalanced:  29.5% promoters vs. 70.5% non-promoters in the training dataset  28.3% promoters vs. 71.7% non-promoters in the test dataset

11 Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 11 Classification requisites Feature mapping: orthogonal Kernel function: power series kernel Metrics:  Specificity (SPC)  Sensitivity (TPR) SVM classifier: SVM light SVM parameter optimization method:  Modified Nelder-Mead (downhill simplex)

12 Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 12 Modification of Nelder-Mead Optimization time problem:  Call to SVM training and testing function is very time-costly for large datasets  Requires many evaluations of objective function Modifications:  Function value caching  Normalization after reflection step

13 Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 13 Classification results KernelNo. of optimized parameters Type of optimized parameters Classification evaluation metric Specificity (SPC) Sensitivity (TPR) Linear-none84.83%58.25% Linear3learning91.23%81.38% Polynomial-none81.81%44.90% Polynomial6learning + kernel 87.64%67.48% Power series (2)3kernel94.85%89.69% Power series (3)4kernel94.92%89.95%

14 Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 14 ROC plot 100

15 Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 15 Conclusions SVM classifier alone can not achieve satisfactory classification results for a complex unbalanced dataset SVM parameter optimization can improve classification results significantly Best results can be achieved when SVM parameter optimization is combined with kernel function modification Power series kernel is particularly suitable for optimization because of a larger number of kernel parameters

16 Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 16 Ongoing work and future research Application of SVM parameter optimization for splice site recognition problem [presented in CISIS’2008] Selection of rules for optimal DNA sequence mapping to the feature space [accepted to WCSB’2008] Analysis of the relationships between data characteristics and classifier behavior [accepted to IS’2008] Automatic derivation of formal grammars rules [accepted to KES’2008] Structural analysis of sequences using SVM with grammar inference [accepted to ITA’2008]

17 Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008 17 Thank You. Any questions?


Download ppt "Optimization of SVM Parameters for Promoter Recognition in DNA Sequences Robertas Damaševičius Software Engineering Department, Kaunas University of Technology."

Similar presentations


Ads by Google