Optimized Numerical Mapping Scheme for Filter-Based Exon Location in DNA Using a Quasi-Newton Algorithm P. Ramachandran, W.-S. Lu, and A. Antoniou Department.

Slides:



Advertisements
Similar presentations
Shapelets Correlated with Surface Normals Produce Surfaces Peter Kovesi School of Computer Science & Software Engineering The University of Western Australia.
Advertisements

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.
Evaluating Classifiers
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Learning Algorithm Evaluation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Signal and System IIR Filter Filbert H. Juwono
1 Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data Aparna S. Varde, Elke A. Rundensteiner, Carolina Ruiz, Mohammed Maniruzzaman.
Classification and risk prediction
Model Evaluation Metrics for Performance Evaluation
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluating Hypotheses
Digital Voice Communication Link EE 413 – TEAM 2 April 21 st, 2005.
Development of Empirical Models From Process Data
Location of Exons in DNA Sequences Using Digital Filters Parameswaran Ramachandran, Wu-Sheng Lu, and Andreas Antoniou Department of Electrical Engineering,
Visual Recognition Tutorial
Experimental Evaluation
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
Case Study. DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known.
Correlation and Linear Regression
Correlation and Linear Regression Chapter 13 Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
Evaluating Classifiers
EE513 Audio Signals and Systems Digital Signal Processing (Systems) Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Identifying Computer Graphics Using HSV Model And Statistical Moments Of Characteristic Functions Xiao Cai, Yuewen Wang.
Linear Regression and Correlation
Hypothesis Testing.
Modulation Continuous wave (CW) modulation AM Angle modulation FM PM Pulse Modulation Analog Pulse Modulation PAMPPMPDM Digital Pulse Modulation DMPCM.
CPE 619 Simple Linear Regression Models Aleksandar Milenković The LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama.
Simple Linear Regression Models
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
1 Complex Images k’k’ k”k” k0k0 -k0-k0 branch cut   k 0 pole C1C1 C0C0 from the Sommerfeld identity, the complex exponentials must be a function.
© Copyright McGraw-Hill 2000
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
VI. Regression Analysis A. Simple Linear Regression 1. Scatter Plots Regression analysis is best taught via an example. Pencil lead is a ceramic material.
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
THE LAPLACE TRANSFORM LEARNING GOALS Definition
Copyright © Cengage Learning. All rights reserved. 12 Analysis of Variance.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Z Transform The z-transform of a digital signal x[n] is defined as:
Professor William H. Press, Department of Computer Science, the University of Texas at Austin1 Opinionated in Statistics by Bill Press Lessons #50 Binary.
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
Compression of Protein Sequences EE-591 Information Theory FEI NAN, SUMIT SHARMA May 3, 2003.
Ch 8.2: Improvements on the Euler Method Consider the initial value problem y' = f (t, y), y(t 0 ) = y 0, with solution  (t). For many problems, Euler’s.
BME 353 – BIOMEDICAL MEASUREMENTS AND INSTRUMENTATION MEASUREMENT PRINCIPLES.
INTEGRALS We saw in Section 5.1 that a limit of the form arises when we compute an area. We also saw that it arises when we try to find the distance traveled.
Topics 1 Specific topics to be covered are: Discrete-time signals Z-transforms Sampling and reconstruction Aliasing and anti-aliasing filters Sampled-data.
©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Linear Regression and Correlation Chapter 13.
Shadow Detection in Remotely Sensed Images Based on Self-Adaptive Feature Selection Jiahang Liu, Tao Fang, and Deren Li IEEE TRANSACTIONS ON GEOSCIENCE.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
A new protein-protein docking scoring function based on interface residue properties Reporter: Yu Lun Kuo (D )
Biomathematics seminar Application of Fourier to Bioinformatics Girolamo Giudice.
Chapter 7. Classification and Prediction
Spectral Analysis Spectral analysis is concerned with the determination of the energy or power spectrum of a continuous-time signal It is assumed that.
Subject Name: Digital Communication Subject Code:10EC61
Prediction of Regulatory Elements for Non-Model Organisms Rachita Sharma, Patricia.
Learning Algorithm Evaluation
Quadrature-Mirror Filter Bank
Universal microbial diagnostics using random DNA probes
More on Maxent Env. Variable importance:
Presentation transcript:

Optimized Numerical Mapping Scheme for Filter-Based Exon Location in DNA Using a Quasi-Newton Algorithm P. Ramachandran, W.-S. Lu, and A. Antoniou Department of Electrical Engineering, University of Victoria, BC, Canada. ISCAS 2010, Paris

2 DNA The instructions to build and maintain a living organism are encoded in its DNA. The instructions to build and maintain a living organism are encoded in its DNA. DNA is composed of smaller components called nucleotides, namely, adenine, thymine, guanine, and cytosine (A, T, G, and C). DNA is composed of smaller components called nucleotides, namely, adenine, thymine, guanine, and cytosine (A, T, G, and C). DNA comprises a pair of strands. DNA comprises a pair of strands.

3 DNA (cont’d) Nucleotides pair up across the two strands. Nucleotides pair up across the two strands. A always pairs with T and G always pairs with C. A always pairs with T and G always pairs with C. Symbolic representation of a DNA sequence.

4 Genes Regions in a genome that code for proteins are called genes. Regions in a genome that code for proteins are called genes.

5 Exons and Introns Genes are further split into coding regions called exons and noncoding regions called introns. Genes are further split into coding regions called exons and noncoding regions called introns.

6 Location of Exons Accurate location of exons in genomes is very important for understanding life processes. Accurate location of exons in genomes is very important for understanding life processes. The power spectra of DNA segments corresponding to exons exhibit a relatively strong component at The power spectra of DNA segments corresponding to exons exhibit a relatively strong component at This is known as the period-3 property. Thus, exons can be located by mapping the DNA characters into numbers and then tracking the strength of the period-3 component along the length of the DNA sequence of interest. Thus, exons can be located by mapping the DNA characters into numbers and then tracking the strength of the period-3 component along the length of the DNA sequence of interest.

7 EIIP Values Earlier, we have used electron-ion interaction potential (EIIP) values in conjunction with a filtering technique for exon location. Earlier, we have used electron-ion interaction potential (EIIP) values in conjunction with a filtering technique for exon location. Here, we propose the use of an optimized set of nucleotide weights, we refer to as pseudo-EIIP values, that significantly improve the accuracy of our exon- location technique. Here, we propose the use of an optimized set of nucleotide weights, we refer to as pseudo-EIIP values, that significantly improve the accuracy of our exon- location technique.

8 Filter-Based Exon Location Technique 1. The DNA character sequence of interest is mapped onto a numerical sequence using EIIP values. NucleotideEIIP Adenine Thymine Guanine Cytosine EIIP Values 2. A narrowband bandpass digital filter with its passband centered at the period-3 frequency is used to filter the DNA sequence.

9 Filter-Based Technique (cont’d) 3. The filtered output is an amplitude modulated signal, which is demodulated by filtering its power,, using a lowpass filter. The exon locations are identified as distinct peaks. Exon location system.

10 Receiver Operating Characteristic (ROC) Technique The ROC technique is a tool for evaluating prediction techniques in terms of their performance. The ROC technique is a tool for evaluating prediction techniques in terms of their performance. It is based on metrics known as the true positive rate (TPR) and the false positive rate (FPR): It is based on metrics known as the true positive rate (TPR) and the false positive rate (FPR): and TP, TN, FP, and FN denote the number of true positives, true negatives, false positives, and false negatives, respectively, of the predicted exon locations relative to a set of known true locations.

11 ROC Technique (cont’d) ROC plane The TPR is plotted versus the FPR to obtain a point in the ROC plane as illustrated. The TPR is plotted versus the FPR to obtain a point in the ROC plane as illustrated. Since the TPR and FPR range from 0 to 1, the total area of the ROC plane is unity. Since the TPR and FPR range from 0 to 1, the total area of the ROC plane is unity.

12 ROC Technique (cont’d) The northwest pole, (0, 1), represents perfect prediction and the goal of any prediction technique is to reach this point. The northwest pole, (0, 1), represents perfect prediction and the goal of any prediction technique is to reach this point. The area under the ROC curve (AUC) is a good indicator of the overall performance of an exon-location technique. The area under the ROC curve (AUC) is a good indicator of the overall performance of an exon-location technique. The greater the AUC, the better would be the performance. ROC plane

13 Proposed Training Procedure A better set of nucleotide weights can be obtained by maximizing the AUC corresponding to a training set of DNA sequences or, equivalently, by minimizing the quantity 1−AUC. A better set of nucleotide weights can be obtained by maximizing the AUC corresponding to a training set of DNA sequences or, equivalently, by minimizing the quantity 1−AUC. A quasi-Newton algorithm based on the BFGS updating formula was found to give good results. A quasi-Newton algorithm based on the BFGS updating formula was found to give good results. Closed-form expressions for the objective function and gradient are not possible for this problem and, therefore, they are evaluated numerically. Closed-form expressions for the objective function and gradient are not possible for this problem and, therefore, they are evaluated numerically.

14 Training Procedure (cont’d) For consistency between the optimized nucleotide weights and the EIIP values, we need to ensure that For consistency between the optimized nucleotide weights and the EIIP values, we need to ensure that  the four variables are always positive and  their numerical values are normalized at the end of each iteration such that their sum is always equal to the sum of the EIIP values.

15 Training Procedure (cont’d)  Positive values can be achieved by replacing each variable by its square in the objective function.  The normalization can be achieved by using the following scaling factor in each iteration: Constant is the sum of the actual EIIP values and the denominator variables are the current optimized nucleotide weights.

16 Model for ROC Curves ROC curves are not continuous but can be approximated using an exponential model of the form ROC curves are not continuous but can be approximated using an exponential model of the form Parameters and can be determined by minimizing the error function where and are points in the ROC plane.

17 The minimization can be performed using a quasi-Newton algorithm as before. Sample ROC curve and its approximation. Training Procedure (cont’d)

18 Results Simulation were performed to optimize the nucleotide weights using a specific data set and then test the optimized weights on a nonoverlapping test set. Simulation were performed to optimize the nucleotide weights using a specific data set and then test the optimized weights on a nonoverlapping test set. The data sets were chosen from the popular HMR195 database. The data sets were chosen from the popular HMR195 database. Of the 195 sequences in the database, we selected the 160 sequences that have been verified experimentally and divided them into two sets, the initial training set and a test set of 80 sequences each. Of the 195 sequences in the database, we selected the 160 sequences that have been verified experimentally and divided them into two sets, the initial training set and a test set of 80 sequences each.

19 Termination tolerance: Termination tolerance: Iterations for minimization of 1−AUC: 42 Iterations for minimization of 1−AUC: 42 Iterations for exponential model: 20 Iterations for exponential model: 20 Results (cont’d)

20 Results (cont’d) ROC curves corresponding to the actual and pseudo-EIIP values, obtained using the training set. Pseudo-EIIP values EIIP values

21 ROC curves corresponding to the actual and pseudo-EIIP values, obtained using a test set with no overlap with the training set. Results (cont’d) Pseudo-EIIP values EIIP values

22 Conclusions A method for obtaining optimized nucleotide weights, referred to as pseudo-EIIP values, has been proposed for use in filter-based exon location in DNA sequences. A method for obtaining optimized nucleotide weights, referred to as pseudo-EIIP values, has been proposed for use in filter-based exon location in DNA sequences. The pseudo-EIIP values were found to yield improved exon location with respect to the training set as well as a nonoverlapping set of DNA sequences. The pseudo-EIIP values were found to yield improved exon location with respect to the training set as well as a nonoverlapping set of DNA sequences. The pseudo-EIIP values render the filter-based exon location technique a more useful computational technique that can be used by biologists as an alternative to expensive and laborious wet experimental techniques. The pseudo-EIIP values render the filter-based exon location technique a more useful computational technique that can be used by biologists as an alternative to expensive and laborious wet experimental techniques.

23 Thank you for your attention.