Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimized Numerical Mapping Scheme for Filter-Based Exon Location in DNA Using a Quasi-Newton Algorithm P. Ramachandran, W.-S. Lu, and A. Antoniou Department.

Similar presentations


Presentation on theme: "Optimized Numerical Mapping Scheme for Filter-Based Exon Location in DNA Using a Quasi-Newton Algorithm P. Ramachandran, W.-S. Lu, and A. Antoniou Department."— Presentation transcript:

1 Optimized Numerical Mapping Scheme for Filter-Based Exon Location in DNA Using a Quasi-Newton Algorithm P. Ramachandran, W.-S. Lu, and A. Antoniou Department of Electrical Engineering, University of Victoria, BC, Canada. ISCAS 2010, Paris

2 2 DNA The instructions to build and maintain a living organism are encoded in its DNA. The instructions to build and maintain a living organism are encoded in its DNA. DNA is composed of smaller components called nucleotides, namely, adenine, thymine, guanine, and cytosine (A, T, G, and C). DNA is composed of smaller components called nucleotides, namely, adenine, thymine, guanine, and cytosine (A, T, G, and C). DNA comprises a pair of strands. DNA comprises a pair of strands.

3 3 DNA (cont’d) Nucleotides pair up across the two strands. Nucleotides pair up across the two strands. A always pairs with T and G always pairs with C. A always pairs with T and G always pairs with C. Symbolic representation of a DNA sequence.

4 4 Genes Regions in a genome that code for proteins are called genes. Regions in a genome that code for proteins are called genes.

5 5 Exons and Introns Genes are further split into coding regions called exons and noncoding regions called introns. Genes are further split into coding regions called exons and noncoding regions called introns.

6 6 Location of Exons Accurate location of exons in genomes is very important for understanding life processes. Accurate location of exons in genomes is very important for understanding life processes. The power spectra of DNA segments corresponding to exons exhibit a relatively strong component at The power spectra of DNA segments corresponding to exons exhibit a relatively strong component at This is known as the period-3 property. Thus, exons can be located by mapping the DNA characters into numbers and then tracking the strength of the period-3 component along the length of the DNA sequence of interest. Thus, exons can be located by mapping the DNA characters into numbers and then tracking the strength of the period-3 component along the length of the DNA sequence of interest.

7 7 EIIP Values Earlier, we have used electron-ion interaction potential (EIIP) values in conjunction with a filtering technique for exon location. Earlier, we have used electron-ion interaction potential (EIIP) values in conjunction with a filtering technique for exon location. Here, we propose the use of an optimized set of nucleotide weights, we refer to as pseudo-EIIP values, that significantly improve the accuracy of our exon- location technique. Here, we propose the use of an optimized set of nucleotide weights, we refer to as pseudo-EIIP values, that significantly improve the accuracy of our exon- location technique.

8 8 Filter-Based Exon Location Technique 1. The DNA character sequence of interest is mapped onto a numerical sequence using EIIP values. NucleotideEIIP Adenine0.1260 Thymine0.1335 Guanine0.0806 Cytosine0.1340 EIIP Values 2. A narrowband bandpass digital filter with its passband centered at the period-3 frequency is used to filter the DNA sequence.

9 9 Filter-Based Technique (cont’d) 3. The filtered output is an amplitude modulated signal, which is demodulated by filtering its power,, using a lowpass filter. The exon locations are identified as distinct peaks. Exon location system.

10 10 Receiver Operating Characteristic (ROC) Technique The ROC technique is a tool for evaluating prediction techniques in terms of their performance. The ROC technique is a tool for evaluating prediction techniques in terms of their performance. It is based on metrics known as the true positive rate (TPR) and the false positive rate (FPR): It is based on metrics known as the true positive rate (TPR) and the false positive rate (FPR): and TP, TN, FP, and FN denote the number of true positives, true negatives, false positives, and false negatives, respectively, of the predicted exon locations relative to a set of known true locations.

11 11 ROC Technique (cont’d) ROC plane The TPR is plotted versus the FPR to obtain a point in the ROC plane as illustrated. The TPR is plotted versus the FPR to obtain a point in the ROC plane as illustrated. Since the TPR and FPR range from 0 to 1, the total area of the ROC plane is unity. Since the TPR and FPR range from 0 to 1, the total area of the ROC plane is unity.

12 12 ROC Technique (cont’d) The northwest pole, (0, 1), represents perfect prediction and the goal of any prediction technique is to reach this point. The northwest pole, (0, 1), represents perfect prediction and the goal of any prediction technique is to reach this point. The area under the ROC curve (AUC) is a good indicator of the overall performance of an exon-location technique. The area under the ROC curve (AUC) is a good indicator of the overall performance of an exon-location technique. The greater the AUC, the better would be the performance. ROC plane

13 13 Proposed Training Procedure A better set of nucleotide weights can be obtained by maximizing the AUC corresponding to a training set of DNA sequences or, equivalently, by minimizing the quantity 1−AUC. A better set of nucleotide weights can be obtained by maximizing the AUC corresponding to a training set of DNA sequences or, equivalently, by minimizing the quantity 1−AUC. A quasi-Newton algorithm based on the BFGS updating formula was found to give good results. A quasi-Newton algorithm based on the BFGS updating formula was found to give good results. Closed-form expressions for the objective function and gradient are not possible for this problem and, therefore, they are evaluated numerically. Closed-form expressions for the objective function and gradient are not possible for this problem and, therefore, they are evaluated numerically.

14 14 Training Procedure (cont’d) For consistency between the optimized nucleotide weights and the EIIP values, we need to ensure that For consistency between the optimized nucleotide weights and the EIIP values, we need to ensure that  the four variables are always positive and  their numerical values are normalized at the end of each iteration such that their sum is always equal to the sum of the EIIP values.

15 15 Training Procedure (cont’d)  Positive values can be achieved by replacing each variable by its square in the objective function.  The normalization can be achieved by using the following scaling factor in each iteration: Constant 0.4741 is the sum of the actual EIIP values and the denominator variables are the current optimized nucleotide weights.

16 16 Model for ROC Curves ROC curves are not continuous but can be approximated using an exponential model of the form ROC curves are not continuous but can be approximated using an exponential model of the form Parameters and can be determined by minimizing the error function where and are points in the ROC plane.

17 17 The minimization can be performed using a quasi-Newton algorithm as before. Sample ROC curve and its approximation. Training Procedure (cont’d)

18 18 Results Simulation were performed to optimize the nucleotide weights using a specific data set and then test the optimized weights on a nonoverlapping test set. Simulation were performed to optimize the nucleotide weights using a specific data set and then test the optimized weights on a nonoverlapping test set. The data sets were chosen from the popular HMR195 database. The data sets were chosen from the popular HMR195 database. Of the 195 sequences in the database, we selected the 160 sequences that have been verified experimentally and divided them into two sets, the initial training set and a test set of 80 sequences each. Of the 195 sequences in the database, we selected the 160 sequences that have been verified experimentally and divided them into two sets, the initial training set and a test set of 80 sequences each.

19 19 Termination tolerance: 10 -6 Termination tolerance: 10 -6 Iterations for minimization of 1−AUC: 42 Iterations for minimization of 1−AUC: 42 Iterations for exponential model: 20 Iterations for exponential model: 20 Results (cont’d)

20 20 Results (cont’d) ROC curves corresponding to the actual and pseudo-EIIP values, obtained using the training set. Pseudo-EIIP values EIIP values

21 21 ROC curves corresponding to the actual and pseudo-EIIP values, obtained using a test set with no overlap with the training set. Results (cont’d) Pseudo-EIIP values EIIP values

22 22 Conclusions A method for obtaining optimized nucleotide weights, referred to as pseudo-EIIP values, has been proposed for use in filter-based exon location in DNA sequences. A method for obtaining optimized nucleotide weights, referred to as pseudo-EIIP values, has been proposed for use in filter-based exon location in DNA sequences. The pseudo-EIIP values were found to yield improved exon location with respect to the training set as well as a nonoverlapping set of DNA sequences. The pseudo-EIIP values were found to yield improved exon location with respect to the training set as well as a nonoverlapping set of DNA sequences. The pseudo-EIIP values render the filter-based exon location technique a more useful computational technique that can be used by biologists as an alternative to expensive and laborious wet experimental techniques. The pseudo-EIIP values render the filter-based exon location technique a more useful computational technique that can be used by biologists as an alternative to expensive and laborious wet experimental techniques.

23 23 Thank you for your attention.


Download ppt "Optimized Numerical Mapping Scheme for Filter-Based Exon Location in DNA Using a Quasi-Newton Algorithm P. Ramachandran, W.-S. Lu, and A. Antoniou Department."

Similar presentations


Ads by Google