Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Slides:



Advertisements
Similar presentations
Markov models and applications
Advertisements

Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Genomic Sequence Analysis using Electron-Ion Interaction Potential Masumi Kobayashi Performance Evaluation Laboratory University of Aizu.
Ab initio gene prediction Genome 559, Winter 2011.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Markov Models Charles Yan Markov Chains A Markov process is a stochastic process (random process) in which the probability distribution of the.
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Hidden Markov Models in Bioinformatics
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
QUANTITATIVE DATA ANALYSIS
Markov Models Charles Yan Spring Markov Models.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
DNA Barcode Data Analysis Boosting Accuracy by Combining Simple Classification Methods CSE 377 – Bioinformatics - Spring 2006 Sotirios Kentros Univ. of.
Scoring matrices Identity PAM BLOSUM.
Computational Biology, Part 4 Protein Coding Regions Robert F. Murphy Copyright  All rights reserved.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Lecture 12 Splicing and gene prediction in eukaryotes
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Hidden Markov Models In BioInformatics
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
1 Patterns of Substitution and Replacement. 2 3.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Markov Chain Models BMI/CS 576 Fall 2010.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Lecture 3: Markov models of sequence evolution Alexei Drummond.
Tree Inference Methods
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Finding Mathematics in Genes and Diseases Ming-Ying Leung Department of Mathematical Sciences University of Texas at El Paso (UTEP)
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
10/29/20151 Gene Finding Project (Cont.) Charles Yan.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Jacek Leluk, Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University CLASSIFICATION AND CHARACTERIZATION OF NATURAL PROTEIN.
From Genomes to Genes Rui Alves.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
The Genetic Code. The DNA that makes up the human genome can be subdivided into information bytes called genes. Each gene encodes a unique protein that.
Construction of Substitution matrices
1 Codon Usage. 2 Discovering the codon bias 3 In the year 1980 Four researchers from Lyon analyzed ALL published mRNA sequences of more than about 50.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
(H)MMs in gene prediction and similarity searches.
Modelling evolution Gil McVean Department of Statistics TC A G.
Fitch-Margoliash Algorithm 1.From the distance matrix find the closest pair, e.g., A & B 2.Treat the rest of the sequences as a single composite sequence.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
1. 2 Discovering the codon bias 3 Il codice genetico è DEGENERATO.
Discovering the codon bias
bacteria and eukaryotes
Interpolated Markov Models for Gene Finding
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Essential Question: How cells make proteins
Modeling Signals in DNA
Patterns of amino acid usage and its GC-content of synonymous codons in 65 nuclear genomes in this study. Patterns of amino acid usage and its GC-content.
Presentation transcript:

Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling Warsaw University

Periodic asymmetry index Position asymmetry Codon usage Markov models Codon prototype Measures dependent on a model of coding DNA Measures independent of a model of coding DNA Identification of coding/non-coding sequences in genome oligonucleotide counts base compositional bias between codon positions dependence between nucleotide positions base compositional bias between codon positions periodic correlation between nucleotide positions Average mutual information Fourier spectrum Amino acid usage Codon preference Hexamer usage based on: Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

The notation used S – DNA sequence of length l, while S i ( i=1... l ) denotes the individual nucleotides C – sequence of codons; C j – the codon occupying position j in the sequence - denotes the sequence of codons that results when the grouping of nucleotides from sequence S into codons starts at nucleotide i or, - denotes the codon occupying position j in the decomposition i of the sequence S [k] - the nucleotide occupying position k in the codon Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Examples Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

The notation used Measures based on a model of coding DNA probability of the sequence of nucleotides S, given that S is coding in frame i (i=1, 2, 3) probability of the non-coding DNA sequence (randomly generated) Likelihood ratio The ratio of the probability of finding the sequence of nucleotides S, if S is coding in frame i over the probability of finding the sequence of nucleotides S, if S is non-coding Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

The notation used Measures based on a model of coding DNA Log-likelihood ratio coding potential of sequence S in frame i given the model of coding DNA the probability of the sequence of nucleotides S is higher assuming that S is coding in frame i, than assuming that S is non-coding in frame i the probability of S is higher assuming that S does not code in frame i than assuming that S is coding in frame i The log-likelihood ratios is computed for all three possible frames. If the sequence is coding, the log-likelihood ratio will larger for one of the frames than for the other two. Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Codon usage Measures based on a model of coding DNA Measures based on oligonucleotide counts frequency (probability) of codon C in the genes of the considered species (the codon usage table) probability of finding the sequence of codons C knowing that C codes for a protein P 0 (C)=(1/64) m probability of finding the non-coding sequence Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Amino acid usage Measures based on a model of coding DNA Measures based on oligonucleotide counts the observed probability of the amino acid encoded by codon C in the existing proteins This value can be directly derived from a codon usage table by summing up the probabilities of synonymous codons wheremeans c’ synonymous to c probability of finding the amino acid sequence resulting of translating the sequence in coding open reading frame frequency of the „non-coding amino acids”; n c – number of codons synonymous to C Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Codon preference Measures based on a model of coding DNA Measures based on oligonucleotide counts relative probability in coding regions of codon C among codons synonymous to C probability of the sequence S encoding the particular amino acid sequence in frame i probability of codon C in non-coding DNA In non-coding regions there is no preference between „synonymous codons”. Then: Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Hexamer usage Measures based on a model of coding DNA Measures based on oligonucleotide counts This approach is based on the hexamer usage table for i=1, 2, 3,..., In this case there are six reading frames to be analyzed. The probability of a sequence of hexanucleotides, in the coding frame of a coding sequence is Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Codon prototype Measures based on a model of coding DNA Measures based on base compositional bias between codon positions Let f(b,r) be the probability of nucleotide b at codon position r, as estimated from known coding regions. Then: P 2 (S) and P 3 (S) are computed in similar way is the probability of codon c in coding regions, assuming independence between adjacent nucleotides probability of for all triplets c in non-coding DNA Example: Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Markov Models Measures based on a model of coding DNA Measures based on dependence between nucleotide positions In the Markov models the probability of a nucleotide at a particular codon position depends on the nucleotide(s) preceding it. The Markov models of order 1 is the simplest of the Markov models. The probability of a nucleotide depends only on the preceding nucleotide. In this case, the model of coding DNA is based on the probabilities of the four nucleotides at each codon position, depending on the nucleotide occurring at the preceding codon position (technically called the transition probabilities). Thus, instead of one single matrix, as in Codon Prototype, three 4x4 matrices (the transition matrices) are required, F 1, F 2, and F 3, each one corresponding to a different codon position. There are used Markov models of the order 1 to 5 Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Position asymmetry Measures independent of a model of coding DNA Measures based on base compositional bias between codon positions The goal is to measure how asymmetric is the distribution of nucleotides at the three triplet positions in the sequence. the relative frequency of nucleotide b at codon r position in the sequence S, as calculated from one of the three decompositions of S in codons (any of them) average frequency of nucleotide b at the three codon positions asymmetry in the distribution of nucleotide b Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Position asymmetry (continued) Measures independent of a model of coding DNA Measures based on base compositional bias between codon positions Position Asymmetry of the sequence Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Periodic asymmetry index Measures independent of a model of coding DNA Measures based on periodic correlation between nucleotide positions This approach considers three distinct probabilities: - the probability P in of finding pairs of the same nucleotide at distances k=2, 5, 8,... - the probability P 1 out of finding pairs of the same nucleotide at distances k=0, 3, 6,... - the probability P 2 out of finding pairs of the same nucleotide at distances k=1, 4, 7,... The tendency to cluster homogeneous di-nucleotides in a 3-base periodic pattern can be measured by the Periodic Asymmetry Index: Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Average mutual information Measures independent of a model of coding DNA Measures based on periodic correlation between nucleotide positions absolute number of times when nucleotide i is followed by nucleotide j at a distance of k positions Correlation between nucleotides i and j at a distance of k positions probability that nucleotide i is followed by nucleotide j at a distance of k positions where p i and p j are probabilities of nucleotide i and j occurrence in sequence S Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Average mutual information (continued) Measures independent of a model of coding DNA Measures based on periodic correlation between nucleotide positions Mutual Information function quantifies the amount of information that can be obtained from one nucleotide about another nucleotide at a distance k Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Average mutual information (continued) Measures independent of a model of coding DNA Measures based on periodic correlation between nucleotide positions the in-frame mutual information at distances k=2, 5, 8,... Average Mutual Information the out-frame mutual information at distances k=0, 1, 3, 4,... Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Fourier analysis Measures independent of a model of coding DNA Measures based on periodic correlation between nucleotide positions No such ``peak'' is apparent for non-coding sequences DNA coding regions reveal the characteristic periodicity of 3 as a distinct peak at frequency f =1/3 The partial spectrum of a DNA sequence S of length l corresponding to nucleotide b is defined as: where U b (S j )=1 if S j =b, and otherwise it is 0, and f is the discrete frequency, f =k/l, for k=1, 2,...,l/2 Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Summary of results Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

List of Gene Identification programs and Internet access (part 1) Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

List of Gene Identification programs and Internet access (part 2) Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Thank you for your attention