# 1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.

## Presentation on theme: "1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008."— Presentation transcript:

1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008

2 Outline: 1. Markov Chain 2. DNA and Modeling 3. Markovian Models for DNA Sequences 4. Hidden Markov Models (HMM) 5. HMM for DNA Sequences 6. Future Works 7. References

3 1.Markov Chain : Alphabet: are called states, and S is the state space Notation > Sequence of random variables: A sequence of random variables is called a Markov Chain, (MC), if for all n>=1 and The conditional probability of a future event depends only upon the immediate past event

4 1.Markov Chain (cont.) Conditional Probability: Transition Matrix Property: Higher-Order Markov Chains: – Second order MC:

5 2.DNA and Modeling: Bases: {A,T,C,G} Complementary strands > sequence of bases in a single strand Sequences are always read from 5’ to 3’ end. DNA  mRNA  proteins (transcription and translation) Codons: Triples of bases which code for amino acids 61 + 3 ‘stop’ codons Specific sequence of codons  gene  Chromosomes  genome exons: coding portion of genes introns: non-coding regions Goal: To determine the nucleotide sequence of entire genomes

6 3.Markov Chains for DNA Sequences Nucleotides are chained linearly one by one  local dependence between the bases and their neighbors Markov chains offer computationally effective ways of expressing the various frequencies and local dependencies Alphabet of bases = {A,T,C,G}  not uniformly distributed in any sequence and the composition vary within and between sequences The probability of finding a particular base at one position can depend not only on the immediate adjacent bases, but also on several more distant bases upstream or downstream  higher order Markov model, (heterogeneous) Gene finding: Markov models of coding and non-coding regions to classify segments as either exons or introns. Segmentation for decomposing DNA sequences into homogeneous regions  Hidden Markov Models

7 4.Hidden Markov Models (HMM) Stochastic process generated by two interrelated probabilistic mechanisms Underlying Markov chain with a finite number of states and a set of random functions, each associated with its respective state Changing the states: according to transition matrix Only the output of the random functions can be seen Advantage: HMM allow for local characteristics of molecular sequences to be modeled and predicted within a rigorous statistical framework, and also allow the knowledge from prior investigations to be incorporated into analysis.

8 5.HMM for DNA Sequences Every nucleotide in a DNA belongs to either a “Normal” region (N), or a GC-rich region (R). No random distribution: Larger regions of (N) sequence Example of such a sequence: NNNNNNNNNRRRRRNNNNNNNNNNNNNNNNNRRRRRRRNNNN States of HMM: {N,R} Possible DNA sequence with this underlying collection: TTACTTGACGCCAGAAATCTATATTTGGTAACCCGACGCTAA No typical random collection of nucleotides: GC in R regions: 83% vs. 23% in N regions HMM: Identify these types of feature in sequences Ability to capture both the patchiness of N and R and different compositional frequencies within the categories

9 6.Future work… Better and deeper understanding of HMM Different applications of HMM, such as, Segmentation of DNA Sequence and Gene Finding Build an automata for a simple case 7.References Koski, Timo. Hidden Markov Models for Bioinformatics. Sweden: Kluwer Academic, 2001. Birney, E.. "Hidden Markov models in biological sequence analysis". July 2001: Haussler, David. David Kulp, Martin Reese Frank Eeckman "A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA". Boufounos, Petros, Sameh El-Difrawy, Dan Ehrlich. "HIDDEN MARKOV MODELS FOR DNA SEQUENCING".