Modeling Biological Sequence using Hidden Markov Models

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Marjolijn Elsinga & Elze de Groot1 Markov Chains and Hidden Markov Models Marjolijn Elsinga & Elze de Groot.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Hidden Markov Model.
Lecture 8: Hidden Markov Models (HMMs) Michael Gutkin Shlomi Haba Prepared by Originally presented at Yaakov Stein’s DSPCSP Seminar, spring 2002 Modified.
Hidden Markov Models Eine Einführung.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Modified from:
Statistical NLP: Lecture 11
Hidden Markov Models in Bioinformatics
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:
Hidden Markov Models In BioInformatics
HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
1 Hidden Markov Models (HMMs). 2 Definition Hidden Markov Model is a statistical model where the system being modeled is assumed to be a Markov process.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models – Concepts 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models BMI/CS 576
What is a Hidden Markov Model?
Hidden Markov Models (HMMs)
Hidden Markov Models (HMMs)
Hidden Markov Models (HMMs)
LECTURE 15: HMMS – EVALUATION AND DECODING
Hidden Markov Models - Training
Hidden Markov Model I.
6.047/ Computational Biology: Genomes, Networks, Evolution
Hidden Markov Models Part 2: Algorithms
Three classic HMM problems
Professor of Computer Science and Mathematics
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)
LECTURE 14: HMMS – EVALUATION AND DECODING
Hidden Markov Models (HMMs)
Hidden Markov Model ..
Generalizations of Markov model to characterize biological sequences
Professor of Computer Science and Mathematics
Professor of Computer Science and Mathematics
CSE 5290: Algorithms for Bioinformatics Fall 2009
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
CSE 5290: Algorithms for Bioinformatics Fall 2009
Hidden Markov Model Lecture #6
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (II)
Presentation transcript:

Modeling Biological Sequence using Hidden Markov Models 6.047/6.878 - Computational Biology: Genomes, Networks, Evolution Modeling Biological Sequence using Hidden Markov Models Lecture 6 Sept 23, 2008

Challenges in Computational Biology 4 Genome Assembly 9 Regulatory motif discovery 6 Gene Finding DNA 2 Sequence alignment 14 Comparative Genomics TCATGCTAT TCGTGATAA TGAGGATAT TTATCATAT TTATGATTT 3 Database search 10 Evolutionary Theory 5 Gene expression analysis RNA transcript 4 Cluster discovery 9 Gibbs sampling 11 Protein network analysis 12 Regulatory network inference 13 Emerging network properties

Course Outline Lec Date Topic Homework 1 Thu Sep 04 Intro: Biology, Algorithms, Machine Learning   Recitation1: Probability, Statistics, Biology PS1 2 Tue Sep 09 Global/local alignment/DynProg (due 9/15) 3 Thu Sep 11 StringSearch/Blast/DB Search Recitation2: Affine gaps alignment / Hashing with combs 4 Tue Sep 16 Clustering/Basics/Gene Expression/Sequence Clustering PS2 5 Thu Sep 18 Classification/Feature Selection/ROC/SVM (due 9/22) Recitation3: Microarrays 6 Tue Sep 23 HMMs1 - Evaluation / Parsing 7 Thu Sep 25 HMMs2 - PosteriorDecoding/Learning Recitation4: Posterior decoding review, Baum-Welch Learning PS3 8 Tue Sep 30 Generalized HMMs and Gene Prediction (due 10/06) 9 Thu Oct 02 Regulatory Motifs/Gibbs Sampling/EM Recitation5: Entropy, Information, Background models 10 Tue Oct 07 Gene evolution: Phylogenetic algorithms, NJ, ML, Parsimony 11 Thu Oct 09 MolecularEvolution/Coalescence/Selection/MKS/KaKs Recitation 6: Gene Trees, Species Trees, Reconciliation PS4 12 Tue Oct 14 Population Genomics - Fundamentals (due 10/20) 13 Thu Oct 16 Population Genomics - Association Studies Recitation 7: Population genomics 14 Tue Oct 21 MIDTERM (Review Session Mon Oct 20rd at 7pm) Midterm 15 Thu Oct 23 GenomeAssembly/EulerGraphs Recitation 8: Brainstorming for Final Projects (MIT) Project 16 Tue Oct 28 ComparativeGenomics1:BiologicalSignalDiscovery/EvolutionarySignatures Phase I 17 Thu Oct 30 ComparativeGenomics2:Phylogenomics, Gene and Genome Duplication (due 11/06) 18 Tue Nov 04 Conditional Random Fields/Gene Finding/Feature Finding 19 Thu Nov 06 Regulatory Networks/Bayesian Networks Tue Nov 11 Veterans Day Holiday 20 Thu Nov 13 Inferring Biological Networks/Graph Isomorphism/Network Motifs Phase II 21 Tue Nov 18 Metabolic Modeling 1 - Dynamic systems modeling (due 11/24) 22 Thu Nov 20 Metabolic Modeling 2 - Flux balance analysis & metabolic control analysis 23 Tue Nov 25 Systems Biology (Uri Alon) Thu Nov 27 ThanksGiving Break 24 Tue Dec 02 Module Networks (Aviv Regev) 25 Thu Dec 04 Synthetic Biology (Tom Knight) Phase III 26 Tue Dec 09 Final Presentations - Part I (11am) (due 12/08) 27 Final Presentations - Part II (5pm) Course Outline

What have we learned so far? String searching and counting Brute-force algorithm W-mer indexing Sequence alignment Dynamic programming, duality path  alignment Global / local alignment, general gap penalties String comparison Exact string match, semi-numerical matching Rapid database search Exact matching: Hashing, BLAST Inexact matching: neighborhood search, projections Problem set 1

So, you find a new piece of DNA… What do you do? …GTACTCACCGGGTTACAGGATTATGGGTTACAGGTAACCGTT… Align it to things we know about Align it to things we don’t know about Stare at it Non-standard nucleotide composition? Interesting k-mer frequencies? Recurring patterns? Model it Make some hypotheses about it Build a ‘generative model’ to describe it Find sequences of similar type

This week: Modeling biological sequences (a. k. a This week: Modeling biological sequences (a.k.a. What to do with a huge chunk of DNA) Intergenic CpG island Promoter First exon Intron Other exon Intron GGTTACAGGATTATGGGTTACAGGTAACCGTTGTACTCACCGGGTTACAGGATTATGGGTTACAGGTAACCGGTACTCACCGGGTTACAGGATTATGGTAACGGTACTCACCGGGTTACAGGATTGTTACAGG Ability to emit DNA sequences of a certain type Not exact alignment to previously known gene Preserving ‘properties’ of type, not identical sequence Ability to recognize DNA sequences of a certain type (state) What (hidden) state is most likely to have generated observations Find set of states and transitions that generated a long sequence Ability to learn distinguishing characteristics of each state Training our generative models on large datasets Learn to classify unlabelled data

Why Probabilistic Sequence Modeling? Biological data is noisy Probability provides a calculus for manipulating models Not limited to yes/no answers – can provide “degrees of belief” Many common computational tools based on probabilistic models Our tools: Markov Chains and Hidden Markov Models (HMMs)

Definition: Markov Chain Definition: A Markov chain is a triplet (Q, p, A), where: Q is a finite set of states. Each state corresponds to a symbol in the alphabet Σ p is the initial state probabilities. A is the state transition probabilities, denoted by ast for each s, t in Q. For each s, t in Q the transition probability is: ast ≡ P(xi = t|xi-1 = s) Output: The output of the model is the set of states at each instant time => the set of states are observable Property: The probability of each symbol xi depends only on the value of the preceding symbol xi-1 : P (xi | xi-1,…, x1) = P (xi | xi-1) Let me first give a formal definition of a Markov Chain model and then show how this simple model can be used to answer these two questions. Examples: Markov model of the weather: The weather is observed once a day. States: Rain or snow, Cloudy, Sunny Go over an example on p. 259 (left column) Talk about: the training set How the transition matrix could have been calculated How to calculate the probability to have a pattern s-s-r-r-s-c-s State: the first-order assumption is too simplistic Joke about the changing weather in Tennessee Now, let see how we can apply this model to our CpG islands problem Formula: The probability of the sequence: P(x) = P(xL,xL-1,…, x1) = P (xL | xL-1) P (xL-1 | xL-2)… P (x2 | x1) P(x1)

Definitions: HMM (Hidden Markov Model) Definition: An HMM is a 5-tuple (Q, V, p, A, E), where: Q is a finite set of states, |Q|=N V is a finite set of observation symbols per state, |V|=M p is the initial state probabilities. A is the state transition probabilities, denoted by ast for each s, t in Q. For each s, t in Q the transition probability is: ast ≡ P(xi = t|xi-1 = s) E is a probability emission matrix, esk ≡ P (vk at time t | qt = s) Output: Only emitted symbols are observable by the system but not the underlying random walk between states -> “hidden” Property: Emissions and transitions are dependent on the current state only and not on the past.

The six algorithmic settings for HMMs One path All paths 1. Scoring x, one path P(x,π) Prob of a path, emissions 2. Scoring x, all paths P(x) = Σπ P(x,π) Prob of emissions, over all paths Scoring 3. Viterbi decoding π* = argmaxπ P(x,π) Most likely path 4. Posterior decoding π^ = {πi | πi=argmaxk ΣπP(πi=k|x)} Path containing the most likely state at any time point. Decoding 5. Supervised learning, given π Λ* = argmaxΛ P(x,π|Λ) 6. Unsupervised learning. Λ* = argmaxΛ maxπP(x,π|Λ) Viterbi training, best path 6. Unsupervised learning Λ* = argmaxΛ ΣπP(x,π|Λ) Baum-Welch training, over all paths Learning

Example 1: Finding GC-rich regions Promoter regions frequently have higher counts of Gs and Cs Model genome as nucleotides drawn independently from two distributions: Background (B) and Promoters (P). Emission probabilities based on nucleotide composition in each. Transition probabilities based on relative abundance & avg. length 0.15 0.25 0.75 0.85 Background (B) Promoter Region (P) A: 0.15 T: 0.13 G: 0.30 C: 0.42 A: 0.25 T: 0.25 G: 0.25 C: 0.25 P(Xi|B) P(Xi|P) TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC

HMM as a Generative Model P P P P P P P P P P P P P P P P B A A T B B B B B B B B B B B B B B B G C A G C S: P(Li+1|Li) P(S|B) P(S|P) Bi+1 Pi+1 Bi 0.85 0.15 Pi 0.25 0.75 A: 0.25 T: 0.25 G: 0.25 C: 0.25 A: 0.42 T: 0.30 G: 0.13 C: 0.15

Sequence Classification PROBLEM: Given a sequence, is it a promoter region? We can calculate P(S|MP), but what is a sufficient P value? SOLUTION: compare to a null model and calculate log-likelihood ratio e.g. background DNA distribution model, B Pathogenicity Islands Background DNA A: -0.73 T: -0.94 G: 0.26 C: 0.74 A: T: G: C: Score Matrix A: 0.15 T: 0.13 G: 0.30 C: 0.42

Finding GC-rich regions Could use the log-likelihood ratio on windows of fixed size Downside: have to evaluate all islands of all lengths repeatedly Need: a way to easily find transitions TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC

Probability of a sequence if all promoter 0.75 0.75 0.75 0.75 0.75 0.75 0.75 P P P P P P P P L: B B B B B B B B S: G 0.30 C 0.42 A 0.15 A 0.15 A 0.15 T 0.13 G 0.30 C 0.30 P(x,π)=aP*eP(G)*aPP*eP(G)*aPP*eP(C)*aPP*eP(A)*aPP*… =ap*(0.75)7*(0.15)3*(0.13)1*(0.30)2*(0.42)2 =9.3*10-7 A: 0.15 T: 0.13 G: 0.30 C: 0.42 Why is this so small? 15

Probability of the same sequence if all background 0.25 B 0.85 B B B B B B B B S: G C A A A T G C A: 0.25 T: 0.25 G: 0.25 C: 0.25 Compare relative probabilities: 5-fold more likely!

Probability of the same sequence if mixed 0.85 0.25 0.15 0.42 0.30 P 0.75 P P P P P P P P L: B B B B B B B B S: G C A A A T G C Should we try all possibilities? What is the most likely path?

The six algorithmic settings for HMMs One path All paths 1. Scoring x, one path P(x,π) Prob of a path, emissions 2. Scoring x, all paths P(x) = Σπ P(x,π) Prob of emissions, over all paths Scoring 3. Viterbi decoding π* = argmaxπ P(x,π) Most likely path 4. Posterior decoding π^ = {πi | πi=argmaxk ΣπP(πi=k|x)} Path containing the most likely state at any time point. Decoding 5. Supervised learning, given π Λ* = argmaxΛ P(x,π|Λ) 6. Unsupervised learning. Λ* = argmaxΛ maxπP(x,π|Λ) Viterbi training, best path 6. Unsupervised learning Λ* = argmaxΛ ΣπP(x,π|Λ) Baum-Welch training, over all paths Learning

3. DECODING: What was the sequence of hidden states? Given: Model parameters ei(.), aij Given: Sequence of emissions x Find: Sequence of hidden states π

Finding the optimal path We can now evaluate any path through hidden states, given the emitted sequences How do we find the best path? Optimal substructure! Best path through a given state is: Best path to previous state Best transition from previous state to this state Best path to the end state  Viterbi algortithm Define Vk(i) = Probability of the most likely path through state i=k Compute Vk(i+1) as a function of maxk’ { Vk’(i) } Vk(i+1) = ek(xi+1) * maxj ajk Vj(i)  Dynamic Programming

Finding the most likely path 1 2 K … x2 x3 xK x1 Find path * that maximizes total joint probability P[ x,  ] P(x,) = a01 * Πi ei(xi)  aii+1 start emission transition

Calculate maximum P(x,) recursively … … ajk k Vk(i) hidden states Vj(i-1) j ek … xi-1 xi observations Assume we know Vj for the previous time step (i-1) Calculate Vk(i) = ek(xi) * maxj ( Vj(i-1)  ajk ) max ending in state j at step i Transition from state j current max this emission all possible previous states j

The Viterbi Algorithm Traceback: Input: x = x1……xN State 1 2 Vk(i) K x1 x2 x3 ………………………………………..xN Traceback: Follow max pointers back Similar to aligning states to seq In practice: Use log scores for computation Running time and space: Time: O(K2N) Space: O(KN) Input: x = x1……xN Initialization: V0(0)=1, Vk(0) = 0, for all k > 0 Iteration: Vk(i) = eK(xi)  maxj ajk Vj(i-1) Termination: P(x, *) = maxk Vk(N)

The six algorithmic settings for HMMs One path All paths 1. Scoring x, one path P(x,π) Prob of a path, emissions 2. Scoring x, all paths P(x) = Σπ P(x,π) Prob of emissions, over all paths Scoring 3. Viterbi decoding π* = argmaxπ P(x,π) Most likely path 4. Posterior decoding π^ = {πi | πi=argmaxk ΣπP(πi=k|x)} Path containing the most likely state at any time point. Decoding 5. Supervised learning, given π Λ* = argmaxΛ P(x,π|Λ) 6. Unsupervised learning. Λ* = argmaxΛ maxπP(x,π|Λ) Viterbi training, best path 6. Unsupervised learning Λ* = argmaxΛ ΣπP(x,π|Λ) Baum-Welch training, over all paths Learning

2. EVALUATION (how well does our model capture the world) Given: Model parameters ei(.), aij Given: Sequence of emissions x Find: P(x|M), summed over all possible paths π

Simple: Given the model, generate some sequence x 1 2 K … 1 1 2 K … 1 2 K … 1 2 K … … a02 2 2 K e2(x1) x1 x2 x3 xn Given a HMM, we can generate a sequence of length n as follows: Start at state 1 according to prob a01 Emit letter x1 according to prob e1(x1) Go to state 2 according to prob a12 … until emitting xn We have some sequence x that can be emitted by p. Can calculate its likelihood. However, in general, many different paths may emit this same sequence x. How do we find the total probability of generating a given x, over any path?

Complex: Given x, was it generated by the model? 1 2 K … x1 x2 x3 xn e2(x1) a02 Given a sequence x, What is the probability that x was generated by the model (using any path)? P(x) = Σπ P(x,π) Challenge: exponential number of paths

Calculate probability of emission over all paths Each path has associated probability Some paths are likely, others unlikely: sum them all up  Return total probability that emissions are observed, summed over all paths Viterbi path is the most likely one How much ‘probability mass’ does it contain? (cheap) alternative: Calculate probability over maximum (Viterbi) path π* Good approximation if Viterbi has highest density BUT: incorrect (real) solution Calculate the exact sum iteratively P(x) = Σπ P(x,π) Can use dynamic programming

The Forward Algorithm – derivation Define the forward probability: fl(i) = P(x1…xi, i = l) = 1…i-1 P(x1…xi-1, 1,…, i-2, i-1, i = l) el(xi) = k 1…i-2 P(x1…xi-1, 1,…, i-2, i-1=k) akl el(xi) = k fk(i-1) akl el(xi) = el(xi) k fk(i-1) akl

Calculate total probability Σπ P(x,) recursively … … ajk k fk(i) hidden states fj(i-1) j ek … xi-1 xi observations Assume we know fj for the previous time step (i-1) Calculate fk(i) = ek(xi) * sumj ( fj(i-1)  ajk ) sum ending in state j at step i transition from state j updated sum this emission every possible previous state j

The Forward Algorithm In practice: Input: x = x1……xN State 1 2 fk(i) K x1 x2 x3 ………………………………………..xN In practice: Sum of log scores is difficult  approximate exp(1+p+q)  scaling of probabilities Running time and space: Time: O(K2N) Space: O(KN) Input: x = x1……xN Initialization: f0(0)=1, fk(0) = 0, for all k > 0 Iteration: fk(i) = eK(xi)  sumj ajk fj(i-1) Termination: P(x, *) = sumk fk(N)

The six algorithmic settings for HMMs One path All paths 1. Scoring x, one path P(x,π) Prob of a path, emissions 2. Scoring x, all paths P(x) = Σπ P(x,π) Prob of emissions, over all paths Scoring 3. Viterbi decoding π* = argmaxπ P(x,π) Most likely path 4. Posterior decoding π^ = {πi | πi=argmaxk ΣπP(πi=k|x)} Path containing the most likely state at any time point. Decoding 5. Supervised learning, given π Λ* = argmaxΛ P(x,π|Λ) 6. Unsupervised learning. Λ* = argmaxΛ maxπP(x,π|Λ) Viterbi training, best path 6. Unsupervised learning Λ* = argmaxΛ ΣπP(x,π|Λ) Baum-Welch training, over all paths Learning

Introducing memory State, emissions, only depend on current state How do you count di-nucleotide frequencies? CpG islands Codon triplets Di-codon frequencies Introducing memory to the system Expanding the number of states

Example 2: CpG islands: incorporating memory aAT A+ T+ A+ C+ G+ T+ aAC aGT C+ G+ A: 1 C: 0 G: 0 T: 0 A: 0 C: 1 G: 0 T: 0 A: 0 C: 0 G: 1 T: 0 A: 0 C: 0 G: 0 T: 1 aGC Markov Chain Q: states p: initial state probabilities A: transition probabilities HMM Q: states V: observations p: initial state probabilities A: transition probabilities E: emission probabilities

Counting nucleotide transitions: Markov/HMM aAT A+ T+ G+ C+ A: 0 C: 0 G: 1 T: 0 A: 1 C: 0 G: 0 T: 0 A: 0 C: 1 G: 0 T: 0 A: 0 C: 0 G: 0 T: 1 A+ T+ G+ C+ A: 0 C: 0 G: 1 T: 0 A: 1 C: 0 G: 0 T: 0 A: 0 C: 1 G: 0 T: 0 A: 0 C: 0 G: 0 T: 1 A+ T+ aAC aGT C+ G+ aGC Markov Chain Q: states p: initial state probabilities A: transition probabilities HMM Q: states V: observations p: initial state probabilities A: transition probabilities E: emission probabilities

What have we learned ? Modeling sequential data Definitions Recognize a type of sequence, genomic, oral, verbal, visual, etc… Definitions Markov Chains Hidden Markov Models (HMMs) Simple examples Recognizing GC-rich regions. Recognizing CpG dinucleotides Our first computations Running the model: know model  generate sequence of a ‘type’ Evaluation: know model, emissions, states  p? Viterbi: know model, emissions  find optimal path Forward: know model, emissions  total p over all paths Next time: Posterior decoding Supervised learning Unsupervised learning: Baum-Welch, Viterbi training