Hidden Markov Models—Variants Conditional Random Fields 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Slides:



Advertisements
Similar presentations
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Advertisements

Learning HMM parameters
Hidden Markov Model.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Bioinformatics Hidden Markov Models. Markov Random Processes n A random sequence has the Markov property if its distribution is determined solely by its.
Hidden Markov Models Eine Einführung.
Hidden Markov Models.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Modified from:
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Statistical NLP: Lecture 11
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Model Most pages of the slides are from lecture notes from Prof. Serafim Batzoglou’s course in Stanford: CS 262: Computational Genomics (Winter.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
CpG islands in DNA sequences
Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs of Words, Patterns 3.Systems.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models Lecture 6, Thursday April 17, 2003.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Lecture 5: Learning models using EM
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
CpG islands in DNA sequences
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Time Warping Hidden Markov Models Lecture 2, Thursday April 3, 2003.
Hidden Markov Models 1 2 K … x1 x2 x3 xK.
Bioinformatics Hidden Markov Models. Markov Random Processes n A random sequence has the Markov property if its distribution is determined solely by its.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Conditional Random Fields 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG.
CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
CS5263 Bioinformatics Lecture 12: Hidden Markov Models and applications.
CS5263 Bioinformatics Lecture 10: Markov Chain and Hidden Markov Models.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
CS Statistical Machine learning Lecture 24
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Other Models for Time Series. The Hidden Markov Model (HMM)
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models – Concepts 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models BMI/CS 576
Variants of HMMs.
Presentation transcript:

Hidden Markov Models—Variants Conditional Random Fields 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2

CS262 Lecture 7, Win06, Batzoglou Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN:the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls 2.Estimation when the “right answer” is unknown Examples: GIVEN:the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice QUESTION:Update the parameters  of the model to maximize P(x|  )

CS262 Lecture 7, Win06, Batzoglou 1.When the “true” parse is known Given x = x 1 …x N for which the true  =  1 …  N is known, Simply count up # of times each transition & emission is taken! Define: A kl = # times k  l transition occurs in  E k (b) = # times state k in  emits b in x We can show that the maximum likelihood parameters  (maximize P(x|  )) are: A kl E k (b) a kl = ––––– e k (b) = –––––––  i A ki  c E k (c)

CS262 Lecture 7, Win06, Batzoglou 2. When the “true parse” is unknown Baum-Welch Algorithm expected Compute expected # of times each transition & is taken! Initialization: Pick the best-guess for model parameters (or arbitrary) Iteration: 1.Forward 2.Backward 3.Calculate A kl, E k (b), given  CURRENT 4.Calculate new model parameters  NEW : a kl, e k (b) 5.Calculate new log-likelihood P(x |  NEW ) GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZATION Until P(x |  ) does not change much

CS262 Lecture 7, Win06, Batzoglou Variants of HMMs

CS262 Lecture 7, Win06, Batzoglou Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 = j)a jkl … A second order HMM with K states is equivalent to a first order HMM with K 2 states state Hstate T a HT (prev = H) a HT (prev = T) a TH (prev = H) a TH (prev = T) state HHstate HT state THstate TT a HHT a TTH a HTT a THH a THT a HTH

CS262 Lecture 7, Win06, Batzoglou Modeling the Duration of States Length distribution of region X: E[l X ] = 1/(1-p) Geometric distribution, with mean 1/(1-p) This is a significant disadvantage of HMMs Several solutions exist for modeling different length distributions XY 1-p 1-q pq

CS262 Lecture 7, Win06, Batzoglou Example: exon lengths in genes

CS262 Lecture 7, Win06, Batzoglou Solution 1: Chain several states XY 1-p 1-q p q X X Disadvantage: Still very inflexible l X = C + geometric with mean 1/(1-p)

CS262 Lecture 7, Win06, Batzoglou Solution 2: Negative binomial distribution Duration in X: m turns, where  During first m – 1 turns, exactly n – 1 arrows to next state are followed  During m th turn, an arrow to next state is followed m – 1 P(l X = m) = n – 1 (1 – p) n-1+1 p (m-1)-(n-1) = n – 1 (1 – p) n p m-n X (n) p X (2) X (1) p 1 – p p …… Y 1 – p

CS262 Lecture 7, Win06, Batzoglou Example: genes in prokaryotes EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A Negative binomial with n = 3

CS262 Lecture 7, Win06, Batzoglou Solution 3:Duration modeling Upon entering a state: 1.Choose duration d, according to probability distribution 2.Generate d letters according to emission probs 3.Take a transition to next state according to transition probs Disadvantage: Increase in complexity of Viterbi: Time: O(D) Space: O(1) where D = maximum duration of state F d<D f x i …x i+d-1 PfPf Warning, Rabiner’s tutorial claims O(D 2 ) & O(D) increases

CS262 Lecture 7, Win06, Batzoglou Viterbi with duration modeling Recall original iteration: Vl(i) = max k V k (i – 1) a kl  e l (x i ) New iteration: V l (i) = max k max d=1…Dl V k (i – d)  P l (d)  a kl   j=i-d+1…i e l (x j ) FL transitions emissions d<D f x i …x i + d – 1 emissions d<D l x j …x j + d – 1 PfPf PlPl Precompute cumulative values

CS262 Lecture 7, Win06, Batzoglou Conditional Random Fields A brief description of a relatively new kind of graphical model

CS262 Lecture 7, Win06, Batzoglou Let’s look at an HMM again Why are HMMs convenient to use?  Because we can do dynamic programming with them! “Best” state sequence for 1…i interacts with “best” sequence for i+1…N using K 2 arrows V l (i+1) = e l (i+1) max k V k (i) a kl = max k ( V k (i) + [ e(l, i+1) + a(k, l) ] ) (where e(.,.) and a(.,.) are logs) Total likelihood of all state sequences for 1…i+1 can be calculated from total likelihood for 1…i by only summing up K 2 arrows 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xNxN 2 1 K 2

CS262 Lecture 7, Win06, Batzoglou Let’s look at an HMM again Some shortcomings of HMMs  Can’t model state duration Solution: explicit duration models (Semi-Markov HMMs)  Unfortunately, state  i cannot “look” at any letter other than x i ! Strong independence assumption: P(  i | x 1 …x i-1,  1 …  i-1 ) = P(  i |  i-1 ) 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xNxN 2 1 K 2

CS262 Lecture 7, Win06, Batzoglou Let’s look at an HMM again Another way to put this, features used in objective function P(x,  ):  a kl, e k (b), where b    At position i: all K 2 a kl features, and all K e l (x i ) features play a role  OK forget probabilistic interpretation for a moment  “Given that prev. state is k, current state is l, how much is current score?” V l (i) = V k (i – 1) + (a(k, l) + e(l, i)) = V k (i – 1) + g(k, l, x i ) Let’s generalize g!!! V k (i – 1) + g(k, l, x, i) 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xNxN 2 1 K 2

CS262 Lecture 7, Win06, Batzoglou “Features” that depend on many pos. in x What do we put in g(k, l, x, i)?  The “higher” g(k, l, x, i), the more we like going from k to l at position i Richer models using this additional power  Examples Casino player looks at previous 100 pos’ns; if > 50 6s, he likes to go to Fair g(Loaded, Fair, x, i) += 1[x i-100, …, x i-1 has > 50 6s]  w DON’T_GET_CAUGHT Genes are close to CpG islands; for any state k, g(k, exon, x, i) += 1[x i-1000, …, x i+1000 has > 1/16 CpG]  w CG_RICH_REGION x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x 10 x8x8 x9x9 ii  i-1

CS262 Lecture 7, Win06, Batzoglou “Features” that depend on many pos. in x Conditional Random Fields—Features 1.Define a set of features that you think are important  All features should be functions of current state, previous state, x, and position i  Example: Old features: transition k  l, emission b from state k Plus new features: prev 100 letters have 50 6s  Number the features 1…n: f 1 (k, l, x, i), …, f n (k, l, x, i) features are indicator true/false variables  Find appropriate weights w 1,…, w n for when each feature is true weights are the parameters of the model 2.Let’s assume for now each feature has a weight w j  Then, g(k, l, x, i) =  j=1…n f j (k, l, x, i)  w j x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x 10 x8x8 x9x9

CS262 Lecture 7, Win06, Batzoglou “Features” that depend on many pos. in x Define V k (i): Optimal score of “parsing” x 1 …x i and ending in state k Then, assuming V k (i) is optimal for every k at position i, it follows that V l (i+1) = max k [V k (i) + g(k, l, x, i+1)] Why? Even though at pos’n i+1 we “look” at arbitrary positions in x, we are only “affected” by the choice of ending state k Therefore, Viterbi algorithm again finds optimal (highest scoring) parse for x 1 …x N x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x 10 x8x8 x9x9

CS262 Lecture 7, Win06, Batzoglou “Features” that depend on many pos. in x Score of a parse depends on all of x at each position Can still do Viterbi because state  i only “looks” at prev. state  i-1 and the constant sequence x 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 … 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 … HMM CRF

CS262 Lecture 7, Win06, Batzoglou How many parameters are there, in general? Arbitrarily many parameters!  For example, let f j (k, l, x, i) depend on x i-5, x i-4, …, x i+5 Then, we would have up to K  |  | 11 parameters!  Advantage: powerful, expressive model Example: “if there are more than 50 6’s in the last 100 rolls, but in the surrounding 18 rolls there are at most 3 6’s, this is evidence we are in Fair state” Interpretation: casino player is afraid to be caught, so switches to Fair when he sees too many 6’s Example: “if there are any CG-rich regions in the vicinity (window of 2000 pos) then favor predicting lots of genes in this region”  Question: how do we train these parameters?

CS262 Lecture 7, Win06, Batzoglou Conditional Training Hidden Markov Model training:  Given training sequence x, “true” parse   Maximize P(x,  ) Disadvantage:  P(x,  ) = P(  | x) P(x) Quantity we care about so as to get a good parse Quantity we don’t care so much about because x is always given

CS262 Lecture 7, Win06, Batzoglou Conditional Training P(x,  ) = P(  | x) P(x) P(  | x) = P(x,  ) / P(x) Recall F(j, x,  ) = # times feature f j occurs in (x,  ) =  i=1…N f j (k, l, x, i) ; count f j in x,  In HMMs, let’s denote by w j the weight of j th feature: w j = log(a kl ) or log(e k (b)) Then, HMM: P(x,  ) = exp [  j=1…n w j  F(j, x,  ) ] CRF:Score(x,  ) = exp [  j=1…n w j  F(j, x,  ) ]

CS262 Lecture 7, Win06, Batzoglou Conditional Training In HMMs, P(  | x) = P(x,  ) / P(x) P(x,  ) = exp [  j=1…n w j  F(j, x,  ) ] P(x) =   exp [  j=1…n w j  F(j, x,  ) ] =: Z Then, in CRF we can do the same to normalize Score(x,  ) into a prob. P CRF (  | x) = exp [  j=1…n w j  F(j, x,  ) ] / Z QUESTION: Why is this a probability???

CS262 Lecture 7, Win06, Batzoglou Conditional Training 1.We need to be given a set of sequences x and “true” parses  2.Calculate Z by a sum-of-paths algorithm similar to HMM We can then easily calculate P(  | x) 3.Calculate partial derivative of P(  | x) w.r.t. each parameter w j (not covered—akin to forward/backward) Update each parameter with gradient descent! 4.Continue until convergence to optimal set of weights P(  | x) = exp [  j=1…n w j  F(j, x,  ) ] / Zis convex!!!

CS262 Lecture 7, Win06, Batzoglou Conditional Random Fields—Summary 1.Ability to incorporate complicated non-local feature sets Do away with some independence assumptions of HMMs Parsing is still equally efficient 2.Conditional training Train parameters that are best for parsing, not modeling Need labeled examples—sequences x and “true” parses  (Can train on unlabeled sequences, however it is unreasonable to train too many parameters this way) Training is significantly slower—many iterations of forward/backward