Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hidden Markov Models Achim Tresch MPI for Plant Breedging Research & University of Cologne.

Similar presentations


Presentation on theme: "Hidden Markov Models Achim Tresch MPI for Plant Breedging Research & University of Cologne."— Presentation transcript:

1 Hidden Markov Models Achim Tresch MPI for Plant Breedging Research & University of Cologne

2 Recap: Proabilistic Clustering H1H1 X1X1 H2H2 X2X2 HNHN XNXN... Observations Hidden variables (“states“) indicating cluster which generated the observation H2H2 X2X2 Parameters: Cluster frequencies Emission distributions

3 The Hidden Markov Model H1H1 X1X1 H2H2 X2X2 HNHN XNXN... Observations Hidden variables (“states“) indicating cluster which generated the observation H2H2 X2X2 Hidden states become dependent, they form a Markov Chain Parameters: Cluster frequencies Emission distributions to be replaced

4 Markov Chain H1H1 H2H2 HNHN H2H2... This factorization always holds. But it is not useful, since the last term (j=N) still contains all variables. Goal: Factorize joint probability into products of “smaller“ terms (that depend merely on a few variables) Markov assumption (“memoryless process“):

5 Markov Chain Markov assumption Under the Markov assumption, the joint distribution factorizes into Homogeneity assumption: The joint distribution of a homogenous Markov chain is Initial state distribution Transition probabilities Parameters:

6 The Hidden Markov Model H1H1 X1X1 H2H2 X2X2 HNHN XNXN... Observations Hidden variables (“states“) indicating cluster which generated the observation H2H2 X2X2 Markov Chain Parameters: Initial state distribution Emission distributions Transition probabilities Hidden Markov Model:

7 HMM Parameter Estimation We will introduce the Baum-Welch Algorithm, an Expectation-Maximization (EM) algorithm for the HMM. I.e., we iteratively maximize the lower bound function Q. Parameters Hidden state variables H Observations X We will focus on the learning of the transition probabilities A=(a rs ). The learning of the initial distribution is easier, and the learning of the emission distributions leads to exactly the same formulas as for clustering.

8 HMM Parameter Estimation Omitting A -independent terms leads to: (summation of each H j is over all states 1,...,K ) These coefficients need to be calculated efficiently! (see later)

9 HMM Parameter Estimation Omitting A -independent terms leads to: (summation of each H j is over all states 1,...,K ) These coefficients need to be calculated efficiently! (see later)

10 HMM Parameter Estimation Omitting A -independent terms leads to: (summation of each H j is over all states 1,...,K ) These coefficients need to be calculated efficiently! (see later)

11 HMM Parameter Estimation Omitting A -independent terms leads to: (summation of each H j is over all states 1,...,K ) These coefficients need to be calculated efficiently! (see later)

12 HMM Parameter Estimation Omitting A -independent terms leads to: (summation of each H j is over all states 1,...,K ) These coefficients need to be calculated efficiently! (see later)

13 HMM Parameter Estimation We have to maximize Q( Θ;Θ old ) with respect to A=(a rs ) under the constraints How to maximize a function under additional constraints? Reformulate side conditions as zeros of a function: The task is now: Under the constraints Maximize

14 Side Note: Method of Lagrange Multipliers Introducing Lagrange multiplicators and taking partial derivatives leads to Set the derivatives to zero and solve for λ r : This leads to Introducing Lagrange multiplicators and taking partial derivatives leads to Set the derivatives to zero and solve for λ r : This leads to must be perpendicular to the g(x)=0 hypersurface (red). Otherwise, it would have a component parallel to it, and moving along that component would not leave g(x) equal to 0. must be perpendicular to the g(x)=0 hypersurface. Otherwise, f(x) would be further maximized by moving along the gradient‘s component parallel to the hypersurface. Therefore, and must be parallel to each other, or, in other words f(x) for some Let x* be a maximum of under the constraint x1x1 x2x2 g(x 1,x 2 )=0 g(x)=0

15 HMM Parameter Estimation We have to maximize Q under the constraints Introducing Lagrange multiplicators and taking partial derivatives leads to Set the derivatives to zero and solve for λ r : This leads to

16 HMM Parameter Estimation We have to maximize Q under the constraints Introducing Lagrange multiplicators and taking partial derivatives leads to Set the derivatives to zero and solve for λ r : This leads to

17 HMM Parameter Estimation We have to maximize Q under the constraints Introducing Lagrange multiplicators and taking partial derivatives leads to Set the derivatives to zero and solve for λ r : This leads to

18 Forward-Backward Algorithm Still to do: Calculate the marginal posterior probabilities We first calculate the univariate marginal posterior Small proof required (exercise) Forward probabilities Backward probabilities Thus,

19 Forward-Backward Algorithm The forward and backward probabilities can be calculated recursively (Forward-Backward-algorithm):

20 Forward-Backward Algorithm From α and β, we derive ζ :

21 Forward-Backward Algorithm From α and β, we derive ζ :

22 Forward-Backward Algorithm From α and β, we derive ζ :

23 Forward-Backward Algorithm From α and β, we derive ζ :

24 Forward-Backward Algorithm From α and β, we derive ζ :

25 Baum-Welch Algortihm 1.Start with some initial parameter guess Θ old. 2.Calculate α, β and γ, ζ (Forward-Backward algorithm) 3.Update parameters Θ, in particular 4.Set Θ old = Θ and iterate 2. and 3. until convergence. Depending on the chosen emission distribution (see mixture clustering)

26 genomic position HMM Example Observation tracks, e.g. Genome-wide ChIP-chip occupancy

27 HMM Example Observation tracks, e.g. Genome-wide ChIP-chip occupancy

28 state 1state 2 state 3 state 4state 5 typical occupancy vector(s) transition matrix 1234512345 1 2 3 4 5 Viterbi path HMM Example

29 State annotation = Viterbi path (maximum likelihood path) ;Θ;Θ Viterbi Decoding Pr Likelihood function HMM parameters

30 Viterbi Decoding Viterbi decoding searches for the most probable hidden state path. It maximizes the joint posterior of H.

31 Viterbi Decoding Viterbi decoding searches for the most probable hidden state path. It maximizes the joint posterior of H. Let v j (r) be the probability of the most probable path ending in state H j =r with observations X 1,..., X j : … … … … … … ……… ………… j-1jN1 r 1 K Positions States … … …

32 Viterbi Decoding We find v j (r) iteratively by dynamic programming, starting with j=1 and ending with j=N. Suppose we have found v j-1 (r) for and all r. Then, … … … … … … ……… ………… j-1jN1 r 1 K Positions States … … …

33 Viterbi Decoding If we keep track of the previous state of the maximum probability path, we can reconstruct the maximum likelihood path: Along with v j (r), define the backtrack variable B j (r) : … … … … … … ……… ………… j-1jN1 r 1 K Positions States … … … Then, we find by backtracking:

34 Posterior Decoding Posterior decoding searches for the most probable hidden state for each individual position. It maximizes the marginal posterior for each H j. The marginal posteriors were already calculated in the forward- backward algorithm (applied with the known parameter set Θ ): Hence we have

35 Efficient Marginalization: Factor Graphs A factor graph consists of: 1. An undirected, bipartite graph 2. A set of local functions, one for each factor node. The local function is a function of its neighbouring variables, A B C D f AB f BC f BD With variable nodes representing random variables And factor nodes representing local functions Factor nodes Variable nodes The neighbours ne(X) of a node X are the nodes that are directly adjacent to X f AB By definition, a factor graph encodes the function Where X 1,…X n is the set of all variable nodes

36 Factor Graphs Factor graphs are very general tools, e.g. Bayesian networks can be written as factor graphs: A B C f ABC AB C Possible representations as a factor graph: A B C f AB fAfA f ABC

37 The Sum-Product Algorithm “Marginalisation” in factor graphs that are trees. Let X = {X 1,…,X n } We want to calculate AXjXj CD f AC f CX j fXjDfXjD ne(X j ) ne(f CX j ) message from f k to X j

38 The Sum-Product algorithm Objective: Calculate the „marginals“ (in our case, the distributions)

39 The Sum-Product algorithm

40

41 Example Step 1 (initialization)Step 2

42 The Sum-Product algorithm

43 Theorem: If the factor graph is a tree, the order in which messages are passed is irrelevant for the result. For graphs containing loops, the result depends on the message passing scheme. Messages are usually passed until convergence, but convergence is not guaranteed. Therefore it is desirable to construct factor graphs that are trees. The Sum-Product algorithm

44 HMM Generalizations

45 >20,000 time-lapse movies of RNAi knock-downs (histone-GFP tagged HeLa cells) Neumann, Ellenberg et al., Nature 2010... Carpenter et al., Genome Biol 2006 Cell Profiler Example: Phenotyping in Time-Lapse Imaging Mitocheck Database Generation of time-lapse movies Cell identification and tracking Feature extraction

46 Phenotype classesRaw data Example: Phenotyping in Time-Lapse Imaging time

47 The Tree Hidden Factor Graph Model... ? (f 1,f 2,…,f n )

48 The Tree Hidden Factor Graph Model (f 1,f 2,…,f n ) (Mixture) Clustering Structure of hidden variables Model parameters Emission distributions empty graph treeHFM tree / forest higher order transitions Hidden Markov Model line graph transitions Gerlich et al., Nat. Meth. 2012 CellCognition

49 X : Hidden (Cell) States X1X1 X2X2 X3X3 ΓX1X2ΓX1X2 ΓX2X3ΓX2X3 D1D1 D2D2 D3D3 D : Observations (Phenotypes) HMMs in Factor Graph Notation ΨX1(D1)ΨX1(D1) ΨX2(D2)ΨX2(D2) ΨX3(D3)ΨX3(D3) Marginal Likelihood:

50 HMMs in Factor Graph Notation Kschischang, Frey, Loeliger, IEEE Trans. Information Theory 2001 X1X1 X2X2 X3X3 D1D1 D2D2 D3D3 G1G1 G2G2 G3G3 F1F1 F2F2 P(X 1 ) Factor graph represen- tation of the HMM

51 X2X2 X1X1 D1D1 D2D2 X3X3 D3D3 X4X4 D4D4 Genealogies in Factor Graph Notation

52 X2X2 X1X1 D1D1 D2D2 X3X3 D3D3 X4X4 D4D4

53 X1X1 D1D1 X2X2 D2D2 X3X3 D3D3 X4X4 D4D4 G1G1 F1F1 G3G3 G4G4 G2G2 F2F2 Factor nodes F j model transition probabilities We need an extended transition parameter set

54 Expectation-Maximization in the HFM The EM algorithm iteratively maximizes For HMMs, this can be expressed in terms of forward and backward probabilities In the spirit of HMMs, we are then able to derive an analytic (and fast) update formula. For HFMs, this can be expressed in terms of messages using factor graphs

55 HFM parameters: Transition probabilities Sequential transition probabilities Division probabilities 0.7 0.6 0.5 0.4 0.3 0.2 0.1

56 Summary Statistics Viterbi trees Transition graph relative cell cycle time frequency Cell cycle time distribution A B D Morphology (PCA plot of classes) C

57


Download ppt "Hidden Markov Models Achim Tresch MPI for Plant Breedging Research & University of Cologne."

Similar presentations


Ads by Google