Download presentation

Presentation is loading. Please wait.

Published byElyssa Bradham Modified over 2 years ago

1
Hidden Markov Model and some applications in handwriting recognition

2
Often arise through measurement of time series. - Rainfall measurements in Beer-Sheva. - Daily values of currency exchange rate. Sequential Data

3
We have stochastic process in time: The system has N states, S 1,S 2,…,S N, where the state of the system at time step t is q t For simplicity of calculations we assume the state of the system in time t+1 depends only on the state of the system in time t. First Order Markov Model

4
Formal Definition for Markov Property: P[q t = S j | q t-1 = S i, q t-2 = S k, ….] = P[q t = S j | q t-1 = S i ], 1 i,j N. That is, the state in the next time step of a Markov chain depends only on the state in the current time. This is called Markov Property or memory- less property. First Order Markov Model

5
Formal Definition (Cont): The transitions in the Markov chain are independent of time, So we can write: P[q t = S i | q t-1 = S j ] = a ij, 1 i,j N. With the following conditions: 1.a ij First Order Markov Model

6
Example (Weather): Rain today40% rain tomorrow 60% no rain tomorrow Not raining today20% rain tomorrow 80% no rain tomorrow Rain No rain Stochastic Finite State Machine:

7
First Order Markov Model Example (Weather continued): Rain today40% rain tomorrow 60% no rain tomorrow Not raining today20% rain tomorrow 80% no rain tomorrow The transition matrix:

8
First Order Markov Model Example (Weather continued): Question: Given that day 1 is sunny, what is the probability that the weather for the next 3 days will be sun-rain-rain-sun ? Answer: We write the sequence of states as O = {S 2,S 1,S 1,S 2 } and compute: P(O| Model) = P{S 2,S 1,S 1,S 2 | Model} = P[S 2 ]*P[S 1 |S 2 ]*P[S 1 |S 1 ]*S[S 2 |S 1 ] = π 2 *a 21 *a 11 *a 12 = 1*0.2*0.4*0.6 = Where π i = P[q 1 = S i ], 1 i N, that is π is the initial state probabilities.

9
Example (Random Walk on Undirected Graphs): We have an undirected graph G=(V,E), and a particle is placed at vertex v i with probability π i. In the next time point, it moves to one of its neighbors with probability 1/d(i), where d(i) is the degree of v i. v1v1 v5v5 v4v4 v7v7 v3v3 v2v2 v6v6 First Order Markov Model

10
Example (Random Walk on Undirected Graphs): We have an undirected graph G=(V,E), and a particle is placed at vertex v i with probability π i. In the next time point, it moves to one of its neighbors with probability 1/d(i), where d(i) is the degree of v i. v1v1 v5v5 v4v4 v7v7 v3v3 v2v2 v6v6 First Order Markov Model

11
Example (Random Walk on Undirected Graphs): It can be proven that for connected, not bipartite graphs, p i - the probability of being in vertex v i, converges to d(v i )/2|E|. That is, the initial probability does not matter. It can be proven that for connected, not bipartite graphs, p i - the probability of being in vertex v i, converges to d(v i )/2|E|. That is, the initial probability does not matter. p 1 =1/18 v1v1 v5v5 v4v4 v7v7 v3v3 v2v2 v6v6 p 3 =3/18 p 4 =3/18 p 7 =2/18 First Order Markov Model In the next time point, it moves to one of its neighbors with probability 1/d(i), where d(i) is the degree of v i. We have an undirected graph G=(V,E), and a particle is placed at vertex v i with probability π i.

12
Example (Random Walk, some applications): First Order Markov Model -In economic, the random walk hypothesis" is used to model shares prices and other factors. -In physics, random walks are used as simplified models of physical random movement of molecules in liquids and gases. -In computer science, random walks are used to estimate the size of the Web (bar-yossef et al, 2006). - In image segmentation, random walks are used to determine the labels (i.e., "object" or "background") to associate with each pixel. This algorithm is typically referred to as the random walker segmentation algorithm. Random walk in two dimensions.

13
Introducing Hidden Variables S1S1 S2S2 S L-1 SNSN SiSi O1O1 O2O2 O L-1 ONON OiOi Observed data Hidden states For each observation O n, introduce a hidden variable S n. Hidden variables form the Markov chain

14
Hidden Markov Model Example: Let us consider Bob which lives in a foreign country, Bob posts in his blog on a daily basis, his activity. Which is one of the following activities: - Walking in the park ( with probability 0.1, if it rains, and probability 0.6 otherwise ). - Shopping ( with probability 0.4, if it rains, and probability 0.3 otherwise ). - Cleaning his apartment ( with probability 0.5, if it rains, and probability 0.1 otherwise ). The choice of what Bob does is determined exclusively by the weather on a given day. The activities of Bob are the observations, while the weather is hidden from us. The entire system is that of a hidden Markov model (HMM).

15
Start Sunny Rainy CleanShopWalk Hidden Markov Model Example (cont):

16
Elements of HMM - N States, S = {S 1,S 2,..,S N }, we denote the state at time t as q t. -M distinct observation symbols per state, V = {v 1,v 2,..,v M }. -The state transition probability distribution A = {a ij }, where: a ij = P[q t = S i | q t-1 = S j ]. -The observation symbol probability distribution in state j, B = {b j (k)}, where: b j (k) =P[v k at t|q t = S j ]1 j N, 1 k M. -Initial distribution π = {π i }, where π i = P[q 1 = S i ], 1 i N.

17
Start Sunny Rainy CleanShopWalk Hidden Markov Model Example (cont): π 1 =0.3π 2 =0.7 a 11 =0.4 a 12 =0.6 a 12 =0.2 a 22 =0.8 S1S1 S2S2 v1v1 v2v2 v3v3 b 1 (1)=0.1 b 1 (2)=0.4 b 1 (3)=0.5 b 2 (1)=0.6 b 2 (2)=0.3 b 2 (1)=0.1 N=2 M=3

18
Problem 1: The Evaluation Problem Given the observation sequence O=O 1 O 2 …O T, and a model λ = (A,B,π), how do we determine the probability that O symbols was generated by that model ? Problem 2: The Decoding Problem Given the observation sequence O determine the most likely sequence of hidden states that led to the observations. Problem 3: The Learning Problem Given a coarse structure of the model (number of states and symbols) but not the probabilities a ij and b jk. Determine these parameters. The Three Basic Problems for HMMs.

19
Probability that the model produces the observation sequence O = O 1 O 2 …O T : Naïve Solution: We denote by Q a fixed state sequence, Q = q 1 q 2 …q T. We Sum over all possible states: P(O | λ) = Σ P(O,Q | λ) * P(Q | λ) Problem 1: The Evaluation problem Hidden States Observations Problem: Too Expensive, the complexity is O(N T ). all Q Outline of the solution: We use a recursive algorithm that computes the value of the forward variable α t (i) = P(O 1 O 2 …O t, q t = S i | λ), based on the preceding time step in the algorithm, i.e., {α t -1 (1), α t -1 (2),…, α t -1 (N)}.

20
Problem 2: Decoding Problem Given a sequence of observations O, the decoding problem is to find the most probable sequence of hidden states. We want to find the best state sequence q 1,q 2,…,q T such that: q 1,q 2,…,q T = argmax P[q 1,q 2,…,q T, O 1 O 2 …O T | λ] q 1,q 2,…,q T Viterbi Algorithm: A dynamic programming algorithm that computes the most probable sequence of steps up until time step t, using the most probable sequence up until time step t-1.

21
Viterbi Algorithm Consider Bob and the weather example from before. The state transition matrix (TRANS) is: Whereas the observation (EMIS) matrix is: The following command in Matlab: [observations,states] = hmmgenerate(10,TRANS,EMIS,... 'Statenames',{'start','rain','sun'},... 'Symbols',{'walk','shop','clean'}) Generates a random sequence of length 10 of states and observation symbols. Example (Matlab):

22
Result: 'sun' 'rain' 'sun' 'sun' 'sun' 'sun' 'sun' 'sun' 'rain 'sun 'clean' 'clean' 'walk' 'clean' 'walk' 'walk' 'clean' 'walk 'shop' 'walk Viterbi Algorithm Example (Matlab,continued): Observations states T =

23
Viterbi Algorithm The Matlab function hmmviterbi uses the Viterbi algorithm to compute the most likely sequence of states the model would go through to generate a given sequence of observations: [observations,states] = hmmgenerate(1000,TRANS,EMIS) likelystates = hmmviterbi(observations, TRANS, EMIS); To test the accuracy of hmmviterbi, compute the percentage of the actual sequence states that agrees with the sequence likelystates. sum(states==likelystates)/1000 ans = In this case, the most likely sequence of states agrees with the random sequence 80% of the time. Example (Matlab,continued):

24
Problem 3: Learning Problem Goal: To determine model parameters a ij and b jk from an ensemble of training samples (observations). Outline of Baum-Welch Algorithm: - Start with rough estimates of a ij and b jk. - Calculate improved estimate. - Repeat until sufficiently small change in the estimated values of the parameters. Problem: The algorithm converges to a local maximum.

25
HMM Word Recognition Two approaches: 1.Path-Discriminant HMM can model all possible words (fitted to large lexicons) Each letter is associated with a sub-HMM that is connected to all others. Viterbi Algorithm gives the most likely word 2.Model-Discriminant Separate HMMs are used to model each word (small lexicons) Evaluation problem gives probability of observations We choose the model with the highest probability. …… a Sub-HMM Clique topology Probability Computation Probability Computation Probability Computation Select Maximum HMM for word 1 HMM for word 2 HMM for word v z Sub-HMM ith letter Sub-HMM

26
HMM Word Recognition Preliminaries (feature extraction): Question: So far the symbols we have seen could be presented as scalars (sun = 1, rain =2), what are the symbols for a 2D image ? Answer: We extract from the image a vector of features, were each feature is a number, representing a measurable property of the image

27
HMM Word Recognition Preliminaries (feature extraction, cont): Example for a feature: The number of crossing between the skeleton of the letter and a line passing through a the center of mass of the letter. (for this letter, the value is 3). Binarization Skeletonization and c.o.m computation

28
HMM Word Recognition Problem: Working on a small lexicon of a few hundred words could generate thousands of symbols, is there a way to use (much)less symbols ? Answer: There are 2 popular algorithms, Vector Quantization or K means, that are used to map a set of vectors into a finite [smaller] set of vectors (representatives) without losing too much information! - Usually there is a distance function between the original set of vectors and their representatives. We wish to minimize the value of this function over each vector. The representatives are called centroids or codebooks. Preliminaries (Vector Quantization, K means, cont):

29
HMM Word Recognition An example of vector quantizer in 2D, with 34 centroids. Each point in a cell is replaced by the corresponding Voronoi site. The Distance function, is the Euclidian distance. Voronoi Diagram Preliminaries (Vector Quantization, K means, cont):

30
HMM Word Recognition Segmentation in the context of this lecture, is the splitting of the word image into segments that relate to characters. Segmentation, pros and cons: Example of Segmentation (cusp at bottom): Pro: Segmentation based methods that use the path-discriminant approach, have great flexibility with respect to the size of the lexicon. Con: Segmentation is hard and ambiguous. "To recognize a letter, one must know where it starts and where it ends, to isolate a letter, one must recognize it first. K. M. Sayre (1973).

31
HMM Word Recognition Segmentation free methods: -In a segmentation-free method, one should find the best interpretation possible for an observation sequence derived from the word image without performing a meaningful segmentation first. -Segmentation free methods are usually used with model-discriminant model. -HMMs that realize segmentation-free methods do not attach any meaning to specific transitions, with respect to character fractions. Segmentation, pros and cons (cont):

32
HMM Word Recognition Example (Segmentation Free Recognition System): - The model of Bunke et al (1995), uses a fixed lexicon. -The observations are based on the edges of the skeleton graph of the word image. Definition: The pixels of the skeleton of a word are considered part of an edge if they have exactly two neighbors, they are considered nodes otherwise. Four reference lines are also extracted: The lower line, lower baseline, upper baseline and upper line.

33
HMM Word Recognition Example (Segmentation Free Recognition System, cont): Example of edges in the word lazy, pixels which belong to the same edge, are marked with the same letter. Lower Line Lower Baseline Upper Baseline Upper Line

34
HMM Word Recognition Example (Segmentation Free Recognition System, cont): Feature Extraction: -The authors extract 10 features for each edge. - The first 4 feature, f 1,..,f 4, are based on the relation between the edge and the baselines, e.g., f 1 defines the percentage of pixels lying between the upper line and upper baseline. - The other features, are related to the edges themselves. For example f 7 is defined as the percentage of pixels lying above the top endpoint. top end point of the Edge E end point of the Edge E f 7 = 21/23

35
HMM Word Recognition Example (Segmentation Free Recognition System, cont): Model: -The model-discriminant is used (lexicon size = 150 words). - The vector quantization algorithm produced 28 codebooks. -The number of states for each letter in the alphabet is set to the minimum number of edges that can be expected for a letter. - The initial values of (A,B, π) are set to some fixed probabilities, and were improved using the Baum-Welch algorithm. For each word in the model there were approximately 60 words for training. - Recognition rate is reported to be 98%.

36
HMM Word Recognition Example (Segmentation Free Recognition System, cont): Word Model Example:

37
HMM Word Recognition Segmentation Based Algorithm (Kundu et at, 1988): The authors assume we can segment each letter (problematic assumption). The path-discriminant model is used, where each state corresponds to a letter. -To compute the initial and transition probability the authors used a statistical study made on the English language. -To compute the symbol probability, as training set of 2500 words was used. -The vector quantization algorithm produced 90 codebooks

38
HMM Word Recognition Segmentation Based Algorithm (Kundu et at, 1988, cont): Feature Extraction: From each letter the authors extract 15 features. Example for features: f zh = horizontal zero crossings A horizontal line passing through the center of gravity is calculated. f zh is assigned a value of the number of crossing between the letter and the line. f x = number of x joints In the thinned image, if the central pixel of the 3x3 window is black, and 4 (or more) of the neighboring pixels are black too.

39
HMM Word Recognition Example (Cont): Model Overview.

40
HMM Word Recognition Example (Raid Saabni et al., 2010): Keyword Searching for Arabic Handwritten Documents: The authors use model-discriminant method. Arabic is written in a cursive style from right to left. The authors denote connected letters in a word, as word-part. Example: The following word contains 7 letters, but only 3 word-parts. The authors use the word-parts as the basic building block for their recognition system.

41
HMM Word Recognition Example (Raid Saabni, 2010): In Arabic, the word-parts could be divided into 2 main components. The main component to denote the continuous body of a word-part, and secondary component to refer to an additional stroke(s). Example of a word-parts with different numbers of additional strokes. In the scope of this lecture, we show only the recognition of the main component.

42
HMM Word Recognition Example (Raid Saabni, 2010): Feature Extraction: The pixels on a component's contour form a 2D polygon. The authors simplify the contour polygon to a smaller number of representative vertices. Later on, the simplified polygon is refined by adding k vertices from the original polygon, which are distributed nearly uniformly between each two consecutive vertices. The point sequence P = [p 1, p 2, …, p n ] includes all the vertices on the refined polygon. The authors extract 2 features: 1.The angle between 2 consecutive vectors, (p i-1,p i ) and (p i,p i+1 ). 2.The angle between the vectors (p i,p i+1 ) and (p j,p j+1 ), where p j and p j+1 are consecutive vertices in the simplified polygon, and p i is a vertex inserted between them by the refining process. p i-1 pipi p i+1 pjpj p j+1 pipi

43
HMM Word Recognition Example (Raid Saabni, 2010): Matching: The authors have manually extracted different occurrences of word-parts from the searched document, which are used to train HMMs. The search for a keyword is performed by searching for its word-parts, which are later combined into words (the keywords). For each processed word-part an observation sequence is generated and fed to the trained HMM system to determine its proximity to each of the keyword's word-parts.

44
References - A tutorial on hidden Markov models and selected applications in speech recognition Rabiner (1989). - Recognition of Handwritten Word: First and Second Order Hidden Markov Model Based Approach Amlan Kundu, Yang He and PAramvir Bahl (1988) - Off-Line Cursive Handwriting Rrecognition Using Hidden Markov Models H. Bunke, M. Roth and E. G. Schukat-Talamazzini (1995) -Offline cursive script word recognition – a survey Tal Steinherz, Ehud Rivlin, Nathan Intrator (1999) -A presentation on Sequential Data and Hidden Markov, taken from the course Introduction to Pattern Recognition by Sargur Srihari. -Keyword Searching for Arabic Handwritten Documents -Raid Saabni, Jihad El-Sana, 2010.

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google