Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.

Slides:



Advertisements
Similar presentations
Learning HMM parameters
Advertisements

DIRECTED GRAPHICAL MODEL MEMM AS SEQUENCE CLASSIFIER NIKET TANDON Tutor: Martin Theobald, Date: 09 June, 2011.
Natural Language Processing Lecture 8—9/24/2013 Jim Martin.
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
Introduction to Hidden Markov Models
Hidden Markov Models Eine Einführung.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models in NLP
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Albert Gatt Corpora and Statistical Methods Lecture 8.
INTRODUCTION TO Machine Learning 3rd Edition
Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PatReco: Hidden Markov Models Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
FSA and HMM LING 572 Fei Xia 1/5/06.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
CSA3202 Human Language Technology HMMs for POS Tagging.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
中文信息处理 Chinese NLP Lecture 7.
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
MAXIMUM ENTROPY MARKOV MODEL Adapted From: Heshaam Faili University of Tehran – Dikkala Sai Nishanth – Ashwin P. Paranjape
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Hidden Markov Models BMI/CS 576
CS 224S / LINGUIST 285 Spoken Language Processing
CSCE 771 Natural Language Processing
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
CSC 594 Topics in AI – Natural Language Processing
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models Part 2: Algorithms
CONTEXT DEPENDENT CLASSIFICATION
Presentation transcript:

Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran

2 Introduction Hidden Markov Model (HMM) Maximum Entropy Maximum Entropy Markov Model (MEMM) machine learning methods A sequence classifier or sequence labeler is a model whose job is to assign some label or class to each unit in a sequence finite-state transducer is a non-probabilistic sequence classifier for transducing from sequences of words to sequences of morphemes HMM and MEMM extend this notion by being probabilistic sequence classifiers

3 Markov chain Observed Markov model Weighted finite-state automaton Markov Chain: a weighted automaton in which the input sequence uniquely determines which states the automaton will go through can’t represent inherently ambiguous problems useful for assigning probabilities to unambiguous sequences

4 Markov Chain

5 Formal Description

6 First-order Markov Chain: the probability of a particular state is dependent only on the previous state Markov Assumption: P(q i |q 1...q i−1 ) = P(q i |q i−1 )

7 Markov Chain example compute the probability of each of the following sequences hot hot cold hot

8 Hidden Markov Model in POS tagging we didn’t observe POS tags in the world; we saw words, and had to infer the correct tags from the word sequence. We call the POS tags hidden because they are not observed HMM allows us to talk HIDDEN MARKOV about both observed MODEL events (like words) and hidden events (like POS tags) that we think of as causal factors in our probabilistic model

9 Jason Eisner (2002) example Imagine that you are a climatologist in the year 2799 studying the history of global warming. You cannot find any records of the weather in Baltimore, Maryland, for the summer of 2007, but you do find Jason Eisner’s diary, which lists how many ice creams Jason ate every day that summer. Our goal is to use these observations to estimate the temperature every day Given a sequence of observations O, each observation an integer corresponding to the number of ice creams eaten on a given day, figure out the correct ‘hidden’ sequence Q of weather states (H or C) which caused Jason to eat the ice cream

10 Formal Description

11 Formal Description

12 HMM Example

13 Fully-connected (Ergodic) & Left-to-right (Bakis) HMM

14 Three fundamental problems Problem 1 (Computing Likelihood): Given an HMM = (A,B) and an observation sequence O, determine the likelihood P(O | ) Problem 2 (Decoding): Given an observation sequence O and an HMM = (A,B), discover the best hidden state sequence Q Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B

15 COMPUTING LIKELIHOOD: THE FORWARD ALGORITHM Given an HMM = (A,B) and an observation sequence O, determine the likelihood P(O | ) For a Markov chain: we could compute the probability of just by following the states labeled and multiplying the probabilities along the arcs We want to determine the probability of an ice-cream observation sequence like 3 1 3, but we don’t know what the hidden state sequence is! Markov chain: Suppose we already knew the weather, and wanted to predict how much ice cream Jason would eat For a given hidden state sequence (e.g. hot hot cold) we can easily compute the output likelihood of

16 THE FORWARD ALGORITHM

17 THE FORWARD ALGORITHM

18 THE FORWARD ALGORITHM dynamic programming O(N 2 T) N hidden states and an observation sequence of T observations  T (j) represents the probability of being in state j after seeing the first t observations, given the automaton q t = j means “the probability that the tth state in the sequence of states is state j”

19

20 THE FORWARD ALGORITHM

21 THE FORWARD ALGORITHM

22 THE FORWARD ALGORITHM

23 DECODING: THE VITERBI

24 DECODING: THE VITERBI ALGORITHM v t (j) represents the probability that the HMM is in state j after seeing the first t observations and passing through the most probable state sequence q 0,q 1,...,q t−1, given the automaton

25 TRAINING HMMS: THE FORWARD- BACKWARD ALGORITHM Given an observation sequence O and the set of possible states in the HMM, learn the HMM parameters A and B Ice-Cream task: we would start with a sequence of observations O = {1,3,2,...,}, and the set of hidden states H and C. part-of-speech tagging task: we would start with a sequence of observations O = {w1,w2,w3...} and a set of hidden states NN, NNS, VBD, IN,...

26 forward-backward Forward-backward or Baum-Welch algorithm (Baum, 1972), a special case of the Expectation- Maximization (EM algorithm) Start on Markov Model: no emission probabilities B (alternatively we could view a Markov chain as a degenerate Hidden Markov Model where all the b probabilities are 1.0 for the observed symbol and 0 for all other symbols) Only need to train transition probability A

27 forward-backward For Markov Chain: only need to compute the state transition based on observation and calculate matrix A For Hidden Markov Model: we can not count this transition Baum-Welch algorithm uses two intuitions: The first idea is to iteratively estimate the counts computing the forward probability for an observation and then dividing that probability mass among all the different paths that contributed to this forward probability

28 backward probability.

29 backward probability.

30 backward probability.

31 forward-backward

32 forward-backward

33 forward-backward

34 forward-backward The probability of being in state j at time t, which we will call  t (j)

35 forward-backward

36 forward-backward

37

38 MAXIMUM ENTROPY MODELS Machine learning framework called Maximum Entropy modeling, MAXEnt Used for Classification The task of classification is to take a single observation, extract some useful features describing the observation, and then based on these features, to classify the observation into one of a set of discrete classes. Probabilistic classifier: gives the probability of the observation being in that class Non-sequential classification in text classification we might need to decide whether a particular should be classified as spam or not In sentiment analysis we have to determine whether a particular sentence or document expresses a positive or negative opinion. we’ll need to classify a period character (‘.’) as either a sentence boundary or not

39 MaxEnt MaxEnt belongs to the family of classifiers known as the exponential or log-linear classifiers MaxEnt works by extracting some set of features from the input, combining them linearly (meaning that we multiply each by a weight and then add them up), and then using this sum as an exponent Example: tagging A feature for tagging might be this word ends in -ing or the previous word was ‘the’

40 Linear Regression Two different names for tasks that map some input features into some output value: regression when the output is real-valued, and classification when the output is one of a discrete set of classes

41 Linear Regression, Example price = w0+w1 ∗ Num Adjectives

42 Multiple linear regression price=w0+w1 ∗ Num Adjectives+w2 ∗ Mortgage Rate+w3 ∗ Num Unsold Houses

43 Learning in linear regression sum-squared error

44 Logistic regression Classification in which the output y we are trying to predict takes on one from a small set of discrete values binary classification: Odds logit function

45 Logistic regression

46 Logistic regression

47 Logistic regression: Classification hyperplane

48 Learning in logistic regression conditional maximum likelihood estimation.

49 Learning in logistic regression Convex Optimization

50 MAXIMUM ENTROPY MODELING multinomial logistic regression(MaxEnt) Most of the time, classification problems that come up in language processing involve larger numbers of classes (part- of-speech classes) y is a value take on C different value corresponding to classes C1,…,Cn

51 Maximum Entropy Modeling Indicator function: A feature that only takes on the values 0 and 1

52 Maximum Entropy Modeling Example Secretariat/NNP is/BEZ expected/VBN to/TO race/?? tomorrow/

53 Maximum Entropy Modeling

54 Why do we call it Maximum Entropy? From of all possible distributions, the equiprobable distribution has the maximum entropy

55 Why do we call it Maximum Entropy?

56 Maximum Entropy probability distribution of a multinomial logistic regression model whose weights W maximize the likelihood of the training data! Thus the exponential model