1 LIN 6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 4: Part of Speech Tagging (II) - Introduction to Probability February.

Slides:



Advertisements
Similar presentations
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Advertisements

Albert Gatt Corpora and statistical methods. In this lecture Overview of rules of probability multiplication rule subtraction rule Probability based on.
SI485i : NLP Day 2 Probability Review. Introduction to Probability Experiment (trial) Repeatable procedure with well-defined possible outcomes Outcome.
February 2007CSA3050: Tagging II1 CSA2050: Natural Language Processing Tagging 2 Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams.
Natural Language Processing Lecture 8—9/24/2013 Jim Martin.
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.
1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture August 2007.
Chapter 4 Probability.
September SOME BASIC NOTIONS OF PROBABILITY THEORY Universita’ di Venezia 29 Settembre 2003.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
LING 438/538 Computational Linguistics
POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Part of speech (POS) tagging
LING 438/538 Computational Linguistics Sandiway Fong Lecture 17: 10/24.
Word classes and part of speech tagging Chapter 5.
Information Theory and Security
Fundamentals of Probability
Lecture II.  Using the example from Birenens Chapter 1: Assume we are interested in the game Texas lotto (similar to Florida lotto).  In this game,
CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue
Albert Gatt Corpora and Statistical Methods Lecture 9.
Copyright © 2015, 2011, 2008 Pearson Education, Inc. Chapter 7, Unit A, Slide 1 Probability: Living With The Odds 7.
M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012.
Parts of Speech Sudeshna Sarkar 7 Aug 2008.
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
Natural Language Processing Lecture 8—2/5/2015 Susan W. Brown.
Lecture 6 POS Tagging Methods Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:
Fall 2005 Lecture Notes #8 EECS 595 / LING 541 / SI 661 Natural Language Processing.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
November 2004CSA4050: Crash Concepts in Probability1 CSA4050: Advanced Topics in NLP Probability I Experiments/Outcomes/Events Independence/Dependence.
Chapter 4 Probability ©. Sample Space sample space.S The possible outcomes of a random experiment are called the basic outcomes, and the set of all basic.
Computing Fundamentals 2 Lecture 6 Probability Lecturer: Patrick Browne
Word classes and part of speech tagging Chapter 5.
CSA3202 Human Language Technology HMMs for POS Tagging.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
From Randomness to Probability Chapter 14. Dealing with Random Phenomena A random phenomenon is a situation in which we know what outcomes could happen,
Sixth lecture Concepts of Probabilities. Random Experiment Can be repeated (theoretically) an infinite number of times Has a well-defined set of possible.
Natural Language Processing Giuseppe Attardi Introduction to Probability IP notice: some slides from: Dan Jurafsky, Jim Martin, Sandiway Fong, Dan Klein.
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Dongfang Xu School of Information
+ Chapter 5 Overview 5.1 Introducing Probability 5.2 Combining Events 5.3 Conditional Probability 5.4 Counting Methods 1.
NLP. Introduction to NLP Rule-based Stochastic –HMM (generative) –Maximum Entropy MM (discriminative) Transformation-based.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing Probability AI-Lab
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Speech and Language Processing SLP Chapter 5. 10/31/1 2 Speech and Language Processing - Jurafsky and Martin 2 Today  Parts of speech (POS)  Tagsets.
CS 224S / LINGUIST 285 Spoken Language Processing
Lecture 5 POS Tagging Methods
Word classes and part of speech tagging
Natural Language Processing
CSCI 5832 Natural Language Processing
Natural Language Processing
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models IP notice: slides from Dan Jurafsky.
CSCI 5832 Natural Language Processing
CSC 594 Topics in AI – Natural Language Processing
Introduction to Probability
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider
CONTEXT DEPENDENT CLASSIFICATION
Presentation transcript:

1 LIN 6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 4: Part of Speech Tagging (II) - Introduction to Probability February 1, 2007

2 LIN 6932 Spring 2007 Outline Part of speech tagging Parts of speech What’s POS tagging good for anyhow? Tag sets 2 main types of tagging algorithms Rule-based Statistical Important Ideas –Training sets and test sets –Unknown words –Error analysis Examples of taggers –Rule-based tagging (Karlsson et al 1995) EngCG –Transformation-Based tagging (Brill 1995) –HMM tagging - Stochastic (Probabilitistic) taggers

3 LIN 6932 Spring methods for POS tagging (recap from last lecture) 1. Rule-based tagging Example: Karlsson (1995) EngCG tagger based on the Constraint Grammar architecture and ENGTWOL lexicon –Basic Idea:  Assign all possible tags to words (morphological analyzer used)  Remove wrong tags according to set of constraint rules (typically more than 1000 hand-written constraint rules, but may be machine-learned) 2. Transformation-based tagging Example: Brill (1995) tagger - combination of rule-based and stochastic (probabilistic) tagging methodologies –Basic Idea:  Start with a tagged corpus + dictionary (with most frequent tags)  Set the most probable tag for each word as a start value  Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order (like rule-based taggers)  machine learning is used—the rules are automatically induced from a previously tagged training corpus (like stochastic approach) 3. Stochastic (=Probabilistic) tagging Example: HMM (Hidden Markov Model) tagging - a training corpus used to compute the probability (frequency) of a given word having a given POS tag in a given context

4 LIN 6932 Spring 2007 Today Probability Conditional Probability Independence Bayes Rule HMM tagging Markov Chains Hidden Markov Models

5 LIN 6932 Spring Introduction to Probability Experiment (trial) Repeatable procedure with well-defined possible outcomes Sample Space (S) –the set of all possible outcomes –finite or infinite Example –coin toss experiment –possible outcomes: S = {heads, tails} Example –die toss experiment –possible outcomes: S = {1,2,3,4,5,6}

6 LIN 6932 Spring 2007 Introduction to Probability Definition of sample space depends on what we are asking Sample Space (S): the set of all possible outcomes Example –die toss experiment for whether the number is even or odd –possible outcomes: {even,odd} –not {1,2,3,4,5,6}

7 LIN 6932 Spring 2007 More definitions Events an event is any subset of outcomes from the sample space Example die toss experiment let A represent the event such that the outcome of the die toss experiment is divisible by 3 A = {3,6} A is a subset of the sample space S= {1,2,3,4,5,6}

8 LIN 6932 Spring 2007 Introduction to Probability Some definitions Events –an event is a subset of sample space –simple and compound events Example –deck of cards draw experiment –suppose sample space S = {heart,spade,club,diamond} (four suits) –let A represent the event of drawing a heart –let B represent the event of drawing a red card –A = {heart} (simple event) –B = {heart} u {diamond} = {heart,diamond} (compound event)  a compound event can be expressed as a set union of simple events Example –alternative sample space S = set of 52 cards –A and B would both be compound events

9 LIN 6932 Spring 2007 Introduction to Probability Some definitions Counting –suppose an operation o i can be performed in n i ways, –a set of k operations o 1 o 2...o k can be performed in n 1  n 2 ...  n k ways Example –dice toss experiment, 6 possible outcomes –two dice are thrown at the same time –number of sample points in sample space = 6  6 = 36

10 LIN 6932 Spring 2007 Definition of Probability The probability law assigns to an event a nonnegative number Called P(A) Also called the probability A That encodes our knowledge or belief about the collective likelihood of all the elements of A Probability law must satisfy certain properties

11 LIN 6932 Spring 2007 Probability Axioms Nonnegativity P(A) >= 0, for every event A Additivity If A and B are two disjoint events, then the probability of their union satisfies: P(A U B) = P(A) + P(B) Normalization The probability of the entire sample space S is equal to 1, i.e. P(S) = 1.

12 LIN 6932 Spring 2007 An example An experiment involving a single coin toss There are two possible outcomes, H and T Sample space S is {H,T} If coin is fair, should assign equal probabilities to 2 outcomes Since they have to sum to 1 P({H}) = 0.5 P({T}) = 0.5 P({H,T}) = P({H})+P({T}) = 1.0

13 LIN 6932 Spring 2007 Another example Experiment involving 3 coin tosses Outcome is a 3-long string of H or T S ={HHH,HHT,HTH,HTT,THH,THT,TTH,TTT} Assume each outcome is equiprobable “Uniform distribution” What is probability of the event that exactly 2 heads occur? A = {HHT,HTH,THH} 3 events/outcomes P(A) = P({HHT})+P({HTH})+P({THH}) additivity - union of the probability of the individual events = 1/8 + 1/8 + 1/8 total 8 events/outcomes = 3/8

14 LIN 6932 Spring 2007 Probability definitions In summary: Probability of drawing a spade from 52 well-shuffled playing cards:

15 LIN 6932 Spring 2007 Moving toward language What’s the probability of drawing a 2 from a deck of 52 cards with four 2s? What’s the probability of a random word (from a random dictionary page) being a verb?

16 LIN 6932 Spring 2007 Probability and part of speech tags What’s the probability of a random word (from a random dictionary page) being a verb? How to compute each of these All words = just count all the words in the dictionary # of ways to get a verb: # of words which are verbs! If a dictionary has 50,000 entries, and 10,000 are verbs…. P(V) is 10000/50000 = 1/5 =.20

17 LIN 6932 Spring 2007 Conditional Probability A way to reason about the outcome of an experiment based on partial information In a word guessing game the first letter for the word is a “t”. What is the likelihood that the second letter is an “h”? How likely is it that a person has a disease given that a medical test was negative? A spot shows up on a radar screen. How likely is it that it corresponds to an aircraft?

18 LIN 6932 Spring 2007 More precisely Given an experiment, a corresponding sample space S, and a probability law Suppose we know that the outcome is some event B We want to quantify the likelihood that the outcome also belongs to some other event A We need a new probability law that gives us the conditional probability of A given B P(A|B)

19 LIN 6932 Spring 2007 An intuition Let’s say A is “it’s raining”. Let’s say P(A) in dry Florida is.01 Let’s say B is “it was sunny ten minutes ago” P(A|B) means “what is the probability of it raining now if it was sunny 10 minutes ago” P(A|B) is probably way less than P(A) Perhaps P(A|B) is.0001 Intuition: The knowledge about B should change our estimate of the probability of A.

20 LIN 6932 Spring 2007 S Conditional Probability let A and B be events in the sample space P(A|B) = the conditional probability of event A occurring given some fixed event B occurring definition: P(A|B) = P(A  B) / P(B)

21 LIN 6932 Spring 2007 Conditional probability P(A|B) = P(A  B)/P(B) Or ABA,B Note: P(A,B)=P(A|B) · P(B) Also: P(A,B) = P(B,A)

22 LIN 6932 Spring 2007 Independence What is P(A,B) if A and B are independent? P(A,B)=P(A) · P(B) iff A,B independent. P(heads,tails) = P(heads) · P(tails) =.5 ·.5 =.25 Note: P(A|B)=P(A) iff A,B independent Also: P(B|A)=P(B) iff A,B independent

23 LIN 6932 Spring 2007 Bayes Theorem Idea: The probability of an A conditional on another event B is generally different from the probability of B conditional on A. There is a definite relationship between the two.

24 LIN 6932 Spring 2007 Deriving Bayes Rule The probability of event A given event B is

25 LIN 6932 Spring 2007 Deriving Bayes Rule The probability of event B given event A is

26 LIN 6932 Spring 2007 Deriving Bayes Rule

27 LIN 6932 Spring 2007 Deriving Bayes Rule

28 LIN 6932 Spring 2007 Deriving Bayes Rule the theorem may be paraphrased as conditional/posterior probability = (LIKELIHOOD multiplied by PRIOR) divided by NORMALIZING CONSTANT

29 LIN 6932 Spring 2007 Hidden Markov Model (HMM) Tagging Using an HMM to do POS tagging HMM is a special case of Bayesian inference Foundational work in computational linguistics n-tuple features used for OCR (Optical Character Recognition) W.W. Bledsoe and I. Browning, "Pattern Recognition and Reading by Machine," Proc. Eastern Joint Computer Conf., no. 16, pp , Dec F. Mosteller and D. Wallace, “Inference and disputed Authorship: The Federalist,” statistical methods applied to determine the authorship of the Federalist Papers (function words, Alexander Hamilton, James Madison) It is also related to the “noisy channel” model in ASR (Automatic Speech Recognition)

30 LIN 6932 Spring 2007 POS tagging as a sequence classification task We are given a sentence (an “observation” or “sequence of observations”) Secretariat is expected to race tomorrow sequence of n words w1…wn. What is the best sequence of tags which corresponds to this sequence of observations? Probabilistic/Bayesian view: Consider all possible sequences of tags Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1…wn.

31 LIN 6932 Spring 2007 Getting to HMM Let T = t 1,t 2,…,t n Let W = w 1,w 2,…,w n Goal: Out of all sequences of tags t 1 …t n, get the the most probable sequence of POS tags T underlying the observed sequence of words w 1,w 2,…,w n Hat ^ means “our estimate of the best = the most probable tag sequence” Argmax x f(x) means “the x such that f(x) is maximized” it maximazes our estimate of the best tag sequence

32 LIN 6932 Spring 2007 Getting to HMM This equation is guaranteed to give us the best tag sequence But how do we make it operational? How do we compute this value? Intuition of Bayesian classification: Use Bayes rule to transform it into a set of other probabilities that are easier to compute Thomas Bayes: British mathematician ( )

33 LIN 6932 Spring 2007 Bayes Rule Breaks down any conditional probability P(x|y) into three other probabilities P(x|y): The conditional probability of an event x assuming that y has occurred

34 LIN 6932 Spring 2007 Bayes Rule We can drop the denominator: it does not change for each tag sequence; we are looking for the best tag sequence for the same observation, for the same fixed set of words

35 LIN 6932 Spring 2007 Bayes Rule

36 LIN 6932 Spring 2007 Likelihood and prior n

37 LIN 6932 Spring 2007 Likelihood and prior Further Simplifications n 1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag 3. The most probable tag sequence estimated by the bigram tagger

38 LIN 6932 Spring 2007 Likelihood and prior Further Simplifications n 1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it the koala put the keys on the table WORDS TAGS N V P DET

39 LIN 6932 Spring 2007 Likelihood and prior Further Simplifications 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag Bigrams are groups of two written letters, two syllables, or two words; they are a special case of N-gram. Bigrams are used as the basis for simple statistical analysis of text The bigram assumption is related to the first-order Markov assumption

40 LIN 6932 Spring 2007 Likelihood and prior Further Simplifications 3. The most probable tag sequence estimated by the bigram tagger n biagram assumption

41 LIN 6932 Spring 2007 Two kinds of probabilities (1) Tag transition probabilities p(t i |t i-1 ) Determiners likely to precede adjs and nouns –That/DT flight/NN –The/DT yellow/JJ hat/NN –So we expect P(NN|DT) and P(JJ|DT) to be high –But P(DT|JJ) to be:?

42 LIN 6932 Spring 2007 Two kinds of probabilities (1) Tag transition probabilities p(t i |t i-1 ) Compute P(NN|DT) by counting in a labeled corpus: # of times DT is followed by NN

43 LIN 6932 Spring 2007 Two kinds of probabilities (2) Word likelihood probabilities p(w i |t i ) P(is|VBZ) = probability of VBZ (3sg Pres verb) being “is” Compute P(is|VBZ) by counting in a labeled corpus: If we were expecting a third person singular verb, how likely is it that this verb would be is?

44 LIN 6932 Spring 2007 An Example: the verb “race” Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN How do we pick the right tag?

45 LIN 6932 Spring 2007 Disambiguating “race”

46 LIN 6932 Spring 2007 Disambiguating “race” P(NN|TO) = P(VB|TO) =.83 The tag transition probabilities P(NN|TO) and P(VB|TO) answer the question: ‘How likely are we to expect verb/noun given the previous tag TO?’ P(race|NN) = P(race|VB) = Lexical likelihoods from the Brown corpus for ‘race’ given a POS tag NN or VB. P(NR|VB) =.0027 P(NR|NN) =.0012 tag sequence probability for the likelihood of an adverb occurring given the previous tag verb or noun P(VB|TO)P(NR|VB)P(race|VB) = P(NN|TO)P(NR|NN)P(race|NN)= Multiply the lexical likelihoods with the tag sequence probabiliies: the verb wins

47 LIN 6932 Spring 2007 Hidden Markov Models What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM) Let’s just spend a bit of time tying this into the model In order to define HMM, we will first introduce the Markov Chain, or observable Markov Model.

48 LIN 6932 Spring 2007 Definitions A weighted finite-state automaton adds probabilities to the arcs The sum of the probabilities leaving any arc must sum to one A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through Markov chains can’t represent inherently ambiguous problems Useful for assigning probabilities to unambiguous sequences

49 LIN 6932 Spring 2007 Markov chain = “First-order observed Markov Model” a set of states Q = q 1, q 2 …q N; the state at time t is q t a set of transition probabilities: a set of probabilities A = a 01 a 02 …a n1 …a nn. Each a ij represents the probability of transitioning from state i to state j The set of these is the transition probability matrix A Distinguished start and end states Special initial probability vector   i the probability that the MM will start in state i, each  i expresses the probability p(q i |START)

50 LIN 6932 Spring 2007 Markov chain = “First-order observed Markov Model” Markov Chain for weather: Example 1 three types of weather: sunny, rainy, foggy we want to find the following conditional probabilities: P(qn|qn-1, qn-2, …, q1) - I.e., the probability of the unknown weather on day n, depending on the (known) weather of the preceding days - We could infer this probability from the relative frequency (the statistics) of past observations of weather sequences Problem: the larger n is, the more observations we must collect. Suppose that n=6, then we have to collect statistics for 3(6-1) = 243 past histories

51 LIN 6932 Spring 2007 Markov chain = “First-order observed Markov Model” Therefore, we make a simplifying assumption, called the (first-order) Markov assumption for a sequence of observations q1, … qn, current state only depends on previous state the joint probability of certain past and current observations

52 LIN 6932 Spring 2007 Markov chain = “First-order observable Markov Model”

53 LIN 6932 Spring 2007 Markov chain = “First-order observed Markov Model” Given that today the weather is sunny, what's the probability that tomorrow is sunny and the day after is rainy? Using the Markov assumption and the probabilities in table 1, this translates into:

54 LIN 6932 Spring 2007 The weather figure: specific example Markov Chain for weather: Example 2

55 LIN 6932 Spring 2007 Markov chain for weather What is the probability of 4 consecutive rainy days? Sequence is rainy-rainy-rainy-rainy I.e., state sequence is P(3,3,3,3) =  1 a 11 a 11 a 11 a 11 = 0.2 x (0.6) 3 =

56 LIN 6932 Spring 2007 Hidden Markov Model For Markov chains, the output symbols are the same as the states. See sunny weather: we’re in state sunny But in part-of-speech tagging (and other things) The output symbols are words But the hidden states are part-of-speech tags So we need an extension! A Hidden Markov Model is an extension of a Markov chain in which the output symbols are not the same as the states. This means we don’t know which state we are in.

57 LIN 6932 Spring 2007 Markov chain for weather

58 LIN 6932 Spring 2007 Markov chain for words Observed events: words Hidden events: tags

59 LIN 6932 Spring 2007 States Q = q 1, q 2 …q N; Observations O = o 1, o 2 …o N; Each observation is a symbol from a vocabulary V = {v 1,v 2,…v V } Transition probabilities (prior) Transition probability matrix A = {a ij } Observation likelihoods (likelihood) Output probability matrix B={b i (o t )} a set of observation likelihoods, each expressing the probability of an observation o t being generated from a state i, emission probabilities Special initial probability vector   i the probability that the HMM will start in state i, each  i expresses the probability p(q i |START) Hidden Markov Models

60 LIN 6932 Spring 2007 Assumptions Markov assumption: the probability of a particular state depends only on the previous state Output-independence assumption: the probability of an output observation depends only on the state that produced that observation

61 LIN 6932 Spring 2007 HMM for Ice Cream You are a climatologist in the year 2799 Studying global warming You can’t find any records of the weather in Boston, MA for summer of 2007 But you find Jason Eisner’s diary Which lists how many ice-creams Jason ate every date that summer Our job: figure out how hot it was

62 LIN 6932 Spring 2007 Noam task Given Ice Cream Observation Sequence: 1,2,3,2,2,2,3… (cp. with output symbols) Produce: Weather Sequence: C,C,H,C,C,C,H … (cp. with hidden states, causing states)

63 LIN 6932 Spring 2007 HMM for ice cream

64 LIN 6932 Spring 2007 Different types of HMM structure Bakis = left-to-right Ergodic = fully-connected

65 LIN 6932 Spring 2007 HMM Taggers Two kinds of probabilities A transition probabilities (PRIOR) (slide 36) B observation likelihoods (LIKELIHOOD) (slide 36) HMM Taggers choose the tag sequence which maximizes the product of word likelihood and tag sequence probability

66 LIN 6932 Spring 2007 Weighted FSM corresponding to hidden states of HMM, showing A probs

67 LIN 6932 Spring 2007 B observation likelihoods for POS HMM

68 LIN 6932 Spring 2007 HMM Taggers The probabilities are trained on hand-labeled training corpora (training set) Combine different N-gram levels Evaluated by comparing their output from a test set to human labels for that test set (Gold Standard)

69 LIN 6932 Spring 2007 Next Time Minimum Edit Distance A “dynamic programming” algorithm A probabilistic version of this called “Viterbi” is a key part of the Hidden Markov Model!