Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 10 16 August 2007.

Similar presentations


Presentation on theme: "Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 10 16 August 2007."— Presentation transcript:

1 Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 10 16 August 2007

2 Lecture 1, 7/21/2005Natural Language Processing2 Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference  It is also related to the “noisy channel” model in ASR (Automatic Speech Recognition)

3 Lecture 1, 7/21/2005Natural Language Processing3  Goal: maximize P(word|tag) x P(tag|previous n tags)  P(word|tag) word/lexical likelihood probability that given this tag, we have this word NOT probability that this word has this tag modeled through language model (word-tag matrix)  P(tag|previous n tags) tag sequence likelihood probability that this tag follows these previous tags modeled through language model (tag-tag matrix) Hidden Markov Model (HMM) Taggers Lexical information Syntagmatic information

4 Lecture 1, 7/21/2005Natural Language Processing4 POS tagging as a sequence classification task  We are given a sentence (an “observation” or “sequence of observations”) Secretariat is expected to race tomorrow sequence of n words w1…wn.  What is the best sequence of tags which corresponds to this sequence of observations?  Probabilistic/Bayesian view: Consider all possible sequences of tags Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1…wn.

5 Lecture 1, 7/21/2005Natural Language Processing5 Getting to HMM  Let T = t 1,t 2,…,t n  Let W = w 1,w 2,…,w n  Goal: Out of all sequences of tags t 1 …t n, get the the most probable sequence of POS tags T underlying the observed sequence of words w 1,w 2,…,w n  Hat ^ means “our estimate of the best = the most probable tag sequence”  Argmax x f(x) means “the x such that f(x) is maximized” it maximazes our estimate of the best tag sequence

6 Lecture 1, 7/21/2005Natural Language Processing6 Getting to HMM  This equation is guaranteed to give us the best tag sequence  But how do we make it operational? How do we compute this value?  Intuition of Bayesian classification: Use Bayes rule to transform it into a set of other probabilities that are easier to compute Thomas Bayes: British mathematician (1702-1761)

7 Lecture 1, 7/21/2005Natural Language Processing7 Bayes Rule Breaks down any conditional probability P(x|y) into three other probabilities P(x|y): The conditional probability of an event x assuming that y has occurred

8 Lecture 1, 7/21/2005Natural Language Processing8 Bayes Rule We can drop the denominator: it does not change for each tag sequence; we are looking for the best tag sequence for the same observation, for the same fixed set of words

9 Lecture 1, 7/21/2005Natural Language Processing9 Bayes Rule

10 Lecture 1, 7/21/2005Natural Language Processing10 Likelihood and prior n

11 Lecture 1, 7/21/2005Natural Language Processing11 Likelihood and prior Further Simplifications n 1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag 3. The most probable tag sequence estimated by the bigram tagger

12 Lecture 1, 7/21/2005Natural Language Processing12 Likelihood and prior Further Simplifications n 1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it the koala put the keys on the table WORDS TAGS N V P DET

13 Lecture 1, 7/21/2005Natural Language Processing13 Likelihood and prior Further Simplifications 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag Bigrams are groups of two written letters, two syllables, or two words; they are a special case of N-gram. Bigrams are used as the basis for simple statistical analysis of text The bigram assumption is related to the first-order Markov assumption

14 Lecture 1, 7/21/2005Natural Language Processing14 Likelihood and prior Further Simplifications 3. The most probable tag sequence estimated by the bigram tagger n biagram assumption ---------------------------------------------------------------------------------------------------------------

15 Lecture 1, 7/21/2005Natural Language Processing15 Two kinds of probabilities (1)  Tag transition probabilities p(t i |t i-1 ) Determiners likely to precede adjs and nouns  That/DT flight/NN  The/DT yellow/JJ hat/NN  So we expect P(NN|DT) and P(JJ|DT) to be high  But P(DT|JJ) to be:?

16 Lecture 1, 7/21/2005Natural Language Processing16 Two kinds of probabilities (1)  Tag transition probabilities p(t i |t i-1 ) Compute P(NN|DT) by counting in a labeled corpus: # of times DT is followed by NN

17 Lecture 1, 7/21/2005Natural Language Processing17 Two kinds of probabilities (2)  Word likelihood probabilities p(w i |t i ) P(is|VBZ) = probability of VBZ (3sg Pres verb) being “is” Compute P(is|VBZ) by counting in a labeled corpus: If we were expecting a third person singular verb, how likely is it that this verb would be is?

18 Lecture 1, 7/21/2005Natural Language Processing18 An Example: the verb “race”  Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR  People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN  How do we pick the right tag?

19 Lecture 1, 7/21/2005Natural Language Processing19 Disambiguating “race”

20 Lecture 1, 7/21/2005Natural Language Processing20 Disambiguating “race”  P(NN|TO) =.00047  P(VB|TO) =.83 The tag transition probabilities P(NN|TO) and P(VB|TO) answer the question: ‘How likely are we to expect verb/noun given the previous tag TO?’  P(race|NN) =.00057  P(race|VB) =.00012 Lexical likelihoods from the Brown corpus for ‘race’ given a POS tag NN or VB.  P(NR|VB) =.0027  P(NR|NN) =.0012 tag sequence probability for the likelihood of an adverb occurring given the previous tag verb or noun  P(VB|TO)P(NR|VB)P(race|VB) =.00000027  P(NN|TO)P(NR|NN)P(race|NN)=.00000000032 Multiply the lexical likelihoods with the tag sequence probabiliies: the verb wins

21 Lecture 1, 7/21/2005Natural Language Processing21 Hidden Markov Models  What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM)  Let’s just spend a bit of time tying this into the model  In order to define HMM, we will first introduce the Markov Chain, or observable Markov Model.

22 Lecture 1, 7/21/2005Natural Language Processing22 Definitions  A weighted finite-state automaton adds probabilities to the arcs The sum of the probabilities leaving any arc must sum to one  A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through  Markov chains can’t represent inherently ambiguous problems Useful for assigning probabilities to unambiguous sequences

23 Lecture 1, 7/21/2005Natural Language Processing23 Markov chain = “First-order observed Markov Model”  a set of states Q = q 1, q 2 …q N; the state at time t is q t  a set of transition probabilities: a set of probabilities A = a 01 a 02 …a n1 …a nn. Each a ij represents the probability of transitioning from state i to state j The set of these is the transition probability matrix A  Distinguished start and end states Special initial probability vector   i the probability that the MM will start in state i, each  i expresses the probability p(q i |START)

24 Lecture 1, 7/21/2005Natural Language Processing24 Markov chain = “First-order observed Markov Model” Markov Chain for weather: Example 1  three types of weather: sunny, rainy, foggy  we want to find the following conditional probabilities: P(qn|qn-1, qn-2, …, q1) - I.e., the probability of the unknown weather on day n, depending on the (known) weather of the preceding days - We could infer this probability from the relative frequency (the statistics) of past observations of weather sequences Problem: the larger n is, the more observations we must collect. Suppose that n=6, then we have to collect statistics for 3(6-1) = 243 past histories

25 Lecture 1, 7/21/2005Natural Language Processing25 Markov chain = “First-order observed Markov Model”  Therefore, we make a simplifying assumption, called the (first-order) Markov assumption for a sequence of observations q1, … qn, current state only depends on previous state  the joint probability of certain past and current observations

26 Lecture 1, 7/21/2005Natural Language Processing26 Markov chain = “First-order observable Markov Model”

27 Lecture 1, 7/21/2005Natural Language Processing27 Markov chain = “First-order observed Markov Model”  Given that today the weather is sunny, what's the probability that tomorrow is sunny and the day after is rainy?  Using the Markov assumption and the probabilities in table 1, this translates into:

28 Lecture 1, 7/21/2005Natural Language Processing28 Markov chain for weather  What is the probability of 4 consecutive rainy days?  Sequence is rainy-rainy-rainy-rainy  I.e., state sequence is 3-3-3-3  P(3,3,3,3) =  1 a 11 a 11 a 11 a 11 = 0.2 x (0.6) 3 = 0.0432

29 Lecture 1, 7/21/2005Natural Language Processing29 Hidden Markov Model  For Markov chains, the output symbols are the same as the states. See sunny weather: we’re in state sunny  But in part-of-speech tagging (and other things) The output symbols are words But the hidden states are part-of-speech tags  So we need an extension!  A Hidden Markov Model is an extension of a Markov chain in which the output symbols are not the same as the states.  This means we don’t know which state we are in.

30 Lecture 1, 7/21/2005Natural Language Processing30 Markov chain for words Observed events: words Hidden events: tags

31 Lecture 1, 7/21/2005Natural Language Processing31  States Q = q 1, q 2 … q N;  Observations O = o 1, o 2 … o N; Each observation is a symbol from a vocabulary V = {v 1,v 2,…v V }  Transition probabilities (prior) Transition probability matrix A = {a ij }  Observation likelihoods (likelihood) Output probability matrix B={b i (o t )} a set of observation likelihoods, each expressing the probability of an observation o t being generated from a state i, emission probabilities  Special initial probability vector   i the probability that the HMM will start in state i, each  i expresses the probability p(q i |START) Hidden Markov Models

32 Lecture 1, 7/21/2005Natural Language Processing32 Assumptions  Markov assumption: the probability of a particular state depends only on the previous state  Output-independence assumption: the probability of an output observation depends only on the state that produced that observation

33 Lecture 1, 7/21/2005Natural Language Processing33 HMM for Ice Cream  You are a climatologist in the year 2799  Studying global warming  You can’t find any records of the weather in Boston, MA for summer of 2007  But you find Jason Eisner’s diary  Which lists how many ice-creams Jason ate every date that summer  Our job: figure out how hot it was

34 Lecture 1, 7/21/2005Natural Language Processing34 Noam task  Given Ice Cream Observation Sequence: 1,2,3,2,2,2,3… (cp. with output symbols)  Produce: Weather Sequence: C,C,H,C,C,C,H … (cp. with hidden states, causing states)

35 Lecture 1, 7/21/2005Natural Language Processing35 HMM for ice cream

36 Lecture 1, 7/21/2005Natural Language Processing36 Different types of HMM structure Bakis = left-to-right Ergodic = fully-connected

37 Lecture 1, 7/21/2005Natural Language Processing37 HMM Taggers  Two kinds of probabilities A transition probabilities (PRIOR) B observation likelihoods (LIKELIHOOD)  HMM Taggers choose the tag sequence which maximizes the product of word likelihood and tag sequence probability

38 Lecture 1, 7/21/2005Natural Language Processing38 Weighted FSM corresponding to hidden states of HMM, showing A probs

39 Lecture 1, 7/21/2005Natural Language Processing39 B observation likelihoods for POS HMM

40 Lecture 1, 7/21/2005Natural Language Processing40 The A matrix for the POS HMM

41 Lecture 1, 7/21/2005Natural Language Processing41 The B matrix for the POS HMM

42 Lecture 1, 7/21/2005Natural Language Processing42 HMM Taggers  The probabilities are trained on hand-labeled training corpora (training set)  Combine different N-gram levels  Evaluated by comparing their output from a test set to human labels for that test set (Gold Standard)

43 Lecture 1, 7/21/2005Natural Language Processing43 The Viterbi Algorithm  best tag sequence for "John likes to fish in the sea"?  efficiently computes the most likely state sequence given a particular output sequence  based on dynamic programming

44 Lecture 1, 7/21/2005Natural Language Processing44 A smaller example 0.6 b q r start end 0.5 0.7  What is the best sequence of states for the input string “bbba”?  Computing all possible paths and finding the one with the max probability is exponential a 0.4 0.8 0.2 b a 11 0.3 0.5

45 Lecture 1, 7/21/2005Natural Language Processing45 A smaller example (con’t)  For each state, store the most likely sequence that could lead to it (and its probability)  Path probability matrix: An array of states versus time (tags versus words) That stores the prob. of being at each state at each time in terms of the prob. for being in each state at the preceding time. Best sequenceInput sequence / time ε --> bb --> bbb --> bbbb --> a leading to q coming from q ε --> q 0.6 (1.0x0.6) q --> q 0.108 (0.6x0.3x0.6) qq --> q 0.01944 (0.108x0.3x0.6) qrq --> q 0.018144 (0.1008x0.3x0.4) coming from r r --> q 0 (0x0.5x0.6) qr --> q 0.1008 (0.336x0.5x 0.6) qrr --> q 0.02688 (0.1344x0.5x0.4) leading to r coming from q ε --> r 0 (0x0.8) q --> r 0.336 (0.6x0.7x0.8) qq --> r 0.0648 (0.108x0.7x0.8) qrq --> r 0.014112 (0.1008x0.7x0.2) coming from r r --> r 0 (0x0.5x0.8) qr --> r 0.1344 (0.336x0.5x0.8) qrr --> r 0.01344 (0.1344x0.5x0.2)

46 Lecture 1, 7/21/2005Natural Language Processing46 Viterbi intuition: we are looking for the best ‘path’ S1S1 S2S2 S4S4 S3S3 S5S5 Slide from Dekang Lin

47 Lecture 1, 7/21/2005Natural Language Processing47 The Viterbi Algorithm

48 Lecture 1, 7/21/2005Natural Language Processing48 Intuition  The value in each cell is computed by taking the MAX over all paths that lead to this cell.  An extension of a path from state i at time t-1 is computed by multiplying: Previous path probability from previous cell viterbi[t- 1,i] Transition probability a ij from previous state I to current state j Observation likelihood b j (o t ) that current state j matches observation symbol t

49 Lecture 1, 7/21/2005Natural Language Processing49 Viterbi example

50 Lecture 1, 7/21/2005Natural Language Processing50 Smoothing of probabilities  Data sparseness is a problem when estimating probabilities based on corpus data.  The “add one” smoothing technique – C- absolute frequency N: no of training instances B: no of different types  Linear interpolation methods can compensate for data sparseness with higher order models. A common method is interpolating trigrams, bigrams and unigrams:  The lambda values are automatically determined using a variant of the Expectation Maximization algorithm.

51 Lecture 1, 7/21/2005Natural Language Processing51 Viterbi for POS tagging Let: n = nb of words in sentence to tag (nb of input tokens) T = nb of tags in the tag set (nb of states) vit = path probability matrix (viterbi) vit[i,j] = probability of being at state (tag) j at word i state = matrix to recover the nodes of the best path (best tag sequence) state[i+1,j] = the state (tag) of the incoming arc that led to this most probable state j at word i+1 // Initialization vit[1,PERIOD]:=1.0 // pretend that there is a period before // our sentence (start tag = PERIOD) vit[1,t]:=0.0 for t ≠ PERIOD

52 Lecture 1, 7/21/2005Natural Language Processing52 Viterbi for POS tagging (con’t) // Induction (build the path probability matrix) for i:=1 to n step 1 do // for all words in the sentence for all tags t j do // for all possible tags // store the max prob of the path vit[i+1,t j ] := max 1≤k≤T (vit[i,t k ] x P(w i+1 |t j ) x P(t j | t k )) // store the actual state path[i+1,t j ] := argmax 1≤k≤T ( vit[i,t k ] x P(w i+1 |t j ) x P(t j | t k )) end //Termination and path-readout bestState n+1 := argmax 1≤j≤T vit[n+1,j] for j:=n to 1 step -1 do // for all the words in the sentence bestState j := path[i+1, bestState j+1 ] end P(bestState 1,…, bestState n ) := max 1≤j≤T vit[n+1,j] emission probability state transition probability probability of best path leading to state t k at word i

53 Lecture 1, 7/21/2005Natural Language Processing53  in bigram POS tagging, we condition a tag only on the preceding tag  why not... use more context (ex. use trigram model)  more precise:  “is clearly marked” --> verb, past participle  “he clearly marked” --> verb, past tense  combine trigram, bigram, unigram models condition on words too  but with an n-gram approach, this is too costly (too many parameters to model) Possible improvements

54 Lecture 1, 7/21/2005Natural Language Processing54 Next Time  Minimum Edit Distance  A “dynamic programming” algorithm  A probabilistic version of this called “Viterbi” is a key part of the Hidden Markov Model!

55 Lecture 1, 7/21/2005Natural Language Processing55 Further issues with Markov Model tagging  Unknown words are a problem since we don’t have the required probabilities. Possible solutions: Assign the word probabilities based on corpus-wide distribution of POS Use morphological cues (capitalization, suffix) to assign a more calculated guess.  Using higher order Markov models: Using a trigram model captures more context However, data sparseness is much more of a problem.

56 Lecture 1, 7/21/2005Natural Language Processing56 TnT  Efficient statistical POS tagger developed by Thorsten Brants, ANLP-2000  Underlying model: Trigram modelling – The probability of a POS only depends on its two preceding POS The probability of a word appearing at a particular position given that its POS occurs at that position is independent of everything else.

57 Lecture 1, 7/21/2005Natural Language Processing57 Training  Maximum likelihood estimates: Smoothing : context-independent variant of linear interpolation.

58 Lecture 1, 7/21/2005Natural Language Processing58 Smoothing algorithm  Set λ i =0  For each trigram t 1 t 2 t 3 with f(t 1,t 2,t 3 )>0 Depending on the max of the following three values:  Case (f(t 1,t 2,t 3 )-1)/ f(t 1,t 2 ) : incr λ 3 by f(t 1,t 2,t 3 )  Case (f(t 2,t 3 )-1)/ f(t 2 ) : incr λ 2 by f(t 1,t 2,t 3 )  Case (f(t 3 )-1)/ N-1 : incr λ 1 by f(t 1,t 2,t 3 )  Normalize λ i

59 Lecture 1, 7/21/2005Natural Language Processing59 Evaluation of POS taggers  compared with gold-standard of human performance  metric: accuracy = % of tags that are identical to gold standard  most taggers ~96-97% accuracy  must compare accuracy to: ceiling (best possible results)  how do human annotators score compared to each other? (96- 97%)  so systems are not bad at all! baseline (worst possible results)  what if we take the most-likely tag (unigram model) regardless of previous tags ? (90-91%)  so anything less is really bad

60 Lecture 1, 7/21/2005Natural Language Processing60 More on tagger accuracy  is 95% good? that’s 5 mistakes every 100 words if on average, a sentence is 20 words, that’s 1 mistake per sentence  when comparing tagger accuracy, beware of: size of training corpus  the bigger, the better the results difference between training & testing corpora (genre, domain…)  the closer, the better the results size of tag set  Prediction versus classification unknown words  the more unknown words (not in dictionary), the worst the results

61 Lecture 1, 7/21/2005Natural Language Processing61 Error Analysis  Look at a confusion matrix (contingency table)  E.g. 4.4% of the total errors caused by mistagging VBD as VBN  See what errors are causing problems Noun (NN) vs ProperNoun (NNP) vs Adj (JJ) Adverb (RB) vs Particle (RP) vs Prep (IN) Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)  ERROR ANALYSIS IS ESSENTIAL!!!

62 Lecture 1, 7/21/2005Natural Language Processing62 Tag indeterminacy

63 Lecture 1, 7/21/2005Natural Language Processing63 Major difficulties in POS tagging  Unknown words (proper names) because we do not know the set of tags it can take and knowing this takes you a long way (cf. baseline POS tagger) possible solutions:  assign all possible tags with probabilities distribution identical to lexicon as a whole  use morphological cues to infer possible tags  ex. word ending in -ed are likely to be past tense verbs or past participles  Frequently confused tag pairs preposition vs particle a hill (prep) / a bill (particle) verb, past tense vs. past participle vs. adjective

64 Lecture 1, 7/21/2005Natural Language Processing64 Unknown Words  Most-frequent-tag approach.  What about words that don’t appear in the training set?  Suffix analysis: The probability distribution for a particular suffix is generated from all words in the training set that share the same suffix.  Suffix estimation – Calculate the probability of a tag t given the last i letters of an n letter word.  Smoothing: successive abstraction through sequences of increasingly more general contexts (i.e., omit more and more characters of the suffix)  Use a morphological analyzer to get the restriction on the possible tags.

65 Lecture 1, 7/21/2005Natural Language Processing65 Unknown words

66 Lecture 1, 7/21/2005Natural Language Processing66 Alternative graphical models for part of speech tagging

67 Lecture 1, 7/21/2005Natural Language Processing67 Different Models for POS tagging  HMM  Maximum Entropy Markov Models  Conditional Random Fields

68 Lecture 1, 7/21/2005Natural Language Processing68 Hidden Markov Model (HMM) : Generative Modeling Source Model P  Y  Noisy Channel P  X  Y  y x

69 Lecture 1, 7/21/2005Natural Language Processing69 Dependency (1st order)

70 Lecture 1, 7/21/2005Natural Language Processing70 Disadvantage of HMMs (1)  No Rich Feature Information Rich information are required  When x k is complex  When data of x k is sparse  Example: POS Tagging How to evaluate P  w k |t k  for unknown words w k ? Useful features  Suffix, e.g., -ed, -tion, -ing, etc.  Capitalization  Generative Model Parameter estimation: maximize the joint likelihood of training examples

71 Lecture 1, 7/21/2005Natural Language Processing71 Generative Models  Hidden Markov models (HMMs) and stochastic grammars Assign a joint probability to paired observation and label sequences The parameters typically trained to maximize the joint likelihood of train examples

72 Lecture 1, 7/21/2005Natural Language Processing72 Generative Models (cont’d)  Difficulties and disadvantages Need to enumerate all possible observation sequences Not practical to represent multiple interacting features or long-range dependencies of the observations Very strict independence assumptions on the observations

73 Lecture 1, 7/21/2005Natural Language Processing73  Better Approach Discriminative model which models P(y|x) directly Maximize the conditional likelihood of training examples

74 Lecture 1, 7/21/2005Natural Language Processing74 Maximum Entropy modeling  N-gram model : probabilities depend on the previous few tokens.  We may identify a more heterogeneous set of features which contribute in some way to the choice of the current word. (whether it is the first word in a story, whether the next word is to, whether one of the last 5 words is a preposition, etc)  Maxent combines these features in a probabilistic model.  The given features provide a constraint on the model.  We would like to have a probability distribution which, outside of these constraints, is as uniform as possible – has the maximum entropy among all models that satisfy these constraints.

75 Lecture 1, 7/21/2005Natural Language Processing75 Maximum Entropy Markov Model  Discriminative Sub Models Unify two parameters in generative model into one conditional model  Two parameters in generative model,  parameter in source model and parameter in noisy channel  Unified conditional model Employ maximum entropy principle Maximum Entropy Markov Model

76 Lecture 1, 7/21/2005Natural Language Processing76 General Maximum Entropy Principle  Model Model distribution P  Y  |X  with a set of features  f   f     f l  defined on X and Y  Idea Collect information of features from training data Principle  Model what is known  Assume nothing else  Flattest distribution  Distribution with the maximum Entropy

77 Lecture 1, 7/21/2005Natural Language Processing77 Example  ( Berger et al., 1996) example Model translation of word “in” from English to French  Need to model P(word French )  Constraints  1: Possible translations: dans, en, à, au course de, pendant  2: “dans” or “en” used in 30% of the time  3: “dans” or “à” in 50% of the time

78 Lecture 1, 7/21/2005Natural Language Processing78 Features  Features 0-1 indicator functions  1 if  x  y  satisfies a predefined condition  0 if not  Example: POS Tagging

79 Lecture 1, 7/21/2005Natural Language Processing79 Constraints  Empirical Information Statistics from training data T Constraints Expected Value From the distribution P  Y  |X  we want to model

80 Lecture 1, 7/21/2005Natural Language Processing80 Maximum Entropy: Objective  Entropy Maximization Problem

81 Lecture 1, 7/21/2005Natural Language Processing81 Dual Problem  Dual Problem Conditional model Maximum likelihood of conditional data Solution Improved iterative scaling (IIS) (Berger et al. 1996) Generalized iterative scaling (GIS) (McCallum et al. 2000)

82 Lecture 1, 7/21/2005Natural Language Processing82 Maximum Entropy Markov Model  Use Maximum Entropy Approach to Model 1st order Features Basic features (like parameters in HMM) Bigram (1st order) or trigram (2nd order) in source model State-output pair feature  X k  x k  Y k  y k  Advantage: incorporate other advanced features on  x k  y k 

83 HMM vs MEMM (1st order) HMM Maximum Entropy Markov Model (MEMM)

84 Lecture 1, 7/21/2005Natural Language Processing84 Performance in POS Tagging  POS Tagging Data set: WSJ Features:  HMM features, spelling features (like – ed, -tion, -s, -ing, etc.)  Results (Lafferty et al. 2001) 1st order HMM  94.31% accuracy, 54.01% OOV accuracy 1st order MEMM  95.19% accuracy, 73.01% OOV accuracy

85 Lecture 1, 7/21/2005Natural Language Processing85 ME applications  Part of Speech (POS) Tagging (Ratnaparkhi, 1996) P(POS tag | context) Information sources  Word window (4)  Word features (prefix, suffix, capitalization)  Previous POS tags

86 Lecture 1, 7/21/2005Natural Language Processing86 ME applications  Abbreviation expansion (Pakhomov, 2002) Information sources  Word window (4)  Document title  Word Sense Disambiguation (WSD) (Chao & Dyer, 2002) Information sources  Word window (4)  Structurally related words (4)  Sentence Boundary Detection (Reynar & Ratnaparkhi, 1997) Information sources  Token features (prefix, suffix, capitalization, abbreviation)  Word window (2)

87 Lecture 1, 7/21/2005Natural Language Processing87 Solution  Global Optimization Optimize parameters in a global model simultaneously, not in sub models separately  Alternatives Conditional random fields Application of perceptron algorithm

88 Lecture 1, 7/21/2005Natural Language Processing88 Why ME?  Advantages Combine multiple knowledge sources  Local  Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996) )  Word POS, POS class, suffix (WSD - (Chao & Dyer, 2002) )  Token prefix, suffix, capitalization, abbreviation (Sentence Boundary - (Reynar & Ratnaparkhi, 1997) )  Global  N-grams (Rosenfeld, 1997)  Word window  Document title (Pakhomov, 2002)  Structurally related words (Chao & Dyer, 2002)  Sentence length, conventional lexicon (Och & Ney, 2002) Combine dependent knowledge sources

89 Lecture 1, 7/21/2005Natural Language Processing89 Why ME?  Advantages Add additional knowledge sources Implicit smoothing  Disadvantages Computational  Expected value at each iteration  Normalizing constant Overfitting  Feature selection  Cutoffs  Basic Feature Selection (Berger et al., 1996)

90 Lecture 1, 7/21/2005Natural Language Processing90 Conditional Models  Conditional probability P(label sequence y | observation sequence x) rather than joint probability P(y, x) Specify the probability of possible label sequences given an observation sequence  Allow arbitrary, non-independent features on the observation sequence X  The probability of a transition between labels may depend on past and future observations Relax strong independence assumptions in generative models

91 Lecture 1, 7/21/2005Natural Language Processing91 Discriminative Models Maximum Entropy Markov Models (MEMMs)  Exponential model  Given training set X with label sequence Y: Train a model θ that maximizes P(Y|X, θ) For a new data sequence x, the predicted label y maximizes P(y|x, θ) Notice the per-state normalization

92 Lecture 1, 7/21/2005Natural Language Processing92 MEMMs (cont’d)  MEMMs have all the advantages of Conditional Models  Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states (“conservation of score mass”)  Subject to Label Bias Problem Bias toward states with fewer outgoing transitions

93 Lecture 1, 7/21/2005Natural Language Processing93 Label Bias Problem P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r) Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri) In the training data, label value 2 is the only label value observed after label value 1 Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro). Per-state normalization does not allow the required expectation Consider this MEMM:

94 Lecture 1, 7/21/2005Natural Language Processing94 Solve the Label Bias Problem  Change the state-transition structure of the model Not always practical to change the set of states  Start with a fully-connected model and let the training procedure figure out a good structure Prelude the use of prior, which is very valuable (e.g. in information extraction)

95 Lecture 1, 7/21/2005Natural Language Processing95 Random Field

96 Lecture 1, 7/21/2005Natural Language Processing96 Conditional Random Fields (CRFs)  CRFs have all the advantages of MEMMs without label bias problem MEMM uses per-state exponential model for the conditional probabilities of next states given the current state CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence  Undirected acyclic graph  Allow some transitions “vote” more strongly than others depending on the corresponding observations

97 Lecture 1, 7/21/2005Natural Language Processing97 Definition of CRFs X is a random variable over data sequences to be labeled Y is a random variable over corresponding label sequences

98 Lecture 1, 7/21/2005Natural Language Processing98 Example of CRFs

99 Lecture 1, 7/21/2005Natural Language Processing99 Graphical comparison among HMMs, MEMMs and CRFs HMM MEMM CRF

100 Lecture 1, 7/21/2005Natural Language Processing100 Conditional Distribution x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V f k and g k are given and fixed. g k is a Boolean vertex feature; f k is a Boolean edge feature k is the number of features are parameters to be estimated y| e is the set of components of y defined by edge e y| v is the set of components of y defined by vertex v If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is:

101 Lecture 1, 7/21/2005Natural Language Processing101 Conditional Distribution (cont’d) CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization over the data sequence x

102 Lecture 1, 7/21/2005Natural Language Processing102 Parameter Estimation for CRFs  The paper provided iterative scaling algorithms  It turns out to be very inefficient  Prof. Dietterich’s group applied Gradient Descendent Algorithm, which is quite efficient

103 Lecture 1, 7/21/2005Natural Language Processing103 Training of CRFs (From Prof. Dietterich) First, we take the log of the equation Then, take the derivative of the above equation For training, the first 2 items are easy to get. For example, for each k, f k is a sequence of Boolean numbers, such as 00101110100111. is just the total number of 1’s in the sequence. The hardest thing is how to calculate Z(x)

104 Lecture 1, 7/21/2005Natural Language Processing104 Training of CRFs (From Prof. Dietterich) (cont’d) Maximal cliques y1y1 y2y2 y3y3 y4y4 c1c1 c2c2 c3c3 c1c1 c2c2 c3c3

105 Lecture 1, 7/21/2005Natural Language Processing105 POS tagging Experiments

106 Lecture 1, 7/21/2005Natural Language Processing106 POS tagging Experiments (cont’d) Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging Each word in a given input sentence must be labeled with one of 45 syntactic tags Add a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies oov = out-of-vocabulary (not observed in the training set)

107 Lecture 1, 7/21/2005Natural Language Processing107 Summary  Discriminative models are prone to the label bias problem  CRFs provide the benefits of discriminative models  CRFs solve the label bias problem well, and demonstrate good performance


Download ppt "Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 10 16 August 2007."

Similar presentations


Ads by Google