Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 27– SMT Assignment; HMM recap; Probabilistic Parsing cntd) Pushpak Bhattacharyya.

Similar presentations


Presentation on theme: "CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 27– SMT Assignment; HMM recap; Probabilistic Parsing cntd) Pushpak Bhattacharyya."— Presentation transcript:

1 CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 27– SMT Assignment; HMM recap; Probabilistic Parsing cntd) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th March, 2011

2 CMU Pronunciation Dictionary Assignment

3 Data The Carnegie Mellon University Pronouncing Dictionary machine-readable pronunciation dictionary for North American English that contains over 125,000 words and their transcriptions. The current phoneme set contains 39 phonemes

4 “Parallel” Corpus Phoneme Example Translation ------- ------- ----------- AA odd AA D AE at AE T AH hut HH AH T AO ought AO T AW cow K AW AY hide HH AY D B be B IY

5 “Parallel” Corpus cntd Phoneme Example Translation ------- ------- ----------- CH cheese CH IY Z D dee D IY DH thee DH IY EH Ed EH D ER hurt HH ER T EY ate EY T F fee F IY G green G R IY N HH he HH IY IH it IH T IY eat IY T JH gee JH IY

6 The tasks First obtain the Carnegie Mellon University's Pronouncing Dictionary Create the Phrase Table using GIZA++ For language modeling use SRILM For decoding use Moses Calculate precision, recall and F-score

7 Probabilistic Parsing

8 Bridging Classical and Probabilistic Parsing The bridge between probabilistic parsing and classical parsing is the concept of domination Frequency: P( NP -> DT NN) = 0.5 means in the corpus 50% of noun phrase is composed of determiner and noun Phenomenon: P(NP -> DT NN) is actually P(DT NN |NP) i.e. join probability of domination by DT and NN to give rise to domination of NP The concept of domination is the bridge between Frequency (probabilistic parsing) and Phenomenon (classical parsing).

9 Calculating Probability of a Sentence We can either calculate P(s=w 1 m ) using naive N-gram based approach or by calculating Which approach to choose?? The velocity of waves rises near the shore. Consecutive plural noun and singular verb is unlikely in the corpus. So low probability value for the sentence as given by n- gram.

10 Parse Tree No other Parse Tree is possible for the sentence S NP VP near rises wavesofvelocity The PP DT theshore NP NN P NNS V PP NP P P N

11 Various ways to calculate probability of sentence Naïve n-gram based Syntactic level (Parse tree) Semantic Level Pragmatics Discourse

12 Probabilistic Context Free Grammars S  NP VP1.0 NP  DT NN0.5 NP  NNS0.3 NP  NP PP 0.2 PP  P NP1.0 VP  VP PP 0.6 VP  VBD NP0.4 DT  the1.0 NN  gunman0.5 NN  building0.5 VBD  sprayed 1.0 NNS  bullets1.0

13 Example Parse t 1` The gunman sprayed the building with bullets. S 1.0 NP 0.5 VP 0.6 DT 1.0 NN 0.5 VBD 1.0 NP 0.5 PP 1.0 DT 1.0 NN 0.5 P 1.0 NP 0.3 NNS 1.0 bullets with buildingthe Thegunman sprayed P (t 1 ) = 1.0 * 0.5 * 1.0 * 0.5 * 0.6 * 0.4 * 1.0 * 0.5 * 1.0 * 0.5 * 1.0 * 1.0 * 0.3 * 1.0 = 0.00225 VP 0.4

14 Another Parse t 2 S 1.0 NP 0.5 VP 0.4 DT 1.0 NN 0.5 VBD 1.0 NP 0.5 PP 1.0 DT 1.0 NN 0.5 P 1.0 NP 0.3 NNS 1.0 bullet s with buildingth e Thegunmansprayed NP 0.2 P (t 2 ) = 1.0 * 0.5 * 1.0 * 0.5 * 0.4 * 1.0 * 0.2 * 0.5 * 1.0 * 0.5 * 1.0 * 1.0 * 0.3 * 1.0 = 0.0015 The gunman sprayed the building with bullets.

15 Probability of a parse tree (cont.) S 1,l NP 1,2 VP 3,l N2N2 V 3,3 PP 4,l P 4,4 NP 5,l w2w2 w4w4 DT 1 w1w1 w3w3 w5w5 wlwl P ( t|s ) = P (t | S 1,l ) = P ( NP 1,2, DT 1,1, w 1, N 2,2, w 2, VP 3,l, V 3,3, w 3, PP 4,l, P 4,4, w 4, NP 5,l, w 5…l | S 1,l ) = P ( NP 1,2, VP 3,l | S 1,l ) * P ( DT 1,1, N 2,2 | NP 1,2 ) * D(w 1 | DT 1,1 ) * P (w 2 | N 2,2 ) * P (V 3,3, PP 4,l | VP 3,l ) * P(w 3 | V 3,3 ) * P( P 4,4, NP 5,l | PP 4,l ) * P(w 4 |P 4,4 ) * P (w 5…l | NP 5,l ) (Using Chain Rule, Context Freeness and Ancestor Freeness )

16 HMM ↔ PCFG O observed sequence ↔ w 1m sentence X state sequence ↔ t parse tree  model ↔ G grammar Three fundamental questions

17 HMM ↔ PCFG How likely is a certain observation given the model? ↔ How likely is a sentence given the grammar? How to choose a state sequence which best explains the observations? ↔ How to choose a parse which best supports the sentence? ↔ ↔

18 HMM ↔ PCFG How to choose the model parameters that best explain the observed data? ↔ How to choose rule probabilities which maximize the probabilities of the observed sentences? ↔

19 Recap of HMM

20 HMM Definition Set of states: S where |S|=N Start state S 0 /*P(S 0 )=1*/ Output Alphabet: O where |O|=M Transition Probabilities: A= {a ij } /*state i to state j*/ Emission Probabilities : B= {b j (o k )} /*prob. of emitting or absorbing o k from state j*/ Initial State Probabilities: Π={p 1,p 2,p 3,…p N } Each p i =P(o 0 =ε,S i |S 0 )

21 Markov Processes Properties Limited Horizon: Given previous t states, a state i, is independent of preceding 0 to t- k+1 states. P(X t =i|X t-1, X t-2,… X 0 ) = P(X t =i|X t-1, X t-2 … X t-k ) Order k Markov process Time invariance: (shown for k=1) P(X t =i|X t-1 =j) = P(X 1 =i|X 0 =j) …= P(X n =i|X n-1 =j)

22 Three basic problems (contd.) Problem 1: Likelihood of a sequence Forward Procedure Backward Procedure Problem 2: Best state sequence Viterbi Algorithm Problem 3: Re-estimation Baum-Welch ( Forward-Backward Algorithm )

23 Probabilistic Inference O: Observation Sequence S: State Sequence Given O find S * where called Probabilistic Inference Infer “Hidden” from “Observed” How is this inference different from logical inference based on propositional or predicate calculus?

24 Essentials of Hidden Markov Model 1. Markov + Naive Bayes 2. Uses both transition and observation probability 3. Effectively makes Hidden Markov Model a Finite State Machine (FSM) with probability

25 Probability of Observation Sequence Without any restriction, Search space size= |S| |O|

26 Continuing with the Urn example Urn 1 # of Red = 30 # of Green = 50 # of Blue = 20 Urn 3 # of Red =60 # of Green =10 # of Blue = 30 Urn 2 # of Red = 10 # of Green = 40 # of Blue = 50 Colored Ball choosing

27 Example (contd.) U1U1 U2U2 U3U3 U1U1 0.10.40.5 U2U2 0.60.2 U3U3 0.30.40.3 Given : Observation : RRGGBRGR What is the corresponding state sequence ? and RGB U1U1 0.30.50.2 U2U2 0.10.40.5 U3U3 0.60.10.3 Transition Probability Observation/output Probability

28 Diagrammatic representation (1/2) U1U1 U2U2 U3U3 0.1 0.2 0.4 0.6 0.4 0.5 0.3 0.2 0.3 R, 0.6 G, 0.1 B, 0.3 R, 0.1 B, 0.5 G, 0.4 B, 0.2 R, 0.3 G, 0.5

29 Diagrammatic representation (2/2) U1U1 U2U2 U3U3 R,0.02 G,0.08 B,0.10 R,0.24 G,0.04 B,0.12 R,0.06 G,0.24 B,0.30 R, 0.08 G, 0.20 B, 0.12 R,0.15 G,0.25 B,0.10 R,0.18 G,0.03 B,0.09 R,0.18 G,0.03 B,0.09 R,0.02 G,0.08 B,0.10 R,0.03 G,0.05 B,0.02

30 Probabilistic FSM (a 1 :0.3) (a 2 :0.4) (a 1 :0.2) (a 2 :0.3) (a 1 :0.1) (a 2 :0.2) (a 1 :0.3) (a 2 :0.2) The question here is: “what is the most likely state sequence given the output sequence seen” S1S1 S2S2

31 Developing the tree Start S1S2S1S2S1S2S1S2S1S2 1.00.0 0.1 0.30.2 0.3 1*0.1=0.1 0.30.0 0.1*0.2=0.02 0.1*0.4=0.040.3*0.3=0.090.3*0.2=0.06.... € a1a1 a2a2 Choose the winning sequence per state per iteration 0.2 0.40.3 0.2

32 Tree structure contd… S1S2S1S2S1S2 0.1 0.30.2 0.3 0.0270.012.. 0.090.06 0.09*0.1=0.009 0.018 S1 0.3 0.0081 S2 0.2 0.0054 S2 0.4 0.0048 S1 0.2 0.0024. a1a1 a2a2 The problem being addressed by this tree is a1-a2-a1-a2 is the output sequence and μ the model or the machine

33 Viterbi Algorithm for the Urn problem (first two symbols) S0S0 U1U1 U2U2 U3U3 0.5 0.3 0.2 U1U1 U2U2 U3U3 0.03 0.08 0.15 U1U1 U2U2 U3U3 U1U1 U2U2 U3U3 0.06 0.02 0.18 0.24 0.18 0.015 0.040.075*0.0180.006 0.048*0.036 *: winner sequences ε R

34 Markov process of order>1 (say 2) Same theory works P(S).P(O|S) =P(O 0 |S 0 ).P(S 1 |S 0 ). [P(O 1 |S 1 ).P(S 2 |S 1 S 0 )]. [P(O 2 |S 2 ).P(S 3 |S 2 S 1 )]. [P(O 3 |S 3 ).P(S 4 |S 3 S 2 )]. [P(O 4 |S 4 ).P(S 5 |S 4 S 3 )]. [P(O 5 |S 5 ).P(S 6 |S 5 S 4 )]. [P(O 6 |S 6 ).P(S 7 |S 6 S 5 )]. [P(O 7 |S 7 ).P(S 8 |S 7 S 6 )]. [P(O 8 |S 8 ).P(S 9 |S 8 S 7 )]. We introduce the states S 0 and S 9 as initial and final states respectively. After S 8 the next state is S 9 with probability 1, i.e., P(S 9 |S 8 S 7 )=1 O 0 is ε-transition O 0 O 1 O 2 O 3 O 4 O 5 O 6 O 7 O 8 Obs: ε RRG G B R G R State: S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9

35 Adjustments Transition probability table will have tuples on rows and states on columns Output probability table will remain the same In the Viterbi tree, the Markov process will take effect from the 3 rd input symbol (εRR) There will be 27 leaves, out of which only 9 will remain Sequences ending in same tuples will be compared Instead of U1, U2 and U3 U 1 U 1, U 1 U 2, U 1 U 3, U 2 U 1, U 2 U 2,U 2 U 3, U 3 U 1,U 3 U 2,U 3 U 3

36 Forward and Backward Probability Calculation

37 Forward probability F(k,i) Define F(k,i)= Probability of being in state S i having seen o 0 o 1 o 2 …o k F(k,i)=P(o 0 o 1 o 2 …o k, S i ) With m as the length of the observed sequence P(observed sequence)=P(o 0 o 1 o 2..o m ) =Σ p=0,N P(o 0 o 1 o 2..o m, S p ) =Σ p=0,N F(m, p)

38 Forward probability (contd.) F(k, q) = P(o 0 o 1 o 2..o k, S q ) = P(o 0 o 1 o 2..o k-1, o k,S q ) = Σ p=0,N P(o 0 o 1 o 2..o k-1, S p, o k,S q ) = Σ p=0,N P(o 0 o 1 o 2..o k-1, S p ). P(o m,S q |o 0 o 1 o 2..o k-1, S p ) = Σ p=0,N F(k-1,p). P(o k,S q |S p ) = Σ p=0,N F(k-1,p). P(S p  S q ) okok O 0 O 1 O 2 O 3 … O k O k+1 … O m-1 O m S 0 S 1 S 2 S 3 … S p S q … S m S final

39 Backward probability B(k,i) Define B(k,i)= Probability of seeing o k o k+1 o k+2 …o m given that the state was S i B(k,i)=P(o k o k+1 o k+2 …o m \ S i ) With m as the length of the observed sequence P(observed sequence)=P(o 0 o 1 o 2..o m ) = P(o 0 o 1 o 2..o m | S 0 ) =B(0,0)

40 Backward probability (contd.) B(k, p) = P(o k o k+1 o k+2 …o m \ S p ) = P(o k+1 o k+2 …o m, o k |S p ) = Σ q=0,N P(o k+1 o k+2 …o m, o k, S q |S p ) = Σ q=0,N P(o k,S q |S p ) P(o k+1 o k+2 …o m |o k,S q,S p ) = Σ q=0,N P(o k+1 o k+2 …o m |S q ). P(o k, S q |S p ) = Σ q=0,N B(k+1,q). P(S p  S q ) okok O 0 O 1 O 2 O 3 … O k O k+1 … O m-1 O m S 0 S 1 S 2 S 3 … S p S q … S m S final

41 Back to PCFG

42 Interesting Probabilities The gunman sprayed the building with bullets 1 2 3 45 6 7 N1N1 NP What is the probability of having a NP at this position such that it will derive “the building” ? - What is the probability of starting from N 1 and deriving “The gunman sprayed”, a NP and “with bullets” ? - Inside Probabilities Outside Probabilities

43 Interesting Probabilities Random variables to be considered The non-terminal being expanded. E.g., NP The word-span covered by the non-terminal. E.g., (4,5) refers to words “the building” While calculating probabilities, consider: The rule to be used for expansion : E.g., NP  DT NN The probabilities associated with the RHS non- terminals : E.g., DT subtree’s inside/outside probabilities & NN subtree’s inside/outside probabilities

44 Outside Probability  j (p,q) : The probability of beginning with N 1 & generating the non-terminal N j pq and all words outside w p..w q w 1 ………w p-1 w p …w q w q+1 ……… w m N1N1 NjNj

45 Inside Probabilities  j (p,q) : The probability of generating the words w p..w q starting with the non-terminal N j pq. w 1 ………w p-1 w p …w q w q+1 ……… w m   N1N1 NjNj

46 Outside & Inside Probabilities: example The gunman sprayed the building with bullets 1 2 3 45 6 7 N1N1 NP

47 Calculating Inside probabilities  j (p,q) Base case: Base case is used for rules which derive the words or terminals directly E.g., Suppose N j = NN is being considered & NN  building is one of the rules with probability 0.5

48 Induction Step: Assuming Grammar in Chomsky Normal Form Induction step : wpwp NjNj NrNr NsNs wdwd w d+1 wqwq Consider different splits of the words - indicated by d E.g., the huge building Consider different non-terminals to be used in the rule: NP  DT NN, NP  DT NNS are available options Consider summation over all these. Split here for d=2 d=3

49 The Bottom-Up Approach The idea of induction Consider “the gunman” Base cases : Apply unary rules DT  the Prob = 1.0 NN  gunmanProb = 0.5 Induction : Prob that a NP covers these 2 words = P (NP  DT NN) * P (DT deriving the word “the”) * P (NN deriving the word “gunman”) = 0.5 * 1.0 * 0.5 = 0.25 The gunman NP 0.5 DT 1.0 NN 0.5

50 Parse Triangle A parse triangle is constructed for calculating  j (p,q) Probability of a sentence using  j (p,q):

51 Parse Triangle The (1) gunman (2) sprayed (3) the (4) building (5) with (6) bullets (7) 1 2 3 4 5 6 7 Fill diagonals with

52 Parse Triangle The (1) gunman (2) sprayed (3) the (4) building (5) with (6) bullets (7) 1 2 3 4 5 6 7 Calculate using induction formula

53 Example Parse t 1 S 1.0 NP 0.5 VP 0.6 DT 1.0 NN 0.5 VBD 1.0 NP 0.5 PP 1.0 DT 1.0 NN 0.5 P 1.0 NP 0.3 NNS 1.0 bullet s with buildingth e Thegunman sprayed VP 0.4 Rule used here is VP  VP PP The gunman sprayed the building with bullets.

54 Another Parse t 2 S 1.0 NP 0.5 VP 0.4 DT 1.0 NN 0.5 VBD 1.0 NP 0.5 PP 1.0 DT 1.0 NN 0.5 P 1.0 NP 0.3 NNS 1.0 bullet s with buildingth e Thegunmansprayed NP 0.2 Rule used here is VP  VBD NP The gunman sprayed the building with bullets.

55 Parse Triangle The (1)gunman (2) sprayed (3) the (4) building (5) with (6) bullets (7) 1 2 3 4 5 6 7

56 Different Parses Consider Different splitting points : E.g., 5th and 3 rd position Using different rules for VP expansion : E.g., VP  VP PP, VP  VBD NP Different parses for the VP “sprayed the building with bullets” can be constructed this way.

57 Outside Probabilities  j (p,q) Base case: Inductive step for calculating : wpwp N f pe N j pq N g (q+1) e wqwq w q+1 wewe w p-1 w1w1 wmwm w e+1 N1N1 Summation over f, g & e

58 Probability of a Sentence Joint probability of a sentence w 1m and that there is a constituent spanning words w p to w q is given as: The gunman sprayed the building with bullets 1 2 3 4 5 6 7 N1N1 NP


Download ppt "CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 27– SMT Assignment; HMM recap; Probabilistic Parsing cntd) Pushpak Bhattacharyya."

Similar presentations


Ads by Google