Presentation is loading. Please wait.

Presentation is loading. Please wait.

NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.

Similar presentations


Presentation on theme: "NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text."— Presentation transcript:

1 NLP

2 Introduction to NLP

3 Sequence of random variables that aren’t independent Examples –weather reports –text

4 Limited horizon: P(X t+1 = s k |X 1,…,X t ) = P(X t+1 = s k |X t ) Time invariant (stationary) = P(X 2 =s k |X 1 ) Definition: in terms of a transition matrix A and initial state probabilities .

5 f ab ed c 1.0 0.2 0.8 1.0 0.3 0.7 0.1 0.7 start

6 P(X 1,…X T ) = P(X 1 ) P(X 2 |X 1 ) P(X 3 |X 1,X 2 ) … P(X T |X 1,…,X T-1 ) = P(X 1 ) P(X 2 |X 1 ) P(X 3 |X 2 ) … P(X T |X T-1 ) P(d, a, b) = P(X 1 =d) P(X 2 =a|X 1 =d) P(X 3 =b|X 2 =a) = 1.0 x 0.7 x 0.8 = 0.56

7 Motivation –Observing a sequence of symbols –The sequence of states that led to the generation of the symbols is hidden Definition –Q = sequence of states –O = sequence of observations, drawn from a vocabulary –q 0,q f = special (start, final) states –A = state transition probabilities –B = symbol emission probabilities –  = initial state probabilities –µ = (A,B,  ) = complete probabilistic model

8 Uses –part of speech tagging –speech recognition –gene sequencing

9 Can be used to model state sequences and observation sequences Example: –P(s,w) =  i P(s i |s i-1 )P(w i |s i ) S1 S2 S3 Sn S0 … W1 W2 W3 Wn

10 Pick start state from  For t = 1..T –Move to another state based on A –Emit an observation based on B

11 GH 0.2 0.6 0.40.8 start

12 P(O t =k|X t =s i,X t+1 =s j ) = b ijk xyz G0.70.20.1 H0.30.50.2

13 Initial –P(G|start) = 1.0, P(H|start) = 0.0 Transition –P(G|G) = 0.8, P(G|H) = 0.6, P(H|G) = 0.2, P(H|H) = 0.4 Emission –P(x|G) = 0.7, P(y|G) = 0.2, P(z|G) = 0.1 –P(x|H) = 0.3, P(y|H) = 0.5, P(z|H) = 0.2

14 Starting in state G (or H), P(yz) = ? Possible sequences of states: –GG –GH –HG –HH P(yz) = P(yz|GG) + P(yz|GH) + P(yz|HG) + P(yz|HH) = =.8 x.2 x.8 x.1 +.8 x.2 x.2 x.2 +.2 x.5 x.4 x.2 +.2 x.5 x.6 x.1 =.0128+.0064+.0080+.0060 =.0332

15 The states encode the most recent history The transitions encode likely sequences of states –e.g., Adj-Noun or Noun-Verb –or perhaps Art-Adj-Noun Use MLE to estimate the transition probabilities

16 Estimating the emission probabilities –Harder than transition probabilities –There may be novel uses of word/POS combinations Suggestions –It is possible to use standard smoothing –As well as heuristics (e.g., based on the spelling of the words)

17 The observer can only see the emitted symbols Observation likelihood –Given the observation sequence S and the model  = (A,B,  ), what is the probability P(S|  ) that the sequence was generated by that model. Being able to compute the probability of the observations sequence turns the HMM into a language model

18 Tasks –Given  = (A,B,  ), find P(O|  ) –Given O, , what is (X 1,…X T+1 ) –Given O and a space of all possible , find model that best describes the observations Decoding – most likely sequence –tag each token with a label Observation likelihood –classify sequences Learning –train models to fit empirical data

19 Find the most likely sequence of tags, given the sequence of words –t* = argmax t P(t|w) Given the model µ, it is possible to compute P (t|w) for all values of t In practice, there are way too many combinations Possible solution: –Use beam search (partial hypotheses) –At each state, only keep the k best hypotheses so far –May not work

20 Find the best path up to observation i and state s Characteristics –Uses dynamic programming –Memoization –Backpointers

21 P(H|G) P(B|B) P(y|G) start end H H H H H H H H G G G G G G G G start end

22 start end H H H H H H H H G G G G G G G G start end y y z z.. P(G,t=1) P(G,t=1) = P(start) x P(G|start) x P(y|G)

23 start end H H H H H H H H G G G G G G G G y y z z start end.. P(H,t=1) P(H,t=1) = P(start) x P(H|start) x P(y|H)

24 start end H H H H H H H H G G G G G G G G y y z z start end.. P(H,t=2) = max (P(G,t=1) x P(H|G) x P(z|H), P(H,t=1) x P(H|H) x P(z|H)) P(H,t=2)

25 start end H H H H H H H H G G G G G G G G y y z z start end.. P(H,t=2)

26 start end H H H H H H H H G G G G G G G G y y z z start end..

27 start end H H H H H H H H G G G G G G G G y y z z start end.. P(end,t=3)

28 start end H H H H H H H H G G G G G G G G y y z z start end.. P(end,t=3) P(end,t=3) = max (P(G,t=2) x P(end|G), P(H,t=2) x P(end|H))

29 start end H H H H H H H H G G G G G G G G y y z z start end.. P(end,t=3) P(end,t=3) = best score for the sequence Use the backpointers to find the sequence of states. P(end,t=3) = max (P(G,t=2) x P(end|G), P(H,t=2) x P(end|H))

30 Given multiple HMMs –e.g., for different languages –Which one is the most likely to have generated the observation sequence Naïve solution –try all possible state sequences

31 Dynamic programming method –Computing a forward trellis that encodes all possible state paths. –Based on the Markov assumption that the probability of being in any state at a given time point only depends on the probabilities of being in all states at the previous time point

32 Supervised –Training sequences are labeled Unsupervised –Training sequences are unlabeled –Known number of states Semi-supervised –Some training sequences are labeled

33 Estimate the static transition probabilities using MLE Estimate the observation probabilities using MLE Use smoothing

34 Given: –observation sequences Goal: –build the HMM Use EM (Expectation Maximization) methods –forward-backward (Baum-Welch) algorithm –Baum-Welch finds an approximate solution for P(O|µ)

35 Algorithm –Randomly set the parameters of the HMM –Until the parameters converge repeat: E step – determine the probability of the various state sequences for generating the observations M step – reestimate the parameters based on these probabilities Notes –the algorithm guarantees that at each iteration the likelihood of the data P(O|µ) increases –it can be stopped at any point and give a partial solution –it converges to a local maximum

36 NLP


Download ppt "NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text."

Similar presentations


Ads by Google