Presentation is loading. Please wait.

Presentation is loading. Please wait.

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING.

Similar presentations


Presentation on theme: "MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING."— Presentation transcript:

1 MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

2 Outline Some Sample NLP Task [Noah Smith] Structured Prediction For NLP Structured Prediction Methods – Conditional Random Fields – Structured Perceptron Discussion

3 MOTIVATING STRUCTURED-OUTPUT PREDICTION FOR NLP

4 Copyright © Andrew W. Moore Types of machine learning Regressor Prediction of real-valued output Input Attributes Density Estimator Probability Input Attributes Classifier Prediction of categorical output or class Input Attributes One of a few discrete values

5

6

7

8

9

10

11

12 Outline Some Sample NLP Task [Noah Smith] – What do these have in common? – We’re making many related predictions... Structured Prediction For NLP Structured Prediction Methods – Structured Perceptron – Conditional Random Fields Deep learning and NLP

13

14 beginGPE out beginGF inGF via begin-inside-outside token tagging beginX inX* delimits entity X

15 beginGF inGF via begin-inside-outside token tagging beginX inX* delimits entity X thisToken = english thisTokenShape = Xx+ prevToken = the prevPrevToken = across nextToken = channel nextNextToken = Monday nextTokenShape = Xx+ prevTokenShape = x+ …. thisToken_english = 1 thisTokenShape_Xx+ = 1 prevToken_the = 1 prevPrevToken_across = 1 … thisToken_english = 1 thisTokenShape_Xx+ = 1 prevToken_the = 1 prevPrevToken_across = 1 … beginGF a classification task!

16 inGF via begin-inside-outside token tagging thisToken_english = 1 thisTokenShape_Xx+ = 1 prevToken_the = 1 prevPrevToken_across = 1 … thisToken_english = 1 thisTokenShape_Xx+ = 1 prevToken_the = 1 prevPrevToken_across = 1 … beginGF thisToken_channel = 1 thisTokenShape_Xx+ = 1 prevToken_english = 1 prevPrevToken_the = 1 … thisToken_channel = 1 thisTokenShape_Xx+ = 1 prevToken_english = 1 prevPrevToken_the = 1 … thisToken_the = 1 thisTokenShape_x+ = 1 prevToken_across = 1 … thisToken_the = 1 thisTokenShape_x+ = 1 prevToken_across = 1 … out A problem: the instances are not i.i.d, nor are the classes. inX usually happens after beginX or inX beginX usually happens after out … A problem: the instances are not i.i.d, nor are the classes. inX usually happens after beginX or inX beginX usually happens after out …

17 Structured Output Prediction Structured Output Prediction: definition – classifier where output is structured, and input might be structured – example: predict a sequence of begin-in-out tags from a sequence of tokens (represented by a sequence of feature vectors)

18 beginGPE out beginGF inGF NER via begin-inside-outside token tagging x = y = A problem: with K begin-inside-outside tags and N words, we have N K possible structured labels. How can we learn to chose among so many possible outputs? A problem: with K begin-inside-outside tags and N words, we have N K possible structured labels. How can we learn to chose among so many possible outputs?

19 HMMS AND NER

20 beginGPE out beginGF inGF x = y = beginGF beginGPE beginPer inGPE inGF inPer out e.g., HMMs map a sequence of tokens to a sequence of outputs (the hidden states)

21 21 Borkar et al’s: HMMs for segmentation Example: Addresses, bib records Problem: some DBs may split records up differently (eg no “mail stop” field, combine address and apt #, …) or not at all Solution: Learn to segment textual form of records P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237. Author Year TitleJournal Volume Page

22 IE with Hidden Markov Models Must learn transition, emission probabilities 22 Title Journal Author 0.9 0.5 0.8 0.2 0.1 Transition probabilities Year Learning Convex … 0.06 0.03.. Comm. Trans. Chemical 0.04 0.02 0.004 Smith Cohen Jordan … 0.01 0.05 0.3 … Emission probabilities dddd dd 0.8 0.2

23 beginGF inGF thisToken = english thisTokenShape = Xx+ prevToken = the prevPrevToken = across nextToken = channel nextNextToken = Monday nextTokenShape = Xx+ prevTokenShape = x+ …. thisToken_english = 1 thisTokenShape_Xx+ = 1 prevToken_the = 1 prevPrevToken_across = 1 … thisToken_english = 1 thisTokenShape_Xx+ = 1 prevToken_the = 1 prevPrevToken_across = 1 … beginGF a classification task! HMMS limitation: not well-suited to modeling a set of features for each token.

24 What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Lots of learning systems are not confounded by multiple, non- independent features: decision trees, neural nets, SVMs, …

25 Outline Some Sample NLP Task [Noah Smith] Structured Prediction For NLP Structured Prediction Methods –HMMs –Conditional Random Fields –Structured Perceptron Discussion

26 CONDITIONAL RANDOM FIELDS

27 What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Lots of learning systems are not confounded by multiple, non- independent features: decision trees, neural nets, SVMs, …

28 Inference for linear-chain MRFs When will prof Cohen post the notes … Idea 1: features are properties of two adjacent tokens, and the pair of labels assigned to them. (y(i)==B or y(i)==I) and (token(i) is capitalized) (y(i)==I and y(i-1)==B) and (token(i) is hyphenated) (y(i)==B and y(i-1)==B) eg “tell Ziv William is on the way” Idea 2: construct a graph where each path is a possible sequence labeling.

29 Inference for a linear-chain MRF B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes … Inference: find the highest-weight path Learning: optimize the feature weights so that this highest-weight path is correct relative to the data

30 Inference for a linear-chain MRF B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes … Feature + weights define edge weights: weight(y i,y i+1 ) = 2*[(y i =B or I) and isCap(x i )] + 1*[(y i =B and isFirstName(x i )] - 5*[(y i+1 ≠B and isLower(x i ) and isUpper(x i+1 )]

31 CRF learning – from Sha & Pereira

32

33 Something like forward-backward Idea: Define matrix of y,y’ “affinities” at stage i M i [y,y’] = “unnormalized probability” of transition from y to y’ at stage I M i * M i+1 = “unnormalized probability” of any path through stages i and i+1

34 Sha & Pereira results CRF beats MEMM (McNemar’s test); MEMM probably beats voted perceptron

35 Sha & Pereira results in minutes, 375k examples

36 THE STRUCTURED PERCEPTRON

37 Review: The perceptron A B instance x i Compute: y i = v k. x i ^ y i ^ If mistake: v k+1 = v k + y i x i

38 u -u 2γ2γ u -u-u 2γ2γ v1v1 +x2+x2 v2v2 +x1+x1 v1v1 -x2-x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 (3b) The guess v 2 after the one positive and one negative example: v 2 =v 1 -x 2 >γ>γ

39 u -u 2γ2γ u -u-u 2γ2γ v1v1 +x2+x2 v2v2 +x1+x1 v1v1 -x2-x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 (3b) The guess v 2 after the one positive and one negative example: v 2 =v 1 -x 2

40 2 2 2 2 2

41 The voted perceptron for ranking A B instances x 1 x 2 x 3 x 4 … Compute: y i = v k. x i Return: the index b* of the “best” x i ^ b* bIf mistake: v k+1 = v k + x b - x b*

42 u -u x x x x x γ Ranking some x’s with the target vector u

43 u -u x x x x x γ v Ranking some x’s with some guess vector v – part 1

44 u -u x x x x x v Ranking some x’s with some guess vector v – part 2. The purple-circled x is x b* - the one the learner has chosen to rank highest. The green circled x is x b, the right answer.

45 u -u x x x x x v Correcting v by adding x b – x b*

46 x x x x x vkvk V k+1 Correcting v by adding x b – x b* (part 2)

47 u -u 2γ2γ v1v1 +x2+x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 >γ>γ

48 u -u 2γ2γ v1v1 +x2+x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 >γ>γ 3

49 u -u 2γ2γ v1v1 +x2+x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 >γ>γ 3

50 u -u 2γ2γ v1v1 +x2+x2 v2v2 (3a) The guess v 2 after the two positive examples: v 2 =v 1 +x 2 >γ>γ Notice this doesn’t depend at all on the number of x’s being ranked Neither proof depends on the dimension of the x’s.

51 The voted perceptron for ranking A B instances x 1 x 2 x 3 x 4 … Compute: y i = v k. x i Return: the index b* of the “best” x i ^ b* bIf mistake: v k+1 = v k + x b - x b* Change number one is notation: replace x with g

52 The voted perceptron for NER A B instances g 1 g 2 g 3 g 4 … Compute: y i = v k. g i Return: the index b* of the “best” g i ^ b* bIf mistake: v k+1 = v k + g b - g b* 1.A sends B the Sha & Pereira paper, some feature functions, and instructions for creating the instances g: A sends a word vector x i. Then B could create the instances g 1 =F(x i,y 1 ), g 2 = F(x i,y 2 ), … but instead B just returns the y* that gives the best score for the dot product v k. F(x i,y*) by using Viterbi. 2.A sends B the correct label sequence y i. 3.On errors, B sets v k+1 = v k + g b - g b* = v k + F(x i,y) - F(x i,y*)

53 The voted perceptron for NER A B instances z 1 z 2 z 3 z 4 … Compute: y i = v k. z i Return: the index b* of the “best” z i ^ b* bIf mistake: v k+1 = v k + z b - z b* 1.A sends a word vector x i. 2.B just returns the y* that gives the best score for v k. F(x i,y*) 3.A sends B the correct label sequence y i. 4.On errors, B sets v k+1 = v k + z b - z b* = v k + F(x i,y) - F(x i,y*) So, this algorithm can also be viewed as an approximation to the CRF learning algorithm – where we’re using a Viterbi approximation to the expectations, and stochastic gradient descent to optimize the likelihood.

54 EMNLP 2002, Best paper And back to the paper…..

55 Collins’ Experiments POS tagging (with MXPOST features) NP Chunking (words and POS tags from Brill’s tagger as features) and BIO output tags Compared Maxent Tagging/MEMM’s (with iterative scaling) and “Voted Perceptron trained HMM’s” –With and w/o averaging –With and w/o feature selection (count>5)

56 Collins’ results

57 Outline Some Sample NLP Task [Noah Smith] Structured Prediction For NLP Structured Prediction Methods –Conditional Random Fields –Structured Perceptron Discussion

58 Structured prediction is just one part of NLP –There is lots of work on learning and NLP –It would be much easier to summarize non- learning related NLP CRFs/HMMs/structured perceptrons are just one part of structured output prediction –Chris Dyer course …. Hot topic now: representation learning and deep learning for NLP NAACL 2013: Deep learning for NLP, Manning


Download ppt "MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING."

Similar presentations


Ads by Google