Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

Similar presentations


Presentation on theme: "1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint."— Presentation transcript:

1 1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA

2 2 Outline – Multi-Class classification: – Structured Prediction – Models for Structured Prediction and Classification Example of POS tagging

3 3 Mutliclass problems – Most of the machinery we talked before was focused on binary classification problems – e.g., SVMs we discussed so far – However most problems we encounter in NLP are either: MultiClass: e.g., text categorization Structured Prediction: e.g., predict syntactic structure of a sentence – How to deal with them?

4 4 Binary linear classification

5 5 Multiclass classification

6 6 Perceptron

7 Structured Perceptron Joint feature representation: Algoritm:

8 8 Perceptron

9 9 Binary Classification Margin

10 10 Generalize to MultiClass

11 11 Converting to MultiClass SVM

12 12 Max margin = Min Norm As before, these are equivalent formulations:

13 13 Problems: Requires separability What if we have noise in data? What if we have little simple feature space?

14 14 Non-separable case

15 15 Non-separable case

16 16 Compare with MaxEnt

17 17 Loss Comparison

18 18 So far, we considered multiclass classification 0-1 losses l(y,y’) What if what we want to do is to predict: sequences of POS syntactic trees translation Multiclass -> Structured

19 19 Predicting word alignments

20 20 Predicting Syntactic Trees

21 21 Structured Models

22 22 Parsing

23 23 Max Margin Markov Networks (M3Ns) Taskar et al, 2003; similar Tsochantaridis et al, 2004

24 24 Max Margin Markov Networks (M3Ns)

25 25MultiClass Classification Solving MultiClass with binary learning MultiClass classifier – Function f : R d  {1,2,3,...,k} Decompose into binary problems Not always possible to learn Different scale No theoretical justification Real Problem

26 26MultiClass Classification Learning via One-Versus-All (OvA) Assumption Find v r,v b,v g,v y  R n such that – v r.x > 0 iff y = red  – v b.x > 0 iff y = blue  – v g.x > 0 iff y = green  – v y.x > 0 iff y = yellow  Classifier f(x) = argmax v i.x Individual Classifiers Decision Regions H = R kn

27 27MultiClass Classification Learning via All-Verses-All (AvA) Assumption Find v rb,v rg,v ry,v bg,v by,v gy  R d such that – v rb.x > 0 if y = red < 0 if y = blue – v rg.x > 0 if y = red < 0 if y = green –... (for all pairs) Individual Classifiers Decision Regions H = R kkn How to classify?

28 28 Classifying with AvA Tree 1 red, 2 yellow, 2 green  ? Majority Vote Tournament All are post-learning and might cause weird stuff

29 29 POS Tagging English tags

30 30 POS Tagging, examples from WSJ From McCallum

31 31 POS Tagging Ambiguity: not a trivial task Useful tasks: important features for other steps are based on POS E.g., use POS as input to a parser

32 32 But still why so popular – Historically the first statistical NLP problem – Easy to apply arbitrary classifiers: – both for sequence models and just independent classifiers – Can be regarded as Finite-State Problem – Easy to evaluate – Annotation is cheaper to obtain than TreeBanks (other languages)

33 33 HMM (reminder)

34 34 HMM (reminder) - transitions

35 35 Transition Estimates

36 36 Emission Estimates

37 37 MaxEnt (reminder)

38 38 Decoding: HMM vs MaxEnt

39 39 Accuracies overview

40 40 Accuracies overview

41 41 SVMs for tagging – We can use SVMs in a similar way as MaxEnt (or other classifiers) – We can use a window around the word – 97.16 % on WSJ

42 42 SVMs for tagging from Jimenez & Marquez

43 43 No sequence modeling

44 44 CRFs and other global models

45 45 CRFs and other global models

46 Compare CRFs - no local normalization MEMMs - Note: after each step t the remaining probability mass cannot be reduced – it can only be distributed across among possible state transitions HMMs W T

47 47 Label Bias based on a slide from Joe Drish

48 48 Label Bias Recall Transition based parsing -- Nivre’s algorithm (with beam search) At each step we can observe only local features (limited look-ahead) If later we see that the following word is impossible we can only distribute probability uniformly across all (im- )possible decisions If a small number of such decisions – we cannot decrease probability dramatically So, label bias is likely to be a serious problem if: Non local dependencies States have small number of possible outgoing transitions

49 49 Pos Tagging Experiments – “+” is an extended feature set (hard to integrate in a generative model) – oov – out-of-vocabulary

50 50 Supervision – We considered before the supervised case – Training set is labeled – However, we can try to induce word classes without supervision – Unsupervised tagging – We will later discuss the EM algorithm – Can do it in a partly supervised: – Seed tags – Small labeled dataset – Parallel corpus –....

51 51 Why not to predict POS + parse trees simultaneously? – It is possible and often done this way – Doing tagging internally often benefits parsing accuracy – Unfortunately, parsing models are less robust than taggers – e.g., non-grammatical sentences, different domains – It is more expensive and does not help...

52 52 Questions Why there is no label-bias problem for a generative model (e.g., HMM) ? How would you integrate word features in a generative model (e.g., HMMs for POS tagging)? e.g., if word has: -ing, -s, -ed, -d, -ment,... post-, de-,...

53 53 “CRFs” for more complex structured output problems We considered sequence labeled problems Here, the structure of dependencies is fixed What if we do not know the structure but would like to have interactions respecting the structure ?

54 54 “CRFs” for more complex structured output problems Recall, we had the MST algorithm (McDonald and Pereira, 05)

55 55 “CRFs” for more complex structured output problems Complex inference E.g., arbitrary 2 nd order dependency parsing models are not tractable (non-projective) NP-complete: (McDonald & Pereira, EACL 06) Recently conditional models for constituent parsing: (Finkel et al, ACL 08) (Carreras et al, CoNLL 08)...

56 56 Back to MultiClass – Let us review how to decompose multiclass problem to binary classification problems

57 57 Summary Margin-based method for multiclass classification and structured prediction CRFs vs HMMs vs MEMMs for POS tagging

58 58 Conclusions All approaches use linear representation The differences are – Features – How to learn weights – Training Paradigms: Global Training (CRF, Global Perceptron) Modular Training (PMM, MEMM,...) – These approaches are easier to train, but may requires additional mechanisms to enforce global constraints.


Download ppt "1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint."

Similar presentations


Ads by Google