Presentation is loading. Please wait.

Presentation is loading. Please wait.

EMNLP’02 11/11/2002 ML: Classical methods from AI –Decision-Tree induction –Exemplar-based Learning –Rule Induction –TBEDL ML: Classical methods from AI.

Similar presentations


Presentation on theme: "EMNLP’02 11/11/2002 ML: Classical methods from AI –Decision-Tree induction –Exemplar-based Learning –Rule Induction –TBEDL ML: Classical methods from AI."— Presentation transcript:

1 EMNLP’02 11/11/2002 ML: Classical methods from AI –Decision-Tree induction –Exemplar-based Learning –Rule Induction –TBEDL ML: Classical methods from AI –Decision-Tree induction –Exemplar-based Learning –Rule Induction –TBEDL

2 EMNLP’02 11/11/2002 Decision Trees Decision trees are a way to represent rules underlying training data, with hierarchical sequential structures that recursively partition the data. They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization. From a machine-learning perspective: Decision Trees are n -ary branching trees that represent classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes Acquisition: Top-Down Induction of Decision Trees (TDIDT) Systems: CART (Breiman et al. 84), ID3, C4.5, C5.0 (Quinlan 86,93,98), A SSISTANT, A SSISTANT -R (Cestnik et al. 87; Kononenko et al. 95) Decision trees are a way to represent rules underlying training data, with hierarchical sequential structures that recursively partition the data. They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization. From a machine-learning perspective: Decision Trees are n -ary branching trees that represent classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes Acquisition: Top-Down Induction of Decision Trees (TDIDT) Systems: CART (Breiman et al. 84), ID3, C4.5, C5.0 (Quinlan 86,93,98), A SSISTANT, A SSISTANT -R (Cestnik et al. 87; Kononenko et al. 95)

3 EMNLP’02 11/11/2002 An Example A1 A2 A3 C1 A5 A2 A5 C3 C2 C1... v1v1 v2v2 v3v3 v5v5 v4v4 v6v6 v7v7 DecisionTrees small big SHAPE pos circle red SIZE Decision Tree COLOR triang blue neg pos neg

4 EMNLP’02 11/11/2002 Learning Decision Trees Training Set TDIDT + DT = Test = DT Example + Class DecisionTrees

5 EMNLP’02 11/11/2002 General Induction Algorithm function TDIDT (X:set-of-examples; A:set-of-features) var: tree 1,tree 2 : decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion (X)) then tree 1 := create_leaf_tree (X) else a max := feature_selection (X,A); tree 1 := create_tree (X, a max ); for-all val in values (a max ) do X’ := select_exampes (X,a max,val); A’ := A \ {a max }; tree 2 := TDIDT (X’,A’); tree 1 := add_branch (tree 1,tree 2,val) end-for end-if return (tree 1 ) end-function DTs

6 EMNLP’02 11/11/2002 General Induction Algorithm function TDIDT (X:set-of-examples; A:set-of-features) var: tree 1,tree 2 : decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion (X)) then tree 1 := create_leaf_tree (X) else a max := feature_selection (X,A); tree 1 := create_tree (X, a max ); for-all val in values (a max ) do X’ := select_examples (X,a max,val); A’ := A \ {a max }; tree 2 := TDIDT (X’,A’); tree 1 := add_branch (tree 1,tree 2,val) end-for end-if return (tree 1 ) end-function DTs

7 EMNLP’02 11/11/2002 Feature Selection Criteria  Functions derived from Information Theory: –Information Gain, Gain Ratio (Quinlan86)  Functions derived from Distance Measures –Gini Diversity Index (Breiman et al. 84) –RLM (López de Mántaras 91)  Statistically-based –Chi-square test (Sestito & Dillon 94) –Symmetrical Tau (Zhou & Dillon 91)  R ELIEF F-IG: variant of R ELIEF F (Kononenko 94)  Functions derived from Information Theory: –Information Gain, Gain Ratio (Quinlan86)  Functions derived from Distance Measures –Gini Diversity Index (Breiman et al. 84) –RLM (López de Mántaras 91)  Statistically-based –Chi-square test (Sestito & Dillon 94) –Symmetrical Tau (Zhou & Dillon 91)  R ELIEF F-IG: variant of R ELIEF F (Kononenko 94) DecisionTrees

8 EMNLP’02 11/11/2002 Information Gain DecisionTrees (Quinlan79)

9 EMNLP’02 11/11/2002 Information Gain (2) DecisionTrees (Quinlan79)

10 EMNLP’02 11/11/2002 Gain Ratio DecisionTrees (Quinlan86)

11 EMNLP’02 11/11/2002 R ELIEF DecisionTrees (Kira & Rendell, 1992)

12 EMNLP’02 11/11/2002 R ELIEF F DecisionTrees (Kononenko, 1994)

13 EMNLP’02 11/11/2002 R ELIEF F-IG DecisionTrees (Màrquez, 1999) R ELIEF F + the distance measure used for calculating the nearest hits/misses does not treat all attributes equally ( it weights the attributes according to the IG measure).

14 EMNLP’02 11/11/2002 Extensions of DTs DecisionTrees (pre/post) Pruning Minimize the effect of the greedy approach: lookahead Non-lineal splits Combination of multiple models etc. (pre/post) Pruning Minimize the effect of the greedy approach: lookahead Non-lineal splits Combination of multiple models etc. (Murthy 95)

15 EMNLP’02 11/11/2002 Decision Trees and NLP Speech processing (Bahl et al. 89; Bakiri & Dietterich 99) POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00) Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96) Parsing (Magerman 95,96; Haruno et al. 98,99) Text categorization (Lewis & Ringuette 94; Weiss et al. 99) Text summarization (Mani & Bloedorn 98) Dialogue act tagging (Samuel et al. 98) Speech processing (Bahl et al. 89; Bakiri & Dietterich 99) POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00) Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96) Parsing (Magerman 95,96; Haruno et al. 98,99) Text categorization (Lewis & Ringuette 94; Weiss et al. 99) Text summarization (Mani & Bloedorn 98) Dialogue act tagging (Samuel et al. 98) DecisionTrees

16 EMNLP’02 11/11/2002 Decision Trees and NLP Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95) Discourse analysis in information extraction (Soderland & Lehnert 94) Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94) Verb classification in Machine Translation (Tanaka 96; Siegel 97) More recent applications of DTs to NLP: but combined in a boosting framework (we will see it in following sessions) Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95) Discourse analysis in information extraction (Soderland & Lehnert 94) Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94) Verb classification in Machine Translation (Tanaka 96; Siegel 97) More recent applications of DTs to NLP: but combined in a boosting framework (we will see it in following sessions) DecisionTrees

17 EMNLP’02 11/11/2002 Example: POS Tagging using DT DecisionTrees He was shot in the hand as he chased the robbers in the back street NN VB NN VB JJ VB JJ VB NN VB NN VB (The Wall Street Journal Corpus) POS Tagging

18 EMNLP’02 11/11/2002 POS Tagging using Decision Trees Language Model Disambiguation Algorithm Raw text Tagged text Morphological analysis … POS tagging DecisionTrees (Màrquez, PhD 1999)

19 EMNLP’02 11/11/2002 Disambiguation Algorithm Raw text Tagged text Morphological analysis … POS tagging Decision Trees POS Tagging using Decision Trees DecisionTrees (Màrquez, PhD 1999)

20 EMNLP’02 11/11/2002 … Language Model RTT STT R ELAX Raw text Tagged text Morphological analysis POS tagging POS Tagging using Decision Trees DecisionTrees (Màrquez, PhD 1999)

21 EMNLP’02 11/11/2002 DT-based Language Modelling root P(IN)=0.81 P(RB)=0.19 Word Form leaf P(IN)=0.83 P(RB)=0.17 tag(+1) P(IN)=0.13 P(RB)=0.87 tag(+2) P(IN)=0.013 P(RB)=0.987 “As”,“as” RB IN others... “preposition-adverb” tree Statistical interpretation: P( RB | word=“A/as” & tag(+1)=RB & tag(+2)=IN) = P( IN | word=“A/as” & tag(+1)=RB & tag(+2)=IN) = ^ ^ DecisionTrees

22 EMNLP’02 11/11/2002 root P(IN)=0.81 P(RB)=0.19 Word Form leaf P(IN)=0.83 P(RB)=0.17 tag(+1) P(IN)=0.13 P(RB)=0.87 tag(+2) P(IN)=0.013 P(RB)=0.987 “As”,“as” RB IN others... “as_RB much_RB as_IN” Collocations: “as_RB well_RB as_IN” “as_RB soon_RB as_IN” DT-based Language Modelling “preposition-adverb” tree DecisionTrees

23 EMNLP’02 11/11/2002 Language Modelling using DTs Algorithm: Top-Down Induction of Decision Trees (TDIDT). Supervised learning –CART (Breiman et al. 84), C4.5 (Quinlan 95), etc. Attributes: Local context (-3,+2) tokens Particular implementation: –Branch-merging –CART post-pruning –Smoothing –Attributes with many values –Several functions for attribute selection Algorithm: Top-Down Induction of Decision Trees (TDIDT). Supervised learning –CART (Breiman et al. 84), C4.5 (Quinlan 95), etc. Attributes: Local context (-3,+2) tokens Particular implementation: –Branch-merging –CART post-pruning –Smoothing –Attributes with many values –Several functions for attribute selection Minimizing the effect of over-fitting, data fragmentation & sparseness Minimizing the effect of over-fitting, data fragmentation & sparseness Granularity? Ambiguity class level –adjective-noun, adjective-noun-verb, etc. Granularity? Ambiguity class level –adjective-noun, adjective-noun-verb, etc. DecisionTrees

24 EMNLP’02 11/11/2002 Model Evaluation 1,170,000 words Tagset size: 45 tags Noise: 2-3% of mistagged words 49,000 word-form frequency lexicon –Manual filtering of 200 most frequent entries –36.4% ambiguous words –2.44 (1.52) average tags per word 243 ambiguity classes 1,170,000 words Tagset size: 45 tags Noise: 2-3% of mistagged words 49,000 word-form frequency lexicon –Manual filtering of 200 most frequent entries –36.4% ambiguous words –2.44 (1.52) average tags per word 243 ambiguity classes The Wall Street Journal (WSJ) annotated corpus DecisionTrees

25 EMNLP’02 11/11/2002 Model Evaluation The Wall Street Journal (WSJ) annotated corpus Number of ambiguity classes that cover x% of the training corpus Arity of the classification problems DecisionTrees

26 EMNLP’02 11/11/ Ambiguity Classes They cover 57.90% of the ambiguous occurrences! Experimental setting: 10-fold cross validation DecisionTrees

27 EMNLP’02 11/11/2002 N-fold Cross Validation DecisionTrees Divide the training set S into a partition of n equal-size disjoint subsets: s 1, s 2, …, s n for i:=1 to N do learn and test a classifier using: training_set := U s j for all j different from i validation_set :=s i end_for return: the average accuracy from the n experiments Which is a good value for N? ( ) Extreme case (N=training set size): Leave-one-out Divide the training set S into a partition of n equal-size disjoint subsets: s 1, s 2, …, s n for i:=1 to N do learn and test a classifier using: training_set := U s j for all j different from i validation_set :=s i end_for return: the average accuracy from the n experiments Which is a good value for N? ( ) Extreme case (N=training set size): Leave-one-out

28 EMNLP’02 11/11/2002 Size: Number of Nodes Average size reduction: 51.7% 46.5% 74.1% (total) DecisionTrees

29 EMNLP’02 11/11/2002 Accuracy (at least) No loss in accuracy DecisionTrees

30 EMNLP’02 11/11/2002 Feature Selection Criteria Statistically equivalent DecisionTrees

31 EMNLP’02 11/11/2002  Tree Base = Statistical Component –RTT : Reductionistic Tree based tagger –STT: Statistical Tree based tagger  Tree Base = Statistical Component –RTT : Reductionistic Tree based tagger –STT: Statistical Tree based tagger  Tree Base = Compatibility Constraints –R ELAX : Relaxation-Labelling based tagger  Tree Base = Compatibility Constraints –R ELAX : Relaxation-Labelling based tagger (Màrquez & Rodríguez 99) (Màrquez & Rodríguez 97) (Màrquez & Padró 97) DT-based POS Taggers DecisionTrees

32 EMNLP’02 11/11/2002 RTT Raw text Morphological analysis Tagged text Classify Update Filter Language Model Disambiguation stop? (Màrquez & Rodríguez 97) yes no DecisionTrees

33 EMNLP’02 11/11/2002 STT N-grams (trigrams) (Màrquez & Rodríguez 99) DecisionTrees

34 EMNLP’02 11/11/2002 STT Contextual probabilities (Màrquez & Rodríguez 99) Estimated using Decision Trees DecisionTrees

35 EMNLP’02 11/11/2002 Tagged text Raw text Morphological analysis STT (Màrquez & Rodríguez 99) Viterbi algorithm Language Model Disambiguation Lexical probs. + Contextual probs. DecisionTrees

36 EMNLP’02 11/11/2002 Viterbi algorithm Tagged text Raw text Morphological analysis Language Model Disambiguation N-grams Lexical probs. + + STT+ (Màrquez & Rodríguez 99) Contextual probs. DecisionTrees

37 EMNLP’02 11/11/2002  Tree Base = Statistical Component –RTT : Reductionistic Tree based tagger –STT: Statistical Tree based tagger  Tree Base = Statistical Component –RTT : Reductionistic Tree based tagger –STT: Statistical Tree based tagger  Tree Base = Compatibility Constraints –R ELAX : Relaxation-Labelling based tagger  Tree Base = Compatibility Constraints –R ELAX : Relaxation-Labelling based tagger (Màrquez & Rodríguez 99) (Màrquez & Rodríguez 97) (Màrquez & Padró 97) DecisionTrees

38 EMNLP’02 11/11/2002 R ELAX Relaxation Labelling (Padró 96) Tagged text Raw text Morphological analysis Language Model Disambiguation (Màrquez & Padró 97) Linguistic rules N-grams + + Set of constraints DecisionTrees

39 EMNLP’02 11/11/2002 R ELAX (Màrquez & Padró 97) root P(IN)=0.81 P(RB)=0.19 Word Form leaf P(IN)=0.83 P(RB)=0.17 tag(+1) P(IN)=0.13 P(RB)=0.87 tag(+2) P(IN)=0.013 P(RB)=0.987 “As”,“as” RB IN others... Compatibility values: estimated using Mutual Information Translating Tress into Constraints (IN) (0 “as” “As”) (1 RB) (2 IN) (IN) (0 “as” “As”) (1 RB) (2 IN) 2.37 (RB) (0 “as” “As”) (1 RB) (2 IN) 2.37 (RB) (0 “as” “As”) (1 RB) (2 IN) Positive constraint Negative constraint DecisionTrees

40 EMNLP’02 11/11/2002 Experimental Evaluation Training set: 1,121,776 words Test set: 51,990 words Closed vocabulary assumption Base of 194 trees –Covering 99.5% of the ambiguous occurrences –Storage requirement: 565 Kb –Acquisition time: 12 CPU-hours (Common LISP / Sparc10 workstation) Training set: 1,121,776 words Test set: 51,990 words Closed vocabulary assumption Base of 194 trees –Covering 99.5% of the ambiguous occurrences –Storage requirement: 565 Kb –Acquisition time: 12 CPU-hours (Common LISP / Sparc10 workstation) Using the WSJ annotated corpus DecisionTrees

41 EMNLP’02 11/11/ % error reduction with respect to MFT Accuracy = 94.45% (ambiguous) 97.29% (overall) Comparable to the best state-of-the-art automatic POS taggers Recall = 98.22% Precision = 95.73% (1.08 tags/word) 67.52% error reduction with respect to MFT Accuracy = 94.45% (ambiguous) 97.29% (overall) Comparable to the best state-of-the-art automatic POS taggers Recall = 98.22% Precision = 95.73% (1.08 tags/word) RTT results +RTT allows to state a tradeoff between precision and recall Experimental Evaluation DecisionTrees

42 EMNLP’02 11/11/2002 Comparable to those of RTT STT results +STT allows the incorporation of N-gram information some problems of sparseness and coherence of the resulting tag sequence can be alleviated +STT allows the incorporation of N-gram information some problems of sparseness and coherence of the resulting tag sequence can be alleviated Better than those of RTT and STT STT+ results Experimental Evaluation DecisionTrees

43 EMNLP’02 11/11/2002 Translation of 44 representative trees covering 84% of the examples = 8,473 constraints Addition of: –bigrams (2,808 binary constraints) –trigrams (52,161 ternary constraints) –linguistically-motivated manual constraints (20) Translation of 44 representative trees covering 84% of the examples = 8,473 constraints Addition of: –bigrams (2,808 binary constraints) –trigrams (52,161 ternary constraints) –linguistically-motivated manual constraints (20) Including trees into R ELAX Experimental Evaluation DecisionTrees

44 EMNLP’02 11/11/2002 Accuracy of R ELAX MFT= baseline, B=bigrams, T=trigrams, C=“tree constraints” H = set of 20 hand-written linguistic rules DecisionTrees

45 EMNLP’02 11/11/2002 Decision Trees: Summary Advantages –Acquires symbolic knowledge in a understandable way –Very well studied ML algorithms and variants –Can be easily translated into rules –Existence of available software: C4.5, C5.0, etc. –Can be easily integrated into an ensemble Advantages –Acquires symbolic knowledge in a understandable way –Very well studied ML algorithms and variants –Can be easily translated into rules –Existence of available software: C4.5, C5.0, etc. –Can be easily integrated into an ensemble DecisionTrees

46 EMNLP’02 11/11/2002 Drawbacks –Computationally expensive when scaling to large natural language domains: training examples, features, etc. –Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation –DTs is a model with high variance (unstable) –Tendency to overfit training data: pruning is necessary –Requires quite a big effort in tuning the model Drawbacks –Computationally expensive when scaling to large natural language domains: training examples, features, etc. –Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation –DTs is a model with high variance (unstable) –Tendency to overfit training data: pruning is necessary –Requires quite a big effort in tuning the model DecisionTrees Decision Trees: Summary


Download ppt "EMNLP’02 11/11/2002 ML: Classical methods from AI –Decision-Tree induction –Exemplar-based Learning –Rule Induction –TBEDL ML: Classical methods from AI."

Similar presentations


Ads by Google