Presentation is loading. Please wait.

Presentation is loading. Please wait.

Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.

Similar presentations


Presentation on theme: "Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012."— Presentation transcript:

1 Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012

2 Roadmap Transformation-based (Error-driven) learning Basic TBL framework Design decisions TBL Part-of-Speech Tagging Rule templates Effectiveness HW #9

3 Transformation-Based Learning: Overview Developed by Eric Brill (~1992)

4 Transformation-Based Learning: Overview Developed by Eric Brill (~1992) Basic approach: Apply initial labeling Perform cascade of transformations to reduce error

5 Transformation-Based Learning: Overview Developed by Eric Brill (~1992) Basic approach: Apply initial labeling Perform cascade of transformations to reduce error Applications: Classification Sequence labeling Part-of-speech tagging, dialogue act tagging, chunking, handwriting segmentation, referring expression generation Structure extraction: parsing

6 TBL Architecture

7 Key Process Goal: Correct errors made by initial labeling

8 Key Process Goal: Correct errors made by initial labeling Approach: transformations

9 Key Process Goal: Correct errors made by initial labeling Approach: transformations Two components:

10 Key Process Goal: Correct errors made by initial labeling Approach: transformations Two components: Rewrite rule: e.g., MD  NN

11 Key Process Goal: Correct errors made by initial labeling Approach: transformations Two components: Rewrite rule: e.g., MD  NN Triggering environment: e.g., prevT=DT

12 Key Process Goal: Correct errors made by initial labeling Approach: transformations Two components: Rewrite rule: e.g., MD  NN Triggering environment: e.g., prevT=DT Example transformation: MD  NN if prevT=DT The/DT can/NN rusted/VB

13 Key Process Goal: Correct errors made by initial labeling Approach: transformations Two components: Rewrite rule: e.g., MD  NN Triggering environment: e.g., prevT=DT Example transformation: MD  NN if prevT=DT The/DT can/MD rusted/VB  The/DT can/NN rusted/VB

14 TBL: Training 1. Apply initial labeling procedure

15 TBL: Training 1. Apply initial labeling procedure 2. Consider all possible transformations, select transformation with best score on training data

16 TBL: Training 1. Apply initial labeling procedure 2. Consider all possible transformations, select transformation with best score on training data 3. Add rule to end of rule cascade

17 TBL: Training 1. Apply initial labeling procedure 2. Consider all possible transformations, select transformation with best score on training data 3. Add rule to end of rule cascade 4. Iterate over steps 2,3 until no more improvement

18 Error Reduction

19 TBL: Testing For each test instance:

20 TBL: Testing For each test instance: 1. Apply initial labeling

21 TBL: Testing For each test instance: 1. Apply initial labeling 2. Apply, in sequence, all matching transformations

22 Core Design Decisions

23 Choose initial-state annotator Many possibilities:

24 Core Design Decisions Choose initial-state annotator Many possibilities: Simple heuristics, complex learned classifiers

25 Core Design Decisions Choose initial-state annotator Many possibilities: Simple heuristics, complex learned classifiers Specify legal transformations:

26 Core Design Decisions Choose initial-state annotator Many possibilities: Simple heuristics, complex learned classifiers Specify legal transformations: ‘Transformation templates’: instantiate based on data

27 Core Design Decisions Choose initial-state annotator Many possibilities: Simple heuristics, complex learned classifiers Specify legal transformations: ‘Transformation templates’: instantiate based on data Can be simple and local or complex and long-distance

28 Core Design Decisions Choose initial-state annotator Many possibilities: Simple heuristics, complex learned classifiers Specify legal transformations: ‘Transformation templates’: instantiate based on data Can be simple and local or complex and long-distance Define evaluation function for selecting transform’n

29 Core Design Decisions Choose initial-state annotator Many possibilities: Simple heuristics, complex learned classifiers Specify legal transformations: ‘Transformation templates’: instantiate based on data Can be simple and local or complex and long-distance Define evaluation function for selecting transform’n Evaluation measures:

30 Core Design Decisions Choose initial-state annotator Many possibilities: Simple heuristics, complex learned classifiers Specify legal transformations: ‘Transformation templates’: instantiate based on data Can be simple and local or complex and long-distance Define evaluation function for selecting transform’n Evaluation measures: accuracy, f-measure, parse score Stopping criterion:

31 Core Design Decisions Choose initial-state annotator Many possibilities: Simple heuristics, complex learned classifiers Specify legal transformations: ‘Transformation templates’: instantiate based on data Can be simple and local or complex and long-distance Define evaluation function for selecting transform’n Evaluation measures: accuracy, f-measure, parse score Stopping criterion: no improvement, threshold,..

32 More Design Decisions Issue: Rule application can affect downstream application

33 More Design Decisions Issue: Rule application can affect downstream application Approach: Constraints on transformation application:

34 More Design Decisions Issue: Rule application can affect downstream application Approach: Constraints on transformation application: Specify order of transformation application

35 More Design Decisions Issue: Rule application can affect downstream application Approach: Constraints on transformation application: Specify order of transformation application Decide if transformations applied immediately or after whole corpus is checked for triggering environments

36 Ordering Example Example sequence: A A A A A

37 Ordering Example Example sequence: A A A A A Transformation: if prevT==A, A  B

38 Ordering Example Example sequence: A A A A A Transformation: if prevT==A, A  B If check whole corpus for contexts before application:

39 Ordering Example Example sequence: A A A A A Transformation: if prevT==A, A  B If check whole corpus for contexts before application: A B B B B If apply immediately, left-to-right:

40 Ordering Example Example sequence: A A A A A Transformation: if prevT==A, A  B If check whole corpus for contexts before application: A B B B B If apply immediately, left-to-right: A B A B A If apply immediately, right-to-left:

41 Ordering Example Example sequence: A A A A A Transformation: if prevT==A, A  B If check whole corpus for contexts before application: A B B B B If apply immediately, left-to-right: A B A B A If apply immediately, right-to-left: A B B B B

42 Comparisons Decision trees: Similar in

43 Comparisons Decision trees: Similar in applying a cascade of tests for classification Similar in interpretability TBL:

44 Comparisons Decision trees: Similar in applying a cascade of tests for classification Similar in interpretability TBL: More general: Can convert any DT to TBL

45 Comparisons Decision trees: Similar in applying a cascade of tests for classification Similar in interpretability TBL: More general: Can convert any DT to TBL Selection criterion

46 Comparisons Decision trees: Similar in applying a cascade of tests for classification Similar in interpretability TBL: More general: Can convert any DT to TBL Selection criterion: more focused: accuracy vs InfoGain More powerful

47 Comparisons Decision trees: Similar in applying a cascade of tests for classification Similar in interpretability TBL: More general: Can convert any DT to TBL Selection criterion: more focused: accuracy vs InfoGain More powerful: Can perform arbitrary output transform Can refine arbitrary existing labeling Multipass processing More expensive:

48 Comparisons Decision trees: Similar in applying a cascade of tests for classification Similar in interpretability TBL: More general: Can convert any DT to TBL Selection criterion: more focused: accuracy vs InfoGain More powerful: Can perform arbitrary output transform Can refine arbitrary existing labeling Multipass processing More expensive: Transformations tested/applied to all training data

49 Case Study: POS Tagging

50 TBL Part-of-Speech Tagging Design decisions:

51 TBL Part-of-Speech Tagging Design decisions: Select initial labeling Define transformation templates Select evaluation criterion Determine application order

52 POS Initial Labeling & Transformations Initial-state annotation?

53 POS Initial Labeling & Transformations Initial-state annotation? Heuristic:

54 POS Initial Labeling & Transformations Initial-state annotation? Heuristic: Most likely tag for word in training corpus

55 POS Initial Labeling & Transformations Initial-state annotation? Heuristic: Most likely tag for word in training corpus What about unknown words? (In testing)

56 POS Initial Labeling & Transformations Initial-state annotation? Heuristic: Most likely tag for word in training corpus What about unknown words? (In testing) Heuristic:

57 POS Initial Labeling & Transformations Initial-state annotation? Heuristic: Most likely tag for word in training corpus What about unknown words? (In testing) Heuristic: Caps  NNP; lowercase  NN Transformation templates: Rewrite rules:

58 POS Initial Labeling & Transformations Initial-state annotation? Heuristic: Most likely tag for word in training corpus What about unknown words? (In testing) Heuristic: Caps  NNP; lowercase  NN Transformation templates: Rewrite rules: TagA  TagB Trigger contexts:

59 POS Initial Labeling & Transformations Initial-state annotation? Heuristic: Most likely tag for word in training corpus What about unknown words? (In testing) Heuristic: Caps  NNP; lowercase  NN Transformation templates: Rewrite rules: TagA  TagB Trigger contexts: From 3 word windows Non-lexicalized Lexicalized

60 Non-lexicalized Templates No word references, only tags t -1 == z

61 Non-lexicalized Templates No word references, only tags t -1 == z t +1 ==z

62 Non-lexicalized Templates No word references, only tags t -1 == z t +1 ==z t -2 == z t -1 == z or t -2 == z t +1 == z or t +2 ==z or t +3 ==z

63 Non-lexicalized Templates No word references, only tags t -1 == z t +1 ==z t -2 == z t -1 == z or t -2 == z t +1 == z or t +2 ==z or t +3 ==z t -1 ==z and t +1 ==w t -1 ==z and t -2 ==w …

64 Lexicalized Templates Add word references as well as tags Use all non-lexicalized templates, plus

65 Lexicalized Templates Add word references as well as tags Use all non-lexicalized templates, plus w -1 == w

66 Lexicalized Templates Add word references as well as tags Use all non-lexicalized templates, plus w -1 == w w -2 == w w -1 == w or w -2 == w

67 Lexicalized Templates Add word references as well as tags Use all non-lexicalized templates, plus w -1 == w w -2 == w w -1 == w or w -2 == w w 0 == w and w -1 == x w 0 == w and t -1 == z ….

68 Lexicalized Templates Add word references as well as tags Use all non-lexicalized templates, plus w -1 == w w -2 == w w -1 == w or w -2 == w w 0 == w and w -1 == x w 0 == w and t -1 == z w 0 == w w -1 == w and t -1 == z ….

69 Remaining Design Decisions Optimization criterion:

70 Remaining Design Decisions Optimization criterion: Greatest reduction in tagging error Stops when no reduction beyond a threshold

71 Remaining Design Decisions Optimization criterion: Greatest reduction in tagging error Stops when no reduction beyond a threshold Ordering: Processing is left-to-right Application:

72 Remaining Design Decisions Optimization criterion: Greatest reduction in tagging error Stops when no reduction beyond a threshold Ordering: Processing is left-to-right Application: Find all triggering environments first Then apply transformations to all environments found

73 Top Non-Lexicalized Transformations

74 Example Lexicalized Transformations if w +2 == as, IN  RB PTB coding says: as tall as  as/RB tall/JJ as/IN Initial tagging: as/IN tall/JJ as/IN

75 Example Lexicalized Transformations if w +2 == as, IN  RB PTB coding says: as tall as  as/RB tall/JJ as/IN Initial tagging: as/IN tall/JJ as/IN if w -1 == n’t or w -2 == n’t, VBP  VB Example: we do n’t eat or we do n’t usually sleep

76 TBL Accuracy Performance evaluated on different corpora: Baseline: 92.4%

77 TBL Accuracy Performance evaluated on different corpora: Penn WSJ: 96.6% Penn Brown: 96.3% Orig Brown: 96.5% Baseline: 92.4% Lexicalized vs Non-lexicalized transformations Non-lexicalized WSJ: 96.3% Lexicalized WSJ: 96.6%

78 TBL Accuracy Performance evaluated on different corpora: Penn WSJ: 96.6% Penn Brown: 96.3% Orig Brown: 96.5% Baseline: 92.4% Lexicalized vs Non-lexicalized transformations Non-lexicalized WSJ: 96.3% Lexicalized WSJ: 96.6% Rules: 447 contextual rules (+ 243 for unknown wds)

79 TBL Summary Builds on an initial-state annotation Defines transformation templates for task Learns ordered list of transformations on labels By iteratively selecting best transformation Classifies by applying initial-state and applicable transformations in order

80 Analysis Contrasts: Distinctive learning strategy

81 Analysis Contrasts: Distinctive learning strategy Improves initial classification outputs

82 Analysis Contrasts: Distinctive learning strategy Improves initial classification outputs Exploits current tagging state (previous, current, post) tags

83 Analysis Contrasts: Distinctive learning strategy Improves initial classification outputs Exploits current tagging state (previous, current, post) tags Can apply to classification, sequence learning, structure Potentially global feature use – short vs long distance Arbitrary transformations, rules

84 TBL: Advantages Pros:

85 TBL: Advantages Pros: Optimization

86 TBL: Advantages Pros: Optimization: Directly optimizes accuracy

87 TBL: Advantages Pros: Optimization: Directly optimizes accuracy Goes beyond classification to sequence & structure Parsing. etc Meta-classification:

88 TBL: Advantages Pros: Optimization: Directly optimizes accuracy Goes beyond classification to sequence & structure Parsing. etc Meta-classification: Use initial state annotation Explicitly manipulates full labeled output Output of transformation is visible to other transformations

89 TBL: Disadvantages Cons:

90 TBL: Disadvantages Cons: Computationally expensive: Learns by iterating all possible template instantiations Over all instances in the training data Modifications have been proposed to accelerate

91 TBL: Disadvantages Cons Computationally expensive: Learns by iterating all possible template instantiations Over all instances in the training data Modifications have been proposed to accelerate No probabilistic interpretation Can’t produce ranked n-best lists

92 HW #9

93 TBL For Text Classification Transformations: If (feature is present) and (classlabel == class1) classlabel  class2

94 Q1 Build a TBL trainer Input: training_data file Output: model_file Lines: featureName from_class to_class gain Parameter: Gain threshold Example: guns talk guns mideast 89

95 Q2 TBL decoder Input: test_file model_file Output: output_file Parameter: N  # of transformations Output_file: InstanceID goldLabel outputLabel rule1 rule2 … Rule: feature from_class to_class Ex: inst1 guns mideast we guns misc talk misc mideast


Download ppt "Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012."

Similar presentations


Ads by Google