Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

DECISION TREES. Decision trees  One possible representation for hypotheses.

Decision Tree Approach in Data Mining

Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.

Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.

Chunk Parsing CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)

Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,

Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.

Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.

Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.

Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.

CS Catching Up CS Porter Stemmer Porter Stemmer (1980) Used for tasks in which you only care about the stem –IR, modeling given/new distinction,

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.

Learning Object Identification Rules for Information Integration Sheila Tejada Craig A. Knobleock Steven University of Southern California.

Transformation-based error- driven learning (TBL) LING 572 Fei Xia 1/19/06.

Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University

Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.

Part of speech (POS) tagging

Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?

Sequence labeling and beam search LING 572 Fei Xia 2/15/07.

Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network Kristina Toutanova, Dan Klein, Christopher Manning, Yoram Singer Stanford University.

The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.

NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.

Albert Gatt Corpora and Statistical Methods Lecture 9.

Final review LING572 Fei Xia Week 10: 03/11/

Part-of-Speech Tagging

Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.

EMNLP’01 19/11/2001 ML: Classical methods from AI –Decision-Tree induction –Exemplar-based Learning –Rule Induction –T ransformation B ased E rror D riven.

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Some Advances in Transformation-Based Part of Speech Tagging

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Dialogue Act Tagging Using TBL Sachin Kamboj CISC 889: Statistical Approaches to NLP Spring 2003 September 14, 2015September 14, 2015September 14, 2015.

Graphical models for part of speech tagging

Ling 570 Day 17: Named Entity Recognition Chunking.

Albert Gatt Corpora and Statistical Methods Lecture 10.

인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

AQUAINT Workshop – June 2003 Improved Semantic Role Parsing Kadri Hacioglu, Sameer Pradhan, Valerie Krugler, Steven Bethard, Ashley Thornton, Wayne Ward,

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006.

A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart

13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Statistical Decision-Tree Models for Parsing NLP lab, POSTECH 김 지 협.

CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.

Supertagging CMSC Natural Language Processing January 31, 2006.

February 2007CSA3050: Tagging III and Chunking 1 CSA2050: Natural Language Processing Tagging 3 and Chunking Transformation Based Tagging Chunking.

Automatic Grammar Induction and Parsing Free Text - Eric Brill Thur. POSTECH Dept. of Computer Science 심 준 혁.

Data Mining and Decision Support

Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06.

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

POS Tagging1 POS Tagging 1 POS Tagging Rule-based taggers Statistical taggers Hybrid approaches.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Overview of Compilation Prepared by Manuel E. Bermúdez, Ph.D. Associate Professor University of Florida Programming Language Principles Lecture 2.

Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.

Text Summarization using Lexical Chains. Summarization using Lexical Chains Summarization? What is Summarization? Advantages… Challenges…

Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.

Statistical NLP: Lecture 9

Chunk Parsing CS1573: AI Application Development, Spring 2003

Statistical NLP : Lecture 9 Word Sense Disambiguation

Presentation transcript:

Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012

Roadmap Transformation-based (Error-driven) learning Basic TBL framework Design decisions TBL Part-of-Speech Tagging Rule templates Effectiveness HW #9

Transformation-Based Learning: Overview Developed by Eric Brill (~1992)

Transformation-Based Learning: Overview Developed by Eric Brill (~1992) Basic approach: Apply initial labeling Perform cascade of transformations to reduce error

Transformation-Based Learning: Overview Developed by Eric Brill (~1992) Basic approach: Apply initial labeling Perform cascade of transformations to reduce error Applications: Classification Sequence labeling Part-of-speech tagging, dialogue act tagging, chunking, handwriting segmentation, referring expression generation Structure extraction: parsing

TBL Architecture

Key Process Goal: Correct errors made by initial labeling

Key Process Goal: Correct errors made by initial labeling Approach: transformations

Key Process Goal: Correct errors made by initial labeling Approach: transformations Two components:

Key Process Goal: Correct errors made by initial labeling Approach: transformations Two components: Rewrite rule: e.g., MD  NN

Key Process Goal: Correct errors made by initial labeling Approach: transformations Two components: Rewrite rule: e.g., MD  NN Triggering environment: e.g., prevT=DT

Key Process Goal: Correct errors made by initial labeling Approach: transformations Two components: Rewrite rule: e.g., MD  NN Triggering environment: e.g., prevT=DT Example transformation: MD  NN if prevT=DT The/DT can/NN rusted/VB

Key Process Goal: Correct errors made by initial labeling Approach: transformations Two components: Rewrite rule: e.g., MD  NN Triggering environment: e.g., prevT=DT Example transformation: MD  NN if prevT=DT The/DT can/MD rusted/VB  The/DT can/NN rusted/VB

TBL: Training 1. Apply initial labeling procedure

TBL: Training 1. Apply initial labeling procedure 2. Consider all possible transformations, select transformation with best score on training data

TBL: Training 1. Apply initial labeling procedure 2. Consider all possible transformations, select transformation with best score on training data 3. Add rule to end of rule cascade

TBL: Training 1. Apply initial labeling procedure 2. Consider all possible transformations, select transformation with best score on training data 3. Add rule to end of rule cascade 4. Iterate over steps 2,3 until no more improvement

Error Reduction

TBL: Testing For each test instance:

TBL: Testing For each test instance: 1. Apply initial labeling

TBL: Testing For each test instance: 1. Apply initial labeling 2. Apply, in sequence, all matching transformations

Core Design Decisions

Choose initial-state annotator Many possibilities:

Core Design Decisions Choose initial-state annotator Many possibilities: Simple heuristics, complex learned classifiers

Core Design Decisions Choose initial-state annotator Many possibilities: Simple heuristics, complex learned classifiers Specify legal transformations:

Core Design Decisions Choose initial-state annotator Many possibilities: Simple heuristics, complex learned classifiers Specify legal transformations: ‘Transformation templates’: instantiate based on data

Core Design Decisions Choose initial-state annotator Many possibilities: Simple heuristics, complex learned classifiers Specify legal transformations: ‘Transformation templates’: instantiate based on data Can be simple and local or complex and long-distance

Core Design Decisions Choose initial-state annotator Many possibilities: Simple heuristics, complex learned classifiers Specify legal transformations: ‘Transformation templates’: instantiate based on data Can be simple and local or complex and long-distance Define evaluation function for selecting transform’n

Core Design Decisions Choose initial-state annotator Many possibilities: Simple heuristics, complex learned classifiers Specify legal transformations: ‘Transformation templates’: instantiate based on data Can be simple and local or complex and long-distance Define evaluation function for selecting transform’n Evaluation measures:

Core Design Decisions Choose initial-state annotator Many possibilities: Simple heuristics, complex learned classifiers Specify legal transformations: ‘Transformation templates’: instantiate based on data Can be simple and local or complex and long-distance Define evaluation function for selecting transform’n Evaluation measures: accuracy, f-measure, parse score Stopping criterion:

Core Design Decisions Choose initial-state annotator Many possibilities: Simple heuristics, complex learned classifiers Specify legal transformations: ‘Transformation templates’: instantiate based on data Can be simple and local or complex and long-distance Define evaluation function for selecting transform’n Evaluation measures: accuracy, f-measure, parse score Stopping criterion: no improvement, threshold,..

More Design Decisions Issue: Rule application can affect downstream application

More Design Decisions Issue: Rule application can affect downstream application Approach: Constraints on transformation application:

More Design Decisions Issue: Rule application can affect downstream application Approach: Constraints on transformation application: Specify order of transformation application

More Design Decisions Issue: Rule application can affect downstream application Approach: Constraints on transformation application: Specify order of transformation application Decide if transformations applied immediately or after whole corpus is checked for triggering environments

Ordering Example Example sequence: A A A A A

Ordering Example Example sequence: A A A A A Transformation: if prevT==A, A  B

Ordering Example Example sequence: A A A A A Transformation: if prevT==A, A  B If check whole corpus for contexts before application:

Ordering Example Example sequence: A A A A A Transformation: if prevT==A, A  B If check whole corpus for contexts before application: A B B B B If apply immediately, left-to-right:

Ordering Example Example sequence: A A A A A Transformation: if prevT==A, A  B If check whole corpus for contexts before application: A B B B B If apply immediately, left-to-right: A B A B A If apply immediately, right-to-left:

Ordering Example Example sequence: A A A A A Transformation: if prevT==A, A  B If check whole corpus for contexts before application: A B B B B If apply immediately, left-to-right: A B A B A If apply immediately, right-to-left: A B B B B

Comparisons Decision trees: Similar in

Comparisons Decision trees: Similar in applying a cascade of tests for classification Similar in interpretability TBL:

Comparisons Decision trees: Similar in applying a cascade of tests for classification Similar in interpretability TBL: More general: Can convert any DT to TBL

Comparisons Decision trees: Similar in applying a cascade of tests for classification Similar in interpretability TBL: More general: Can convert any DT to TBL Selection criterion

Comparisons Decision trees: Similar in applying a cascade of tests for classification Similar in interpretability TBL: More general: Can convert any DT to TBL Selection criterion: more focused: accuracy vs InfoGain More powerful

Comparisons Decision trees: Similar in applying a cascade of tests for classification Similar in interpretability TBL: More general: Can convert any DT to TBL Selection criterion: more focused: accuracy vs InfoGain More powerful: Can perform arbitrary output transform Can refine arbitrary existing labeling Multipass processing More expensive:

Comparisons Decision trees: Similar in applying a cascade of tests for classification Similar in interpretability TBL: More general: Can convert any DT to TBL Selection criterion: more focused: accuracy vs InfoGain More powerful: Can perform arbitrary output transform Can refine arbitrary existing labeling Multipass processing More expensive: Transformations tested/applied to all training data

Case Study: POS Tagging

TBL Part-of-Speech Tagging Design decisions:

TBL Part-of-Speech Tagging Design decisions: Select initial labeling Define transformation templates Select evaluation criterion Determine application order

POS Initial Labeling & Transformations Initial-state annotation?

POS Initial Labeling & Transformations Initial-state annotation? Heuristic:

POS Initial Labeling & Transformations Initial-state annotation? Heuristic: Most likely tag for word in training corpus

POS Initial Labeling & Transformations Initial-state annotation? Heuristic: Most likely tag for word in training corpus What about unknown words? (In testing)

POS Initial Labeling & Transformations Initial-state annotation? Heuristic: Most likely tag for word in training corpus What about unknown words? (In testing) Heuristic:

POS Initial Labeling & Transformations Initial-state annotation? Heuristic: Most likely tag for word in training corpus What about unknown words? (In testing) Heuristic: Caps  NNP; lowercase  NN Transformation templates: Rewrite rules:

POS Initial Labeling & Transformations Initial-state annotation? Heuristic: Most likely tag for word in training corpus What about unknown words? (In testing) Heuristic: Caps  NNP; lowercase  NN Transformation templates: Rewrite rules: TagA  TagB Trigger contexts:

POS Initial Labeling & Transformations Initial-state annotation? Heuristic: Most likely tag for word in training corpus What about unknown words? (In testing) Heuristic: Caps  NNP; lowercase  NN Transformation templates: Rewrite rules: TagA  TagB Trigger contexts: From 3 word windows Non-lexicalized Lexicalized

Non-lexicalized Templates No word references, only tags t -1 == z

Non-lexicalized Templates No word references, only tags t -1 == z t +1 ==z

Non-lexicalized Templates No word references, only tags t -1 == z t +1 ==z t -2 == z t -1 == z or t -2 == z t +1 == z or t +2 ==z or t +3 ==z

Non-lexicalized Templates No word references, only tags t -1 == z t +1 ==z t -2 == z t -1 == z or t -2 == z t +1 == z or t +2 ==z or t +3 ==z t -1 ==z and t +1 ==w t -1 ==z and t -2 ==w …

Lexicalized Templates Add word references as well as tags Use all non-lexicalized templates, plus

Lexicalized Templates Add word references as well as tags Use all non-lexicalized templates, plus w -1 == w

Lexicalized Templates Add word references as well as tags Use all non-lexicalized templates, plus w -1 == w w -2 == w w -1 == w or w -2 == w

Lexicalized Templates Add word references as well as tags Use all non-lexicalized templates, plus w -1 == w w -2 == w w -1 == w or w -2 == w w 0 == w and w -1 == x w 0 == w and t -1 == z ….

Lexicalized Templates Add word references as well as tags Use all non-lexicalized templates, plus w -1 == w w -2 == w w -1 == w or w -2 == w w 0 == w and w -1 == x w 0 == w and t -1 == z w 0 == w w -1 == w and t -1 == z ….

Remaining Design Decisions Optimization criterion:

Remaining Design Decisions Optimization criterion: Greatest reduction in tagging error Stops when no reduction beyond a threshold

Remaining Design Decisions Optimization criterion: Greatest reduction in tagging error Stops when no reduction beyond a threshold Ordering: Processing is left-to-right Application:

Remaining Design Decisions Optimization criterion: Greatest reduction in tagging error Stops when no reduction beyond a threshold Ordering: Processing is left-to-right Application: Find all triggering environments first Then apply transformations to all environments found

Top Non-Lexicalized Transformations

Example Lexicalized Transformations if w +2 == as, IN  RB PTB coding says: as tall as  as/RB tall/JJ as/IN Initial tagging: as/IN tall/JJ as/IN

Example Lexicalized Transformations if w +2 == as, IN  RB PTB coding says: as tall as  as/RB tall/JJ as/IN Initial tagging: as/IN tall/JJ as/IN if w -1 == n’t or w -2 == n’t, VBP  VB Example: we do n’t eat or we do n’t usually sleep

TBL Accuracy Performance evaluated on different corpora: Baseline: 92.4%

TBL Accuracy Performance evaluated on different corpora: Penn WSJ: 96.6% Penn Brown: 96.3% Orig Brown: 96.5% Baseline: 92.4% Lexicalized vs Non-lexicalized transformations Non-lexicalized WSJ: 96.3% Lexicalized WSJ: 96.6%

TBL Accuracy Performance evaluated on different corpora: Penn WSJ: 96.6% Penn Brown: 96.3% Orig Brown: 96.5% Baseline: 92.4% Lexicalized vs Non-lexicalized transformations Non-lexicalized WSJ: 96.3% Lexicalized WSJ: 96.6% Rules: 447 contextual rules (+ 243 for unknown wds)

TBL Summary Builds on an initial-state annotation Defines transformation templates for task Learns ordered list of transformations on labels By iteratively selecting best transformation Classifies by applying initial-state and applicable transformations in order

Analysis Contrasts: Distinctive learning strategy

Analysis Contrasts: Distinctive learning strategy Improves initial classification outputs

Analysis Contrasts: Distinctive learning strategy Improves initial classification outputs Exploits current tagging state (previous, current, post) tags

Analysis Contrasts: Distinctive learning strategy Improves initial classification outputs Exploits current tagging state (previous, current, post) tags Can apply to classification, sequence learning, structure Potentially global feature use – short vs long distance Arbitrary transformations, rules

TBL: Advantages Pros:

TBL: Advantages Pros: Optimization

TBL: Advantages Pros: Optimization: Directly optimizes accuracy

TBL: Advantages Pros: Optimization: Directly optimizes accuracy Goes beyond classification to sequence & structure Parsing. etc Meta-classification:

TBL: Advantages Pros: Optimization: Directly optimizes accuracy Goes beyond classification to sequence & structure Parsing. etc Meta-classification: Use initial state annotation Explicitly manipulates full labeled output Output of transformation is visible to other transformations

TBL: Disadvantages Cons:

TBL: Disadvantages Cons: Computationally expensive: Learns by iterating all possible template instantiations Over all instances in the training data Modifications have been proposed to accelerate

TBL: Disadvantages Cons Computationally expensive: Learns by iterating all possible template instantiations Over all instances in the training data Modifications have been proposed to accelerate No probabilistic interpretation Can’t produce ranked n-best lists

HW #9

TBL For Text Classification Transformations: If (feature is present) and (classlabel == class1) classlabel  class2

Q1 Build a TBL trainer Input: training_data file Output: model_file Lines: featureName from_class to_class gain Parameter: Gain threshold Example: guns talk guns mideast 89

Q2 TBL decoder Input: test_file model_file Output: output_file Parameter: N  # of transformations Output_file: InstanceID goldLabel outputLabel rule1 rule2 … Rule: feature from_class to_class Ex: inst1 guns mideast we guns misc talk misc mideast