Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

Slides:



Advertisements
Similar presentations
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Advertisements

Linear Regression.
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
On-line learning and Boosting
Data Mining Classification: Alternative Techniques
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
CMPUT 466/551 Principal Source: CMU
Longin Jan Latecki Temple University
Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!
Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.
Stemming, tagging and chunking Text analysis short of parsing.
Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007.
Decision Tree Algorithm
2D1431 Machine Learning Boosting.
Ensemble Learning: An Introduction
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Taking the Kitchen Sink Seriously: An Ensemble Approach to Word Sense Disambiguation from Christopher Manning et al.
Adaboost and its application
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Machine Learning: Ensemble Methods
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Online Learning Algorithms
A k-Nearest Neighbor Based Algorithm for Multi-Label Classification Min-Ling Zhang
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
A speech about Boosting Presenter: Roberto Valenti.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Text Classification, Active/Interactive learning.
Naive Bayes Classifier
Benk Erika Kelemen Zsolt
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
BOOSTING David Kauchak CS451 – Fall Admin Final project.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Classification Techniques: Bayesian Classification
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
MAXIMUM ENTROPY MARKOV MODEL Adapted From: Heshaam Faili University of Tehran – Dikkala Sai Nishanth – Ashwin P. Paranjape
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Conditional Markov Models: MaxEnt Tagging and MEMMs
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
1 Machine Learning: Ensemble Methods. 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training data or different.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Adaboost (Adaptive boosting) Jo Yeong-Jun Schapire, Robert E., and Yoram Singer. "Improved boosting algorithms using confidence- rated predictions."
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Ensemble Classifiers.
Machine Learning: Ensemble Methods
Reading: R. Schapire, A brief introduction to boosting
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Data Mining Lecture 11.
Data Mining Practical Machine Learning Tools and Techniques
The
Introduction to Data Mining, 2nd Edition
Introduction to Boosting
Ensemble learning.
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Presentation transcript:

Boosting Applied to Tagging and PP Attachment By Aviad Barzilai

Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment

By Aviad Barzilai Introduction to AdaBoost Binary classification (introduction to AdaBoost) Weak classifier - A classifier better than chance Strong classifier – A classifier with low error rates Weak classifiers can be easier to find

By Aviad Barzilai Introduction to AdaBoost Binary classification (introduction to AdaBoost) X – Examples Y – Labels ({-1, 1}) D – Weights H(x) – Classifiers Goal – Building a strong classifier using weak classifiers

By Aviad Barzilai Introduction to AdaBoost Binary classification (introduction to AdaBoost) Update the weights to reflect the errors Find the “best” weak classifier REPEAT! *A formal definition of “best” will be given later

By Aviad Barzilai Introduction to AdaBoost Binary classification (introduction to AdaBoost) REPEAT! Update the weights to reflect the errors Find the “best” weak classifier *A formal definition of “best” will be given later

By Aviad Barzilai Introduction to AdaBoost Binary classification (introduction to AdaBoost) REPEAT! Update the weights to reflect the errors Find the “best” weak classifier *A formal definition of “best” will be given later

By Aviad Barzilai Introduction to AdaBoost Binary classification (introduction to AdaBoost) Create a strong classifier ++

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment

By Aviad Barzilai AdaBoost & NLP Introduction to AdaBoost AdaBoost & NLP The idea of boosting is to combine many simple “rules of thumb”, such as “the current word is a noun if the previous word is the.” Such rules often give incorrect classifications. The main idea of boosting is to combine many such rules in a principled manner to produce a single highly accurate classification rule. AdaBoost & NLP The “rules of thumb”, (e.g. “the current word is a noun if the previous word is the.”) are called weak hypotheses.

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP h t maps each example x to a real number h t (x). The sign of this number is interpreted as the predicted class (-1 or +1) of example x. The magnitude |h t (x)| is interpreted as the level of confidence in the prediction. X ±confidence Weak hypothesis (h t ) “the current word is a noun if the previous word is the.” Weak hypothesis (h t ) “the current word is a noun if the previous word is the.”

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP AdaBoost outputs a final hypothesis which make predictions using a simple vote of the weak hypotheses’ predictions, taking into account the varying confidences of the different predictions.

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP Training set Importance Weights (initially uniform) Get a weak hypothesis Update the Importance Weights Repeat “In our experiments, we used cross validation to choose the number of rounds T” Repeat “In our experiments, we used cross validation to choose the number of rounds T” Final Hypothesis

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP For each hypothesis W b,s is the sum of weights of examples for which y i =s and the result of the predicate is b. Choose “best” hypothesis (minimal error) Choose “best” hypothesis (minimal error)

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP Our default approach to multiclass problems is to use Schapire and Singer’s (1998) AdaBoost.MH algorithm. The main idea of this algorithm is to regard each example with its multiclass label as several binary- labeled examples. Suppose that the possible classes are 1,…,k. We map each original example x with label y to k binary labeled derived examples (x; 1),…,(x; k) where example (x; c) is labeled +1 if c = y and -1 otherwise.

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment

By Aviad Barzilai Tagging (with AdaBoost) Tagging & PP Attachment Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment “In corpus linguistics, part-of-speech tagging is the process of marking up the words in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.”

By Aviad Barzilai Tagging (with AdaBoost) Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment The authors used the UPenn Treebank corpus (Marcus et al., 1993). The corpus uses 80 labels, which comprise 45 parts of speech properly so- called, and 35 indeterminate tags, representing annotator uncertainty. They introduced an 81st label, ##, for paragraph separators.

By Aviad Barzilai Tagging (with AdaBoost) Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment The straightforward way of applying boosting to tagging is to use AdaBoost.MH. Each word token represents an example, and the classes are the 81 part-of-speech tags.

By Aviad Barzilai Tagging (with AdaBoost) Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment Weak hypotheses are identified with “attribute=value” predicates. The article uses three types of attributes: Lexical attributes, Contextual attributes and Morphological attributes

By Aviad Barzilai Tagging (with AdaBoost) Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment Lexical attributes: The current word as a downcased string (S); its capitalization (C); and its most-frequent tag in training (T). Contextual attributes: The string (LS), capitalization (LC), and most-frequent tag (LT) of the preceding word and similarly for the following word (RS;RC;RT). Morphological attributes: The inflectional suffix (I) of the current word, as provided by an automatic stemmer; also the last two (S2) and last three (S3) letters of the current word.

By Aviad Barzilai Tagging (with AdaBoost) Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment The authors conducted four experiments of tagging the UPenn Treebank corpus using AdaBoost

By Aviad Barzilai Tagging (with AdaBoost) Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment The 3.28% error rate is not significantly different (at p = 0.05) from the error rate of the best-known single tagger, Ratnaparkhi’s Maxent tagger, which achieves 3.11% error on our data. The results are not as good as those achieved by Brill and Wu’s voting scheme yet the experiments described in the article use very simple features.

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment “According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the default” Tagging (with MaxEnt)

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment In MaxEnt we “model all that is known and assume nothing about what is unknown”. Model all that is known: Satisfy a set of constraints that must hold Assume nothing about what is unknown Choose the most “uniform” distribution i.e choose the one with maximum entropy Tagging (with MaxEnt)

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment Tagging (with MaxEnt) Feature (a.k.a. feature function, Indicator function) is a binary-valued function on events: A: the set of possible classes B: space of contexts (e.g. neighboring words/ tags)

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment Tagging (with MaxEnt)

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment Tagging (with MaxEnt) Observed expectation (using the Observed probability ) Observed expectation (using the Observed probability ) Model expectation (using the model’s probability) The task: find p* constraints The solution is obtained iteratively

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment Tagging (with MaxEnt) “The 3.28% error rate is not significantly different (at p = 0.05) from the error rate of the best-known single tagger, Ratnaparkhi’s Maxent tagger, which achieves 3.11% error on our data.”

By Aviad Barzilai PP attachment (Prepositional phrase attachment with AdaBoost) Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment “In grammar, a preposition is a part of speech that introduces a prepositional phrase. For example, in the sentence ‘The cat sleeps on the sofa’, the word ‘on’ is a preposition, introducing the prepositional phrase ‘on the sofa’. Simply put, a preposition indicates a relation between things mentioned in a sentence.”

By Aviad Barzilai PP attachment (Prepositional phrase attachment with AdaBoost) Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment Example of PP attachment “Congress accused the president of peccadillos” is classified according to the attachment site of the prepositional phrase. Attachment to N: accused [the president of peccadillos] Attachment to V: accused [the president] [of peccadillos]

By Aviad Barzilai PP attachment (Prepositional phrase attachment with AdaBoost) Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment The cases of PP-attachment that the article addresses define a binary classification problem. Examples have binary labels: positive represents attachment to noun, and negative represents attachment to verb.

By Aviad Barzilai PP attachment (Prepositional phrase attachment with AdaBoost) Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment The authors used the same training and test data as Collins and Brooks (1995).

By Aviad Barzilai PP attachment (Prepositional phrase attachment with AdaBoost) Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment Each PP-attachment example is represented by its value for four attributes: the main verb (V ), the head word of the direct object (N1), the preposition (P), and the head word of the object of the preposition (N2). For instance, in example before, V = accused, N1 = president, P = of, and N2 = peccadillos. Examples have binary labels: positive represents attachment to noun, and negative represents attachment to verb.

By Aviad Barzilai PP attachment (Prepositional phrase attachment with AdaBoost) Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment The weak hypotheses used correspond to “attribute=value” predicates and conjunctions thereof. There are 16 predicates that are considered for each example. For the previous example, three of these 16 predicates are (V = accused ^ N1 = president ^ N2 = peccadillos), (P = with), and (V = accused ^ P = of).

By Aviad Barzilai PP attachment (Prepositional phrase attachment with AdaBoost) Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment

By Aviad Barzilai PP attachment (Prepositional phrase attachment with AdaBoost) Introduction to AdaBoost AdaBoost & NLP Tagging & PP Attachment After 20,000 rounds of boosting the test error was down to 14.6 ± 0.6%. This is indistinguishable from the best known results for this problem, namely, 14.5 ± 0.6%, reported by Collins and Brook on exactly the same data.

FIN

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP The article uses very simple weak hypotheses that test the value of a Boolean predicate and make a prediction based on that value. The predicates used are of the form “a = v”, for a an attribute and v a value; for example, “PreviousWord = the”. If, on a given example x, the predicate holds, the weak hypothesis outputs prediction p 1, otherwise p 0

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP Schapire and Singer (1998) prove that the training error of the final hypothesis is at most This suggests that the training error can be greedily driven down by designing a weak learner which, on round t of boosting, attempts to find a weak hypothesis h that minimizes Z

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP Given a predicate (e.g. “PreviousWord = the”), we choose p 0 and p 1 to minimize Z Schapire and Singer (1998) show that Z is minimized when W b,s is the sum of weights of examples for which y i =s and the result of the predicate is b.

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP In practice, very large values of p 0 and p 1 can cause numerical problems and may also lead to overfitting. Therefore, we usually “smooth” these values

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP For each hypothesis W b,s is the sum of weights of examples for which y i =s and the result of the predicate is b. Choose “best” hypothesis (max confidence) Choose “best” hypothesis (max confidence)

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP Our default approach to multiclass problems is to use Schapire and Singer’s (1998) AdaBoost.MH algorithm. The main idea of this algorithm is to regard each example with its multiclass label as several binary- labeled examples. Suppose that the possible classes are 1,…,k. We map each original example x with label y to k binary labeled derived examples (x; 1),…,(x; k) where example (x; c) is labeled +1 if c = y and -1 otherwise.

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP We maintain a distribution over pairs (x; c), treating each such as a separate example. Weak hypotheses are identified with predicates over (x; c) pairs, though they now ignore c. The prediction weights p c,0 and p c,1, however, are chosen separately for each class c.

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP An alternative is the to use binary AdaBoost to train separate discriminators (binary classifiers) for each class, and combine their output by choosing the class c that maximizes f c (x), where f c (x) is the final confidence weighted prediction of the discriminator for class c. The above is called AdaBoost.MI (multiclass, independent discriminators)

By Aviad Barzilai Introduction to AdaBoost AdaBoost & NLP AdaBoost.MI differs from AdaBoost.MH in that predicates are selected independently for each class; we do not require that the weak hypothesis at round t be the same for all classes.