1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Ling 570 Day 6: HMM POS Taggers 1. Overview Open Questions HMM POS Tagging Review Viterbi algorithm Training and Smoothing HMM Implementation Details.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Outline Why part of speech tagging? Word classes
Estimating from Samples © Christine Crisp “Teach A Level Maths” Statistics 2.
BİL711 Natural Language Processing
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
February 2007CSA3050: Tagging II1 CSA2050: Natural Language Processing Tagging 2 Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams.
LINGUISTICA GENERALE E COMPUTAZIONALE DISAMBIGUAZIONE DELLE PARTI DEL DISCORSO.
Natural Language Processing Lecture 8—9/24/2013 Jim Martin.
1 I256: Applied Natural Language Processing Marti Hearst Sept 13, 2006.
September PART-OF-SPEECH TAGGING Universita’ di Venezia 1 Ottobre 2003.
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Part-Of-Speech (POS) Tagging.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.
CS Catching Up CS Porter Stemmer Porter Stemmer (1980) Used for tasks in which you only care about the stem –IR, modeling given/new distinction,
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,
POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Presented by Iman Sen.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.
Part of speech (POS) tagging
1 PART-OF-SPEECH TAGGING. 2 Topics of the next three lectures Tagsets Rule-based tagging Brill tagger Tagging with Markov models The Viterbi algorithm.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 15, 2004.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 13, 2004.
1 I256: Applied Natural Language Processing Marti Hearst Sept 18, 2006.
Albert Gatt Corpora and Statistical Methods Lecture 9.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Robert Hass CIS 630 April 14, 2010 NP NP↓ Super NP tagging JJ ↓
Intro to NLP - J. Eisner1 Part-of-Speech Tagging A Canonical Finite-State Task.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes April 5, 2012.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Fall 2005 Lecture Notes #8 EECS 595 / LING 541 / SI 661 Natural Language Processing.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Lecture 10 NLTK POS Tagging Part 3 Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,
Mebi 591D – BHI Kaggle Class Baselines kaggleclass.weebly.com/
Word classes and part of speech tagging Chapter 5.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Word classes and part of speech tagging 09/28/2004 Reading: Chap 8, Jurafsky & Martin Instructor: Rada Mihalcea Note: Some of the material in this slide.
CSA3202 Human Language Technology HMMs for POS Tagging.
Supertagging CMSC Natural Language Processing January 31, 2006.
February 2007CSA3050: Tagging III and Chunking 1 CSA2050: Natural Language Processing Tagging 3 and Chunking Transformation Based Tagging Chunking.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
NLP. Introduction to NLP Rule-based Stochastic –HMM (generative) –Maximum Entropy MM (discriminative) Transformation-based.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Modified from Diane Litman's version of Steve Bird's notes 1 Rule-Based Tagger The Linguistic Complaint –Where is the linguistic knowledge of a tagger?
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
1 COMP790: Statistical NLP POS Tagging Chap POS tagging Goal: assign the right part of speech (noun, verb, …) to words in a text “The/AT representative/NN.
1 Natural Language Processing Vasile Rus
CSC 594 Topics in AI – Natural Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider
Presentation transcript:

1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006

2 Tagging methods Hand-coded Statistical taggers Brill (transformation-based) tagger

3 nltk_lite tag package Type of taggers: tag.Default() tag.Regexp() tag.Affix() tag.Unigram() tag.Bigram() tag.Trigram() Actions: tag.tag() tag.tagsents() tag.untag() tag.train() tag.accuracy() tag.tag2tuple() tag.string2words() tag.string2tags()

4 Hand-coded Tagger Make up some regexp rules that make use of morphology

5 Compare to Brown tags

6 Training and Testing of Learning Algorithms Algorithms that “learn” from data see a set of examples and try to generalize from them. Training set: Examples trained on Test set: Also called held-out data and unseen data Use this for evaluating your algorithm Must be separate from the training set –Otherwise, you cheated! “Gold” standard A test set that a community has agreed on and uses as a common benchmark.

7 Cross-Validation of Learning Algorithms Cross-validation set Part of the training set. Used for tuning parameters of the algorithm without “polluting” (tuning to) the test data. You can train on x%, and then cross-validate on the remaining 1-x% –E.g., train on 90% of the training data, cross-validate (test) on the remaining 10% –Repeat several times with different splits This allows you to choose the best settings to then use on the real test set. –You should only evaluate on the test set at the very end, after you’ve gotten your algorithm as good as possible on the cross-validation set.

8 Strong Baselines When designing NLP algorithms, you need to evaluate them by comparing to others. Baseline Algorithm: An algorithm that is relatively simple but can be expected to do well Should get the best score possible by doing the somewhat obvious thing.

9 A Tagging Baseline Find the most likely tag for the most frequent words Frequent words are ambiguous You’re likely to see frequent words in any collection –Will always see “to” but might not see “armadillo” How to do this? First find the most likely words and their tags in the training data Train a tagger that looks up these results in a table –Note that the tag.Lookup() tagger type is not defined in this version of nltk_lite, so we’ll write our own.

10 Find the most frequent words and their tags

11 Subclassing a Python Class The Lookup module isn’t in our version of nltk_lite Let’s make a subclass of the tag.Unigram class that has this functionality.

12

13

14 Details from the Unigram class

15 Define our own tagger class

16 Use our own tagger class

17 N-Grams The N stands for how many terms are used Unigram: 1 term (0 th order) Bigram: 2 terms (1 st order) Trigrams: 3 terms (2 nd order) –Usually don’t go beyond this You can use different kinds of terms, e.g.: Character based n-grams Word-based n-grams POS-based n-grams Ordering Often adjacent, but not required We use n-grams to help determine the context in which some linguistic phenomenon happens. E.g., look at the words before and after the period to see if it is the end of a sentence or not.

18 Modified from Massio Poesio's lecture Tagging with lexical frequencies Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN Problem: assign a tag to race given its lexical frequency Solution: we choose the tag that has the greater P(race|VB) P(race|NN)

19 Unigram Tagger Train on a set of sentences Keep track of how many times each word is seen with each tag. After training, associate with each word its most likely tag. Problem: many words never seen in the training data. Solution: have a default tag to “backoff” to.

20 Unigram tagger with Backoff

21 What’s wrong with unigram? Most frequent tag isn’t always right! Need to take the context into account Which sense of “to” is being used? Which sense of “like” is being used?

22 N-gram tagger Uses the preceding N-1 predicted tags Also uses the unigram estimate for the current word

23 N-gram taggers in nltk_lite Constructs a frequency distribution describing the frequencies each word is tagged with in different contexts. The context considered consists of the word to be tagged and the n-1 previous words' tags. After training, tag words by assigning each word the tag with the maximum frequency given its context. Assigns “None” tag if it sees a word in a context for which it has no data (which it has not seen). Tuning parameters “cutoff” is the minimal number of times that the context must have been seen in training in order to be incorporated into the statistics Default cutoff is 1

24 Modified from Diane Litman's version of Steve Bird's notes Bigram Tagging For tagging, in addition to considering the token’s type, the context also considers the tags of the n preceding tokens –What is the most likely tag for word n, given word n-1 and tag n-1? The tagger picks the tag which is most likely for that context.

25 Reading the Bigram table The current word The predicted POS The previously seen tag

26 Modified from Diane Litman's version of Steve Bird's notes Combining Taggers using Backoff Use more accurate algorithms when we can, backoff to wider coverage when needed. Try tagging the token with the 1 st order tagger. If the 1 st order tagger is unable to find a tag for the token, try finding a tag with the 0 th order tagger. If the 0 th order tagger is also unable to find a tag, use the default tagger to find a tag. Important point: Bigram and trigram taggers use the previous tag context to assign new tags. If they see a tag of “None” in the previous context, they will output “None” too.

27 Demonstrating the n-gram taggers Trained on brown.tagged(‘a’), tested on brown.tagged(‘b’) Backs off to a default of ‘nn’

28 Demonstrating the n-gram taggers

29 Combining Taggers The bigram backoff tagger did worse than the unigram! Why? Why does it get better again with trigrams? How can we improve these scores?

30 Modified from Diane Litman's version of Steve Bird's notes Rule-Based Tagger The Linguistic Complaint Where is the linguistic knowledge of a tagger? Just a massive table of numbers Aren’t there any linguistic insights that could emerge from the data? Could thus use handcrafted sets of rules to tag input sentences, for example, if input follows a determiner tag it as a noun.

31 Slide modified from Massimo Poesio's The Brill tagger An example of Transformation-Based Learning Basic idea: do a quick job first (using frequency), then revise it using contextual rules. Painting metaphor from the readings Very popular (freely available, works fairly well) A supervised method: requires a tagged corpus

32 Brill Tagging: In more detail Start with simple (less accurate) rules…learn better ones from tagged corpus Tag each word initially with most likely POS Examine set of transformations to see which improves tagging decisions compared to tagged corpus Re-tag corpus using best transformation Repeat until, e.g., performance doesn’t improve Result: tagging procedure (ordered list of transformations) which can be applied to new, untagged text

33 Slide modified from Massimo Poesio's An example Examples: They are expected to race tomorrow. The race for outer space. Tagging algorithm: 1.Tag all uses of “race” as NN (most likely tag in the Brown corpus) They are expected to race/NN tomorrow the race/NN for outer space 2.Use a transformation rule to replace the tag NN with VB for all uses of “race” preceded by the tag TO: They are expected to race/VB tomorrow the race/NN for outer space

34 Example Rule Transformations

35 Sample Final Rules

36 Error Analysis To improve your algorithm, examine where it fails on the cross-validation set It’s often useful to characterize in detail which examples it fails on and which succeed. Make fixes, and then re-train on the training set, again using cross-validation

37 Error Analysis

38 Error Analysis

39 Assignment + Next Time I’ve posted an assignment, due in a week Work in pairs, but only if you work together In your writeup, make clear who did what, and what you did together Next week: shallow parsing