I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Word Bi-grams and PoS Tags
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Ling 570 Day 6: HMM POS Taggers 1. Overview Open Questions HMM POS Tagging Review Viterbi algorithm Training and Smoothing HMM Implementation Details.
CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
CS Catching Up CS Porter Stemmer Porter Stemmer (1980) Used for tasks in which you only care about the stem –IR, modeling given/new distinction,
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Part of speech (POS) tagging
I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 13, 2004.
1 I256: Applied Natural Language Processing Marti Hearst Sept 18, 2006.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Albert Gatt Corpora and Statistical Methods Lecture 9.
1 Sequence Labeling Raymond J. Mooney University of Texas at Austin.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Graphical models for part of speech tagging
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes April 5, 2012.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
Word classes and part of speech tagging Chapter 5.
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
CSA3202 Human Language Technology HMMs for POS Tagging.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Supertagging CMSC Natural Language Processing January 31, 2006.
Part-of-speech tagging
John Lafferty Andrew McCallum Fernando Pereira
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Modified from Diane Litman's version of Steve Bird's notes 1 Rule-Based Tagger The Linguistic Complaint –Where is the linguistic knowledge of a tagger?
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Part-of-Speech Tagging & Sequence Labeling Hongning Wang
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
CSC 594 Topics in AI – Natural Language Processing
CSCI 5832 Natural Language Processing
Machine Learning in Natural Language Processing
Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider
Presentation transcript:

I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario

Graphical Models Within the Machine Learning framework Probability theory plus graph theory Widely used –NLP –Speech recognition –Error correcting codes –Systems diagnosis –Computer vision –Filtering (Kalman filters) –Bioinformatics

(Quick intro to) Graphical Models Nodes are random variables B C D A P(A) P(D) P(B|A) P(C|A,D) Edges are annotated with conditional probabilities Absence of an edge between nodes implies conditional independence “Probabilistic database”

Graphical Models A BCD Define a joint probability distribution: P(X 1,..X N ) =  i P(X i | Par(X i ) ) P(A,B,C,D) = P(A)P(D)P(B|A)P(C|A,D) Learning –Given data, estimate the parameters P(A), P(D), P(B|A), P(C | A, D)

Graphical Models Define a joint probability distribution: P(X 1,..X N ) =  i P(X i | Par(X i ) ) P(A,B,C,D) = P(A)P(D)P(B|A)P(C,A,D) Learning –Given data, estimate P(A), P(B|A), P(D), P(C | A, D) Inference: compute conditional probabilities, e.g., P(A|B, D) or P(C | D) Inference = Probabilistic queries General inference algorithms (e.g. Junction Tree) A BCD

Naïve Bayes models Simple graphical model X i depend on Y Naïve Bayes assumption: all x i are independent given Y Currently used for text classification and spam detection x1x1 x2x2 x3x3 Y

Naïve Bayes models Naïve Bayes for document classification w1w1 w2w2 wnwn topic Inference task: P(topic | w 1, w 2 … w n )

Naïve Bayes for SWD v1v1 v2v2 v3v3 sksk Recall the general joint probability distribution: P(X1,..XN) =  i P(Xi | Par(Xi) ) P(s k, v 1..v 3 ) = P(s k )  P(v i | Par(v i )) = P(s k ) P(v 1 | s k ) P(v 2 | s k ) P(v 3 | s k )

Naïve Bayes for SWD v1v1 v2v2 v3v3 sksk P(s k, v 1..v 3 ) = P(s k )  P(v i | Par(v i )) = P(s k ) P(v 1 | s k ) P(v 2 | s k ) P(v 3 | s k ) Estimation (Training): Given data, estimate: P(s k ) P(v 1 | s k ) P(v 2 | s k ) P(v 3 | s k )

Naïve Bayes for SWD v1v1 v2v2 v3v3 sksk P(s k, v 1..v 3 ) = P(s k )  P(v i | Par(v i )) = P(s k ) P(v 1 | s k ) P(v 2 | s k ) P(v 3 | s k ) Estimation (Training): Given data, estimate: P(s k ) P(v 1 | s k ) P(v 2 | s k ) P(v 3 | s k ) Inference (Testing): Compute conditional probabilities of interest: P(s k | v 1, v 2, v 3 )

Graphical Models Given Graphical model –Do estimation (find parameters from data) –Do inference (compute conditional probabilities) How do I choose the model structure (i.e. the edges)?

How to choose the model structure? v1v1 v2v2 v3v3 sksk v1v1 v2v2 v3v3 sksk v1v1 v2v2 v3v3 sksk v1v1 v2v2 v3v3 sksk

Model structure Learn it: structure learning –Difficult & need a lot of data Knowledge of the domain and of the relationships between the variables –Heuristics –The fewer dependencies (edges) we can have, the “better” Sparsity: more edges, need more data Next class… –Direction of arrows v1v1 v2v2 v3v3 sksk P (v 3 | s k, v 1, v 2 )

Generative vs. discriminative P(s k, v 1..v 3 ) = P(s k )  P(v i | Par(v i )) = P(s k ) P(v 1 | s k ) P(v 2 | s k ) P(v 3 | s k ) Estimation (Training): Given data, estimate: P(s k ), P(v 1 | s k ), P(v 2 | s k ) and P(v 3 | s k ) Inference (Testing): Compute: P(s k | v 1, v 2, v 3 ) (there are algorithms to find these cond. Pb, not covered here) v1v1 v2v2 v3v3 sksk P(s k, v 1..v 3 ) = P(v 1 ) P(v 2 ) P(v 3 ) P( s k | v 1, v 2 v 3 ) Conditional pb. of interest is “ready”: P(s k | v 1, v 2, v 3 ) i.e. modeled directly Estimation (Training): Given data, estimate: P(v 1 ), P(v 2 ), P(v 3 ), and P( s k | v 1, v 2 v 3 ) Do inference to find Pb of interestPb of interest is modeled directly v1v1 v2v2 v3v3 sksk GenerativeDiscriminative

Generative vs. discriminative Don’t worry…. You can use both models If you are interested, let me know But in short: –If the Naive Bayes assumption made by the generative method is not met (conditional independencies not true), the discriminative method can have an edge –But the generative model may converge faster –Generative learning can sometimes be more efficient than discriminative learning; at least when the number of features is large compared to the number of samples

Graphical Models Provides a convenient framework for visualizing conditional independent Provides general inference algorithms Next, we’ll see a GM (Hidden Markov Model) for POS

Part-of-speech (English) From Dan Klein’s cs 288 slides

Modified from Diane Litman's version of Steve Bird's notes 18 Terminology Tagging –The process of associating labels with each token in a text Tags –The labels –Syntactic word classes Tag Set –The collection of tags used

19 Example Typically a tagged text is a sequence of white-space separated base/tag tokens: These/DT findings/NNS should/MD be/VB useful/JJ for/IN therapeutic/JJ strategies/NNS and/CC the/DT development/NN of/IN immunosuppressants/NNS targeting/VBG the/DT CD28/NN costimulatory/NN pathway/NN./.

Part-of-speech (English) From Dan Klein’s cs 288 slides

POS tagging vs. WSD Similar task: assign POS vs. assign WOS –You should butter your toast –Bread and butter Using a word as noun or verb involves a different meaning, like WSD In practice the two topics POS and WOS have been distinguished, because of for their different nature and also because the methods used are different –Nearby structures are most useful for POS (e.g. is the preceding word a determiner?) but are of little use for WOS –Conversely, quite distant content words are very effective for determining the semantic sense, but not POS

Part-of-Speech Ambiguity From Dan Klein’s cs 288 slides (particle) (preposition) (adverb)

Part-of-Speech Ambiguity Words that are highly ambiguous as to their part of speech tag

Sources of information Syntagmatic: tags of the other words –AT JJ NN is common –AT JJ VBP impossible (or unlikely) Lexical: look at the words –The  AT –Flour  more likely to be a noun than a verb –A tagger that always chooses the most common tag is 90% correct (often used as baseline) Most taggers use both

Modified from Diane Litman's version of Steve Bird's notes 25 What does Tagging do? 1.Collapses Distinctions Lexical identity may be discarded e.g., all personal pronouns tagged with PRP 2.Introduces Distinctions Ambiguities may be resolved e.g. deal tagged with NN or VB 3.Helps in classification and prediction

Modified from Diane Litman's version of Steve Bird's notes 26 Why POS? A word’s POS tells us a lot about the word and its neighbors: –Limits the range of meanings (deal), pronunciation (text to speech) (object vs object, record) or both (wind) –Helps in stemming: saw[v] → see, saw[n] → saw –Limits the range of following words –Can help select nouns from a document for summarization –Basis for partial parsing (chunked parsing)

Why POS? From Dan Klein’s cs 288 slides

Slide modified from Massimo Poesio's 28 Choosing a tagset The choice of tagset greatly affects the difficulty of the problem Need to strike a balance between –Getting better information about context –Make it possible for classifiers to do their job

Slide modified from Massimo Poesio's 29 Some of the best-known Tagsets Brown corpus: 87 tags –(more when tags are combined) Penn Treebank: 45 tags Lancaster UCREL C5 (used to tag the BNC): 61 tags Lancaster C7: 145 tags!

NLTK Whenever a corpus contains tagged text, the NLTK corpus interface will have a tagged_words() method.

Tagging methods Hand-coded Statistical taggers –N-Gram Tagging –HMM –(Maximum Entropy) Brill (transformation-based) tagger

Hand-coded Tagger The Regular Expression Tagger

Unigram Tagger Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token. –For example, it will assign the tag JJ to any occurrence of the word frequent, since frequent is used as an adjective (e.g. a frequent word) more often than it is used as a verb (e.g. I frequent this cafe).

Unigram Tagger We train a UnigramTagger by specifying tagged sentence data as a parameter when we initialize the tagger. The training process involves inspecting the tag of each word and storing the most likely tag for any word in a dictionary, stored inside the tagger. We must be careful not to test it on the same data. A tagger that simply memorized its training data and made no attempt to construct a general model would get a perfect score, but would also be useless for tagging new text. Instead, we should split the data, training on 90% and testing on the remaining 10% (or 75% and 25%) Calculate performance on previously unseen text. –Note: this is general procedure for learning systems

N-Gram Tagging An n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens A 1-gram tagger is another term for a unigram tagger: i.e., the context used to tag a token is just the text of the token itself. 2- gram taggers are also called bigram taggers, and 3-gram taggers are called trigram taggers. trigram tagger

N-Gram Tagging Why not 10-gram taggers?

N-Gram Tagging Why not 10-gram taggers? As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data. This is known as the sparse data problem, and is quite pervasive in NLP. As a consequence, there is a trade-off between the accuracy and the coverage of our results (and this is related to the precision/recall trade-off) Next week: sparsity

Markov Model Tagger Bigram tagger Assumptions: –Words are independent of each other –A word identity depends only on its tag –A tag depends only on the previous tag How does a GM with these assumption look like?

Markov Model Tagger t1t1 w1w1 t2t2 w2w2 tntn wnwn

Markov Model Tagger Training For all of tags t i do –For all tags t j do –end For all of tags t i do –For all words w i do C(t j,t i ) = number of occurrences of t j followed by t i C(w j, t j ) = number of occurrences of w i that are labeled as followed as t i

Markov Model Tagger Estimation Goal/Estimation: –Find the optimal tag sequence for a given sentence –The Viterbi algorithm

Sequence free tagging? From Dan Klein’s cs 288 slides

Sequence free tagging? Solution: maximum entropy sequence models (MEMMs- maximum entropy markov models, CRFs– conditional random fields) From Dan Klein’s cs 288 slides

Modified from Diane Litman's version of Steve Bird's notes 44 Rule-Based Tagger The Linguistic Complaint –Where is the linguistic knowledge of a tagger? –Just massive tables of numbers –Aren’t there any linguistic insights that could emerge from the data? –Could thus use handcrafted sets of rules to tag input sentences, for example, if input follows a determiner tag it as a noun.

Slide modified from Massimo Poesio's 45 The Brill tagger (transformation-based tagger) An example of Transformation-Based Learning –Basic idea: do a quick job first (using frequency), then revise it using contextual rules. Very popular (freely available, works fairly well) –Probably the most widely used tagger (esp. outside NLP) –…. but not the most accurate: 96.6% / 82.0 % A supervised method: requires a tagged corpus

Brill Tagging: In more detail Start with simple (less accurate) rules…learn better ones from tagged corpus –Tag each word initially with most likely POS –Examine set of transformations to see which improves tagging decisions compared to tagged corpus –Re-tag corpus using best transformation –Repeat until, e.g., performance doesn’t improve –Result: tagging procedure (ordered list of transformations) which can be applied to new, untagged text

Slide modified from Massimo Poesio's 47 An example Examples: –They are expected to race tomorrow. –The race for outer space. Tagging algorithm: 1.Tag all uses of “race” as NN (most likely tag in the Brown corpus) They are expected to race/NN tomorrow the race/NN for outer space 2.Use a transformation rule to replace the tag NN with VB for all uses of “race” preceded by the tag TO: They are expected to race/VB tomorrow the race/NN for outer space

What gets learned? [from Brill 95] Tags-triggered transformationsMorphology-triggered transformations Rules are linguistically interpretable

Tagging accuracies (overview) From Dan Klein’s cs 288 slides

Tagging accuracies From Dan Klein’s cs 288 slides

Tagging accuracies Taggers are already pretty good on WSJ journal text… What we need is taggers that work on other text! Performance depends on several factors –The amount of training data –The tag set (the larger, the harder the task) –Difference between training and testing corpus –Unknown words For example, technical domains

Common Errors From Dan Klein’s cs 288 slides

Next week What happen when ? Sparsity Methods to deal with it –For example: Back-off: if use instead:

Administrativia Assignment 2 is out –Due September 22 –Soon grades and “best” solutions to assignment 1 Reading for next class –Chapter 6 Statistical NLP