Lecture 10 NLTK POS Tagging Part 3 Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Word Bi-grams and PoS Tags
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
LING 388 Language and Computers Lecture 22 11/25/03 Sandiway FONG.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Brill’s Tagger from UNIX Natural Language Understanding CAP6640 Spring 2005.
1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.
1 Part-of-Speech (POS) Tagging Revisited Mark Sharp CS-536 Machine Learning Term Project Fall 2003.
CS224N Interactive Session Competitive Grammar Writing Chris Manning Sida, Rush, Ankur, Frank, Kai Sheng.
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Categorizing and Tagging Words
Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.
TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Part-of-Speech Tagging
Intro to NLP - J. Eisner1 Part-of-Speech Tagging A Canonical Finite-State Task.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Lecture 6 NLTK Tagging Topics Taggers Readings: NLTK Chapter 5 CSCE 771 Natural Language Processing.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Some Advances in Transformation-Based Part of Speech Tagging
1 CPE 641 Natural Language Processing Lecture 2: Levels of Linguistic Analysis, Tokenization & Part- of-speech Tagging Asst. Prof. Dr. Nuttanart Facundes.
Lecture 9 NLTK POS Tagging Part 2 Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
HW7 Extracting Arguments for % Ang Sun March 25, 2012.
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 Part of Speech (POS) Tagging Lab CSC 9010: Special Topics.
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
Word classes and part of speech tagging Chapter 5.
Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992) University of Pennsylvania, LINC Laboratory.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-16: Probabilistic parsing; computing probability of.
Lecture 12 Classifiers Part 2 Topics Classifiers Maxent Classifiers Maximum Entropy Markov Models Information Extraction and chunking intro Readings: Chapter.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size.
February 2007CSA3050: Tagging III and Chunking 1 CSA2050: Natural Language Processing Tagging 3 and Chunking Transformation Based Tagging Chunking.
Part-of-speech tagging
Machine Learning in Practice Lecture 13 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Word classes and part of speech tagging Chapter 5.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter.
Part-of-Speech Tagging CSE 628 Niranjan Balasubramanian Many slides and material from: Ray Mooney (UT Austin) Mausam (IIT Delhi) * * Mausam’s excellent.
Lecture 9 NLTK POS Tagging Part 2
Part-of-Speech Tagging
Introduction to Machine Learning and Text Mining
CSCI 5832 Natural Language Processing
CSCE 590 Web Scraping - NLTK
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.
The CoNLL-2014 Shared Task on Grammatical Error Correction
Natural Language Processing
CSCE 590 Web Scraping - NLTK
LING 388: Computers and Language
LING 388: Computers and Language
Lecture 9 NLTK POS Tagging Part 2
Part-of-Speech Tagging Using Hidden Markov Models
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.
Presentation transcript:

Lecture 10 NLTK POS Tagging Part 3 Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter 5.4-? February 18, 2013 CSCE 771 Natural Language Processing

– 2 – CSCE 771 Spring 2011 Overview Last Time Overview of POS TagsToday Part of Speech Tagging Parts of Speech Rule Based taggers Stochastic taggers Transformational taggersReadings Chapter ?

– 3 – CSCE 771 Spring 2011 brown_lrnd_tagged = brown.tagged_words(categories='learned', simplify_tags=True) tags = [b[1] for (a, b) in nltk.ibigrams(brown_lrnd_tagged) if a[0] == 'often'] fd = nltk.FreqDist(tags) print fd.tabulate() VN V VD ADJ DET ADV P, CNJ. TO VBZ VG WH

– 4 – CSCE 771 Spring 2011 highly ambiguous words >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) >>> data = nltk.ConditionalFreqDist((word.lower(), tag)... for (word, tag) in brown_news_tagged) >>> data = nltk.ConditionalFreqDist((word.lower(), tag)... for (word, tag) in brown_news_tagged) >>> for word in data.conditions():... if len(data[word]) > 3:... tags = data[word].keys()... print word, ' '.join(tags)... best ADJ ADV NP V better ADJ ADV V DET ….

– 5 – CSCE 771 Spring 2011 Tag Package

– 6 – CSCE 771 Spring 2011 Python's Dictionary Methods:.

– 7 – CSCE 771 Spring Automatic Tagging Training set Test set ### setup import nltk, re, pprint from nltk.corpus import brown brown_tagged_sents = brown.tagged_sents(categories='news') brown_sents = brown.sents(categories='news')

– 8 – CSCE 771 Spring 2011 Default.tagger  NN tags = [tag for (word, tag) in brown.tagged_words(categories='news')] print nltk.FreqDist(tags).max() raw = 'I do not like green eggs and ham, I …Sam I am!' tokens = nltk.word_tokenize(raw) default_tagger = nltk.DefaultTagger('NN') print default_tagger.tag(tokens) [('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), … print default_tagger.evaluate(brown_tagged_sents)

– 9 – CSCE 771 Spring 2011 Tagger2: regexp_tagger patterns = [ (r'.*ing$', 'VBG'), # gerunds (r'.*ing$', 'VBG'), # gerunds (r'.*ed$', 'VBD'), # simple past (r'.*ed$', 'VBD'), # simple past (r'.*es$', 'VBZ'), # 3rd singular present (r'.*es$', 'VBZ'), # 3rd singular present (r'.*ould$', 'MD'), # modals (r'.*ould$', 'MD'), # modals (r'.*\'s$', 'NN$'), # possessive nouns (r'.*\'s$', 'NN$'), # possessive nouns (r'.*s$', 'NNS'), # plural nouns (r'.*s$', 'NNS'), # plural nouns (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers (r'.*', 'NN') # nouns (default) (r'.*', 'NN') # nouns (default)] regexp_tagger = nltk.RegexpTagger(patterns)

– 10 – CSCE 771 Spring 2011 Evaluate regexp_tagger regexp_tagger = nltk.RegexpTagger(patterns) print regexp_tagger.tag(brown_sents[3]) [('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), … print regexp_tagger.evaluate(brown_tagged_sents)

– 11 – CSCE 771 Spring 2011 Unigram Tagger: 100 Most Freq tag fd = nltk.FreqDist(brown.words(categories='news')) cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) most_freq_words = fd.keys()[:100] likely_tags = dict((word, cfd[word].max()) for word in most_freq_words) baseline_tagger = nltk.UnigramTagger(model=likely_tags) print baseline_tagger.evaluate(brown_tagged_sents)

– 12 – CSCE 771 Spring 2011 Likely_tags; Backoff to NN sent = brown.sents(categories='news')[3] baseline_tagger.tag(sent) ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful', 'NN'), ('of', 'NN'), baseline_tagger = nltk.UnigramTagger(model=likely_tags, backoff=nltk.DefaultTagger('NN')) backoff=nltk.DefaultTagger('NN')) print baseline_tagger.tag(sent) 'Only', 'NN'), ('a', 'AT'), ('relative', 'NN'), ('handful', 'NN'), ('of', 'IN'), print baseline_tagger.evaluate(brown_tagged_sents)

– 13 – CSCE 771 Spring 2011 Performance of Easy Taggers. TaggerPerformanceComment NN tagger0.13 Regexp tagger Most Freq tag0.46 Likely_tags; Backoff to NN 0.58

– 14 – CSCE 771 Spring 2011 def performance(cfd, wordlist): lt = dict((word, cfd[word].max()) for word in wordlist) lt = dict((word, cfd[word].max()) for word in wordlist) baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger('NN')) baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger('NN')) return baseline_tagger.evaluate(brown.tagged_sents(categ ories='news')) return baseline_tagger.evaluate(brown.tagged_sents(categ ories='news'))

– 15 – CSCE 771 Spring 2011 Display def display(): import pylab words_by_freq = list(nltk.FreqDist(brown.words(categories='news'))) words_by_freq = list(nltk.FreqDist(brown.words(categories='news'))) cfd = nltk.ConditionalFreqDist(brown.tagged_words(categ ories='news')) cfd = nltk.ConditionalFreqDist(brown.tagged_words(categ ories='news')) sizes = 2 ** pylab.arange(15) sizes = 2 ** pylab.arange(15) perfs = [performance(cfd, words_by_freq[:size]) for size in sizes] perfs = [performance(cfd, words_by_freq[:size]) for size in sizes] pylab.plot(sizes, perfs, '-bo') pylab.plot(sizes, perfs, '-bo') pylab.title('Lookup Tagger Perf. vs Model Size') pylab.title('Lookup Tagger Perf. vs Model Size') pylab.xlabel('Model Size') pylab.xlabel('Model Size') pylab.ylabel('Performance') pylab.ylabel('Performance') pylab.show() pylab.show()

– 16 – CSCE 771 Spring 2011 Error !? Traceback (most recent call last): File "C:/Users/mmm/Documents/Courses/771/Python771/ ch05.4.py", line 70, in File "C:/Users/mmm/Documents/Courses/771/Python771/ ch05.4.py", line 70, in import pylab import pylab ImportError: No module named pylab google (download pylab)  scipy ??

– 17 – CSCE 771 Spring N-gram Tagging from nltk.corpus import brown brown_tagged_sents = brown.tagged_sents(categories='news') brown_sents = brown.sents(categories='news') unigram_tagger = nltk.UnigramTagger(brown_tagged_sents) print unigram_tagger.tag(brown_sents[2007]) [('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), print unigram_tagger.evaluate(brown_tagged_sents)

– 18 – CSCE 771 Spring 2011 Dividing into Training/Test Sets size = int(len(brown_tagged_sents) * 0.9) print size 4160 train_sents = brown_tagged_sents[:size] test_sents = brown_tagged_sents[size:] unigram_tagger = nltk.UnigramTagger(train_sents) print unigram_tagger.evaluate(test_sents)

– 19 – CSCE 771 Spring 2011 bigram_tagger 1rst try -- bigram_tagger = nltk.BigramTagger(train_sents) print "bigram_tagger.tag-2007", bigram_tagger.tag(brown_sents[2007]) bigram_tagger.tag-2007 [('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER') unseen_sent = brown_sents[4203] print "bigram_tagger.tag-4203", bigram_tagger.tag(unseen_sent) bigram_tagger.tag-4203 [('The', 'AT'), ('is', 'BEZ'), ('13.5', None), ('million', None), (',', None), ('divided', None), print bigram_tagger.evaluate(test_sents) not too good

– 20 – CSCE 771 Spring 2011 Backoff bigram  unigram  NN t0 = nltk.DefaultTagger('NN') t1 = nltk.UnigramTagger(train_sents, backoff=t0) t2 = nltk.BigramTagger(train_sents, backoff=t1) print t2.evaluate(test_sents)

– 21 – CSCE 771 Spring 2011 Your turn: tri  bi  uni  NN

– 22 – CSCE 771 Spring 2011 Tagging Unknown Words Our approach to tagging unknown words still uses backoff to a regular-expression tagger or a default tagger. These are unable to make use of context. Thus, if our tagger encountered the word blog, not seen during training, it would assign it the same tag, regardless of whether this word appeared in the context the blog or to blog. How can we do better with these unknown words, or out-of-vocabulary items? A useful method to tag unknown words based on context is to limit the vocabulary of a tagger to the most frequent n words, and to replace every other word with a special word UNK using the method shown in 5.3. During training, a unigram tagger will probably learn that UNK is usually a noun. However, the n-gram taggers will detect contexts in which it has some other tag. For example, if the preceding word is to (tagged TO), then UNK will probably be tagged as a verb. 5.3

– 23 – CSCE 771 Spring 2011 Serialization = pickle Saving Object serialization from cPickle import dump output=open('t2.pkl', 'wb') dump(t2, output, -1) output.close() Loading from cPickle import load input = open('t2.pkl', 'rb') tagger = load(input) input.close()

– 24 – CSCE 771 Spring 2011 Performance Limitations

– 25 – CSCE 771 Spring 2011 text = """The board's action shows what free enterprise is up against in our complex maze of regulatory laws.""" is up against in our complex maze of regulatory laws.""" tokens = text.split() tagger.tag(tokens) cfd = nltk.ConditionalFreqDist( ((x[1], y[1], z[0]), z[1]) ((x[1], y[1], z[0]), z[1]) for sent in brown_tagged_sents for sent in brown_tagged_sents for x, y, z in nltk.trigrams(sent)) for x, y, z in nltk.trigrams(sent)) ambiguous_contexts = [c for c in cfd.conditions() if len(cfd[c]) > 1] print sum(cfd[c].N() for c in ambiguous_contexts) / cfd.N()

– 26 – CSCE 771 Spring 2011 Confusion Matrix test_tags = [tag for sent in brown.sents(categories='editorial') for (word, tag) in t2.tag(sent)] for (word, tag) in t2.tag(sent)] gold_tags = [tag for (word, tag) in brown.tagged_words(categories='editorial')] print nltk.ConfusionMatrix(gold_tags, test_tags) overwhelming output

– 27 – CSCE 771 Spring 2011

– 28 – CSCE 771 Spring 2011 nltk.tag.brill.demo() Loading tagged data... Done loading. Training unigram tagger: [accuracy: ] [accuracy: ] Training bigram tagger: [accuracy: ] [accuracy: ] Training Brill tagger on 1600 sentences... Finding initial useful rules... Found 9757 useful rules. Found 9757 useful rules.

– 29 – CSCE 771 Spring 2011 S F r O | Score = Fixed - Broken S F r O | Score = Fixed - Broken c i o t | R Fixed = num tags changed incorrect -> correct c i o t | R Fixed = num tags changed incorrect -> correct o x k h | u Broken = num tags changed correct -> incorrect o x k h | u Broken = num tags changed correct -> incorrect r e e e | l Other = num tags changed incorrect -> incorrect r e e e | l Other = num tags changed incorrect -> incorrect e d n r | e e d n r | e | WDT -> IN if the tag of words i+1...i+2 is 'DT' | WDT -> IN if the tag of words i+1...i+2 is 'DT' | IN -> RB if the text of the following word is | IN -> RB if the text of the following word is | 'well' | 'well' | WDT -> IN if the tag of the preceding word is | WDT -> IN if the tag of the preceding word is | 'NN', and the tag of the following word is 'NNP' | 'NN', and the tag of the following word is 'NNP' | RBR -> JJR if the tag of words i+1...i+2 is 'NNS' | RBR -> JJR if the tag of words i+1...i+2 is 'NNS' | WDT -> IN if the tag of words i+1...i+2 is 'NNS' | WDT -> IN if the tag of words i+1...i+2 is 'NNS'

– 30 – CSCE 771 Spring | WDT -> IN if the tag of the preceding word is | WDT -> IN if the tag of the preceding word is | 'NN', and the tag of the following word is 'PRP' | 'NN', and the tag of the following word is 'PRP' | WDT -> IN if the tag of words i+1...i+3 is 'VBG' | WDT -> IN if the tag of words i+1...i+3 is 'VBG' | RB -> IN if the tag of the preceding word is 'NN', | RB -> IN if the tag of the preceding word is 'NN', | and the tag of the following word is 'DT' | and the tag of the following word is 'DT' | RBR -> JJR if the tag of the following word is | RBR -> JJR if the tag of the following word is | 'NN' | 'NN' | VBP -> VB if the tag of words i-3...i-1 is 'MD' | VBP -> VB if the tag of words i-3...i-1 is 'MD' | NNS -> NN if the text of the preceding word is | NNS -> NN if the text of the preceding word is | 'one' | 'one' | RP -> RB if the text of words i-3...i-1 is 'were' | RP -> RB if the text of words i-3...i-1 is 'were' | VBP -> VB if the text of words i-2...i-1 is "n't" | VBP -> VB if the text of words i-2...i-1 is "n't" Brill accuracy: Done; rules and errors saved to rules.yaml and errors.out.

– 31 – CSCE 771 Spring 2011