Word Bi-grams and PoS Tags

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research.

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Parsing: computing the grammatical structure of English sentences COMP3310.

Word Classes and POS Tagging Read J & M Chapter 8. You may also want to look at: view.html.

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.

Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,

LINGUISTICA GENERALE E COMPUTAZIONALE DISAMBIGUAZIONE DELLE PARTI DEL DISCORSO.

1 I256: Applied Natural Language Processing Marti Hearst Sept 13, 2006.

LING 388 Language and Computers Lecture 22 11/25/03 Sandiway FONG.

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.

1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006.

Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.

Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.

I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Part of speech (POS) tagging

Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.

Part-of-Speech Tagging & Sequence Labeling

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 13, 2004.

1 I256: Applied Natural Language Processing Marti Hearst Sept 18, 2006.

BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.

NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.

Albert Gatt Corpora and Statistical Methods Lecture 9.

ELN – Natural Language Processing Giuseppe Attardi

February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK.

Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.

LING/C SC/PSYC 438/538 Lecture 27 Sandiway Fong. Administrivia 2 nd Reminder – 538 Presentations – Send me your choices if you haven’t already.

April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.

Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.

Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.

Ling 570 Day 17: Named Entity Recognition Chunking.

인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.

10/12/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 10 Giuseppe Carenini.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.

Lecture 10 NLTK POS Tagging Part 3 Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992) University of Pennsylvania, LINC Laboratory.

Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING.

Tokenization & POS-Tagging

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.

CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.

Part-of-speech tagging

CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.

Shallow Parsing for South Asian Languages -Himanshu Agrawal.

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)

Modified from Diane Litman's version of Steve Bird's notes 1 Rule-Based Tagger The Linguistic Complaint –Where is the linguistic knowledge of a tagger?

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Part-of-Speech Tagging & Sequence Labeling Hongning Wang

Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.

LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.

Natural Language Processing (NLP)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27

CSC 594 Topics in AI – Natural Language Processing

CSCI 5832 Natural Language Processing

CSCE 590 Web Scraping - NLTK

LING/C SC 581: Advanced Computational Linguistics

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 26

Chunk Parsing CS1573: AI Application Development, Spring 2003

Natural Language Processing

Natural Language Processing (NLP)

CSCE 590 Web Scraping - NLTK

LING 388: Computers and Language

Natural Language Processing (NLP)

Presentation transcript:

Word Bi-grams and PoS Tags School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other contributors)

Reminder FreqDist counts of tokens and their distribution can be useful Eg find main characters in Gutenberg texts Eg compare word-lengths in different languages Human can predict the next word … N-gram models are based on counts in a large corpus Auto-generate a story ... (but gets stuck in local maximum) Grammatical trends: modal verb distribution predicts genre

Why do puns make us groan? He drove his expensive car into a tree and found out how the Mercedes bends. Isn't the Grand Canyon just gorges? Time flies like an arrow. Fruit flies like a banana.

Predicting Next Words One reason puns make us groan is they play on our assumptions of what the next word will be – human language processing involves predicting the most probable next word They also exploit homonymy – same sound, different spelling and meaning (bends, Benz; gorges, gorgeous) polysemy – same spelling, different meaning NLP programs can also make use of word-sequence modeling

Auto-generate a Story How to fix this? Use a random number generator.

Auto-generate a Story The choice() method chooses one item randomly from a list (from random import *)

Part-of-Speech Tagging: Terminology The process of associating labels with each token in a text, using an algorithm to select a tag for each word, eg Hand-coded rules Statistical taggers Brill (transformation-based) tagger Hybrid tagger: combination, eg by “vote” Tags The labels Tag Set The collection of tags used for a particular task, eg Brown or LOB tagset Modified from Diane Litman's version of Steve Bird's notes

Example from the GENIA corpus Typically a tagged text is a sequence of white-space separated word/tag tokens: These/DT findings/NNS should/MD be/VB useful/JJ for/IN therapeutic/JJ strategies/NNS and/CC the/DT development/NN of/IN immunosuppressants/NNS targeting/VBG CD28/NN costimulatory/NN pathway/NN ./.

Modified from Diane Litman's version of Steve Bird's notes What does Tagging do? Collapses Distinctions Lexical identity may be discarded e.g., all personal pronouns tagged with PRP Introduces Distinctions Ambiguities may be resolved e.g. deal tagged with NN or VB Helps in classification and prediction Modified from Diane Litman's version of Steve Bird's notes

Significance of Parts of Speech A word’s POS tells us a lot about the word and its neighbors: Limits the range of meanings (deal), pronunciation (object vs object) or both (wind) Helps in stemming Limits the range of following words Can help select nouns from a document for summarization Basis for partial parsing (chunked parsing) Parsers can build trees directly on the POS tags instead of maintaining a lexicon Modified from Diane Litman's version of Steve Bird's notes

Slide modified from Massimo Poesio's Choosing a tagset The choice of tagset greatly affects the difficulty of the problem Need to strike a balance between Getting better information about context Make it possible for classifiers to do their job Slide modified from Massimo Poesio's

Some of the best-known Tagsets Brown corpus: 87 tags (more when tags are combined, eg isn’t) LOB corpus: 132 tags Penn Treebank: 45 tags Lancaster UCREL C5 (used to tag the BNC): 61 tags Lancaster C7: 145 tags Slide modified from Massimo Poesio's

Modified from Diane Litman's version of Steve Bird's notes The Brown Corpus An early digital corpus (1961) Francis and Kucera, Brown University Contents: 500 texts, each 2000 words long From American books, newspapers, magazines Representing genres: Science fiction, romance fiction, press reportage scientific writing, popular lore Modified from Diane Litman's version of Steve Bird's notes

help(nltk.corpus.brown) | paras(self, fileids=None, categories=None) | | raw(self, fileids=None, categories=None) | sents(self, fileids=None, categories=None) | tagged_paras(self, fileids=None, categories=None, simplify_tags=False) | tagged_sents(self, fileids=None, categories=None, simplify_tags=False) | tagged_words(self, fileids=None, categories=None, simplify_tags=False) | words(self, fileids=None, categories=None)

nltk.corpus.brown >>> nltk.corpus.brown.words() ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] >>> nltk.corpus.brown.tagged_words() [('The', 'AT'), ('Fulton', 'NP-TL'), ...] >>> nltk.corpus.brown.tagged_sents() [[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), …

Modified from Diane Litman's version of Steve Bird's notes Penn Treebank First large syntactically annotated corpus 1 million words from Wall Street Journal Part-of-speech tags and syntax trees Modified from Diane Litman's version of Steve Bird's notes

help(nltk.corpus.treebank) | parsed(*args, **kwargs) | @deprecated: Use .parsed_sents() instead. | | parsed_sents(self, files=None) | raw(self, files=None) | read(*args, **kwargs) | @deprecated: Use .raw() or .sents() or .tagged_sents() or | .parsed_sents() instead. | sents(self, files=None) | tagged(*args, **kwargs) | @deprecated: Use .tagged_sents() instead. | tagged_sents(self, files=None) | tagged_words(self, files=None)

Slide modified from Massimo Poesio's How hard is POS tagging? In the Brown corpus, 12% of word types ambiguous 40% of word tokens ambiguous Number of tags 1 2 3 4 5 6 7 Number of word types 35340 3760 264 61 12 Slide modified from Massimo Poesio's

Tagging with lexical frequencies Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN Problem: assign a tag to race given its lexical frequency Solution: we choose the tag that has the greater probability P(race|VB) P(race|NN) Actual estimate from the Switchboard corpus: P(race|NN) = .00041 P(race|VB) = .00003 This suggests we should always tag race/NN (correct 41/44=93%) Modified from Massio Poesio's lecture

Reminder Puns play on our assumptions of the next word… … eg they present us with an unexpected homonym (bends) ConditionalFreqDist() counts word-pairs: word bigrams Used for story generation, Speech recognition, … Parts of Speech: groups words into grammatical categories … and separates different functions of a word In English, many words are ambiguous: 2 or more PoS-tags Very simple tagger: choose by lexical probability (only) Better Pos-Taggers: to come…