February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Word Bi-grams and PoS Tags
1 I256: Applied Natural Language Processing Marti Hearst Aug 30, 2006.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Chapter 8. Word Classes and Part-of-Speech Tagging From: Chapter 8 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
BİL711 Natural Language Processing
February 2007CSA3050: Tagging II1 CSA2050: Natural Language Processing Tagging 2 Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams.
LINGUISTICA GENERALE E COMPUTAZIONALE DISAMBIGUAZIONE DELLE PARTI DEL DISCORSO.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Part-Of-Speech (POS) Tagging.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004.
1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Part-of-Speech Tagging & Sequence Labeling
Word classes and part of speech tagging Chapter 5.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 13, 2004.
1 I256: Applied Natural Language Processing Marti Hearst Sept 18, 2006.
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
NLTK Tagging CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
Python for NLP and the Natural Language Toolkit CS1573: AI Application Development, Spring 2003 (modified from Edward Loper’s notes)
March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.
ELN – Natural Language Processing Giuseppe Attardi
Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html Natural Language Toolkit.
Parts of Speech Sudeshna Sarkar 7 Aug 2008.
April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Deep Learning with Python. 파이썬 (python) 이란 ? 1991 년 Guido van Rossum 이 발표한 인터프리터 언어 Google 의 3 대 개발언어 (C/C++, Java, Python)
October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2 Shallow Parsing and Chunking Python and NLTK NLTK Exercises.
10/12/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 10 Giuseppe Carenini.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
Euromasters SS Trevor Cohn Introduction to NLTK part 1 1 Euromasters summer school 2005 Introduction to NLTK Trevor Cohn July 12, 2005.
Word classes and part of speech tagging Chapter 5.
Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING.
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Word classes and part of speech tagging 09/28/2004 Reading: Chap 8, Jurafsky & Martin Instructor: Rada Mihalcea Note: Some of the material in this slide.
CPSC 503 Computational Linguistics
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
February 2007CSA3050: Tagging III and Chunking 1 CSA2050: Natural Language Processing Tagging 3 and Chunking Transformation Based Tagging Chunking.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2 Python and NLTK Shallow Parsing and Chunking NLTK Lite.
LING/C SC/PSYC 438/538 Lecture 18 Sandiway Fong. Adminstrivia Homework 7 out today – due Saturday by midnight.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Part-of-Speech Tagging & Sequence Labeling Hongning Wang
Stochastic Methods for NLP Probabilistic Context-Free Parsers Probabilistic Lexicalized Context-Free Parsers Hidden Markov Models – Viterbi Algorithm Statistical.
Word classes and part of speech tagging Chapter 5.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
Python for NLP and the Natural Language Toolkit
Lecture 9: Part of Speech
NLTK Tagging CS1573: AI Application Development, Spring 2003
Natural Language Processing (NLP)
CSC 594 Topics in AI – Natural Language Processing
Text Analytics Giuseppe Attardi Università di Pisa
Improving an Open Source Question Answering System
Natural Language Processing (NLP)
CSA2050: Introduction to Computational Linguistics
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.
Natural Language Processing (NLP)
Presentation transcript:

February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

February 2007CSA3050: Tagging I2 Tagging 1 Lecture Slides based on Mike Rosner and Marti Hearst notes Diane Litman’s version of Steven Bird’s notes Additions from NLTK tutorials

February 2007CSA3050: Tagging I3 Tagging Mr. Sherlock Holmes, who was usually very X, … What is the part of speech of X ?

February 2007CSA3050: Tagging I4 Tagging Mr. Sherlock Holmes, who was usually very late/ADJ in the mornings, save upon those not infrequent occasions when he was up all night, was Y What is the part of speech of Y ?

February 2007CSA3050: Tagging I5 Tagging Mr. Sherlock Holmes, who was usually very late in the mornings, save upon those not infrequent occasions when he was up all night, was seated/VBN at the breakfast table

February 2007CSA3050: Tagging I6 Tagging Terminology Tagging –The process of associating labels with each token in a text Tags –The labels Tag Set –The collection of tags used for a particular task

February 2007CSA3050: Tagging I7 Tagging Example Typically a tagged text is a sequence of white- space separated base/tag tokens: The/at Pantheon’s/np interior/nn,/,still/rb in/in its/pp original/jj form/nn,/, is/bez truly/ql majestic/jj and/cc an/at architectural/jj triumph/nn./. Its/pp rotunda/nn forms/vbz a/at perfect/jj circle/nn whose/wp diameter/nn is/bez equal/jj to/in the/at height/nn from/in the/at floor/nn to/in the/at ceiling/nn./.

February 2007CSA3050: Tagging I8 What does tagging do? 1.Collapses Some Distinctions Lexical identity may be discarded e.g. all personal pronouns tagged with PRP 2.….But Introduces Others Ambiguities may be removed e.g. deal tagged with NN or VB e.g. deal tagged with DEAL1 or DEAL2 3.Helps classification and prediction

February 2007CSA3050: Tagging I9 Parts of Speech (POS) A word’s POS tells us a lot about the word and its neighbors: –Limits the range of meanings (deal), pronunciation (object vs object) or both (wind) –Helps in stemming –Limits the range of following words for Speech Recognition –Can help select nouns from a document for IR –Basis for partial parsing (chunked parsing) –Parsers can build trees directly on the POS tags instead of maintaining a lexicon

February 2007CSA3050: Tagging I10 POS and Tagsets The choice of tagset greatly affects the difficulty of the problem Need to strike a balance between –Getting better information about context (best: introduce more distinctions) –Make it possible for classifiers to do their job (need to minimize distinctions)

February 2007CSA3050: Tagging I11 Common Tagsets Brown corpus: 87 tags Penn Treebank: 45 tags Lancaster UCREL C5 (used to tag the British National Corpus - BNC): 61 tags Lancaster C7: 145 tags

February 2007CSA3050: Tagging I12 Brown Corpus The first digital corpus (1961) –Francis and Kucera, Brown University Contents: 500 texts, each 2000 words long –From American books, newspapers, magazines –Representing genres: Science fiction, romance fiction, press reportage scientific writing, popular lore

February 2007CSA3050: Tagging I13 Penn Treebank First syntactically annotated corpus 1 million words from Wall Street Journal Part of speech tags and syntax trees

February 2007CSA3050: Tagging I14 Penn Treebank The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS./. VB DT NN. Book that flight. VBZ DT NN VB NN ? Does that flight serve dinner ?

February 2007CSA3050: Tagging I15 Penn Treebank

February 2007CSA3050: Tagging I16 Penn Treebank – Important Tags

February 2007CSA3050: Tagging I17 Penn Treebank – Verb Tags

February 2007CSA3050: Tagging I18 Penn Treebank Example (S (NP-SBJ-1 (DT The) (NNP Senate)) (VP (VBZ plans_ (S (NP-SBJ (-NONE- *-1)) (VP (TO to) (VP (VB take) (PRT (RP up)) (NP (DT the) (NN measure)) (ADV-TMP (RB quickly)))))) (..))

February 2007CSA3050: Tagging I19 Tagging Typically the set of tags is larger than basic parts of speech Tags often contain some morphological information Often referred to as “morphosyntactic labels”

February 2007CSA3050: Tagging I20 Tagging Ambiguities N N-V V-IN DT N FRUIT FLIES LIKE A BANANA

February 2007CSA3050: Tagging I21 Interpretation 1 S VP NP NP N N V DT N FRUIT FLIES LIKE A BANANA

February 2007CSA3050: Tagging I22 Interpretation 2 S VP PP NP NP N V IN DT N FRUIT FLIES LIKE A BANANA

February 2007CSA3050: Tagging I23 Lots of ambiguities… 1.He can can a can. 2.I can light a fire and you can open a can of beans. Now the can is open, and we can eat in the light of the fire.

February 2007CSA3050: Tagging I24 Lots of ambiguities… In the Brown Corpus –11.5% of word types are ambiguous –40% of word tokens are ambiguous Most words in English are unambiguous. Many of the most common words are ambiguous. Typically ambiguous tags are not equally probable.

February 2007CSA3050: Tagging I25 Lots of ambiguities… Brown Corpus Unambiguous (1 tag): 35,340 types Ambiguous (2-7 tags): 4,100 types (Table: Derose, 1988) 2 tags3,760 3 tags264 4 tags61 5 tags12 6 tags2 7 tags1

February 2007CSA3050: Tagging I26 Approaches to Tagging 1. Tagger: ENGTWOL Tagger (Voutilainen 1995) 2.Stochastic Tagger: HMM-based Tagger 3.Transformation-Based Tagger: Brill Tagger (Brill 1995)

February 2007CSA3050: Tagging I27 NLTK Natural Language Toolkit (NLTK) Please download and install! Runs on Python

February 2007CSA3050: Tagging I28 NLTK Introduction The Natural Language Toolkit (NLTK) provides: –Basic classes for representing data relevant to natural language processing. –Standard interfaces for performing tasks, such as tokenization, tagging, and parsing. –Standard implementations of each task, which can be combined to solve complex problems. Two versions: NLTK and NLTK-Lite

February 2007CSA3050: Tagging I29 NLTK Modules nltk.token : processing individual elements of text, such as words or sentences. nltk.probability : modeling frequency distributions and probabilistic systems. nltk.tagger : tagging tokens with supplemental information, such as parts of speech or wordnet sense tags. nltk.parser : high-level interface for parsing texts. nltk.chartparser : a chart-based implementation of the parser interface. nltk.chunkparser : a regular-expression based surface parser.

February 2007CSA3050: Tagging I30 Python for NLP Python is a great language for NLP: –Simple –Easy to debug: Exceptions Interpreted language –Easy to structure Modules Object oriented programming –Powerful string manipulation

February 2007CSA3050: Tagging I31 Python Modules and Packages Python modules “package program code and data for reuse.” (Lutz) –Similar to library in C, package in Java. Python packages are hierarchical modules (i.e., modules that contain other modules). Three commands for accessing modules: 1.import 2.from…import 3.reload

February 2007CSA3050: Tagging I32 Import Command The import command loads a module: # Load the regular expression module >>> import re To access the contents of a module, use dotted names: # Use the search method from the re module >>> re.search(‘\w+’, str) To list the contents of a module, use dir: >>> dir(re) [‘DOTALL’, ‘I’, ‘IGNORECASE’,…]

February 2007CSA3050: Tagging I33 from...import The from…import command loads individual functions and objects from a module: # Load the search function from the re module >>> from re import search Once an individual function or object is loaded with from…import, it can be used directly: # Use the search method from the re module >>> search (‘\w+’, str)

February 2007CSA3050: Tagging I34 Import vs. from...import Import Keeps module functions separate from user functions. Requires the use of dotted names. Works with reload. from…import Puts module functions and user functions together. More convenient names. Does not work with reload.

February 2007CSA3050: Tagging I35 Reload If you edit a module, you must use the reload command before the changes become visible in Python: >>> import mymodule... >>> reload (mymodule) The reload command only affects modules that have been loaded with import ; it does not update individual functions and objects loaded with from...import.

February 2007CSA3050: Tagging I36 Reload If you edit a module, you must use the reload command before the changes become visible in Python: >>> import mymodule... >>> reload (mymodule) The reload command only affects modules that have been loaded with import ; it does not update individual functions and objects loaded with from...import.

February 2007CSA3050: Tagging I37 Next Sessions… Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams Read Jurafsky and Marting Chapter 4 (PDF) Install NLTK