Presentation is loading. Please wait.

Presentation is loading. Please wait.

February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK.

Similar presentations


Presentation on theme: "February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK."— Presentation transcript:

1 February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK

2 February 2007CSA3050: Tagging I2 Tagging 1 Lecture Slides based on Mike Rosner and Marti Hearst notes Diane Litman’s version of Steven Bird’s notes Additions from NLTK tutorials

3 February 2007CSA3050: Tagging I3 Tagging Mr. Sherlock Holmes, who was usually very X, … What is the part of speech of X ?

4 February 2007CSA3050: Tagging I4 Tagging Mr. Sherlock Holmes, who was usually very late/ADJ in the mornings, save upon those not infrequent occasions when he was up all night, was Y What is the part of speech of Y ?

5 February 2007CSA3050: Tagging I5 Tagging Mr. Sherlock Holmes, who was usually very late in the mornings, save upon those not infrequent occasions when he was up all night, was seated/VBN at the breakfast table

6 February 2007CSA3050: Tagging I6 Tagging Terminology Tagging –The process of associating labels with each token in a text Tags –The labels Tag Set –The collection of tags used for a particular task

7 February 2007CSA3050: Tagging I7 Tagging Example Typically a tagged text is a sequence of white- space separated base/tag tokens: The/at Pantheon’s/np interior/nn,/,still/rb in/in its/pp original/jj form/nn,/, is/bez truly/ql majestic/jj and/cc an/at architectural/jj triumph/nn./. Its/pp rotunda/nn forms/vbz a/at perfect/jj circle/nn whose/wp diameter/nn is/bez equal/jj to/in the/at height/nn from/in the/at floor/nn to/in the/at ceiling/nn./.

8 February 2007CSA3050: Tagging I8 What does tagging do? 1.Collapses Some Distinctions Lexical identity may be discarded e.g. all personal pronouns tagged with PRP 2.….But Introduces Others Ambiguities may be removed e.g. deal tagged with NN or VB e.g. deal tagged with DEAL1 or DEAL2 3.Helps classification and prediction

9 February 2007CSA3050: Tagging I9 Parts of Speech (POS) A word’s POS tells us a lot about the word and its neighbors: –Limits the range of meanings (deal), pronunciation (object vs object) or both (wind) –Helps in stemming –Limits the range of following words for Speech Recognition –Can help select nouns from a document for IR –Basis for partial parsing (chunked parsing) –Parsers can build trees directly on the POS tags instead of maintaining a lexicon

10 February 2007CSA3050: Tagging I10 POS and Tagsets The choice of tagset greatly affects the difficulty of the problem Need to strike a balance between –Getting better information about context (best: introduce more distinctions) –Make it possible for classifiers to do their job (need to minimize distinctions)

11 February 2007CSA3050: Tagging I11 Common Tagsets Brown corpus: 87 tags Penn Treebank: 45 tags Lancaster UCREL C5 (used to tag the British National Corpus - BNC): 61 tags Lancaster C7: 145 tags

12 February 2007CSA3050: Tagging I12 Brown Corpus The first digital corpus (1961) –Francis and Kucera, Brown University Contents: 500 texts, each 2000 words long –From American books, newspapers, magazines –Representing genres: Science fiction, romance fiction, press reportage scientific writing, popular lore

13 February 2007CSA3050: Tagging I13 Penn Treebank First syntactically annotated corpus 1 million words from Wall Street Journal Part of speech tags and syntax trees

14 February 2007CSA3050: Tagging I14 Penn Treebank The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS./. VB DT NN. Book that flight. VBZ DT NN VB NN ? Does that flight serve dinner ?

15 February 2007CSA3050: Tagging I15 Penn Treebank

16 February 2007CSA3050: Tagging I16 Penn Treebank – Important Tags

17 February 2007CSA3050: Tagging I17 Penn Treebank – Verb Tags

18 February 2007CSA3050: Tagging I18 Penn Treebank Example (S (NP-SBJ-1 (DT The) (NNP Senate)) (VP (VBZ plans_ (S (NP-SBJ (-NONE- *-1)) (VP (TO to) (VP (VB take) (PRT (RP up)) (NP (DT the) (NN measure)) (ADV-TMP (RB quickly)))))) (..))

19 February 2007CSA3050: Tagging I19 Tagging Typically the set of tags is larger than basic parts of speech Tags often contain some morphological information Often referred to as “morphosyntactic labels”

20 February 2007CSA3050: Tagging I20 Tagging Ambiguities N N-V V-IN DT N FRUIT FLIES LIKE A BANANA

21 February 2007CSA3050: Tagging I21 Interpretation 1 S VP NP NP N N V DT N FRUIT FLIES LIKE A BANANA

22 February 2007CSA3050: Tagging I22 Interpretation 2 S VP PP NP NP N V IN DT N FRUIT FLIES LIKE A BANANA

23 February 2007CSA3050: Tagging I23 Lots of ambiguities… 1.He can can a can. 2.I can light a fire and you can open a can of beans. Now the can is open, and we can eat in the light of the fire.

24 February 2007CSA3050: Tagging I24 Lots of ambiguities… In the Brown Corpus –11.5% of word types are ambiguous –40% of word tokens are ambiguous Most words in English are unambiguous. Many of the most common words are ambiguous. Typically ambiguous tags are not equally probable.

25 February 2007CSA3050: Tagging I25 Lots of ambiguities… Brown Corpus Unambiguous (1 tag): 35,340 types Ambiguous (2-7 tags): 4,100 types (Table: Derose, 1988) 2 tags3,760 3 tags264 4 tags61 5 tags12 6 tags2 7 tags1

26 February 2007CSA3050: Tagging I26 Approaches to Tagging 1. Tagger: ENGTWOL Tagger (Voutilainen 1995) 2.Stochastic Tagger: HMM-based Tagger 3.Transformation-Based Tagger: Brill Tagger (Brill 1995)

27 February 2007CSA3050: Tagging I27 NLTK Natural Language Toolkit (NLTK) http://nltk.sourceforge.net/ Please download and install! Runs on Python

28 February 2007CSA3050: Tagging I28 NLTK Introduction The Natural Language Toolkit (NLTK) provides: –Basic classes for representing data relevant to natural language processing. –Standard interfaces for performing tasks, such as tokenization, tagging, and parsing. –Standard implementations of each task, which can be combined to solve complex problems. Two versions: NLTK and NLTK-Lite

29 February 2007CSA3050: Tagging I29 NLTK Modules nltk.token : processing individual elements of text, such as words or sentences. nltk.probability : modeling frequency distributions and probabilistic systems. nltk.tagger : tagging tokens with supplemental information, such as parts of speech or wordnet sense tags. nltk.parser : high-level interface for parsing texts. nltk.chartparser : a chart-based implementation of the parser interface. nltk.chunkparser : a regular-expression based surface parser.

30 February 2007CSA3050: Tagging I30 Python for NLP Python is a great language for NLP: –Simple –Easy to debug: Exceptions Interpreted language –Easy to structure Modules Object oriented programming –Powerful string manipulation

31 February 2007CSA3050: Tagging I31 Python Modules and Packages Python modules “package program code and data for reuse.” (Lutz) –Similar to library in C, package in Java. Python packages are hierarchical modules (i.e., modules that contain other modules). Three commands for accessing modules: 1.import 2.from…import 3.reload

32 February 2007CSA3050: Tagging I32 Import Command The import command loads a module: # Load the regular expression module >>> import re To access the contents of a module, use dotted names: # Use the search method from the re module >>> re.search(‘\w+’, str) To list the contents of a module, use dir: >>> dir(re) [‘DOTALL’, ‘I’, ‘IGNORECASE’,…]

33 February 2007CSA3050: Tagging I33 from...import The from…import command loads individual functions and objects from a module: # Load the search function from the re module >>> from re import search Once an individual function or object is loaded with from…import, it can be used directly: # Use the search method from the re module >>> search (‘\w+’, str)

34 February 2007CSA3050: Tagging I34 Import vs. from...import Import Keeps module functions separate from user functions. Requires the use of dotted names. Works with reload. from…import Puts module functions and user functions together. More convenient names. Does not work with reload.

35 February 2007CSA3050: Tagging I35 Reload If you edit a module, you must use the reload command before the changes become visible in Python: >>> import mymodule... >>> reload (mymodule) The reload command only affects modules that have been loaded with import ; it does not update individual functions and objects loaded with from...import.

36 February 2007CSA3050: Tagging I36 Reload If you edit a module, you must use the reload command before the changes become visible in Python: >>> import mymodule... >>> reload (mymodule) The reload command only affects modules that have been loaded with import ; it does not update individual functions and objects loaded with from...import.

37 February 2007CSA3050: Tagging I37 Next Sessions… Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams Read Jurafsky and Marting Chapter 4 (PDF) Install NLTK


Download ppt "February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK."

Similar presentations


Ads by Google