Presentation is loading. Please wait.

Presentation is loading. Please wait.

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

Similar presentations


Presentation on theme: "School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,"— Presentation transcript:

1 School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other contributors)

2 Reminder FreqDist counts of tokens and their distribution can be useful Eg find main characters in Gutenberg texts Eg compare word-lengths in different languages Human can predict the next word … N-gram models are based on counts in a large corpus Auto-generate a story... (but gets stuck in local maximum) Grammatical trends: modal verb distribution predicts genre

3 Why do puns make us groan? He drove his expensive car into a tree and found out how the Mercedes bends. Isn't the Grand Canyon just gorges? Time flies like an arrow. Fruit flies like a banana.

4 Predicting Next Words One reason puns make us groan is they play on our assumptions of what the next word will be – human language processing involves predicting the most probable next word They also exploit homonymy – same sound, different spelling and meaning (bends, Benz; gorges, gorgeous) polysemy – same spelling, different meaning NLP programs can also make use of word-sequence modeling

5 Auto-generate a Story How to fix this? Use a random number generator.

6 Auto-generate a Story The choice() method chooses one item randomly from a list (from random import *)

7 Modified from Diane Litman's version of Steve Bird's notes 7 Part-of-Speech Tagging: Terminology Tagging The process of associating labels with each token in a text, using an algorithm to select a tag for each word, eg Hand-coded rules Statistical taggers Brill (transformation-based) tagger Hybrid tagger: combination, eg by “vote” Tags The labels Tag Set The collection of tags used for a particular task, eg Brown or LOB tagset

8 8 Example from the GENIA corpus Typically a tagged text is a sequence of white-space separated word/tag tokens: These/DT findings/NNS should/MD be/VB useful/JJ for/IN therapeutic/JJ strategies/NNS and/CC the/DT development/NN of/IN immunosuppressants/NNS targeting/VBG the/DT CD28/NN costimulatory/NN pathway/NN./.

9 Modified from Diane Litman's version of Steve Bird's notes 9 What does Tagging do? Collapses Distinctions Lexical identity may be discarded e.g., all personal pronouns tagged with PRP Introduces Distinctions Ambiguities may be resolved e.g. deal tagged with NN or VB Helps in classification and prediction

10 Modified from Diane Litman's version of Steve Bird's notes 10 Significance of Parts of Speech A word’s POS tells us a lot about the word and its neighbors: Limits the range of meanings (deal), pronunciation (object vs object) or both (wind) Helps in stemming Limits the range of following words Can help select nouns from a document for summarization Basis for partial parsing (chunked parsing) Parsers can build trees directly on the POS tags instead of maintaining a lexicon

11 Slide modified from Massimo Poesio's 11 Choosing a tagset The choice of tagset greatly affects the difficulty of the problem Need to strike a balance between Getting better information about context Make it possible for classifiers to do their job

12 Slide modified from Massimo Poesio's 12 Some of the best-known Tagsets Brown corpus: 87 tags (more when tags are combined, eg isn’t) LOB corpus: 132 tags Penn Treebank: 45 tags Lancaster UCREL C5 (used to tag the BNC): 61 tags Lancaster C7: 145 tags

13 NLTK corpus data-files If you're running tcsh ("echo $SHELL" will tell you), the command is : % setenv NLTK_DATA /home/csunix/nlplib/nltk_data If it's bash : % export NLTK_DATA=/home/csunix/nlplib/nltk_data You can simply type those at the command line to get them recognised by the current shell - meaning you can then invoke Python, etc. If you wish to make them permanent, put the corresponding line in ~/.cshrc or ~/.bashrc (respectively) and launch a new shell

14 NLTK in Python cslin-gps% echo $SHELL /bin/tcsh cslin-gps% ls /home/csunix/nlplib/nltk_data corpora grammars stemmers taggers tokenizers cslin-gps% setenv NLTK_DATA /home/csunix/nlplib/nltk_data cslin-gps% python Python 2.6.2 (r262:71600, Aug 21 2009, 12:23:57) [GCC 4.4.1 20090818 (Red Hat 4.4.1-6)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import nltk

15 Help(nltk) to explore… >>> help(nltk) Help on package nltk: NAME nltk FILE /usr/lib64/python2.6/site-packages/nltk/__init__.py DESCRIPTION NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets and tutorials supporting research and development in natural language processing. @version: 2.0b9 PACKAGE CONTENTS app (package) book …

16 Modified from Diane Litman's version of Steve Bird's notes 16 The Brown Corpus An early digital corpus (1961) Francis and Kucera, Brown University Contents: 500 texts, each 2000 words long From American books, newspapers, magazines Representing genres: Science fiction, romance fiction, press reportage scientific writing, popular lore

17 help(nltk.corpus.brown) >>> help(nltk.corpus.brown) | paras(self, fileids=None, categories=None) | | raw(self, fileids=None, categories=None) | | sents(self, fileids=None, categories=None) | | tagged_paras(self, fileids=None, categories=None, simplify_tags=False) | | tagged_sents(self, fileids=None, categories=None, simplify_tags=False) | | tagged_words(self, fileids=None, categories=None, simplify_tags=False) | | words(self, fileids=None, categories=None) |

18 nltk.corpus.brown >>> nltk.corpus.brown.words() ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said',...] >>> nltk.corpus.brown.tagged_words() [('The', 'AT'), ('Fulton', 'NP-TL'),...] >>> nltk.corpus.brown.tagged_sents() [[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), …

19 Modified from Diane Litman's version of Steve Bird's notes 19 Penn Treebank First large syntactically annotated corpus 1 million words from Wall Street Journal Part-of-speech tags and syntax trees

20 help(nltk.corpus.treebank) | parsed(*args, **kwargs) | @deprecated: Use.parsed_sents() instead. | | parsed_sents(self, files=None) | | raw(self, files=None) | | read(*args, **kwargs) | @deprecated: Use.raw() or.sents() or.tagged_sents() or |.parsed_sents() instead. | | sents(self, files=None) | | tagged(*args, **kwargs) | @deprecated: Use.tagged_sents() instead. | | tagged_sents(self, files=None) | | tagged_words(self, files=None)

21 Slide modified from Massimo Poesio's 21 How hard is POS tagging? Number of tags1234567 Number of word types 353403760264611221 In the Brown corpus, 12% of word types ambiguous 40% of word tokens ambiguous

22 Help on package nltk.tag in nltk: NAME nltk.tag FILE /usr/lib64/python2.6/site-packages/nltk/tag/__init__.py DESCRIPTION Classes and interfaces for tagging each token of a sentence with supplementary information, such as its part of speech. This task, which is known as X{tagging}, is defined by the L{TaggerI} interface. PACKAGE CONTENTS api brill crf hmm hunpos sequential simplify tnt util

23 Modified from Massio Poesio's lecture 23 Tagging with lexical frequencies Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN Problem: assign a tag to race given its lexical frequency Solution: we choose the tag that has the greater probability P(race|VB) P(race|NN) Actual estimate from the Switchboard corpus: P(race|NN) =.00041 P(race|VB) =.00003 This suggests we should always tag race/NN (correct 41/44=93%)

24 Reminder Puns play on our assumptions of the next word… … eg they present us with an unexpected homonym (bends) ConditionalFreqDist() counts word-pairs: word bigrams Used for story generation, Speech recognition, … Parts of Speech: groups words into grammatical categories … and separates different functions of a word In English, many words are ambiguous: 2 or more PoS-tags Very simple tagger: choose by lexical probability (only) Better Pos-Taggers: to come…


Download ppt "School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,"

Similar presentations


Ads by Google