Presentation on theme: "Word Bi-grams and PoS Tags"— Presentation transcript:
1 Word Bi-grams and PoS Tags School of ComputingFACULTY OF ENGINEERINGWord Bi-grams and PoS TagsCOMP3310 Natural Language ProcessingEric Atwell, Language Research Group(with thanks to Katja Markert, Marti Hearst, and other contributors)
2 ReminderFreqDist counts of tokens and their distribution can be usefulEg find main characters in Gutenberg textsEg compare word-lengths in different languagesHuman can predict the next word …N-gram models are based on counts in a large corpusAuto-generate a story ... (but gets stuck in local maximum)Grammatical trends: modal verb distribution predicts genre
3 Why do puns make us groan? He drove his expensive car into a tree and found out how the Mercedes bends.Isn't the Grand Canyon just gorges?Time flies like an arrow.Fruit flies like a banana.
4 Predicting Next WordsOne reason puns make us groan is they play on our assumptions of what the next word will be – human language processing involves predicting the most probable next wordThey also exploithomonymy – same sound, different spelling and meaning (bends, Benz; gorges, gorgeous)polysemy – same spelling, different meaningNLP programs can also make use of word-sequence modeling
5 Auto-generate a StoryHow to fix this? Use a random number generator.
6 Auto-generate a Story The choice() method chooses one item randomly from a list(from random import *)
7 Part-of-Speech Tagging: Terminology The process of associating labels with each token in a text, using an algorithm to select a tag for each word, egHand-coded rulesStatistical taggersBrill (transformation-based) taggerHybrid tagger: combination, eg by “vote”TagsThe labelsTag SetThe collection of tags used for a particular task, eg Brown or LOB tagsetModified from Diane Litman's version of Steve Bird's notes
8 Example from the GENIA corpus Typically a tagged text is a sequence of white-space separated word/tag tokens:These/DTfindings/NNSshould/MDbe/VBuseful/JJfor/INtherapeutic/JJstrategies/NNSand/CCthe/DTdevelopment/NNof/INimmunosuppressants/NNStargeting/VBGCD28/NNcostimulatory/NNpathway/NN./.
9 Modified from Diane Litman's version of Steve Bird's notes What does Tagging do?Collapses DistinctionsLexical identity may be discardede.g., all personal pronouns tagged with PRPIntroduces DistinctionsAmbiguities may be resolvede.g. deal tagged with NN or VBHelps in classification and predictionModified from Diane Litman's version of Steve Bird's notes
10 Significance of Parts of Speech A word’s POS tells us a lot about the word and its neighbors:Limits the range of meanings (deal), pronunciation (object vs object) or both (wind)Helps in stemmingLimits the range of following wordsCan help select nouns from a document for summarizationBasis for partial parsing (chunked parsing)Parsers can build trees directly on the POS tags instead of maintaining a lexiconModified from Diane Litman's version of Steve Bird's notes
11 Slide modified from Massimo Poesio's Choosing a tagsetThe choice of tagset greatly affects the difficulty of the problemNeed to strike a balance betweenGetting better information about contextMake it possible for classifiers to do their jobSlide modified from Massimo Poesio's
12 Some of the best-known Tagsets Brown corpus: 87 tags(more when tags are combined, eg isn’t)LOB corpus: 132 tagsPenn Treebank: 45 tagsLancaster UCREL C5 (used to tag the BNC): 61 tagsLancaster C7: 145 tagsSlide modified from Massimo Poesio's
13 Modified from Diane Litman's version of Steve Bird's notes The Brown CorpusAn early digital corpus (1961)Francis and Kucera, Brown UniversityContents: 500 texts, each 2000 words longFrom American books, newspapers, magazinesRepresenting genres:Science fiction, romance fiction, press reportage scientific writing, popular loreModified from Diane Litman's version of Steve Bird's notes
16 Modified from Diane Litman's version of Steve Bird's notes Penn TreebankFirst large syntactically annotated corpus1 million words from Wall Street JournalPart-of-speech tags and syntax treesModified from Diane Litman's version of Steve Bird's notes
17 help(nltk.corpus.treebank) | parsed(*args, **kwargs)| @deprecated: Use .parsed_sents() instead.|| parsed_sents(self, files=None)| raw(self, files=None)| read(*args, **kwargs)| @deprecated: Use .raw() or .sents() or .tagged_sents() or| parsed_sents() instead.| sents(self, files=None)| tagged(*args, **kwargs)| @deprecated: Use .tagged_sents() instead.| tagged_sents(self, files=None)| tagged_words(self, files=None)
18 Slide modified from Massimo Poesio's How hard is POS tagging?In the Brown corpus, 12% of word types ambiguous 40% of word tokens ambiguousNumber of tags1234567Number of word types3534037602646112Slide modified from Massimo Poesio's
19 Tagging with lexical frequencies Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NNPeople/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NNProblem: assign a tag to race given its lexical frequencySolution: we choose the tag that has the greater probabilityP(race|VB)P(race|NN)Actual estimate from the Switchboard corpus:P(race|NN) =P(race|VB) =This suggests we should always tag race/NN (correct 41/44=93%)Modified from Massio Poesio's lecture
20 Reminder Puns play on our assumptions of the next word… … eg they present us with an unexpected homonym (bends)ConditionalFreqDist() counts word-pairs: word bigramsUsed for story generation, Speech recognition, …Parts of Speech: groups words into grammatical categories… and separates different functions of a wordIn English, many words are ambiguous: 2 or more PoS-tagsVery simple tagger: choose by lexical probability (only)Better Pos-Taggers: to come…