Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15.

Similar presentations


Presentation on theme: "LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15."— Presentation transcript:

1 LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

2 Part-of-Speech (POS) Tagging Basic Idea: –assign the right part-of-speech tag, e.g. noun, verb, conjunction, to a word –useful for shallow parsing –or as first stage of a deeper/more sophisticated system Question: –Is it a hard task? i.e. can’t we just look the words up in a dictionary? Answer: –Yes. Ambiguity. –No. POS tagging programs typically claim 95%+ accuracy

3 POS Tagging Task: –assign the right part-of-speech tag to a word in context –not always easy Example: walk –the walk : noun I took … –I walk : verb 2 miles every day Example: still: noun, adjective, adverb, verb –the still of the night, a glass still –still waters –stand still –still struggling –Still, I didn’t give way –still your fear of the dark (transitive) –the bubbling waters stilled (intransitive)

4 POS Tagging Issues/Questions: –What are the parts of speech and subclasses that we might want to tag? –What does a typical tagset look like? –What methods can we use to assign tags?

5 Parts-of-Speech Divide words into classes based on grammatical function –nouns (open-class: unlimited set) referential items (denoting objects/concepts etc.) –proper nouns: John –pronouns: he, him, she, her, it –anaphors: himself, herself (reflexives) –common nouns: dog, dogs, water »number: dog (singular), dogs (plural) »count-mass distinction: many dogs, *many waters –eventive nouns: dismissal, concert, playback, destruction (deverbal) nonreferential items –it as in it is important to study –there as in there seems to be a problem –some languages don’t have these: e.g. Japanese open-class –factoid, email, bush-ism

6 Parts-of-Speech Pronouns: 1.it 2.I 3.he 4.you 5.his 6.they 7.this 8.that 9.she 10.her 11.we 12.all 13.which 14.their 15.what

7 Parts-of-Speech Divide words into classes based on grammatical function –verbs (closed-class: fixed set) auxiliaries –be(passive, progressive) –have (pluperfect tense) –do(what did John buy?, Did Mary win?) –modals: can, could, would, will, may Irregular: –is, was, were, does, did

8 Parts-of-Speech Divide words into classes based on grammatical function –verbs (open-class: unlimited set) Intransitive –unaccusatives: arrive (achievement) –unergatives: run, jog (activities) Transitive –actions: hit (semelfactive: hit the ball for an hour) –actions: eat, destroy (accomplishment) –psych verbs: frighten (x frightens y), fear (y fears x) Ditransitive –put (x put y on z, *x put y) –give (x gave y z, *x gave y, x gave z to y) –load (x loaded y (on z), x loaded z (with y)) –Open-class: reaganize, email, fax

9 Parts-of-Speech Divide words into classes based on grammatical function –adjectives (open-class: unlimited set) modify nouns black, white, open, closed, sick, well attributive: black (black car, car is black), main (main street, *street is main), atomic predicative: afraid (*afraid child, the child is afraid) stage-level: drunk (there is a man drunk in the pub) individual-level: clever, short, tall (*there is a man tall in the bar) object-taking: proud (proud of him,*well of him) intersective: red (red car: intersection of the set of red things and the set of cars) non-intersective: former (former architect), atomic (atomic scientist) comparative, superlative: blacker, blackest, *opener, *openest –open-class: hackable, spammable

10 Parts-of-Speech Divide words into classes based on grammatical function –adverbs (open-class: unlimited set) modify verbs (adjectives and other adverbs) manner: slowly (moved slowly) degree: slightly, more (more clearly), very (very bad), almost sentential: unfortunately, suddenly question: how temporal: when, soon, yesterday (noun?) location: sideways, here (John is here) –open-class: spam-wise

11 Parts-of-Speech Divide words into classes based on grammatical function –prepositions (closed-class: fixed set) –come before an object, assigns a semantic function (from Mars, *Mars from) head-final languages: postpositions (Japanese: amerika-kara) –location: on, in, by –temporal: by, until

12 POS Tagging Task: –assign the right part-of-speech tag, e.g. noun, verb, conjunction, to a word in context POS taggers –need to be fast in order to process large corpora should take no more than time linear in the size of the corpora –full parsing is slow e.g. context-free grammar  n 3, n length of the sentence –POS taggers try to assign correct tag without actually parsing the sentence

13 POS Tagging Components: –Dictionary of words Exhaustive list of closed class items –Examples: »the, a, an: determiner »from, to, of, by: preposition »and, or: coordination conjunction Large set of open class (e.g. noun, verbs, adjectives) items with frequency information

14 POS Tagging Components: –Mechanism to assign tags Context-free: by frequency Context: bigram, trigram, HMM, hand-coded rules –Example: »Det Noun/*Verb the walk… –Mechanism to handle unknown words (extra-dictionary) Capitalization Morphology: -ed, -tion

15 How Hard is Tagging? Brown Corpus (Francis & Kucera, 1982): –1 million words –39K distinct words –35K words with only 1 tag –4K with multiple tags (DeRose, 1988)

16 How Hard is Tagging? Easy task to do well on: –naïve algorithm assign tag by frequency –90% accuracy (Charniak et al., 1993)

17 Penn TreeBank Tagset 48-tag simplification of Brown Corpus tagset Examples: 1.CCCoordinating conjunction 3.DTDeterminer 7.JJAdjective 11.MDModal 12.NNNoun (singular,mass) 13.NNSNoun (plural) 27VBVerb (base form) 28VBDVerb (past)

18 Penn TreeBank Tagset www.ldc.upenn.edu/doc/treebank2/cl93.html

19 Penn TreeBank Tagset www.ldc.upenn.edu/doc/treebank2/cl93.html $

20 Penn TreeBank Tagset How many tags? –Tag criterion Distinctness with respect to grammatical behavior? –Make tagging easier? Punctuation tags –Penn Treebank numbers 37- 48 Trivial computational task

21 Penn TreeBank Tagset Simplifications : –Tag TO : infinitival marker, preposition I want to win I went to the store –Tag IN : preposition: that, when, although I know that I should have stopped, although… I stopped when I saw Bill

22 Penn TreeBank Tagset Simplifications: –Tag DT : determiner: any, some, these, those any man these *man/men –Tag VBP : verb, present: am, are, walk Am I here? *Walked I here?/Did I walk here?

23 Hard to Tag Items Syntactic Function –Example: resultative I saw the man tired from running Examples (from Brown Corpus Manual) –Hyphenation: long-range, high-energy shirt-sleeved signal-to-noise –Foreign words: mens sana in corpore sano

24 Rule-Based POS Tagging Example Systems –ENGCG (1,100 rules) http://www.lingsoft.fi/cgi-bin/engcg –ENGCG-2 (4000 rules) http://www.connexor.com/demos/tagger_en.html Core Components –English morphological analyzer based on two-level morphology see last lecture –56K word stems –processing apply morphological engine get all possible tags for each word apply rules

25 Rule-Based POS Tagging Example: –Pavlov had shown that salivation can be a conditioned reflex

26 Rule-Based POS Tagging Examples of tags: –PCP2 past participle –SV subject verb –SVOO subject verb object object

27 Rule-Based POS Tagging Example: –it isn’t that:adv odd Rule: –given input “that” –if (+1 A/ADV/QUANT) (+2 SENT-LIM) (NOT -1 SVOC/A) –then eliminate non-ADV tags –else eliminate ADV tag

28 Rule-Based POS Tagging Now ENGCG-2 (4000 rules) –http://www.connexor.com/demos/tagger_en.html

29 Rule-Based POS Tagging Now ENGCG-2 (4000 rules) –http://www.connexor.com/demos/tagger_en.html

30 Rule-Based POS Tagging Best performance of all systems: 99.7%

31 Next Time Look at statistical techniques …


Download ppt "LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15."

Similar presentations


Ads by Google