Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/2.

Similar presentations


Presentation on theme: "LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/2."— Presentation transcript:

1 LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/2

2 Today’s Topics Conclude the n-gram section –(Chapter 6) Start Part-of-Speech (POS) Tagging section –(Chapter 8)

3 N-grams for Spelling Correction Spelling Errors –see Chapter 5 (Kukich, 1992): –Non-word detection (easiest) graffe (giraffe) –Isolated-word (context-free) error correction graffe (giraffe,…) graffed (gaffed,…) by definition cannot correct when error word is a valid word –Context-dependent error detection and correction (hardest) your an idiot  you’re an idiot (Microsoft Word corrects this by default)

4 N-grams for Spelling Correction Context-sensitive spelling correction –real-word error detection when the mistyped word happens to be a real word 15% Peterson (1986) 25%-40% Kukich (1992) –examples (local) was conducted mainly be John Black leaving in about fifteen minuets all with continue smoothly –examples (global) Won’t they heave if next Monday at that time? the system has been operating system with all four units on-line

5 N-grams for Spelling Correction Given a word sequence: –W = w 1 … w k … w n Suppose w k is mispelled Suppose possible misspellings are w 1 k,, w 2 k etc. w 1 k,, w 2 k etc. can be estimated via edit distance operations Find the most likely sequence –w 1 … w n k … w n –i.e. maximize p(w 1 … w n k … w n ) Chain Rule –p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )... p(w n |w 1...w n-2 w n-1 )

6 N-grams for Spelling Correction Use an n-gram language model for P(W) Bigram –p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )...p(w n |w 1...w n-3 w n-2 w n-1 ) –p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 ) Trigram –p(w 1 w 2 w 3 w 4...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 ) p(w 4 |w 1 w 2 w 3 )...p(w n |w 1...w n-3 w n-2 w n- 1 ) –p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )p(w 4 |w 2 w 3 )...p(w n |w n-2 w n- 1 ) Problem: –we need to estimate n-grams containing the “mispelled” items –where to find data? –acute sparse data problem Possible Solution (“class-based n-grams”) –Use a part-of-speech n-gram model –more data –p(c 1 c 2 c 3...c n )  p(c 1 ) p(c 2 |c 1 ) p(c 3 |c 2 )...p(c n |c n-1 ) (bigram) –c i = category label (N,V,A,P,Adv,Conj, etc.)

7 Entropy Uncertainty measure (Shannon) –given a random variable x r =2, p i = probability the event is i –Biased coin: -0.8 * lg 0.8 + -0.2 * lg 0.2 = 0.258 + 0.464 = 0.722 –Unbiased coin: - 2* 0.5 * lg 0.5 = 1 –lg = log2 (log base 2) –entropy = H(x) = Shannon uncertainty Perplexity –(average) branching factor –weighted average number of choices a random variable has to make –Formula: 2 H directly related to the entropy value H Examples –Biased coin: 2 0.722 = 0.52 –Unbiased coin: - 2 1 = 2

8 Entropy and Word Sequences given a word sequence: –W = w 1 … w n Entropy for word sequences of length n in language L –H( w 1 … w n ) = -  p( w 1 … w n ) log p( w 1 … w n ) –over all sequences of length n in language L Entropy rate for word sequences of length n –1/n H( w 1 … w n ) –= -1/n  p( w 1 … w n ) log p( w 1 … w n ) Entropy rate –H(L) = lim n>  -1/n  p( w 1 … w n ) log p( w 1 … w n ) –n is number of words in the sequence Shannon-McMillan-Breiman theorem –H(L) = lim n→  - 1/n log p( w 1 … w n ) –select sufficiently large n –possible then to take a single sequence instead of summing over all possible w 1 … w n long sequence will contain many shorter sequences

9 Cross-Entropy evaluate competing models compare two probabilistic models, i.e. approximations, m 1 and p Compute the cross- entropy of m i on p: –H(p, m i ) using –H(p, m i ) = lim n →  1/n -  p( w 1 … w n ) log m i ( w 1 … w n ) Shannon-McMillan-Breiman version: –H(p, m i ) = lim n →  -1/n log m i ( w 1 … w n ) H(p)  H(p, m i ) –true entropy is a lower bound m i with lowest H(p, m i ) is the more accurate model

10 Perplexity of Language Models [see p228: section 6.7] n-gram models corpus: –38 million words from the WSJ (Wall Street Journal) Compute perplexity of each model on a test set of 1.5 million words Perplexity defined as –2 H(p, mi) Results: –unigram 962 –bigram 170 –trigram 109 –the lower the perplexity, the more closely the trained model follows the data

11 Entropy of English Shannon Experiment –given a (hidden) sequence of characters –ask speaker of language to predict what the next character might be –record the number of guesses taken to get the right character –H( English ) = -1/n  p( guess = character ) log p( guess = character ) guess ranges over all characters (letters and space) n is 27

12 Entropy of English Shannon Experiment –1.3 bits (recorded) –URL: http://www.math.ucsd.edu/~crypto/java/ENTROPY/http://www.math.ucsd.edu/~crypto/java/ENTROPY/ –“We should also mention that in a classroom of about 60 students, with everybody venturing guesses for each next letter, we consistently obtained a value of about 1.6 bits for the estimate of the entropy.”

13 Entropy of English Word-based method –train a very good stochastic model m of English on a large corpus –Use it to assign a log-probability to a very long sequence Shannon-McMillan-Breiman formula: –H(p, m) = lim n →  -1/n log m( w 1 … w n ) –H( English )  H(p, m) –Result: 1.75 bits per character (Brown et al.) 583 million words corpus to train model m test sequence was the Brown corpus (1 million words)

14 Next Topic Chapter 8: –Word Classes and Part-of-Speech Tagging

15 Parts-of-Speech Divide words into classes based on grammatical function –nouns (open-class: unlimited set) referential items (denoting objects/concepts etc.) –proper nouns: John –pronouns: he, him, she, her, it –anaphors: himself, herself (reflexives) –common nouns: dog, dogs, water »number: dog (singular), dogs (plural) »count-mass distinction: many dogs, *many waters –eventive nouns: dismissal, concert, playback, destruction (deverbal) nonreferential items –it as in it is important to study –there as in there seems to be a problem –some languages don’t have these: e.g. Japanese open-class –factoid, email, bush-ism

16 Parts-of-Speech Pronouns: 1.it 2.I 3.he 4.you 5.his 6.they 7.this 8.that 9.she 10.her 11.we 12.all 13.which 14.their 15.what figure 8.4

17 Parts-of-Speech Divide words into classes based on grammatical function –verbs (closed-class: fixed set) auxiliaries –be(passive, progressive) –have (pluperfect tense) –do(what did John buy?, Did Mary win?) –modals: can, could, would, will, may Irregular: –is, was, were, does, did figure 8.5

18 Parts-of-Speech Divide words into classes based on grammatical function –verbs (open-class: unlimited set) Intransitive –unaccusatives: arrive (achievement) –unergatives: run, jog (activities) Transitive –actions: hit (semelfactive: hit the ball for an hour) –actions: eat, destroy (accomplishment) –psych verbs: frighten (x frightens y), fear (y fears x) Ditransitive –put (x put y on z, *x put y) –give (x gave y z, *x gave y, x gave z to y) –load (x loaded y (on z), x loaded z (with y)) –Open-class: reaganize, email, fax

19 Parts-of-Speech Divide words into classes based on grammatical function –adjectives (open-class: unlimited set) modify nouns black, white, open, closed, sick, well attributive: black (black car, car is black), main (main street, *street is main), atomic predicative: afraid (*afraid child, the child is afraid) stage-level: drunk (there is a man drunk in the pub) individual-level: clever, short, tall (*there is a man tall in the bar) object-taking: proud (proud of him,*well of him) intersective: red (red car: intersection of the set of red things and the set of cars) non-intersective: former (former architect), atomic (atomic scientist) comparative, superlative: blacker, blackest, *opener, *openest –open-class: hackable, spammable

20 Parts-of-Speech Divide words into classes based on grammatical function –adverbs (open-class: unlimited set) modify verbs (adjectives and other adverbs) manner: slowly (moved slowly) degree: slightly, more (more clearly), very (very bad), almost sentential: unfortunately, suddenly question: how temporal: when, soon, yesterday (noun?) location: sideways, here (John is here) –open-class: spam-wise

21 Parts-of-Speech Divide words into classes based on grammatical function –prepositions (closed-class: fixed set) –come before an object, assigns a semantic function (from Mars, *Mars from) head-final languages: postpositions (Japanese: amerika-kara) –location: on, in, by –temporal: by, until figure 8.1

22 Parts-of-Speech Divide words into classes based on grammatical function –particles (closed-class: fixed set) –resembles a preposition or adverb, often combines to form a phrasal verb –went on, finish up –throw sleep off (throw off sleep) single-word particles (Quirk, 1985): figure 8.2

23 Parts-of-Speech Divide words into classes based on grammatical function –conjunctions (closed-class: fixed set) –used to join two phrases, clauses or sentences –coordinating conjunctions: and, or, but –subordinating conjunctions: that (complementizer) figure 8.3

24 Part-of-Speech (POS) Tagging Idea: –assign the right part-of-speech tag, e.g. noun, verb, conjunction, to a word –useful for shallow parsing –or as first stage of a deeper/more sophisticated system Question: –Is it a hard task? i.e. can’t we look the words up in a dictionary? Answer: –Yes. Ambiguity. –No. POS Taggers typical claim 95%+ accuracy

25 Part-of-Speech (POS) Tagging example –walk: noun, verb the walk : noun I took … I walk : verb 2 miles every day –as a shallow parsing tool: can we do this without fully parsing the sentence? example –still: noun, adjective, adverb, verb the still of the night, a glass still still waters stand still still struggling Still, I didn’t give way still your fear of the dark (transitive) the bubbling waters stilled (intransitive)

26 POS Tagging Task: –assign the right part-of-speech tag, e.g. noun, verb, conjunction, to a word in context POS taggers –need to be fast in order to process large corpora should take no more than time linear in the size of the corpora –full parsing is slow e.g. context-free grammar  n 3, n length of the sentence –POS taggers try to assign correct tag without actually parsing the sentence

27 POS Tagging Components: –Dictionary of words Exhaustive list of closed class items –Examples: »the, a, an: determiner »from, to, of, by: preposition »and, or: coordination conjunction Large set of open class (e.g. noun, verbs, adjectives) items with frequency information

28 POS Tagging Components: –Mechanism to assign tags Context-free: by frequency Context: bigram, trigram, HMM, hand-coded rules –Example: »Det Noun/*Verb the walk… –Mechanism to handle unknown words (extra-dictionary) Capitalization Morphology: -ed, -tion

29 How Hard is Tagging? Brown Corpus (Francis & Kucera, 1982): –1 million words –39K distinct words –35K words with only 1 tag –4K with multiple tags (DeRose, 1988) figure 8.7

30 How Hard is Tagging? Easy task to do well on: –naïve algorithm assign tag by (unigram) frequency –90% accuracy (Charniak et al., 1993)

31 Penn TreeBank Tagset 48-tag simplification of Brown Corpus tagset Examples: 1.CCCoordinating conjunction 3.DTDeterminer 7.JJAdjective 11.MDModal 12.NNNoun (singular,mass) 13.NNSNoun (plural) 27VBVerb (base form) 28VBDVerb (past)

32 Penn TreeBank Tagset www.ldc.upenn.edu/doc/treebank2/cl93.html

33 Penn TreeBank Tagset www.ldc.upenn.edu/doc/treebank2/cl93.html

34 Penn TreeBank Tagset How many tags? –Tag criterion Distinctness with respect to grammatical behavior? –Make tagging easier? Punctuation tags –Penn Treebank numbers 37- 48 Trivial computational task

35 Penn TreeBank Tagset Simplifications : –Tag TO : infinitival marker, preposition I want to win I went to the store –Tag IN : preposition: that, when, although I know that I should have stopped, although… I stopped when I saw Bill

36 Penn TreeBank Tagset Simplifications: –Tag DT : determiner: any, some, these, those any man these *man/men –Tag VBP : verb, present: am, are, walk Am I here? *Walked I here?/Did I walk here?

37 Hard to Tag Items Syntactic Function –Example: resultative I saw the man tired from running Examples (from Brown Corpus Manual) –Hyphenation: long-range, high-energy shirt-sleeved signal-to-noise –Foreign words: mens sana in corpore sano


Download ppt "LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/2."

Similar presentations


Ads by Google