Presentation is loading. Please wait.

Presentation is loading. Please wait.

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

Similar presentations


Presentation on theme: "6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative."— Presentation transcript:

1 6. N-GRAMs 부산대학교 인공지능연구실 최성자

2 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative communication -Context-sensitive spelling error correction

3 3 Language Model  Language Model (LM) –statistical model of word sequences  n-gram: Use the previous n -1 words to predict the next word

4 4 Applications  context-sensitive spelling error detection and correction “He is trying to fine out.” “The design an construction will take a year.”  machine translation

5 5 Counting Words in Corpora  Corpora (on-line text collections)  Which words to count –What we are going to count –Where we are going to find the things to count

6 6 Brown Corpus  1 million words  500 texts  Varied genres (newspaper, novels, non- fiction, academic, etc.)  Assembled at Brown University in 1963-64  The first large on-line text collection used in corpus-based NLP research

7 7 Issues in Word Counting  Punctuation symbols (., ? !)  Capitalization (“He” vs. “he”, “Bush” vs. “bush”)  Inflected forms (“cat” vs. “cats”) –Wordform: cat, cats, eat, eats, ate, eating, eaten –Lemma (Stem): cat, eat

8 8 Types vs. Tokens  Tokens (N): Total number of running words  Types (B): Number of distinct words in a corpus (size of the vocabulary) Example: “They picnicked by the pool, then lay back on the grass and looked at the stars.” –16 word tokens, 14 word types (not counting punctuation) ※ “Types” will mean wordform types and not lemma type, and punctuation marks will generally be counted as word

9 9 How Many Words in English?  Shakespeare’s complete works –884,647 wordform tokens –29,066 wordform types  Brown Corpus –1 million wordform tokens –61,805 wordform types –37,851 lemma types

10 10 Simple (Unsmoothed) N-grams  Task: Estimating the probability of a word  First attempt: –Suppose there is no corpus available –Use uniform distribution –Assume: word types = V (e.g., 100,000)

11 11 Simple (Unsmoothed) N-grams  Task: Estimating the probability of a word  Second attempt: –Suppose there is a corpus –Assume: word tokens = N # times w appears in corpus = C(w)

12 12 Simple (Unsmoothed) N-grams  Task: Estimating the probability of a word  Third attempt: –Suppose there is a corpus –Assume a word depends on its n –1 previous words

13 13 Simple (Unsmoothed) N-grams

14 14 Simple (Unsmoothed) N-grams  n-gram approximation: –W k only depends on its previous n–1words

15 15 Bigram Approximation  Example: P(I want to eat British food) = P(I| ) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) : a special word meaning “start of sentence”

16 16 Note on Practical Problem  Multiplying many probabilities results in a very small number and can cause numerical underflow  Use logprob instead in the actual computation

17 17 Estimating N-gram Probability  Maximum Likelihood Estimate (MLE)

18 18

19 19 Estimating Bigram Probability  Example: –C(to eat) = 860 –C(to) = 3256

20 20

21 21 Two Important facts  The increasing accuracy of N-gram models as we increse the value of N  Very strong dependency on their training corpus (in particular its genre and its size in words)

22 22 Smoothing  Any particular training corpus is finite  Sparse data problem  Deal with zero probability

23 23 Smoothing  Smoothing –Reevaluating zero probability n-grams and assigning them non-zero probability  Also called Discounting –Lowering non-zero n-gram counts in order to assign some probability mass to the zero n- grams

24 24 Add-One Smoothing for Bigram

25 25

26 26

27 27 Things Seen Once  Use the count of things seen once to help estimate the count of things never seen

28 28 Witten-Bell Discounting

29 29 Witten-Bell Discounting for Bigram

30 30 Witten-Bell Discounting for Bigram

31 31  Seen count Unseen count

32 32

33 33 Good-Turing Discounting for Bigram

34 34

35 35 Backoff

36 36 Backoff

37 37 Entropy  Measure of uncertainty  Used to evaluate quality of n-gram models (how well a language model matches a given language)  Entropy H(X) of a random variable X:  Measured in bits  Number of bits to encode information in the optimal coding scheme

38 38 Example 1

39 39 Example 2

40 40 Perplexity

41 41 Entropy of a Sequence

42 42 Entropy of a Language

43 43 Cross Entropy  Used for comparing two language models  p: Actual probability distribution that generated some data  m: A model of p (approximation to p)  Cross entropy of m on p:

44 44 Cross Entropy  By Shannon-McMillan-Breimantheorem:  Property of cross entropy:  Difference between H(p,m) and H(p) is a measure of how accurate model m is  The more accurate a model, the lower its cross-entropy


Download ppt "6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative."

Similar presentations


Ads by Google