Download presentation

Presentation is loading. Please wait.

Published byLewis Cameron Modified about 1 year ago

1
Naïve Bayes Advanced Statistical Methods in NLP Ling572 January 19, 2012 1

2
Roadmap Naïve Bayes Multi-variate Bernoulli event model (recap) Multinomial event model Analysis HW#3 2

3
Naïve Bayes Models in Detail (McCallum & Nigam, 1998) Alternate models for Naïve Bayes Text Classification Multivariate Bernoulli event model Binary independence model Features treated as binary – counts ignored Multinomial event model Unigram language model 3

4
Multivariate Bernoulli Event Text Model Each document: Result of |V| independent Bernoulli trials I.e. for each word in vocabulary, does the word appear in the document? From general Naïve Bayes perspective Each word corresponds to two variables, w t and In each doc, either w t or appears Always have |V| elements in a document 4

5
Training & Testing Laplace smoothed training: MAP decision rule classification: P(c) 5

6
Multinomial Event Model 6

7
Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } 7

8
Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } Document is viewed as result of: One trial for each position P(word = w i ) = p i Σ i p i = 1 8

9
Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } Document is viewed as result of: One trial for each position P(word = w i ) = p i Σ i p i = 1 P(X 1 =x 1,X 2 =x 2,….,X |V| =x |V| ) 9

10
Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } Document is viewed as result of: One trial for each position P(word = w i ) = p i Σ i p i = 1 P(X 1 =x 1,X 2 =x 2,….,X |V| =x |V| ) 10

11
Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } Document is viewed as result of: One trial for each position P(word = w i ) = p i Σ i p i = 1 P(X 1 =x 1,X 2 =x 2,….,X |V| =x |V| ) 11

12
Example Consider a vocabulary V with only three words: a, b, c Due to F. Xia 12

13
Example Consider a vocabulary V with only three words: a, b, c Document d i contains only 2 word instances Due to F. Xia 13

14
Example Consider a vocabulary V with only three words: a, b, c Document d i contains only 2 word instances For each position: (P(w=a)=p 1, P(w=b)=p 2, P(w=c) = p 3 Due to F. Xia 14

15
Example Consider a vocabulary V with only three words: a, b, c Document d i contains only 2 word instances For each position: (P(w=a)=p 1, P(w=b)=p 2, P(w=c) = p 3 What is the probability that we see ‘a’ once and ‘b’ once in d i ? Due to F. Xia 15

16
Example (cont’d) How many possible sequences? 16 Due to F. Xia

17
Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc 17 Due to F. Xia

18
Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? 18 Due to F. Xia

19
Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? n!/(x 1 !..x |v| !) = 2 Probability of the sequence ‘ab’ is: 19 Due to F. Xia

20
Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? n!/(x 1 !..x |v| !) = 2 Probability of the sequence ‘ab’ is: p 1 *p 2 Probability of the sequence ‘ba’ 20 Due to F. Xia

21
Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? n!/(x 1 !..x |v| !) = 2 Probability of the sequence ‘ab’ is: p 1 *p 2 Probability of the sequence ‘ba’ : p 1 * p 2 So probability of seeing ‘a’ once and ‘b’ once is: 21 Due to F. Xia

22
Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? n!/(x 1 !..x |v| !) = 2 Probability of the sequence ‘ab’ is: p 1 *p 2 Probability of the sequence ‘ba’ : p 1 * p 2 So probability of seeing ‘a’ once and ‘b’ once is: = 2 p 1 *p 2 22 Due to F. Xia

23
Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context 23

24
Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i 24

25
Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i Then under multinomial event model: 25

26
Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i Then under multinomial event model: 26

27
Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i Then under multinomial event model: 27

28
Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i Then under multinomial event model: 28

29
Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, 29

30
Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, 30

31
Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, 31

32
Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, 32

33
Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, Contrast this with multivariate Bernoulli 33

34
Testing To classify a document d i compute: argmax c P(c)P(d i |c) 34

35
Testing To classify a document d i compute: argmax c P(c)P(d i |c) argmax c P(c) 35

36
Two Naïve Bayes Models Multi-variate Bernoulli event model: Models binary presence/absence of word feature 36

37
Two Naïve Bayes Models Multi-variate Bernoulli event model: Models binary presence/absence of word feature Multinomial event model: Models counts of word features, unigram models 37

38
Two Naïve Bayes Models Multi-variate Bernoulli event model: Models binary presence/absence of word feature Multinomial event model: Models counts of word features, unigram models In experiments on a range of different text classification corpora, multinomial model usually outperforms multivariate Bernoulli (McCallum & Nigam, 1998) 38

39
Thinking about Performance Naïve Bayes: conditional independence assumption Clearly unrealistic, but performance is often good Why? 39

40
Thinking about Performance Naïve Bayes: conditional independence assumption Clearly unrealistic, but performance is often good Why? Classification based on sign, not magnitude Direction of classification usually right Multivariate Bernoulli vs Multinomial Why does multinomial perform better? 40

41
Thinking about Performance Naïve Bayes: conditional independence assumption Clearly unrealistic, but performance is often good Why? Classification based on sign, not magnitude Direction of classification usually right Multivariate Bernoulli vs Multinomial Why does multinomial perform better? Captures additional information: presence/absence+freq What if we wanted to include other types of features? 41

42
Thinking about Performance Naïve Bayes: conditional independence assumption Clearly unrealistic, but performance is often good Why? Classification based on sign, not magnitude Direction of classification usually right Multivariate Bernoulli vs Multinomial Why does multinomial perform better? Captures additional information: presence/absence+freq What if we wanted to include other types of features? Multivariate: just another Bernoulli trial Multinomial can’t mix distributions 42

43
Model Comparison 43 Multivariate BernoulliMultinomial Event Features Trial P(c) P(w|c) Testing

44
Model Comparison 44 Multivariate BernoulliMultinomial Event FeaturesBinary Trial P(c) P(w|c) Testing

45
Model Comparison 45 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences Trial P(c) P(w|c) Testing

46
Model Comparison 46 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabulary P(c) P(w|c) Testing

47
Model Comparison 47 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabularyEach position in document P(c) P(w|c) Testing

48
Model Comparison 48 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabularyEach position in document P(c) P(w|c) Testing

49
Model Comparison 49 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabularyEach position in document P(c) P(w|c) Testing

50
Model Comparison 50 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabularyEach position in document P(c) P(w|c) Testing P(c)

51
Naïve Bayes: Strengths Advantages: 51

52
Naïve Bayes: Strengths Advantages: Simplicity (conceptual) Training efficiency Testing efficiency Scales fairly well to large data Performs multiclass classification Can provide n-best outputs 52

53
Naïve Bayes: Weaknesses Disadvantages: Theoretical foundation weak: Ragingly inaccurate independence assumption Decent accuracy, but outperformed by more sophisticated 53

54
Naïve Bayes: Weaknesses Disadvantages: 54

55
HW#3 Naïve Bayes Classification: Experiment with the Mallet Naïve Bayes Learner Implement Multivariate Bernoulli event model Implement Multinomial event model Compare with binary variables Analyze results 55

56
Notes Use add-delta smoothing (vs add-one) Beware numerical underflow log probs are your friend Also converts exponents to multipliers Look out for repeated computation Precompute normalization denominators E.g. for multinomial P(w|c), compute once for each c 56

57
Efficiency MVB: 57

58
Efficiency MVB: 58

59
Efficiency MVB: 59

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google