# Naïve Bayes Advanced Statistical Methods in NLP Ling572 January 19, 2012 1.

## Presentation on theme: "Naïve Bayes Advanced Statistical Methods in NLP Ling572 January 19, 2012 1."— Presentation transcript:

Naïve Bayes Advanced Statistical Methods in NLP Ling572 January 19, 2012 1

Roadmap Naïve Bayes Multi-variate Bernoulli event model (recap) Multinomial event model Analysis HW#3 2

Naïve Bayes Models in Detail (McCallum & Nigam, 1998) Alternate models for Naïve Bayes Text Classification Multivariate Bernoulli event model Binary independence model Features treated as binary – counts ignored Multinomial event model Unigram language model 3

Multivariate Bernoulli Event Text Model Each document: Result of |V| independent Bernoulli trials I.e. for each word in vocabulary, does the word appear in the document? From general Naïve Bayes perspective Each word corresponds to two variables, w t and In each doc, either w t or appears Always have |V| elements in a document 4

Training & Testing Laplace smoothed training: MAP decision rule classification: P(c) 5

Multinomial Event Model 6

Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } 7

Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } Document is viewed as result of: One trial for each position P(word = w i ) = p i Σ i p i = 1 8

Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } Document is viewed as result of: One trial for each position P(word = w i ) = p i Σ i p i = 1 P(X 1 =x 1,X 2 =x 2,….,X |V| =x |V| ) 9

Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } Document is viewed as result of: One trial for each position P(word = w i ) = p i Σ i p i = 1 P(X 1 =x 1,X 2 =x 2,….,X |V| =x |V| ) 10

Multinomial Distribution Trial: select a word according to its probability Possible outcomes: {w 1, w 2,…,w |V| } Document is viewed as result of: One trial for each position P(word = w i ) = p i Σ i p i = 1 P(X 1 =x 1,X 2 =x 2,….,X |V| =x |V| ) 11

Example Consider a vocabulary V with only three words: a, b, c Due to F. Xia 12

Example Consider a vocabulary V with only three words: a, b, c Document d i contains only 2 word instances Due to F. Xia 13

Example Consider a vocabulary V with only three words: a, b, c Document d i contains only 2 word instances For each position: (P(w=a)=p 1, P(w=b)=p 2, P(w=c) = p 3 Due to F. Xia 14

Example Consider a vocabulary V with only three words: a, b, c Document d i contains only 2 word instances For each position: (P(w=a)=p 1, P(w=b)=p 2, P(w=c) = p 3 What is the probability that we see ‘a’ once and ‘b’ once in d i ? Due to F. Xia 15

Example (cont’d) How many possible sequences? 16 Due to F. Xia

Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc 17 Due to F. Xia

Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? 18 Due to F. Xia

Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? n!/(x 1 !..x |v| !) = 2 Probability of the sequence ‘ab’ is: 19 Due to F. Xia

Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? n!/(x 1 !..x |v| !) = 2 Probability of the sequence ‘ab’ is: p 1 *p 2 Probability of the sequence ‘ba’ 20 Due to F. Xia

Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? n!/(x 1 !..x |v| !) = 2 Probability of the sequence ‘ab’ is: p 1 *p 2 Probability of the sequence ‘ba’ : p 1 * p 2 So probability of seeing ‘a’ once and ‘b’ once is: 21 Due to F. Xia

Example (cont’d) How many possible sequences? 3^2 = 9 Sequences: aa, ab, ac, bb, ba, bc, ca, cb, cc How many sequences with one ‘a’ and one ‘b’? n!/(x 1 !..x |v| !) = 2 Probability of the sequence ‘ab’ is: p 1 *p 2 Probability of the sequence ‘ba’ : p 1 * p 2 So probability of seeing ‘a’ once and ‘b’ once is: = 2 p 1 *p 2 22 Due to F. Xia

Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context 23

Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i 24

Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i Then under multinomial event model: 25

Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i Then under multinomial event model: 26

Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i Then under multinomial event model: 27

Multinomial Event Model Document is sequence of word events drawn from vocabulary V. Assume document length independent of class Assume (Naïve Bayes) words independent of context Define N it = # of occurrences of w t in document d i Then under multinomial event model: 28

Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, 29

Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, 30

Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, 31

Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, 32

Training P(c j |d i )=1 if document d i is of class c j, and 0 o.w. So, Contrast this with multivariate Bernoulli 33

Testing To classify a document d i compute: argmax c P(c)P(d i |c) 34

Testing To classify a document d i compute: argmax c P(c)P(d i |c) argmax c P(c) 35

Two Naïve Bayes Models Multi-variate Bernoulli event model: Models binary presence/absence of word feature 36

Two Naïve Bayes Models Multi-variate Bernoulli event model: Models binary presence/absence of word feature Multinomial event model: Models counts of word features, unigram models 37

Two Naïve Bayes Models Multi-variate Bernoulli event model: Models binary presence/absence of word feature Multinomial event model: Models counts of word features, unigram models In experiments on a range of different text classification corpora, multinomial model usually outperforms multivariate Bernoulli (McCallum & Nigam, 1998) 38

Thinking about Performance Naïve Bayes: conditional independence assumption Clearly unrealistic, but performance is often good Why? 39

Thinking about Performance Naïve Bayes: conditional independence assumption Clearly unrealistic, but performance is often good Why? Classification based on sign, not magnitude Direction of classification usually right Multivariate Bernoulli vs Multinomial Why does multinomial perform better? 40

Thinking about Performance Naïve Bayes: conditional independence assumption Clearly unrealistic, but performance is often good Why? Classification based on sign, not magnitude Direction of classification usually right Multivariate Bernoulli vs Multinomial Why does multinomial perform better? Captures additional information: presence/absence+freq What if we wanted to include other types of features? 41

Thinking about Performance Naïve Bayes: conditional independence assumption Clearly unrealistic, but performance is often good Why? Classification based on sign, not magnitude Direction of classification usually right Multivariate Bernoulli vs Multinomial Why does multinomial perform better? Captures additional information: presence/absence+freq What if we wanted to include other types of features? Multivariate: just another Bernoulli trial Multinomial can’t mix distributions 42

Model Comparison 43 Multivariate BernoulliMultinomial Event Features Trial P(c) P(w|c) Testing

Model Comparison 44 Multivariate BernoulliMultinomial Event FeaturesBinary Trial P(c) P(w|c) Testing

Model Comparison 45 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences Trial P(c) P(w|c) Testing

Model Comparison 46 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabulary P(c) P(w|c) Testing

Model Comparison 47 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabularyEach position in document P(c) P(w|c) Testing

Model Comparison 48 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabularyEach position in document P(c) P(w|c) Testing

Model Comparison 49 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabularyEach position in document P(c) P(w|c) Testing

Model Comparison 50 Multivariate BernoulliMultinomial Event FeaturesBinary# of occurrences TrialEach word in vocabularyEach position in document P(c) P(w|c) Testing P(c)

Naïve Bayes: Strengths Advantages: Simplicity (conceptual) Training efficiency Testing efficiency Scales fairly well to large data Performs multiclass classification Can provide n-best outputs 52

Naïve Bayes: Weaknesses Disadvantages: Theoretical foundation weak: Ragingly inaccurate independence assumption Decent accuracy, but outperformed by more sophisticated 53

HW#3 Naïve Bayes Classification: Experiment with the Mallet Naïve Bayes Learner Implement Multivariate Bernoulli event model Implement Multinomial event model Compare with binary variables Analyze results 55

Notes Use add-delta smoothing (vs add-one) Beware numerical underflow log probs are your friend Also converts exponents to multipliers Look out for repeated computation Precompute normalization denominators E.g. for multinomial P(w|c), compute once for each c 56

Efficiency MVB: 57

Efficiency MVB: 58

Efficiency MVB: 59