Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval (5) Prof. Dragomir R. Radev

Similar presentations


Presentation on theme: "Information Retrieval (5) Prof. Dragomir R. Radev"— Presentation transcript:

1 Information Retrieval (5) Prof. Dragomir R. Radev radev@umich.edu

2 IR Winter 2010 … 9. Text classification Naïve Bayesian classifiers Decision trees …

3 Introduction Text classification: assigning documents to predefined categories: topics, languages, users A given set of classes C Given x, determine its class in C Hierarchical vs. flat Overlapping (soft) vs non-overlapping (hard)

4 Introduction Ideas: manual classification using rules (e.g., Columbia AND University  Education Columbia AND “South Carolina”  Geography Popular techniques: generative (knn, Naïve Bayes) vs. discriminative (SVM, regression) Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x) Discriminative: model p(y|x) directly.

5 Bayes formula Full probability

6 Example (performance enhancing drug) Drug(D) with values y/n Test(T) with values +/- P(D=y) = 0.001 P(T=+|D=y)=0.8 P(T=+|D=n)=0.01 Given: athlete tests positive P(D=y|T=+)= P(T=+|D=y)P(D=y) / (P(T=+|D=y)P(D=y)+P(T=+|D=n)P(D=n)= (0.8x0.001)/(0.8x0.001+0.01x0.999)=0.074

7 Naïve Bayesian classifiers Naïve Bayesian classifier Assuming statistical independence Features = words (or phrases) typically

8 Example p(well)=0.9, p(cold)=0.05, p(allergy)=0.05 –p(sneeze|well)=0.1 –p(sneeze|cold)=0.9 –p(sneeze|allergy)=0.9 –p(cough|well)=0.1 –p(cough|cold)=0.8 –p(cough|allergy)=0.7 –p(fever|well)=0.01 –p(fever|cold)=0.7 –p(fever|allergy)=0.4 Example from Ray Mooney

9 Example (cont’d) Features: sneeze, cough, no fever P(well|e)=(.9) * (.1)(.1)(.99) / p(e)=0.0089/p(e) P(cold|e)=(.05) * (.9)(.8)(.3) / p(e)=0.01/p(e) P(allergy|e)=(.05) * (.9)(.7)(.6) / p(e)=0.019/p(e) P(e) = 0.0089+0.01+0.019=0.379 P(well|e)=.23 P(cold|e)=.26 P(allergy|e)=.50 Example from Ray Mooney

10 Issues with NB Where do we get the values – use maximum likelihood estimation (N i /N) Same for the conditionals – these are based on a multinomial generator and the MLE estimator is (T ji /  T ji ) Smoothing is needed – why? Laplace smoothing ((T ji +1)/  T ji +1)) Implementation: how to avoid floating point underflow

11 Spam recognition Return-Path: X-Sieve: CMU Sieve 2.2 From: "Ibrahim Galadima" Reply-To: galadima_esq@netpiper.com To: webmaster@aclweb.org Date: Tue, 14 Jan 2003 21:06:26 -0800 Subject: Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLE PERSON WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATES DOLLARS US$20M TO A SAFE FOREIGN ACCOUNT THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A

12 SpamAssassin http://spamassassin.apache.org/ http://spamassassin.apache.org/tests_3_1 _x.htmlhttp://spamassassin.apache.org/tests_3_1 _x.html

13 Feature selection: The  2 test For a term t: C=class, i t = feature Testing for independence: P(C=0,I t =0) should be equal to P(C=0) P(I t =0) –P(C=0) = (k 00 +k 01 )/n –P(C=1) = 1-P(C=0) = (k 10 +k 11 )/n –P(I t =0) = (k 00 +K 10 )/n –P(I t =1) = 1-P(I t =0) = (k 01 +k 11 )/n ItIt 01 C0k 00 k 01 1k 10 k 11

14 Feature selection: The  2 test High values of  2 indicate lower belief in independence. In practice, compute  2 for all words and pick the top k among them.

15 Feature selection: mutual information No document length scaling is needed Documents are assumed to be generated according to the multinomial model Measures amount of information: if the distribution is the same as the background distribution, then MI=0 X = word; Y = class

16 Well-known datasets 20 newsgroups –http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgro ups/http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgro ups/ Reuters-21578 –http://www.daviddlewis.com/resources/testcollections/reuters 21578/http://www.daviddlewis.com/resources/testcollections/reuters 21578/ –Cats: grain, acquisitions, corn, crude, wheat, trade… WebKB –http://www-2.cs.cmu.edu/~webkb/http://www-2.cs.cmu.edu/~webkb/ –course, student, faculty, staff, project, dept, other –NB performance (2000) –P=26,43,18,6,13,2,94 –R=83,75,77,9,73,100,35

17 Evaluation of text classification Microaveraging – average over classes Macroaveraging – uses pooled table

18 Readings MRS18 MRS17, MRS19


Download ppt "Information Retrieval (5) Prof. Dragomir R. Radev"

Similar presentations


Ads by Google