Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval Search Engine Technology (5&6) Prof. Dragomir R. Radev.

Similar presentations


Presentation on theme: "Information Retrieval Search Engine Technology (5&6) Prof. Dragomir R. Radev."— Presentation transcript:

1 Information Retrieval Search Engine Technology (5&6) http://tangra.si.umich.edu/clair/ir09 http://tangra.si.umich.edu/clair/ir09 Prof. Dragomir R. Radev radev@umich.edu

2 Final projects Two formats: –A software system that performs a specific search-engine related task. We will create a web page with all such code and make it available to the IR community. –A research experiment documented in the form of a paper. Look at the proceedings of the SIGIR, WWW, or ACL conferences for a sample format. I will encourage the authors of the most successful papers to consider submitting them to one of the IR- related conferences. Deliverables: –System (code + documentation + examples) or Paper (+ code, data) –Poster (to be presented in class) –Web page that describes the project.

3 SET/IR – W/S 2009 … 9. Text classification Naïve Bayesian classifiers Decision trees …

4 Introduction Text classification: assigning documents to predefined categories: topics, languages, users A given set of classes C Given x, determine its class in C Hierarchical vs. flat Overlapping (soft) vs non-overlapping (hard)

5 Introduction Ideas: manual classification using rules (e.g., Columbia AND University  Education Columbia AND “South Carolina”  Geography Popular techniques: generative (knn, Naïve Bayes) vs. discriminative (SVM, regression) Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x) Discriminative: model p(y|x) directly.

6 Bayes formula Full probability

7 Example (performance enhancing drug) Drug(D) with values y/n Test(T) with values +/- P(D=y) = 0.001 P(T=+|D=y)=0.8 P(T=+|D=n)=0.01 Given: athlete tests positive P(D=y|T=+)= P(T=+|D=y)P(D=y) / (P(T=+|D=y)P(D=y)+P(T=+|D=n)P(D=n)= (0.8x0.001)/(0.8x0.001+0.01x0.999)=0.074

8 Naïve Bayesian classifiers Naïve Bayesian classifier Assuming statistical independence Features = words (or phrases) typically

9 Example p(well)=0.9, p(cold)=0.05, p(allergy)=0.05 –p(sneeze|well)=0.1 –p(sneeze|cold)=0.9 –p(sneeze|allergy)=0.9 –p(cough|well)=0.1 –p(cough|cold)=0.8 –p(cough|allergy)=0.7 –p(fever|well)=0.01 –p(fever|cold)=0.7 –p(fever|allergy)=0.4 Example from Ray Mooney

10 Example (cont’d) Features: sneeze, cough, no fever P(well|e)=(.9) * (.1)(.1)(.99) / p(e)=0.0089/p(e) P(cold|e)=(.05) * (.9)(.8)(.3) / p(e)=0.01/p(e) P(allergy|e)=(.05) * (.9)(.7)(.6) / p(e)=0.019/p(e) P(e) = 0.0089+0.01+0.019=0.379 P(well|e)=.23 P(cold|e)=.26 P(allergy|e)=.50 Example from Ray Mooney

11 Issues with NB Where do we get the values – use maximum likelihood estimation (N i /N) Same for the conditionals – these are based on a multinomial generator and the MLE estimator is (T ji /  T ji ) Smoothing is needed – why? Laplace smoothing ((T ji +1)/  T ji +1)) Implementation: how to avoid floating point underflow

12 Spam recognition Return-Path: X-Sieve: CMU Sieve 2.2 From: "Ibrahim Galadima" Reply-To: galadima_esq@netpiper.com To: webmaster@aclweb.org Date: Tue, 14 Jan 2003 21:06:26 -0800 Subject: Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLE PERSON WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATES DOLLARS US$20M TO A SAFE FOREIGN ACCOUNT THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A

13 SpamAssassin http://spamassassin.apache.org/ http://spamassassin.apache.org/tests_3_1 _x.htmlhttp://spamassassin.apache.org/tests_3_1 _x.html

14 Feature selection: The  2 test For a term t: C=class, i t = feature Testing for independence: P(C=0,I t =0) should be equal to P(C=0) P(I t =0) –P(C=0) = (k 00 +k 01 )/n –P(C=1) = 1-P(C=0) = (k 10 +k 11 )/n –P(I t =0) = (k 00 +K 10 )/n –P(I t =1) = 1-P(I t =0) = (k 01 +k 11 )/n ItIt 01 C0k 00 k 01 1k 10 k 11

15 Feature selection: The  2 test High values of  2 indicate lower belief in independence. In practice, compute  2 for all words and pick the top k among them.

16 Feature selection: mutual information No document length scaling is needed Documents are assumed to be generated according to the multinomial model Measures amount of information: if the distribution is the same as the background distribution, then MI=0 X = word; Y = class

17 Well-known datasets 20 newsgroups –http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgro ups/http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgro ups/ Reuters-21578 –http://www.daviddlewis.com/resources/testcollections/reuters 21578/http://www.daviddlewis.com/resources/testcollections/reuters 21578/ –Cats: grain, acquisitions, corn, crude, wheat, trade… WebKB –http://www-2.cs.cmu.edu/~webkb/http://www-2.cs.cmu.edu/~webkb/ –course, student, faculty, staff, project, dept, other –NB performance (2000) –P=26,43,18,6,13,2,94 –R=83,75,77,9,73,100,35

18 Evaluation of text classification Microaveraging – average over classes Macroaveraging – uses pooled table

19 Vector space classification x1 x2 topic2 topic1

20 Decision surfaces x1 x2 topic2 topic1

21 Decision trees x1 x2 topic2 topic1

22 Classification using decision trees Expected information need I (s 1, s 2, …, s m ) = - p i log (p i ) s = data samples m = number of classes 

23 RIDAgeIncomestudentcreditbuys? 1<= 30HighNoFairNo 2<= 30HighNoExcellentNo 331.. 40HighNoFairYes 4> 40MediumNoFairYes 5> 40LowYesFairYes 6> 40LowYesExcellentNo 731.. 40LowYesExcellentYes 8<= 30MediumNoFairNo 9<= 30LowYesFairYes 10> 40MediumYesFairYes 11<= 30MediumYesExcellentYes 1231.. 40MediumNoExcellentYes 1331.. 40HighYesFairYes 14> 40Mediumnoexcellentno

24 Decision tree induction I(s 1,s 2 ) = I(9,5) = = - 9/14 log 9/14 – 5/14 log 5/14 = = 0.940

25 Entropy and information gain E(A) = I (s 1j,…,s mj )  S 1j + … + s mj s Entropy = expected information based on the partitioning into subsets by A Gain (A) = I (s 1,s 2,…,s m ) – E(A)

26 Entropy Age <= 30 s 11 = 2, s 21 = 3, I(s 11, s 21 ) = 0.971 Age in 31.. 40 s 12 = 4, s 22 = 0, I (s 12,s 22 ) = 0 Age > 40 s 13 = 3, s 23 = 2, I (s 13,s 23 ) = 0.971

27 Entropy (cont’d) E (age) = 5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I (S13,s23) = 0.694 Gain (age) = I (s1,s2) – E(age) = 0.246 Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit) = 0.048

28 Final decision tree excellent age studentcredit noyesnoyes no 31.. 40 > 40 yes fair

29 Other techniques Bayesian classifiers X: age <=30, income = medium, student = yes, credit = fair P(yes) = 9/14 = 0.643 P(no) = 5/14 = 0.357

30 Example P (age < 30 | yes) = 2/9 = 0.222 P (age < 30 | no) = 3/5 = 0.600 P (income = medium | yes) = 4/9 = 0.444 P (income = medium | no) = 2/5 = 0.400 P (student = yes | yes) = 6/9 = 0.667 P (student = yes | no) = 1/5 = 0.200 P (credit = fair | yes) = 6/9 = 0.667 P (credit = fair | no) = 2/5 = 0.400

31 Example (cont’d) P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019 P (X | yes) P (yes) = 0.044 x 0.643 = 0.028 P (X | no) P (no) = 0.019 x 0.357 = 0.007 Answer: yes/no?

32 SET/IR – W/S 2009 … 10. Linear classifiers Kernel methods Support vector machines …

33 Linear boundary x1 x2 topic2 topic1

34 Vector space classifiers Using centroids Boundary = line that is equidistant from two centroids

35 Generative models: knn Assign each element to the closest cluster K-nearest neighbors Very easy to program Tessellation; nonlinearity Issues: choosing k, b? Demo: –http://www-2.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.htmlhttp://www-2.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.html

36 Linear separators Two-dimensional line: w 1 x 1 +w 2 x 2 =b is the linear separator w 1 x 1 +w 2 x 2 >b for the positive class In n-dimensional spaces:

37 Example 1 x1 x2 topic2 topic1

38 Example 2 Classifier for “interest” in Reuters-21578 b=0 If the document is “rate discount dlr world”, its score will be 0.67*1+0.46*1+ (-0.71)*1+(-0.35)*1= 0.05>0 Example from MSR wiwi xixi wiwi xixi 0.70prime-0.71dlrs 0.67rate-0.35world 0.63interest-0.33sees 0.60rates-0.25year 0.46discount-0.24group 0.43bundesbank-0.24dlr

39 Example: perceptron algorithm Input: Algorithm: Output:

40 [Slide from Chris Bishop]

41 Linear classifiers What is the major shortcoming of a perceptron? How to determine the dimensionality of the separator? –Bias-variance tradeoff (example) How to deal with multiple classes? –Any-of: build multiple classifiers for each class –One-of: harder (as J hyperplanes do not divide R M into J regions), instead: use class complements and scoring

42 Support vector machines Introduced by Vapnik in the early 90s.

43 Issues with SVM Soft margins (inseparability) Kernels – non-linearity

44 The kernel idea beforeafter

45 Example (mapping to a higher-dimensional space)

46 The kernel trick Polynomial kernel: Sigmoid kernel: RBF kernel: Many other kernels are useful for IR: e.g., string kernels, subsequence kernels, tree kernels, etc.

47 SVM (Cont’d) Evaluation: –SVM > knn > decision tree > NB Implementation –Quadratic optimization –Use toolkit (e.g., Thorsten Joachims’s svmlight)

48 Semi-supervised learning EM Co-training Graph-based

49 Exploiting Hyperlinks – Co-training Each document instance has two sets of alternate view (Blum and Mitchell 1998) –terms in the document, x1 –terms in the hyperlinks that point to the document, x2 Each view is sufficient to determine the class of the instance –Labeling function that classifies examples is the same applied to x1 or x2 –x1 and x2 are conditionally independent, given the class [Slide from Pierre Baldi]

50 Co-training Algorithm Labeled data are used to infer two Naïve Bayes classifiers, one for each view Each classifier will –examine unlabeled data –pick the most confidently predicted positive and negative examples –add these to the labeled examples Classifiers are now retrained on the augmented set of labeled examples [Slide from Pierre Baldi]

51 Conclusion SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters. NB also good in many circumstances

52 Readings MRS18 MRS17, MRS19


Download ppt "Information Retrieval Search Engine Technology (5&6) Prof. Dragomir R. Radev."

Similar presentations


Ads by Google