Download presentation
Presentation is loading. Please wait.
Published byStephanie Simon Modified over 8 years ago
1
Information Retrieval Search Engine Technology (5&6) http://tangra.si.umich.edu/clair/ir09 http://tangra.si.umich.edu/clair/ir09 Prof. Dragomir R. Radev radev@umich.edu
2
Final projects Two formats: –A software system that performs a specific search-engine related task. We will create a web page with all such code and make it available to the IR community. –A research experiment documented in the form of a paper. Look at the proceedings of the SIGIR, WWW, or ACL conferences for a sample format. I will encourage the authors of the most successful papers to consider submitting them to one of the IR- related conferences. Deliverables: –System (code + documentation + examples) or Paper (+ code, data) –Poster (to be presented in class) –Web page that describes the project.
3
SET/IR – W/S 2009 … 9. Text classification Naïve Bayesian classifiers Decision trees …
4
Introduction Text classification: assigning documents to predefined categories: topics, languages, users A given set of classes C Given x, determine its class in C Hierarchical vs. flat Overlapping (soft) vs non-overlapping (hard)
5
Introduction Ideas: manual classification using rules (e.g., Columbia AND University Education Columbia AND “South Carolina” Geography Popular techniques: generative (knn, Naïve Bayes) vs. discriminative (SVM, regression) Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x) Discriminative: model p(y|x) directly.
6
Bayes formula Full probability
7
Example (performance enhancing drug) Drug(D) with values y/n Test(T) with values +/- P(D=y) = 0.001 P(T=+|D=y)=0.8 P(T=+|D=n)=0.01 Given: athlete tests positive P(D=y|T=+)= P(T=+|D=y)P(D=y) / (P(T=+|D=y)P(D=y)+P(T=+|D=n)P(D=n)= (0.8x0.001)/(0.8x0.001+0.01x0.999)=0.074
8
Naïve Bayesian classifiers Naïve Bayesian classifier Assuming statistical independence Features = words (or phrases) typically
9
Example p(well)=0.9, p(cold)=0.05, p(allergy)=0.05 –p(sneeze|well)=0.1 –p(sneeze|cold)=0.9 –p(sneeze|allergy)=0.9 –p(cough|well)=0.1 –p(cough|cold)=0.8 –p(cough|allergy)=0.7 –p(fever|well)=0.01 –p(fever|cold)=0.7 –p(fever|allergy)=0.4 Example from Ray Mooney
10
Example (cont’d) Features: sneeze, cough, no fever P(well|e)=(.9) * (.1)(.1)(.99) / p(e)=0.0089/p(e) P(cold|e)=(.05) * (.9)(.8)(.3) / p(e)=0.01/p(e) P(allergy|e)=(.05) * (.9)(.7)(.6) / p(e)=0.019/p(e) P(e) = 0.0089+0.01+0.019=0.379 P(well|e)=.23 P(cold|e)=.26 P(allergy|e)=.50 Example from Ray Mooney
11
Issues with NB Where do we get the values – use maximum likelihood estimation (N i /N) Same for the conditionals – these are based on a multinomial generator and the MLE estimator is (T ji / T ji ) Smoothing is needed – why? Laplace smoothing ((T ji +1)/ T ji +1)) Implementation: how to avoid floating point underflow
12
Spam recognition Return-Path: X-Sieve: CMU Sieve 2.2 From: "Ibrahim Galadima" Reply-To: galadima_esq@netpiper.com To: webmaster@aclweb.org Date: Tue, 14 Jan 2003 21:06:26 -0800 Subject: Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLE PERSON WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATES DOLLARS US$20M TO A SAFE FOREIGN ACCOUNT THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A
13
SpamAssassin http://spamassassin.apache.org/ http://spamassassin.apache.org/tests_3_1 _x.htmlhttp://spamassassin.apache.org/tests_3_1 _x.html
14
Feature selection: The 2 test For a term t: C=class, i t = feature Testing for independence: P(C=0,I t =0) should be equal to P(C=0) P(I t =0) –P(C=0) = (k 00 +k 01 )/n –P(C=1) = 1-P(C=0) = (k 10 +k 11 )/n –P(I t =0) = (k 00 +K 10 )/n –P(I t =1) = 1-P(I t =0) = (k 01 +k 11 )/n ItIt 01 C0k 00 k 01 1k 10 k 11
15
Feature selection: The 2 test High values of 2 indicate lower belief in independence. In practice, compute 2 for all words and pick the top k among them.
16
Feature selection: mutual information No document length scaling is needed Documents are assumed to be generated according to the multinomial model Measures amount of information: if the distribution is the same as the background distribution, then MI=0 X = word; Y = class
17
Well-known datasets 20 newsgroups –http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgro ups/http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgro ups/ Reuters-21578 –http://www.daviddlewis.com/resources/testcollections/reuters 21578/http://www.daviddlewis.com/resources/testcollections/reuters 21578/ –Cats: grain, acquisitions, corn, crude, wheat, trade… WebKB –http://www-2.cs.cmu.edu/~webkb/http://www-2.cs.cmu.edu/~webkb/ –course, student, faculty, staff, project, dept, other –NB performance (2000) –P=26,43,18,6,13,2,94 –R=83,75,77,9,73,100,35
18
Evaluation of text classification Microaveraging – average over classes Macroaveraging – uses pooled table
19
Vector space classification x1 x2 topic2 topic1
20
Decision surfaces x1 x2 topic2 topic1
21
Decision trees x1 x2 topic2 topic1
22
Classification using decision trees Expected information need I (s 1, s 2, …, s m ) = - p i log (p i ) s = data samples m = number of classes
23
RIDAgeIncomestudentcreditbuys? 1<= 30HighNoFairNo 2<= 30HighNoExcellentNo 331.. 40HighNoFairYes 4> 40MediumNoFairYes 5> 40LowYesFairYes 6> 40LowYesExcellentNo 731.. 40LowYesExcellentYes 8<= 30MediumNoFairNo 9<= 30LowYesFairYes 10> 40MediumYesFairYes 11<= 30MediumYesExcellentYes 1231.. 40MediumNoExcellentYes 1331.. 40HighYesFairYes 14> 40Mediumnoexcellentno
24
Decision tree induction I(s 1,s 2 ) = I(9,5) = = - 9/14 log 9/14 – 5/14 log 5/14 = = 0.940
25
Entropy and information gain E(A) = I (s 1j,…,s mj ) S 1j + … + s mj s Entropy = expected information based on the partitioning into subsets by A Gain (A) = I (s 1,s 2,…,s m ) – E(A)
26
Entropy Age <= 30 s 11 = 2, s 21 = 3, I(s 11, s 21 ) = 0.971 Age in 31.. 40 s 12 = 4, s 22 = 0, I (s 12,s 22 ) = 0 Age > 40 s 13 = 3, s 23 = 2, I (s 13,s 23 ) = 0.971
27
Entropy (cont’d) E (age) = 5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I (S13,s23) = 0.694 Gain (age) = I (s1,s2) – E(age) = 0.246 Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit) = 0.048
28
Final decision tree excellent age studentcredit noyesnoyes no 31.. 40 > 40 yes fair
29
Other techniques Bayesian classifiers X: age <=30, income = medium, student = yes, credit = fair P(yes) = 9/14 = 0.643 P(no) = 5/14 = 0.357
30
Example P (age < 30 | yes) = 2/9 = 0.222 P (age < 30 | no) = 3/5 = 0.600 P (income = medium | yes) = 4/9 = 0.444 P (income = medium | no) = 2/5 = 0.400 P (student = yes | yes) = 6/9 = 0.667 P (student = yes | no) = 1/5 = 0.200 P (credit = fair | yes) = 6/9 = 0.667 P (credit = fair | no) = 2/5 = 0.400
31
Example (cont’d) P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019 P (X | yes) P (yes) = 0.044 x 0.643 = 0.028 P (X | no) P (no) = 0.019 x 0.357 = 0.007 Answer: yes/no?
32
SET/IR – W/S 2009 … 10. Linear classifiers Kernel methods Support vector machines …
33
Linear boundary x1 x2 topic2 topic1
34
Vector space classifiers Using centroids Boundary = line that is equidistant from two centroids
35
Generative models: knn Assign each element to the closest cluster K-nearest neighbors Very easy to program Tessellation; nonlinearity Issues: choosing k, b? Demo: –http://www-2.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.htmlhttp://www-2.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.html
36
Linear separators Two-dimensional line: w 1 x 1 +w 2 x 2 =b is the linear separator w 1 x 1 +w 2 x 2 >b for the positive class In n-dimensional spaces:
37
Example 1 x1 x2 topic2 topic1
38
Example 2 Classifier for “interest” in Reuters-21578 b=0 If the document is “rate discount dlr world”, its score will be 0.67*1+0.46*1+ (-0.71)*1+(-0.35)*1= 0.05>0 Example from MSR wiwi xixi wiwi xixi 0.70prime-0.71dlrs 0.67rate-0.35world 0.63interest-0.33sees 0.60rates-0.25year 0.46discount-0.24group 0.43bundesbank-0.24dlr
39
Example: perceptron algorithm Input: Algorithm: Output:
40
[Slide from Chris Bishop]
41
Linear classifiers What is the major shortcoming of a perceptron? How to determine the dimensionality of the separator? –Bias-variance tradeoff (example) How to deal with multiple classes? –Any-of: build multiple classifiers for each class –One-of: harder (as J hyperplanes do not divide R M into J regions), instead: use class complements and scoring
42
Support vector machines Introduced by Vapnik in the early 90s.
43
Issues with SVM Soft margins (inseparability) Kernels – non-linearity
44
The kernel idea beforeafter
45
Example (mapping to a higher-dimensional space)
46
The kernel trick Polynomial kernel: Sigmoid kernel: RBF kernel: Many other kernels are useful for IR: e.g., string kernels, subsequence kernels, tree kernels, etc.
47
SVM (Cont’d) Evaluation: –SVM > knn > decision tree > NB Implementation –Quadratic optimization –Use toolkit (e.g., Thorsten Joachims’s svmlight)
48
Semi-supervised learning EM Co-training Graph-based
49
Exploiting Hyperlinks – Co-training Each document instance has two sets of alternate view (Blum and Mitchell 1998) –terms in the document, x1 –terms in the hyperlinks that point to the document, x2 Each view is sufficient to determine the class of the instance –Labeling function that classifies examples is the same applied to x1 or x2 –x1 and x2 are conditionally independent, given the class [Slide from Pierre Baldi]
50
Co-training Algorithm Labeled data are used to infer two Naïve Bayes classifiers, one for each view Each classifier will –examine unlabeled data –pick the most confidently predicted positive and negative examples –add these to the labeled examples Classifiers are now retrained on the augmented set of labeled examples [Slide from Pierre Baldi]
51
Conclusion SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters. NB also good in many circumstances
52
Readings MRS18 MRS17, MRS19
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.