Information Retrieval Search Engine Technology (5&6) Prof. Dragomir R. Radev.

Information Retrieval Search Engine Technology (5&6) http://tangra.si.umich.edu/clair/ir09 http://tangra.si.umich.edu/clair/ir09 Prof. Dragomir R. Radev radev@umich.edu

Final projects Two formats: –A software system that performs a specific search-engine related task. We will create a web page with all such code and make it available to the IR community. –A research experiment documented in the form of a paper. Look at the proceedings of the SIGIR, WWW, or ACL conferences for a sample format. I will encourage the authors of the most successful papers to consider submitting them to one of the IR- related conferences. Deliverables: –System (code + documentation + examples) or Paper (+ code, data) –Poster (to be presented in class) –Web page that describes the project.

SET/IR – W/S 2009 … 9. Text classification Naïve Bayesian classifiers Decision trees …

Introduction Text classification: assigning documents to predefined categories: topics, languages, users A given set of classes C Given x, determine its class in C Hierarchical vs. flat Overlapping (soft) vs non-overlapping (hard)

Introduction Ideas: manual classification using rules (e.g., Columbia AND University  Education Columbia AND “South Carolina”  Geography Popular techniques: generative (knn, Naïve Bayes) vs. discriminative (SVM, regression) Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x) Discriminative: model p(y|x) directly.

Bayes formula Full probability

Naïve Bayesian classifiers Naïve Bayesian classifier Assuming statistical independence Features = words (or phrases) typically

Example (cont’d) Features: sneeze, cough, no fever P(well|e)=(.9) * (.1)(.1)(.99) / p(e)=0.0089/p(e) P(cold|e)=(.05) * (.9)(.8)(.3) / p(e)=0.01/p(e) P(allergy|e)=(.05) * (.9)(.7)(.6) / p(e)=0.019/p(e) P(e) = 0.0089+0.01+0.019=0.379 P(well|e)=.23 P(cold|e)=.26 P(allergy|e)=.50 Example from Ray Mooney

Issues with NB Where do we get the values – use maximum likelihood estimation (N i /N) Same for the conditionals – these are based on a multinomial generator and the MLE estimator is (T ji /  T ji ) Smoothing is needed – why? Laplace smoothing ((T ji +1)/  T ji +1)) Implementation: how to avoid floating point underflow

Spam recognition Return-Path: X-Sieve: CMU Sieve 2.2 From: "Ibrahim Galadima" Reply-To: galadima_esq@netpiper.com To: webmaster@aclweb.org Date: Tue, 14 Jan 2003 21:06:26 -0800 Subject: Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLE PERSON WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATES DOLLARS US$20M TO A SAFE FOREIGN ACCOUNT THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A

SpamAssassin http://spamassassin.apache.org/ http://spamassassin.apache.org/tests_3_1 _x.htmlhttp://spamassassin.apache.org/tests_3_1 _x.html

Feature selection: The  2 test For a term t: C=class, i t = feature Testing for independence: P(C=0,I t =0) should be equal to P(C=0) P(I t =0) –P(C=0) = (k 00 +k 01 )/n –P(C=1) = 1-P(C=0) = (k 10 +k 11 )/n –P(I t =0) = (k 00 +K 10 )/n –P(I t =1) = 1-P(I t =0) = (k 01 +k 11 )/n ItIt 01 C0k 00 k 01 1k 10 k 11

Feature selection: The  2 test High values of  2 indicate lower belief in independence. In practice, compute  2 for all words and pick the top k among them.

Feature selection: mutual information No document length scaling is needed Documents are assumed to be generated according to the multinomial model Measures amount of information: if the distribution is the same as the background distribution, then MI=0 X = word; Y = class

Well-known datasets 20 newsgroups –http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgro ups/http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgro ups/ Reuters-21578 –http://www.daviddlewis.com/resources/testcollections/reuters 21578/http://www.daviddlewis.com/resources/testcollections/reuters 21578/ –Cats: grain, acquisitions, corn, crude, wheat, trade… WebKB –http://www-2.cs.cmu.edu/~webkb/http://www-2.cs.cmu.edu/~webkb/ –course, student, faculty, staff, project, dept, other –NB performance (2000) –P=26,43,18,6,13,2,94 –R=83,75,77,9,73,100,35

Evaluation of text classification Microaveraging – average over classes Macroaveraging – uses pooled table

Vector space classification x1 x2 topic2 topic1

Decision surfaces x1 x2 topic2 topic1

Decision trees x1 x2 topic2 topic1

Classification using decision trees Expected information need I (s 1, s 2, …, s m ) = - p i log (p i ) s = data samples m = number of classes 

RIDAgeIncomestudentcreditbuys? 1<= 30HighNoFairNo 2<= 30HighNoExcellentNo 331.. 40HighNoFairYes 4> 40MediumNoFairYes 5> 40LowYesFairYes 6> 40LowYesExcellentNo 731.. 40LowYesExcellentYes 8<= 30MediumNoFairNo 9<= 30LowYesFairYes 10> 40MediumYesFairYes 11<= 30MediumYesExcellentYes 1231.. 40MediumNoExcellentYes 1331.. 40HighYesFairYes 14> 40Mediumnoexcellentno

Decision tree induction I(s 1,s 2 ) = I(9,5) = = - 9/14 log 9/14 – 5/14 log 5/14 = = 0.940

Entropy and information gain E(A) = I (s 1j,…,s mj )  S 1j + … + s mj s Entropy = expected information based on the partitioning into subsets by A Gain (A) = I (s 1,s 2,…,s m ) – E(A)

Entropy Age <= 30 s 11 = 2, s 21 = 3, I(s 11, s 21 ) = 0.971 Age in 31.. 40 s 12 = 4, s 22 = 0, I (s 12,s 22 ) = 0 Age > 40 s 13 = 3, s 23 = 2, I (s 13,s 23 ) = 0.971

Entropy (cont’d) E (age) = 5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I (S13,s23) = 0.694 Gain (age) = I (s1,s2) – E(age) = 0.246 Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit) = 0.048

Final decision tree excellent age studentcredit noyesnoyes no 31.. 40 > 40 yes fair

Other techniques Bayesian classifiers X: age <=30, income = medium, student = yes, credit = fair P(yes) = 9/14 = 0.643 P(no) = 5/14 = 0.357

Example (cont’d) P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019 P (X | yes) P (yes) = 0.044 x 0.643 = 0.028 P (X | no) P (no) = 0.019 x 0.357 = 0.007 Answer: yes/no?

SET/IR – W/S 2009 … 10. Linear classifiers Kernel methods Support vector machines …

Linear boundary x1 x2 topic2 topic1

Vector space classifiers Using centroids Boundary = line that is equidistant from two centroids

Generative models: knn Assign each element to the closest cluster K-nearest neighbors Very easy to program Tessellation; nonlinearity Issues: choosing k, b? Demo: –http://www-2.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.htmlhttp://www-2.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.html

Linear separators Two-dimensional line: w 1 x 1 +w 2 x 2 =b is the linear separator w 1 x 1 +w 2 x 2 >b for the positive class In n-dimensional spaces:

Example 1 x1 x2 topic2 topic1

Example 2 Classifier for “interest” in Reuters-21578 b=0 If the document is “rate discount dlr world”, its score will be 0.67*1+0.46*1+ (-0.71)*1+(-0.35)*1= 0.05>0 Example from MSR wiwi xixi wiwi xixi 0.70prime-0.71dlrs 0.67rate-0.35world 0.63interest-0.33sees 0.60rates-0.25year 0.46discount-0.24group 0.43bundesbank-0.24dlr

Example: perceptron algorithm Input: Algorithm: Output:

[Slide from Chris Bishop]

Linear classifiers What is the major shortcoming of a perceptron? How to determine the dimensionality of the separator? –Bias-variance tradeoff (example) How to deal with multiple classes? –Any-of: build multiple classifiers for each class –One-of: harder (as J hyperplanes do not divide R M into J regions), instead: use class complements and scoring

Support vector machines Introduced by Vapnik in the early 90s.

Issues with SVM Soft margins (inseparability) Kernels – non-linearity

The kernel idea beforeafter

Example (mapping to a higher-dimensional space)

The kernel trick Polynomial kernel: Sigmoid kernel: RBF kernel: Many other kernels are useful for IR: e.g., string kernels, subsequence kernels, tree kernels, etc.

SVM (Cont’d) Evaluation: –SVM > knn > decision tree > NB Implementation –Quadratic optimization –Use toolkit (e.g., Thorsten Joachims’s svmlight)

Semi-supervised learning EM Co-training Graph-based

Exploiting Hyperlinks – Co-training Each document instance has two sets of alternate view (Blum and Mitchell 1998) –terms in the document, x1 –terms in the hyperlinks that point to the document, x2 Each view is sufficient to determine the class of the instance –Labeling function that classifies examples is the same applied to x1 or x2 –x1 and x2 are conditionally independent, given the class [Slide from Pierre Baldi]

Co-training Algorithm Labeled data are used to infer two Naïve Bayes classifiers, one for each view Each classifier will –examine unlabeled data –pick the most confidently predicted positive and negative examples –add these to the labeled examples Classifiers are now retrained on the augmented set of labeled examples [Slide from Pierre Baldi]

Conclusion SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters. NB also good in many circumstances

Readings MRS18 MRS17, MRS19

Information Retrieval Search Engine Technology (5&6) Prof. Dragomir R. Radev.

Similar presentations

Presentation on theme: "Information Retrieval Search Engine Technology (5&6) Prof. Dragomir R. Radev."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval Search Engine Technology (5&6) Prof. Dragomir R. Radev.

Similar presentations

Presentation on theme: "Information Retrieval Search Engine Technology (5&6) Prof. Dragomir R. Radev."— Presentation transcript:

Similar presentations

About project

Feedback