Information Retrieval

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Text Categorization.
Chapter 5: Introduction to Information Retrieval
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Lecture 13-1: Text Classification & Naive Bayes
Presented by Zeehasham Rasheed
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS347 Lecture 9 May 11, 2001 ©Prabhakar Raghavan.
Overview of Search Engines
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Classifiers, Part 3 Week 1, Video 5 Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical.

Information Retrieval and Web Search Introduction to Text Classification (Note: slides in this set have been adapted from the course taught by Chris Manning.

How to classify reading passages into predefined categories ASH.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris Manning,
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Class Imbalance in Text Classification
Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
KNN & Naïve Bayes Hongning Wang
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Information Retrieval in Practice
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Information Organization: Overview
Instance Based Learning
Lecture 3 : Classification
Information Retrieval
Information Retrieval
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Introduction to Data Science Lecture 7 Machine Learning Overview
Lecture 15: Text Classification & Naive Bayes
Prepared by: Mahmoud Rafeek Al-Farra
Machine Learning: Lecture 3
Text Categorization Rong Jin.
Text Categorization Assigning documents to a fixed set of categories
Instance Based Learning
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
CSE 635 Multimedia Information Retrieval
Michal Rosen-Zvi University of California, Irvine
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Model generalization Brief summary of methods
Information Organization: Overview
Naïve Bayes Text Classification
NAÏVE BAYES CLASSIFICATION
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Information Retrieval Introduction to Information Retrieval Introduction to Information Retrieval Text Classification and Naïve Bayes Chris Manning, Pandu Nayak and Prabhakar Raghavan

Text Classification - Topics to Do Introduction to Information Retrieval Text Classification - Topics to Do Ch. 13 Need for Text Classification Importance of Classification Text Classification Problem Classification Methods Naïve Bayes (NB) Classification Variants of NB Feature Selection with Methods Evaluation of Text Classification

Text Classification The path from IR to text classification: Introduction to Information Retrieval Text Classification Ch. 13 The path from IR to text classification: You have an information need to monitor, say: Developments in Sensor Networks You want to rerun an appropriate query periodically to find new news items on this topic You will be sent new documents that are found I.e., it’s not ranking but classification (relevant vs. not relevant) Standing queries – Periodically executed on collection over time

Importance of Classification Sentiment Detection Email Sorting Vertical Search Engine

Text Categorization/Classification Problem Introduction to Information Retrieval Text Categorization/Classification Problem Sec. 13.1 Given: A representation of a document d Issue: how to represent text documents. Usually some type of high-dimensional space – bag of words A fixed set of classes: C = {c1, c2,…, cJ} A training Set: D->(d,c) where (d,c) Classification Function: Learning Method:

Example Test Data: (AI) (Programming) (HCI) Classes: ML Planning Introduction to Information Retrieval Example Sec. 13.1 “planning language proof intelligence” Test Data: (AI) (Programming) (HCI) Classes: ML Planning Semantics Garb.Coll. Multimedia GUI Training Data: learning intelligence algorithm reinforcement network... planning temporal reasoning plan language... programming semantics language proof... garbage collection memory optimization region... ... ...

Classification Methods (1) Introduction to Information Retrieval Classification Methods (1) Ch. 13 Manual classification Used by the original Yahoo! Directory Looksmart, about.com, ODP, PubMed Accurate when job is done by experts Consistent when the problem size and team is small Difficult and expensive to scale Means we need automatic classification methods for big problems

Classification Methods (2) Introduction to Information Retrieval Classification Methods (2) Ch. 13 Hand-coded rule-based classifiers One technique used by new agencies, intelligence agencies, etc. Widely deployed in government and enterprise Vendors provide “IDE” for writing such rules

Classification Methods (2) Introduction to Information Retrieval Classification Methods (2) Ch. 13 Hand-coded rule-based classifiers Commercial systems have complex query languages Accuracy is can be high if a rule has been carefully refined over time by a subject expert Building and maintaining these rules is expensive

Classification Methods (3): Supervised learning Introduction to Information Retrieval Classification Methods (3): Supervised learning Sec. 13.1 Given: A document d A fixed set of classes: C = {c1, c2,…, cJ} A training set D of documents each with a label in C Determine: A learning method or algorithm which will enable us to learn a classifier γ For a test document d, we assign it the class γ(d) ∈ C

Classification Methods (3) Introduction to Information Retrieval Classification Methods (3) Ch. 13 Supervised learning Naive Bayes (simple, common) – see video k-Nearest Neighbors (simple, powerful) Support-vector machines (new, generally more powerful) … plus many other methods Many commercial systems use a mixture of methods

Naïve Bayes (NB) Classification

Variants of NB Classifier

Feature Selection: Why? Introduction to Information Retrieval Feature Selection: Why? Sec.13.5 Text collections have a large number of features 10,000 – 1,000,000 unique words … and more Selection may make a particular classifier feasible Some classifiers can’t deal with 1,000,000 features Reduces training time Training time for some methods is quadratic or worse in the number of features Makes runtime models smaller and faster Can improve generalization (performance) Eliminates noise features Avoids overfitting

Feature Selection: Frequency Introduction to Information Retrieval Feature Selection: Frequency The simplest feature selection method: Just use the commonest terms No particular foundation But it make sense why this works They’re the words that can be well-estimated and are most often available as evidence In practice, this is often 90% as good as better methods Smarter feature selection – future lecture

Evaluating Categorization Introduction to Information Retrieval Evaluating Categorization Sec.13.6 Evaluation must be done on test data that are independent of the training data Sometimes use cross-validation (averaging results over multiple training and test splits of the overall data) Easy to get good performance on a test set that was available to the learner during training (e.g., just memorize the test set) Measures: precision, recall, F1, classification accuracy Classification accuracy: r/n where n is the total number of test docs and r is the number of test docs correctly classified