Rainbow Tool Kit Matt Perry Global Information Systems Spring 2003.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Text Categorization.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
Text Categorization CSC 575 Intelligent Information Retrieval.
1 Machine Learning: Lecture 7 Instance-Based Learning (IBL) (Based on Chapter 8 of Mitchell T.., Machine Learning, 1997)
Lazy vs. Eager Learning Lazy vs. eager learning
. Markov Chains as a Learning Tool. 2 Weather: raining today40% rain tomorrow 60% no rain tomorrow not raining today20% rain tomorrow 80% no rain tomorrow.
Learning for Text Categorization
K nearest neighbor and Rocchio algorithm
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
These slides are based on Tom Mitchell’s book “Machine Learning” Lazy learning vs. eager learning Processing is delayed until a new instance must be classified.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Automatic Text Classification through Machine Learning David W. Miller Semantic Web Spring 2002 Department of Computer Science University of Georgia
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
INSTANCE-BASE LEARNING
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Exercise Session 10 – Image Categorization
Advanced Multimedia Text Classification Tamara Berg.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization Thorsten Joachims Carnegie Mellon University Presented by Ning Kang.
The identification of interesting web sites Presented by Xiaoshu Cai.
Text Classification, Active/Interactive learning.
Naive Bayes Classifier
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 Computing Relevance, Similarity: The Vector Space Model.
CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Classification Techniques: Bayesian Classification
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Chapter 23: Probabilistic Language Models April 13, 2004.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Vector Space Models.
1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
K-Nearest Neighbor Learning.
Term weighting and Vector space retrieval
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Kansas State University Department of Computing and Information Sciences CIS 890: Special Topics in Intelligent Systems Wednesday, November 15, 2000 Cecil.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Bayesian Learning Evgueni Smirnov Overview Bayesian Theorem Maximum A Posteriori Hypothesis Naïve Bayes Classifier Learning Text Classifiers.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
KNN & Naïve Bayes Hongning Wang
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
IR 6 Scoring, term weighting and the vector space model.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Naive Bayes Classifier
Lecture 15: Text Classification & Naive Bayes
Intelligent Information Retrieval
Data Mining Lecture 11.
Machine Learning. k-Nearest Neighbor Classifiers.
Text Categorization Assigning documents to a fixed set of categories
Instance Based Learning
Hankz Hankui Zhuo Text Categorization Hankz Hankui Zhuo
CS 430: Information Discovery
INF 141: Information Retrieval
Presentation transcript:

Rainbow Tool Kit Matt Perry Global Information Systems Spring 2003

Outline 1. Introduction to Rainbow 2. Description of Bow Library 3. Description of Rainbow methods 1. Naïve Bayes 2. TFIDF/Rocchio 3. K Nearest Neighbor 4. Probabilistic Indexing 4. Demonstration of Rainbow newsgroups example

What is Rainbow? Publicly available executable program that performs document classification Part of the Bow (or libbow) library  A library of C code useful for writing statistical text analysis, language modeling and information retrieval programs  Developed by Andrew McCallum of Carnegie Mellon University

About Bow Library Provides facilities for  Recursively descending directories, finding text files.  Finding `document' boundaries when there are multiple documents per file.  Tokenizing a text file, according to several different methods.  Including N-grams among the tokens.  Mapping strings to integers and back again, very efficiently.  Building a sparse matrix of document/token counts.  Pruning vocabulary by word counts or by information gain.  Building and manipulating word vectors.

About Bow Library Provides facilities for  Setting word vector weights according to Naive Bayes, TFIDF, and several other methods.  Smoothing word probabilities according to Laplace (Dirichlet uniform), M-estimates, Witten-Bell, and Good- Turning.  Scoring queries for retrieval or classification.  Writing all data structures to disk in a compact format.  Reading the document/token matrix from disk in an efficient, sparse fashion.  Performing test/train splits, and automatic classification tests.  Operating in server mode, receiving and answering queries over a socket.

About Bow Library Does Not  Have English parsing or part-of-speech tagging facilities.  Do smoothing across N-gram models.  Claim to be finished.  Have good documentation.  Claim to be bug-free.  Run on a Windows Machine.

About Bow Library In Addition to Rainbow, Bow contains 3 other executable programs  Crossbow - does document clustering  Arrow - does document retrieval – TFIDF  Archer - does document retrieval Supports AltaVista-type queries +, -, “”, etc.

Back to Rainbow Classification Methods used by Rainbow  Naïve Bayes (mostly designed for this)  TFIDF/Rocchio  K-Nearest Neighbor  Probabilistic Indexing

Description of Naïve Bayes Bayesian reasoning provides a probabilistic approach to learning. Idea of Naïve Bayes Classification is to assign a new instance the most probable target value, given the attribute values of the new instance. How?

Description of Naïve Bayes Based on Bayes Theorem Notation  P(h) = probability that a hypothesis h holds Ex. Pr (document1 fits the sports category)  P(D) = probability that training data D will be observed Ex. Pr (we will encounter document1)

Description of Naïve Bayes Notation Continued  P(D|h) probability of observing data D given that hypothesis h holds. Ex. Probability that we will observe document 1 given that document 1 is about sports  P(h|D) probability that h holds given training data D. This is what we want Probability that document 1 is a sports document given the training data D

Description of Naïve Bayes Bayes Theorem

Description of Naïve Bayes Bayes Theorem Provides a way to calculate P(h|D) from P(h), together with P(D) and P(D|h). Increases with P(D|h) and P(h) Decreases with P(D)  Implies that it is more probable to observe D independent of h.  Less evidence D provides in support of h.

Description of Naïve Bayes Approach: Assign the most probable target value given the attributes

Description of Naïve Bayes Simplification based on Bayes Theorem

Description of Naïve Bayes Naïve Bayes assumes (incorrectly) that the attribute values are conditionally independent given the target value

Rainbow Algorithm Let= probability that a document belongs to class Let = probability that a randomly drawn word from class will be the word

Rainbow Algorithm Estimate

Rainbow Algorithm 1. Collect all words, punctuation, and other tokens that occur in examples 2. Calculate the required and probability terms 3. Return the estimated target value for the document Doc

TFIDF/Rocchio Most major component of the Rocchio algorithm is the TFIDF (term frequency / inverse document frequency) word weighting scheme. TF(w,d) (Term Frequency) is the number of times word w occurs in a document d. DF(w) (Document Frequency) is the number of documents in which the word w occurs at least once.

TFIDF/Rocchio The inverse document frequency is calculated as

TFIDF/Rocchio Based on word weight heuristics, the word w i is an important indexing term for a document d if it occurs frequently in that document However, words that occurs frequently in many document spanning many categories are rated less importantly

TFIDF/Rocchio Each document is D is represented as a vector within a given vector space V:

TFIDF/Rocchio Value of d (i) of feature w i for a document d is calculated as the product d(i) is called the weight of the word w i in the document d.

TFIDF/Rocchio Documents that are “close together” in vector space talk about the same things. t1t1 d1d1 d3d3 d5d5 t2t2 θ φ t3t3 d2d2 d4d4

TFIDF/Rocchio Distance between vectors d 1 and d 2 captured by the cosine of the angle x between them. Note – this is similarity, not distance t 1 d2d2 d1d1 t 3 t 2 θ

TFIDF/Rocchio Cosine of angle between two vectors The denominator involves the lengths of the vectors So the cosine measure is also known as the normalized inner product

TFIDF/Rocchio A vector can be normalized (given a length of 1) by dividing each of its components by the vector's length This maps vectors onto the unit circle: Then, Longer documents don’t get more weight For normalized vectors, the cosine is simply the dot product:

Rainbow Algorithm Construct a set of prototype vectors One vector for each class This serves as learned model Model is used to classify a new document D D is assigned to the class with the most similar vector

K Nearest Neighbor Features  All instances correspond to points in an n- dimensional Euclidean space  Classification is delayed until a new instance arrives  Classification done by comparing feature vectors of the different points  Target function may be discrete or real-valued

K Nearest Neighbor 1 Nearest Neighbor

K Nearest Neighbor An arbitrary instance is represented by(a 1 (x), a 2 (x), a 3 (x),.., a n (x))  a i (x) denotes features Euclidean distance between two instances d(x i, x j )=sqrt (sum for r=1 to n (a r (x i ) - a r (x j )) 2 ) Find the k-nearest neighbors whose distance from your test cases falls within a threshold p. If x of those k-nearest neighbors are in category c i, then assign the test case to c i, else it is unmatched.

Rainbow Algorithm Construct a model of points in n-dimensional space for each category Classify a document D based on the k nearest points

Probabilistic Indexing Idea  Quantitative model for automatic indexing based on some statistical assumptions about word distribution.  2 Types of words: function words, specialty words  Function words = words with no importance for defining classes (the, it, etc.)  Specialty words = words that are important in defining classes (war, terrorist, etc.)

Probabilistic Indexing Idea  Function words follow a Poisson distribution over the set of all documents  Specialty words do not follow a Poisson distribution over the set of all documents  Specialty word distribution can be described by a Poisson process within its class  Specialty words distinguish more than one class of documents

Rainbow Method Goal is to estimate P(C|s i, d m )  Probability that assignment of term s i to the document d m is correct Once terms have been identified, assign Form Of Occurrence (FOC)  Certainty that term is correctly identified  Significance of Term

Rainbow Method If term t appears in document d and a term descriptor from t to s exists, s an indexing term, then generate a descriptor indictor Set of generated term descriptors can be evaluated and a probability calculated that document d lies in class c

Rainbow Demonstration 20 newsgroups example References       Mitchell, Tom M. Machine Learning

Rainbow Commands Create a model for the classes: rainbow -d ~/model --index training directory Classifying Documents:  Pick Method (naivebayes, knn, tfidf, prind ) rainbow -d ~/model --method=tfidf --test=1  Automatic Test: rainbow -d ~/model --test-set=0.4 --test=3  Test 1 at a time: rainbow -d ~/model –query test file

Rainbow Demonstration Can also run as a server: rainbow -d ~/model --query-server=port  Use telnet to classify new documents Diagnostics:  List the words with the highest mutual info: rainbow -d ~/model -I 10  Perl script for printing stats: rainbow -d ~/model --test-set=0.4 --test=2 | rainbow- stats.pl