Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Text Categorization.

TF/IDF Ranking. Vector space model Documents are also treated as a “bag” of words or terms. –Each document is represented as a vector. Term Frequency.

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Overview of this week Debugging tips for ML algorithms

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.

Introduction to Information Retrieval

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Text Categorization Karl Rees Ling 580 April 2, 2001.

Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.

Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

Learning for Text Categorization

K nearest neighbor and Rocchio algorithm

Ch 4: Information Retrieval and Text Mining

SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

ACL, June Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Hinrich Schütze and Christina Lioma

1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

Scalable Text Mining with Sparse Generative Models

CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Advanced Multimedia Text Classification Tamara Berg.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 9 9/20/2011.

Text Classification, Active/Interactive learning.

Using Large-Vocabulary Classifiers William W. Cohen.

Phrase Finding with “Request-and-Answer” William W. Cohen.

1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

Using Large-Vocabulary Classifiers William W. Cohen.

Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Algorithms and Abstractions for Stream-and-Sort. Announcements Thursday: there will be a short quiz quiz will close at midnight Thursday, but probably.

IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:

Announcements My office hours: Tues 4pm Wed: guest lecture, Matt Hurst, Bing Local Search.

Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.

Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.

Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.

1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.

ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

CS4432: Database Systems II Query Processing- Part 2.

Some More Efficient Learning Methods William W. Cohen.

Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.

Announcements Phrases assignment out today: – Unsupervised learning – Google n-grams data – Non-trivial pipeline – Make sure you allocate time to actually.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Mistake Bounds William W. Cohen. One simple way to look for interactions Naïve Bayes – two class version dense vector of g(x,y) scores for each word in.

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Indexing & querying text

Modified from Stanford CS276 slides Lecture 4: Index Construction

Information Retrieval and Web Search

Announcements Working AWS codes will be out soon

Soft Joins with TFIDF: Why and What

Implementation Issues & IR Systems

Workflows and Abstractions for Map-Reduce

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

10-605: Map-Reduce Workflows

Basic Information Retrieval

Text Categorization Assigning documents to a fixed set of categories

Information Retrieval and Web Search Lecture 1: Boolean retrieval

Hankz Hankui Zhuo Text Categorization Hankz Hankui Zhuo

Parallel Perceptrons and Iterative Parameter Mixing

Naïve Bayes Text Classification

Learning and Memorization

Presentation transcript:

Rocchio’s Algorithm 1

Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2

Rocchio’s algorithm Relevance Feedback in Information Retrieval, SMART Retrieval System Experiments in Automatic Document Processing, 1971, Prentice Hall Inc. 3

Rocchio’s algorithm Many variants of these formulae …as long as u(w,d)=0 for words not in d! Store only non-zeros in u ( d), so size is O(| d | ) But size of u ( y) is O(| n V | ) 4

Rocchio’s algorithm Given a table mapping w to DF(w), we can compute v ( d ) from the words in d… and the rest of the learning algorithm is just adding… 5

Rocchio v Bayes id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3 …. id 3 y3 w 3,1 w 3,2 …. id 4 y4 w 4,1 w 4,2 … id 5 y5 w 5,1 w 5,2 …... X=w 1 ^Y=sports X=w 1 ^Y=worldNews X=.. X=w 2 ^Y=… X=… … … Train data Event counts id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3 …. id 3 y3 w 3,1 w 3,2 …. id 4 y4 w 4,1 w 4,2 … C[X=w 1,1 ^Y=sports]=5245, C[X=w 1,1 ^Y=..],C[X=w 1,2 ^…] C[X=w 2,1 ^Y=….]=1054,…, C[X=w 2,k2 ^…] C[X=w 3,1 ^Y=….]=… … Recall Naïve Bayes test process? Imagine a similar process but for labeled documents… 6

Rocchio…. id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3 …. id 3 y3 w 3,1 w 3,2 …. id 4 y4 w 4,1 w 4,2 … id 5 y5 w 5,1 w 5,2 …... aardvark agent … … Train data Rocchio: DF counts id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3 …. id 3 y3 w 3,1 w 3,2 …. id 4 y4 w 4,1 w 4,2 … v ( w 1,1, id1), v ( w 1,2, id1)…v ( w 1,k1, id1) v (w 2,1, id2), v ( w 2,2, id2)… … 7

Rocchio…. id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3 …. id 3 y3 w 3,1 w 3,2 …. id 4 y4 w 4,1 w 4,2 … id 5 y5 w 5,1 w 5,2 …... aardvark agent … … Train data Rocchio: DF counts id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3 …. id 3 y3 w 3,1 w 3,2 …. id 4 y4 w 4,1 w 4,2 … v ( id 1 ) v ( id 2 ) … 8

Rocchio…. id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3 …. id 3 y3 w 3,1 w 3,2 …. id 4 y4 w 4,1 w 4,2 … v( w 1,1 w 1,2 w 1,3 …. w 1,k1 ), the document vector for id 1 v( w 2,1 w 2,2 w 2,3 ….)= v(w 2,1,d), v(w 2,2,d), … … For each ( y, v), go through the non-zero values in v … one for each w in the document d …and increment a counter for that dimension of v ( y ) Message : increment v ( y1 )’s weight for w 1,1 by α v(w 1,1,d) /|C y | Message : increment v ( y1 )’s weight for w 1,2 by α v(w 1,2,d) /|C y | 9

Rocchio at Test Time id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3 …. id 3 y3 w 3,1 w 3,2 …. id 4 y4 w 4,1 w 4,2 … id 5 y5 w 5,1 w 5,2 …... aardvark agent … v(y1,w)= v(y1,w)=0.013, v(y2,w)=….... … Train data Rocchio: DF counts id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3 …. id 3 y3 w 3,1 w 3,2 …. id 4 y4 w 4,1 w 4,2 … v ( id 1 ), v(w 1,1,y1),v(w 1,1,y1),….,v(w 1,k1,yk),…,v(w 1,k1,yk) v ( id 2 ), v(w 2,1,y1),v(w 2,1,y1),…. … 10

Rocchio Summary Compute DF – one scan thru docs Compute v ( id i ) for each document – output size O(n) Add up vectors to get v ( y ) Classification ~= disk NB time: O(n), n=corpus size – like NB event-counts time: O(n) – one scan, if DF fits in memory – like first part of NB test procedure otherwise time: O(n) – one scan if v ( y )’s fit in memory – like NB training otherwise 11

Rocchio results… Joacchim ’98, “A Probabilistic Analysis of the Rocchio Algorithm…” Variant TF and IDF formulas Rocchio’s method (w/ linear TF) 12

13

Rocchio results… Schapire, Singer, Singhal, “Boosting and Rocchio Applied to Text Filtering”, SIGIR 98 Reuters – all classes (not just the frequent ones) 14

A hidden agenda Part of machine learning is good grasp of theory Part of ML is a good grasp of what hacks tend to work These are not always the same – Especially in big-data situations Catalog of useful tricks so far – Brute-force estimation of a joint distribution – Naive Bayes – Stream-and-sort, request-and-answer patterns – BLRT and KL-divergence (and when to use them) – TF-IDF weighting – especially IDF it’s often useful even when we don’t understand why 15

One more Rocchio observation Rennie et al, ICML 2003, “Tackling the Poor Assumptions of Naïve Bayes Text Classifiers” NB + cascade of hacks 16

One more Rocchio observation Rennie et al, ICML 2003, “Tackling the Poor Assumptions of Naïve Bayes Text Classifiers” “In tests, we found the length normalization to be most useful, followed by the log transform…these transforms were also applied to the input of SVM”. 17

One? more Rocchio observation Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 DFs -1 DFs - 2 DFs -3 DFs Split into documents subsets Sort and add counts Compute DFs 18

One?? more Rocchio observation Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 v-1 v-2 v-3 DFs Split into documents subsets Sort and add vectors Compute partial v ( y)’s v ( y )’s 19

O(1) more Rocchio observation Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 v-1 v-2 v-3 DFs Split into documents subsets Sort and add vectors Compute partial v ( y)’s v ( y )’s DFs We have shared access to the DFs, but only shared read access – we don’t need to share write access. So we only need to copy the information across the different processes. 20

Abstract Implementation: TFIDF data = pairs (docid,term) where term is a word appears in document with id docid operators: DISTINCT, MAP, JOIN GROUP BY …. [RETAINING …] REDUCING TO a reduce step docFreq = DISTINCT data | GROUP BY λ (docid,term):term REDUCING TO count /* (term,df) */ docIds = MAP DATA BY= λ (docid,term):docid | DISTINCT numDocs = GROUP docIds BY λ docid:1 REDUCING TO count /* (1,numDocs) */ dataPlusDF = JOIN data BY λ (docid, term):term, docFreq BY λ (term, df):term | MAP λ ((docid,term),(term,df)):(docId,term,df) /* (docId,term,document-freq) */ unnormalizedDocVecs = JOIN dataPlusDF by λ row:1, numDocs by λ row:1 | MAP λ ((docId,term,df),(dummy,numDocs)): (docId,term,log(numDocs/df)) /* (docId, term, weight-before-normalizing) : u */ 1/2 docIdterm d123found d123aardvark keyvalue found(d123,found),(d134,found),… aardvark(d123,aardvark),… keyvalue keyvalue found(d123,found),(d134,found),… 2456 aardvark(d123,aardvark),… 7 21

Abstract Implementation: TFIDF normalizers = GROUP unnormalizedDocVecs BY λ (docId,term,w):docid RETAINING λ (docId,term,w): w 2 REDUCING TO sum /* (docid,sum-of-square-weights) */ docVec = JOIN unnormalizedDocVecs BY λ (docId,term,w):docid, normalizers BY λ (docId,norm):docid | MAP λ ((docId,term,w), (docId,norm)): (docId,term,w/sqrt(norm)) /* (docId, term, weight) */ 2/2 key d1234(d1234,found,1.542), (d1234,aardvark,13.23),… d3214…. key d1234(d1234,found,1.542), (d1234,aardvark,13.23),… d3214… docIdtermw d1234found1.542 d1234aardvark13.23 docIdw d d

GuineaPig: demo Pure Python (< 1500 lines) Streams Python data structures – strings, numbers, tuples (a,b), lists [a,b,c] – No records: operations defined functionally Compiles to Hadoop streaming pipeline – Optimizes sequences of MAPs Runs locally without Hadoop – compiles to stream-and-sort pipeline – intermediate results can be viewed Can easily run parts of a pipeline uinea_Pig uinea_Pig 23

Actual Implementation 24

Full Implementation 25