ISchool, Cloud Computing Class Talk, Oct 6 th 2008 1 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Information Retrieval in Practice
Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.
MapReduce.
Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
Information Retrieval Review
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,
ACL, June Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,
Hinrich Schütze and Christina Lioma
Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.
COE Quarterly Technical Exchange, June 10 th Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed.
CpSc 881: Information Retrieval. 2 Hardware basics Many design decisions in information retrieval are based on hardware constraints. We begin by reviewing.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 9 9/20/2011.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif
Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,
Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal.
On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Vector Space Models.
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
IR 6 Scoring, term weighting and the vector space model.
Plan for Today’s Lecture(s)
An Efficient Algorithm for Incremental Update of Concept space
A Straightforward Author Profiling Approach in MapReduce
Optimizing Parallel Algorithms for All Pairs Similarity Search
Information Retrieval in Practice
Information Retrieval and Web Search
MR Application with optimizations for performance and scalability
Information Retrieval and Web Search
Representation of documents and queries
MR Application with optimizations for performance and scalability
6. Implementation of Vector-Space Retrieval
8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Learning with Hadoop – A case study on MapReduce based Data Mining
Presentation transcript:

iSchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed, Jimmy Lin, and Douglas W. Oard

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 2 Overview Abstract Problem Trivial Solution MapReduce Solution Efficiency Tricks

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 3 Abstract Problem ~~~~~~~~~~ ~~~~~~~~~~ Applications:  Clustering  Coreference resolution  “more-like-that” queries

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 4 Similarity of Documents Simple inner product Cosine similarity Term weights  Standard problem in IR  tf-idf, BM25, etc. didi djdj

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 5 Trivial Solution load each vector o(N) times load each term o(df t 2 ) times scalable and efficient solution for large collections Goal

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 6 Better Solution Load weights for each term once Each term contributes o(df t 2 ) partial scores Allows efficiency tricks Each term contributes only if appears in

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 7 Decomposition  MapReduce Load weights for each term once Each term contributes o(df t 2 ) partial scores Each term contributes only if appears in map index reduce

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 8 MapReduce Framework map map map map reduce reduce reduce input output Shuffling group values by: [keys] (a) Map (b) Shuffle (c) Reduce transparently handles low-level details transparently (k 2, [v 2 ]) (k 1, v 1 ) [(k 3, v 3 )] [k 2, v 2 ]

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 9 Standard Indexing tokenize tokenize tokenize tokenize combine combine combine doc posting list Shuffling group values by: terms (a) Map (b) Shuffle (c) Reduce

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 10 Indexing (3-doc toy collection) Clinton Barack Cheney Obama Indexing Clinton Obama Clinton 1 1 Clinton Cheney Clinton Barack Obama Clinton Obama Clinton Cheney Clinton Barack Obama

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 11 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton Barack Cheney Obama

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 12 Pairwise Similarity (abstract) (a) Generate pairs (b) Group pairs (c) Sum pairs multiply multiply multiply multiply sum sum sum term postings similarity Shuffling group values by: pairs

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 13 Experimental Setup  Open source MapReduce implementation Cluster of 19 machines  Each w/ two processors (single core) Aquaint-2 collection  906K documents Okapi BM25 Subsets of collection Elsayed, Lin, and Oard, ACL 2008

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 14 Efficiency (disk space) 8 trillion intermediate pairs Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk Aquaint-2 Collection, ~ 906k docs

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 15 Terms: Zipfian Distribution term rank doc freq (df) each term t contributes o(df t 2 ) partial results very few terms dominate the computations most frequent term (“said”)  3% most frequent 10 terms  15% most frequent 100 terms  57% most frequent 1000 terms  95% ~0.1% of total terms (99.9% df-cut)

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 16 Efficiency (disk space) 8 trillion intermediate pairs 0.5 trillion intermediate pairs Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk Aquaint-2 Collection, ~ 906k doc

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 17 Effectiveness (recent work) Drop 0.1% of terms “Near-Linear” Growth Fit on disk Cost 2% in Effectiveness Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 18 Implementation Issues BM25s Similarity Model  TF, IDF  Document length DF-Cut  Build a histogram  Pick the absolute df for the % df-cut

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 19 Other Approximation Techniques ?

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 20 Other Approximation Techniques (2) Absolute df Consider only terms that appear in at least n (or %) documents  An absolute lower bound on df, instead of just removing the % most-frequent terms

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 21 Other Approximation Techniques (3) tf-Cut Consider only documents (in posting list) with tf > T ; T=1 or 2 OR: Consider only the top N documents based on tf for each term

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 22 Other Approximation Techniques (4) Similarity Threshold Consider only partial scores > Sim T

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 23 Other Approximation Techniques: (5) Ranked List Keep only the most similar N documents  In the reduce phase Good for ad-hoc retrieval and “more-like this” queries

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 24 Space-Saving Tricks (1) Stripes  Stripes instead of pairs  Group by doc-id not pairs

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 25 Space-Saving Tricks (2) Blocking  No need to generate the whole matrix at once  Generate different blocks of the matrix at different steps  limit the max space required for intermediate results Similarity Matrix