ISchool, Cloud Computing Class Talk, Oct 6 th 20081 Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Text Categorization.
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.
MapReduce.
Introduction to Information Retrieval
Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Resolving Personal Names in Using Context Expansion Tamer Elsayed, Douglas W. Oard, and Galileo Namata ACL, Columbus, Ohio, June 2008 Human Language.
Bahman Bahmani  Fundamental Tradeoffs  Drug Interaction Example [Adapted from Ullman’s slides, 2012]  Technique I: Grouping 
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
The Enron and W3C Collections Tamer Elsayed and Douglas W. Oard ICAIL 2007, DESI Workshop, June 4 th, 2007 University of Maryland.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
ACL, June Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,
Modeling Identity in Archival Collections of A Preliminary study Tamer Elsayed and Douglas W. Oard Conference on and Anti-Spam (CEAS), July.
Desiderata in the Legal Realm Two-party –Negotiated (not one-sided) information needs Recall-oriented –“Smoking gun detection” + completeness Explainable.
Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.
COE Quarterly Technical Exchange, June 10 th Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Identity Resolution in Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Overview of Search Engines
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.
Big Data Infrastructure Week 2: MapReduce Algorithm Design (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Plan for Today’s Lecture(s)
An Efficient Algorithm for Incremental Update of Concept space
A Straightforward Author Profiling Approach in MapReduce
Cse 344 May 4th – Map/Reduce.
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Representation of documents and queries
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
6. Implementation of Vector-Space Retrieval
5/7/2019 Map Reduce Map reduce.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Learning with Hadoop – A case study on MapReduce based Data Mining
Presentation transcript:

iSchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed, Jimmy Lin, and Douglas W. Oard

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 2 Overview Abstract Problem Trivial Solution MapReduce Solution Efficiency Tricks Identity Resolution in

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 3 Abstract Problem ~~~~~~~~~~ ~~~~~~~~~~ Applications:  Clustering  Coreference resolution  “more-like-that” queries

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 4 Similarity of Documents Simple inner product Cosine similarity Term weights  Standard problem in IR  tf-idf, BM25, etc. didi djdj

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 5 Trivial Solution load each vector o(N) times load each term o(df t 2 ) times scalable and efficient solution for large collections Goal

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 6 Better Solution Load weights for each term once Each term contributes o(df t 2 ) partial scores Allows efficiency tricks Each term contributes only if appears in

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 7 Decomposition  MapReduce Load weights for each term once Each term contributes o(df t 2 ) partial scores Each term contributes only if appears in map index reduce

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 8 MapReduce Framework map map map map reduce reduce reduce input output Shuffling group values by: [keys] (a) Map (b) Shuffle (c) Reduce transparently handles low-level details transparently (k 2, [v 2 ]) (k 1, v 1 ) [(k 3, v 3 )] [k 2, v 2 ]

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 9 Standard Indexing tokenize tokenize tokenize tokenize combine combine combine doc posting list Shuffling group values by: terms (a) Map (b) Shuffle (c) Reduce

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 10 Indexing (3-doc toy collection) Clinton Barack Cheney Obama Indexing Clinton Obama Clinton 1 1 Clinton Cheney Clinton Barack Obama Clinton Obama Clinton Cheney Clinton Barack Obama

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 11 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton Barack Cheney Obama

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 12 Pairwise Similarity (abstract) (a) Generate pairs (b) Group pairs (c) Sum pairs multiply multiply multiply multiply sum sum sum term postings similarity Shuffling group values by: pairs

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 13 Experimental Setup  Open source MapReduce implementation Cluster of 19 machines  Each w/ two processors (single core) Aquaint-2 collection  906K documents Okapi BM25 Subsets of collection Elsayed, Lin, and Oard, ACL 2008

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 14 Efficiency (disk space) 8 trillion intermediate pairs Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk Aquaint-2 Collection, ~ 906k docs

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 15 Terms: Zipfian Distribution term rank doc freq (df) each term t contributes o(df t 2 ) partial results very few terms dominate the computations most frequent term (“said”)  3% most frequent 10 terms  15% most frequent 100 terms  57% most frequent 1000 terms  95% ~0.1% of total terms (99.9% df-cut)

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 16 Efficiency (disk space) 8 trillion intermediate pairs 0.5 trillion intermediate pairs Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk Aquaint-2 Collection, ~ 906k doc

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 17 Effectiveness (recent work) Drop 0.1% of terms “Near-Linear” Growth Fit on disk Cost 2% in Effectiveness Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 18 Implementation Issues BM25s Similarity Model  TF, IDF  Document length DF-Cut  Build a histogram  Pick the absolute df for the % df-cut

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 19 Other Approximation Techniques ?

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 20 Other Approximation Techniques (2) Absolute df Consider only terms that appear in at least n (or %) documents  An absolute lower bound on df, instead of just removing the % most-frequent terms

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 21 Other Approximation Techniques (3) tf-Cut Consider only documents (in posting list) with tf > T ; T=1 or 2 OR: Consider only the top N documents based on tf for each term

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 22 Other Approximation Techniques (4) Similarity Threshold Consider only partial scores > Sim T

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 23 Other Approximation Techniques: (5) Ranked List Keep only the most similar N documents  In the reduce phase Good for ad-hoc retrieval and “more-like this” queries

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 24 Space-Saving Tricks (1) Stripes  Stripes instead of pairs  Group by doc-id not pairs

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 25 Space-Saving Tricks (2) Blocking  No need to generate the whole matrix at once  Generate different blocks of the matrix at different steps  limit the max space required for intermediate results Similarity Matrix

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 26 Identity Resolution in Topical Similarity Social Similarity Joint Resolution of Mentions

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 27 Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann To: Suzanne Adams Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. Sheila Basic Problem WHO?WHO?

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 28 Enron Collection -----Original Message----- From: Sent: Monday, July 30, :24 PM To: Sager, Elizabeth; Murphy, Harlan; Cc: Subject:Shhhh.... it's a SURPRISE ! Message-ID: Date: Mon, 30 Jul :40: (PDT) From: To: Subject: RE: Shhhh.... it's a SURPRISE ! X-From: Sager, Elizabeth X-To: Hope all is well. Count me in for the group present. See ya next week if not earlier Please call me (713) Liza Elizabeth Sager Hi Shari Thanks! Shari 55 Sheila’s !! weisman pardo glover rich jones breeden huckaby tweed mcintyre chadwick birmingham kahanek foraker tasman fisher petitt Dombo Robbins chang jarnot kirby knudsen boehringer lutz glover wollam jortner neylon whanger nagel graves mclaughlin venville rappazzo miller swatek hollis maynes nacey ferrarini dey macleod howard darling watson perlick advani hester kenner lewis walton whitman berggren osowski kelly Rank Candidates

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 29 Generative Model person 1.Choose “person” c to mention p(c)p(c) context 2.Choose appropriate “context” X to mention c p(X | c) mention 3.Choose a “mention” l p(l | X, c) “sheila” GE conference call

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 30 3-Step Solution (1) Identity Modeling Posterior Distribution (3) Mention Resolution (2) Context Reconstruction

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 31 Contextual Space Local Context Local Context Conversational Context Conversational Context Topical Context

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 32 Topical Context Date: Fri Dec 15 05:33:00 EST 2000 From: To: vince j kaminski Cc: sheila walton Subject: Re: Grant Masson Great news. Lets get this moving along. Sheila, can you work out GE letter? Vince, I am in London Monday/Tuesday, back Weds late. I'll ask Sheila to fix this for you and if you need me call me on my cell phone. Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann To: Suzanne Adams Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. Sheila call Sheila call GE

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 33 Contextual Space Social Context Local Context Local Context Conversational Context Conversational Context Topical Context

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 34 Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann To: Suzanne Adams Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. Social Context Date: Tue, 19 Dec :07: (PST) From: To: Subject: ESA Option Execution Kay Can you initial the ESA assignment and assumption agreement or should I ask Sheila Tweed to do it? I believe she is currently en route from Portland. Thanks, Rebecca Sheila Tweed

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 35 Contextual Space (mentions) “Sheila” social conversational social topical social topical “Sheila Tweed” “sheila” “sg” “Sheila Walton” “Sheila” Joint Resolution of Mentions

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 36 Topical Expansion Each is a document Index all (bodies of) s  remove all signature and salutation lines  Use temporal constraints Need an -to-date/time mapping Check for each pair of documents

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 37 Social Expansion Can we use the same technique?  For each list of participating addresses comprises the document MessageID: 3563 Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann To: Suzanne Adams Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. Index the new “social documents” and apply same topical expansion process

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 38 Social Similarity Models Intersection size Jaccard Coefficent Boolean All given temporal constraints

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 39 Joint Resolution “Sheila” social conversational social topical social topical “Sheila Tweed” “sheila” “sg” “Sheila Walton” “Sheila”

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 40 Joint Resolution Spread Current Resolution Combine Context Info Update Resolution Mention Graph

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 41 Joint Resolution mapshufflereduce Mention Graph MapReduce! Work in Progress!

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 42 System Design sThreadsIdentity Models Social Expansion Conv. Expansion Topical Expansion Local Expansion Local Context Social Context Topical Context Conv. Context Mention Recognition Context-Free Resolution Mentions Merging Contexts Context-Free Resolution Joint Resolution Posterior Resolution Prior Resolution

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 43 Iterative Joint Resolution Input: Context Graph + Prior Resolution Mapper  Consider one mention  Takes: 1.out-edges and context info 2.prior resolution  Spread context info and prior resolution to all mentions in context Reducer  Consider one mention  Takes: 1.in-edges and context info 2.prior resolution  Compute posterior resolution Multiple Iterations

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 44 Conclusion Simple and efficient MapReduce solution  applied to both topical and social expansion in “Identity Resolution in ”  different tricks for approximation Shuffling is critical  df-cut controls efficiency vs. effectiveness tradeoff  99.9% df-cut achieves 98% relative accuracy

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 45 Thank You!

Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective 46 Algorithm Matrix must fit in memory  Works for small collections Otherwise: disk access optimization