ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
University of Minnesota CG_Hadoop: Computational Geometry in MapReduce Ahmed Eldawy* Yuan Li* Mohamed F. Mokbel*$ Ravi Janardan* * Department of Computer.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Sorting, Searching, and Simulation in the MapReduce Framework Michael T. Goodrich Dept. of Computer Science.
Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.
COE Quarterly Technical Exchange, June 10 th Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Ch 4. The Evolution of Analytic Scalability
Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,
MRPGA : An Extension of MapReduce for Parallelizing Genetic Algorithm Reporter :古乃卉.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.
Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
Big Data Infrastructure Week 2: MapReduce Algorithm Design (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
Matrix Multiplication in Hadoop
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
A Simple Approach for Author Profiling in MapReduce
A Straightforward Author Profiling Approach in MapReduce
Optimizing Parallel Algorithms for All Pairs Similarity Search
On Spatial Joins in MapReduce
February 26th – Map/Reduce
Hadoop Basics.
MapReduce Algorithm Design
Cse 344 May 4th – Map/Reduce.
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
CS110: Discussion about Spark
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
6. Implementation of Vector-Space Retrieval
Learning with Hadoop – A case study on MapReduce based Data Mining
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Presentation transcript:

ACL, June Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland, College Park Human Language Technology Center of Excellence and UMIACS CLIP Lab

Pairwise Document Similarity in Large Collections with MapReduce 2 Abstract Problem ~~~~~~~~~~ ~~~~~~~~~~ Applications:  Clustering  Coreference resolution  “more-like-that” queries

Pairwise Document Similarity in Large Collections with MapReduce 3 Trivial Solution load each vector o(N) times load each term o(df t 2 ) times scalable and efficient solution for large collections Goal

Pairwise Document Similarity in Large Collections with MapReduce 4 Better Solution Load weights for each term once Each term contributes o(df t 2 ) partial scores Each term contributes only if appears in

Pairwise Document Similarity in Large Collections with MapReduce 5 MapReduce Framework map map map map reduce reduce reduce input output Shuffling group values by: [keys] (a) Map (b) Shuffle (c) Reduce transparently handles low-level details transparently (k 2, [v 2 ]) (k 1, v 1 ) [(k 3, v 3 )] [k 2, v 2 ]

Pairwise Document Similarity in Large Collections with MapReduce 6 reduce Decomposition Load weights for each term once Each term contributes o(df t 2 ) partial scores Each term contributes only if appears in map

Pairwise Document Similarity in Large Collections with MapReduce 7 Standard Indexing tokenize tokenize tokenize tokenize combine combine combine doc posting list Shuffling group values by: terms (a) Map (b) Shuffle (c) Reduce

Pairwise Document Similarity in Large Collections with MapReduce 8 Indexing (3-doc toy collection) Clinton Barack Cheney Obama Indexing Clinton Obama Clinton 1 1 Clinton Cheney Clinton Barack Obama Clinton Obama Clinton Cheney Clinton Barack Obama

Pairwise Document Similarity in Large Collections with MapReduce 9 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton Barack Cheney Obama

Pairwise Document Similarity in Large Collections with MapReduce 10 Pairwise Similarity (abstract) (a) Generate pairs (b) Group pairs (c) Sum pairs multiply multiply multiply multiply sum sum sum term postings similarity Shuffling group values by: pairs

Pairwise Document Similarity in Large Collections with MapReduce 11 Experimental Setup  Open source MapReduce implementation Cluster of 19 machines  Each w/ two processors (single core) Aquaint-2 collection  906K documents Okapi BM25 Subsets of collection

Pairwise Document Similarity in Large Collections with MapReduce 12 Efficiency (disk space) 8 trillion intermediate pairs Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk Aquaint-2 Collection, ~ 906k docs

Pairwise Document Similarity in Large Collections with MapReduce 13 Terms: Zipfian Distribution term rank doc freq (df) each term t contributes o(df t 2 ) partial results very few terms dominate the computations most frequent term (“said”)  3% most frequent 10 terms  15% most frequent 100 terms  57% most frequent 1000 terms  95% ~0.1% of total terms (99.9% df-cut)

Pairwise Document Similarity in Large Collections with MapReduce 14 Efficiency (disk space) 8 trillion intermediate pairs 0.5 trillion intermediate pairs Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk Aquaint-2 Collection, ~ 906k doc

Pairwise Document Similarity in Large Collections with MapReduce 15 Effectiveness (recent work) Drop 0.1% of terms “Near-Linear” Growth Fit on disk Cost 2% in Effectiveness Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

Pairwise Document Similarity in Large Collections with MapReduce 16 Open source implementation Java 1.5, Available soon … Ivory

Pairwise Document Similarity in Large Collections with MapReduce 17 Conclusion Simple and efficient MapReduce solution  Many HLT problems can also be “hadoopified” E.g., Statistical MT (see paper in StatMT workshop) Shuffling is critical  df-cut controls efficiency vs. effectiveness tradeoff  99.9% df-cut achieves 98% relative accuracy

Pairwise Document Similarity in Large Collections with MapReduce 18 Future work Apply to larger collections! Develop analytical model Measure effectiveness for different applications

Pairwise Document Similarity in Large Collections with MapReduce 19 Thank You!

Pairwise Document Similarity in Large Collections with MapReduce 20 Algorithm Matrix must fit in memory  Works for small collections Otherwise: disk access optimization