Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

Starfish: A Self-tuning System for Big Data Analytics.

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.

Spark: Cluster Computing with Working Sets

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.

Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.

Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009

ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

ACL, June Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Sorting, Searching, and Simulation in the MapReduce Framework Michael T. Goodrich Dept. of Computer Science.

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

CpSc 881: Information Retrieval. 2 Hardware basics Many design decisions in information retrieval are based on hardware constraints. We begin by reviewing.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

Hinrich Schütze and Christina Lioma Lecture 4: Index Construction

PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:

Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Ch 4. The Evolution of Analytic Scalability

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Profiling, What-if Analysis and Cost- based Optimization of MapReduce Programs Oct 7 th 2013 Database Lab. Wonseok Choi.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

MapReduce Algorithm Design Based on Jimmy Lin’s slides

MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.

ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.

MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/

1 VLDB, Background What is important for the user.

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

Optimizing Parallel Algorithms for All Pairs Similarity Search

Implementation Issues & IR Systems

Applying Twister to Scientific Applications

On Spatial Joins in MapReduce

MapReduce Algorithm Design

Cse 344 May 4th – Map/Reduce.

CS110: Discussion about Spark

Ch 4. The Evolution of Analytic Scalability

KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner

Group 15 Swathi Gurram Prajakta Purohit

5/7/2019 Map Reduce Map reduce.

Learning with Hadoop – A case study on MapReduce based Data Mining

Presentation transcript:

Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics and Information Processing (CLIP Lab) UM Institute for Advanced Computer Studies (UMIACS)

Problem ~~~~~~~~~~ ~~~~~~~~~~ Applications:  “more-like-that” queries  Clustering e.g., co-reference resolution

Solutions Trivial  For each pair of vectors Compute the inner product  Loads each vector O(N) times Better  Each term contributes only if appears in

Algorithm Loads each posting once Matrix must fit in memory  Works for small collections Otherwise: disk access optimization

Hadoopify : 2-Step Solution 1)Indexing  one MapRedcue step  term  posting file 2)Pairwise Similarity  another MapRedcue step  term contribution for all possible pairs Generate ½ df*(df-1) intermediate contribution / term

Indexing (A,(d 1,2)) (B,(d 1,1)) (C,(d 1,1)) (B,(d 2,1)) (D,(d 2,2)) (A,(d 3,1)) (B,(d 3,2)) (E,(d 3,1)) (A,[(d 1,2), (d 3,1)]) (B,[(d 1,1), (d 2,1), (d 3,2)]) (C,[(d 1,1)]) (D,[(d 2,2)]) (E,[(d 3,1)]) map map map shuffle reduce reduce reduce reduce reduce (A,[(d 1,2), (d 3,1)]) (B,[(d 1,1), (d 2,1), (d 3,2)]) (C,[(d 1,1)]) (D,[(d 2,2)]) (E,[(d 3,1)]) A A B C B D D A B B E d1d1 d2d2 d3d3

Pairwise Similarity map map map map map (A,[(d 1,2), (d 3,1)]) (B,[(d 1,1), (d 2,1), (d 3,2)]) (C,[(d 1,1)]) (D,[(d 2,2)]) (E,[(d 3,1)]) ((d 1,d 3 ),2) ((d 1,d 2 ),1) ((d 1,d 3 ),2) ((d 2,d 3 ),2) shuffle ((d 1,d 2 ),[1]) ((d 1,d 3 ),[2, 2]) ((d 2,d 3 ),[2]) reduce reduce reduce ((d 1,d 2 ),1) ((d 1,d 3 ),4) ((d 2,d 3 ),2)

Implementation Issues df-cut  Drop common terms Intermediate tuples dominated by very high df terms  efficiency Vs. effectiveness Space saving tricks  Common doc + stripes  Blocking  Compression

Experimental Setup Hadoop Cluster of 19 nodes (w/double processors) Aquaint-2 collection  906K documents Okapi BM25 Subsets of collection

Efficiency (running time) 99% df-cut

Efficiency (disk usage)

Effectiveness (recent)

Conclusion Simple and efficient MapReduce solution  2H (using 38 nodes, 99% df-cut) for ~million-doc collection  Play tricks for I/O bound jobs Effective linear-time-scaling approximation  99.9% df-cut achieves 98% relative accuracy  df-cut controls efficiency vs. effectiveness tradeoff

Future work Bigger collections! More investigation of df-Cut and other techniques Analytical model Compression techniques (e.g., bitwise) More effectiveness experiments  Joint resolution of personal names in  Co-reference resolution of names and organization MapReduce IR research platform  Batch query processing

Thank You!

MapReduce Framework Shuffling: group values by keys mapmapmapmap reducereducereduce input output