Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

MapReduce Simplified Data Processing on Large Clusters

LIBRA: Lightweight Data Skew Mitigation in MapReduce

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.

University of Minnesota CG_Hadoop: Computational Geometry in MapReduce Ahmed Eldawy* Yuan Li* Mohamed F. Mokbel*$ Ravi Janardan* * Department of Computer.

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

Felix Halim, Roland H.C. Yap, Yongzheng Wu

Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.

IMapReduce: A Distributed Computing Framework for Iterative Computation Yanfeng Zhang, Northeastern University, China Qixin Gao, Northeastern University,

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009

ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

ACL, June Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Sorting, Searching, and Simulation in the MapReduce Framework Michael T. Goodrich Dept. of Computer Science.

Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.

MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.

Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th, 2012.

Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.

Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

HAMS Technologies 1

Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.

Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

Clustering Personalized Web Search Results Xuehua Shen and Hong Cheng.

Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.

Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

MapReduce Algorithm Design Based on Jimmy Lin’s slides

ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

 Used MapReduce algorithms to process a corpus of web pages and develop required index files  Inverted Index evaluated using TREC measures  Used Hadoop.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

HAMA: An Efficient Matrix Computation with the MapReduce Framework Sangwon Seo, Edward J. Woon, Jaehong Kim, Seongwook Jin, Jin-soo Kim, Seungryoul Maeng.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.

Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.

Big Data Infrastructure Week 2: MapReduce Algorithm Design (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.

MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Chun Kai Chen Author ： Andrew.

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.

Date : 2016/08/09 Advisor : Jia-ling Koh Speaker : Yi-Yui Lee

A Simple Approach for Author Profiling in MapReduce

A Straightforward Author Profiling Approach in MapReduce

Optimizing Parallel Algorithms for All Pairs Similarity Search

Implementation Issues & IR Systems

Introduction to Spark.

Cse 344 May 4th – Map/Reduce.

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Group 15 Swathi Gurram Prajakta Purohit

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Learning with Hadoop – A case study on MapReduce based Data Mining

Analysis of Structured or Semi-structured Data on a Hadoop Cluster

Presentation transcript:

Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics, 2008 May 15, 2014 Kyung-Bin Lim

2 / 19 Outline  Introduction  Methodology  Discussion  Conclusion

3 / 19 Pairwise Similarity of Documents  PubMed – “More like this”  Similar blog posts  Google – Similar pages

4 / 19 Abstract Problem  Applications: – Clustering – “more-like-that” queries ~~~~~~~~~~ ~~~~~~~~~~

5 / 19 Outline  Introduction  Methodology  Results  Conclusion

6 / 19 Trivial Solution  Load each vector O(N) times  O(N 2 ) dot products scalable and efficient solution for large collections Goal

7 / 19 Better Solution  Load weights for each term once  Each term contributes O(df t 2 ) partial scores Each term contributes only if appears in

8 / 19 Better Solution  A term contributes to each pair that contains it  For example, if a term t 1 appears in documents x, y, z :  List of documents that contain a particular term: Inverted Index t 1 appears in x, y, z t1 contributes for pairs: (x, y) (x, z) (y, z)

9 / 19 Algorithm

10 / 19 MapReduce Programming  Framework that supports distributed computing on clusters of computers  Introduced by Google in 2004  Map step  Reduce step  Combine step (Optional)  Applications

11 / 19 MapReduce Model

12 / 19 Computation Decomposition reduce  Load weights for each term once  Each term contributes o(df t 2 ) partial scores Each term contributes only if appears in map

13 / 19 MapReduce Jobs  (1) Inverted Index Computation  (2) Pairwise Similarity

14 / 19 Job1: Inverted Index (A,(d 1,2)) (B,(d 1,1)) (C,(d 1,1)) (B,(d 2,1)) (D,(d 2,2)) (A,(d 3,1)) (B,(d 3,2)) (E,(d 3,1)) (A,[(d 1,2), (d 3,1)]) (B,[(d 1,1), (d 2,1), (d 3,2)]) (C,[(d 1,1)]) (D,[(d 2,2)]) (E,[(d 3,1)]) map map map shuffle reduce reduce reduce reduce reduce (A,[(d 1,2), (d 3,1)]) (B,[(d 1,1), (d 2,1), (d 3,2)]) (C,[(d 1,1)]) (D,[(d 2,2)]) (E,[(d 3,1)]) A A B C B D D A B B E d1d1 d2d2 d3d3

15 / 19 Job2: Pairwise Similarity map map map map map (A,[(d 1,2), (d 3,1)]) (B,[(d 1,1), (d 2,1), (d 3,2)]) (C,[(d 1,1)]) (D,[(d 2,2)]) (E,[(d 3,1)]) ((d 1,d 3 ),2) ((d 1,d 2 ),1) ((d 1,d 3 ),2) ((d 2,d 3 ),2) shuffle ((d 1,d 2 ),[1]) ((d 1,d 3 ),[2,2]) ((d 2,d 3 ),[2]) reduce reduce reduce ((d 1,d 2 ),1) ((d 1,d 3 ),4) ((d 2,d 3 ),2)

16 / 19 Implementation Issues  df-cut – Drop common terms  Intermediate tuples dominated by very high df terms  Implemented 99% cut  efficiency Vs. effectiveness

17 / 19 Outline  Introduction  Methodology  Results  Conclusion

18 / 19 Experimental Setup  Hadoop  Cluster of 19 machines – Each with two processors (single core)  Aquaint-2 collection – 2.5GB of text – 906k documents  Okapi BM25  Subsets of collection

19 / 19 Running Time of Pairwise Similarity Comparisons

20 / 19 Number of Intermediate Pairs

21 / 19 Outline  Introduction  Methodology  Results  Conclusion

22 / 19 Conclusion  Simple and efficient MapReduce solution – 2H for ~million-doc collection  Effective linear-time-scaling approximation – 99.9% df-cut achieves 98% relative accuracy – df-cut controls efficiency vs. effectiveness tradeoff