Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presenter: Siyuan Hua.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran.
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Near-Duplicates Detection
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.
VisualRank: Applying PageRank to Large-Scale Image Search Yushi Jing, Member, IEEE, and Shumeet Baluja, Member, IEEE.
Data Structures Using C++ 2E
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
Detecting Near-Duplicates for Web Crawling Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presented by Yen-Yi Hung.
FINGER PRINTING BASED AUDIO RETRIEVAL Query by example Content retrieval Srinija Vallabhaneni.
Acceleratio Ltd. is a software development company based in Zagreb, Croatia, founded in Acceleratio specializes in developing high-quality enterprise.
2/25/2004 The Google Cluster Architecture February 25, 2004.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
1 Lecture 18 Syntactic Web Clustering CS
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
UC Santa Cruz Providing High Reliability in a Minimum Redundancy Archival Storage System Deepavali Bhagwat Kristal Pollack Darrell D. E. Long Ethan L.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
1 Physical Data Organization and Indexing Lecture 14.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Data Structures & Algorithms and The Internet: A different way of thinking.
Improving Content Addressable Storage For Databases Conference on Reliable Awesome Projects (no acronyms please) Advanced Operating Systems (CS736) Brandon.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Virtual Memory 1 1.
Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA 1 Michael L. Nelson.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
 Used MapReduce algorithms to process a corpus of web pages and develop required index files  Inverted Index evaluated using TREC measures  Used Hadoop.
Physical Database Design I, Ch. Eick 1 Physical Database Design I Chapter 16 Simple queries:= no joins, no complex aggregate functions Focus of this Lecture:
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Memory: Page Table Structure CSSE 332 Operating Systems Rose-Hulman Institute of Technology.
Book web site:
University of Maryland Baltimore County
Memory: Page Table Structure
Information Retrieval in Practice
Efficient Multi-User Indexing for Secure Keyword Search
Indexing UCSB 293S, 2017 Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley,
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Data Structures Using C++ 2E
Message Integrity and Message Authentication
15-499:Algorithms and Applications
Near Duplicate Detection
Hashing Alexandra Stefan.
Hashing - Hash Maps and Hash Functions
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Review Graph Directed Graph Undirected Graph Sub-Graph
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Lecture 11: Nearest Neighbor Search
Randomized Algorithms CS648
Hadoop Basics.
Cse 344 May 4th – Map/Reduce.
Finding replicated web collections
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
The Search Engine Architecture
Minwise Hashing and Efficient Search
Virtual Memory 1 1.
Presentation transcript:

Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presenter: Siyuan Hua

 Application and why  Algorithm  Google story  Q&A

 Web Documents  Files in a file system  s  Domain-specific corpora

 Web Mirrors  Clustering for “related documents”  Data extraction  Plagiarism  Spam detection  Duplicates in domain-specific corpora

 Simhash compute each document to a f bit value and each bit is relevant to a unique feature of the document  Properties of simhash value: ◦ The fingerprint of a document is a “hash” value of its features ◦ Similar documents have similar hash values

 Definition: ◦ Given a collection of f-bit fingerprints and a query fingerprint F, identify whether an existing fingerprint differs from F in at most k bits. (In the batch-mode version there are set of query fingerprints instead of a single query fingerprint)  Simple Solution: ◦ Linear search O(mn) time  Scale Problem: ◦ 1M query document against 8 billion( ) existing web pages in100 seconds. ◦ Simple solution require comparisons! (impossible in 100 seconds)

 Oberservation: ◦ Pre-compute all F’ such that Hamming distance between F’ and F is at most k. Assume K=3 F’ and comparisons! Too much time! ◦ Pre-compute all F’ such that some existing fingerprint is at most Hamming distance k away from F’. Too much space!

 Their solution: ◦ Initiation: They build t tables:. Associated with table Ti are two quantities: an integer and a permutation over the f bit-positions. ◦ Given fingerprint F and an integer k, we probe these tables in parallel: ◦ Step 1: Identify all permuted fingerprints in Ti whose top bit-positions match the top bit-positions of (F). ◦ Step 2: For each of the permuted fingerprints identified in Step 1, check if it differs from (F) in at most k bit positions.  Example: ◦ 64 bit fingerprint divided to 6 blocks can build 20 tables ◦ Space: Reasonable! Time: Awesome!

 Exploration of Design Parameters: ◦ (1) A small set of permutations to avoid blowup in space requirements ◦ (2) Large values for various Pi to avoid checking too many fingerprints in Step 2.  Tradeoff ◦ Increasing the number of tables increases pi and hence reduces the query time. Decreasing the number of tables reduces storage requirements, but reduces pi and hence increases the query time

 Story: ◦ Assume that existing fingerprints are stored in file F and that the batch of query fingerprints are stored in file Q. With 8B 64-bit fingerprints, file F will occupy 64GB ◦ They use GFS files which is broken into 64MB chunks. Each chunk is replicated at three (almost) randomly chosen machines in a cluster, each chunk is stored as a file in the local file system. ◦ F is divided to 64-MB chunk while Q keeps entirety. ◦ MapReduce computes all the duplications in parallel