Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma

Slides:



Advertisements
Similar presentations
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran.
Advertisements

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Overview of MapReduce and Hadoop
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Near-Duplicates Detection
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
Detecting Near-Duplicates for Web Crawling Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presented by Yen-Yi Hung.
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
Distributed Computations
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Near Duplicate Detection
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
Distributed Computations MapReduce
Overview of Search Engines
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Advanced Search Features Dr. Susan Gauch. Pruning Search Results  If a query term has many postings  It is inefficient to add all postings to the accumulator.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Scott Wen-tau Yih (Microsoft Research) Joint work with Hannaneh Hajishirzi (University of Illinois) Aleksander Kolcz (Microsoft Bing)
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1.
Syntactic Clustering of the Web By Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig CSCI 572 Ameya Patil Syntactic Clustering of the.
Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presenter: Siyuan Hua.
Information Retrieval in Practice
Finding Replicated web collections
Why indexing? For efficient searching of a document
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Indexing & querying text
Near Duplicate Detection
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Map Reduce.
IDENTIFICATION OF DENSE SUBGRAPHS FROM MASSIVE SPARSE GRAPHS
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Hadoop Basics.
Finding replicated web collections
Page Table Implementations
Information Retrieval and Web Design
Presentation transcript:

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma Presented By Venkatesh Katari

Overview : Why Do we care ? Purpose of the paper. Proposed solution for finding near duplicates Pros Cons Future Research.

Why Do We Care? Why do we want to detect near-duplicates? Save storage Search quality Web mirrors Clustering for “related documents” query Data extraction Plagiarism Spam detection Duplicates in domain-specific corpora

Purpose of The Paper? This paper addresses the following issues: Finding near duplicates on the web. Handling the scale of the web Tens of billions of documents indexed Millions of pages crawled every day Which features to be selected while detecting duplicates algorithm for single query and batch processing Survey of other techniques in this field

What are Near-Duplicates? Identical content, but differ in small portion of document Advertisements Counters Timestamps

Simplified Crawl Architecture Web one document HTML Document traverse Web Index links Near-duplicate? entire index newly-crawled document(s) Yes No trash insert

Feature-set per document Shingles from page content Connectivity information Anchor text, anchor window Phrases Document vector from page content - case-folding - stop-word removal, - stemming - computing term-frequencies and weighing each term by its inverse document frequency

Simhash Dimensionality-reduction technique Obtain f-bit fingerprint for each document A pair of documents are near duplicate if and only if fingerprints at most k-bits apart Experimental results show that f=64 & k=3 is good for detecting near duplicates.

Simhash feature, weight hash, weight w1 w1 w2 w2 wn wn 100110 w1 -w1 -w1 w1 w1 -w1 w2 110000 w2 w2 w2 -w2 -w2 -w2 -w2 Doc. wn 001001 wn -wn -wn wn -wn -wn wn add sign 13,108,-22,-5,-32,55 110001 fingerprint

Pre-sorted fingerprints in S Method One Pre-sorted fingerprints in S Exact Probes 64-bit Q All Q’: hd(Q,Q’)≤k=3 ( ) probes! 64 3

S’: All fingerprints at most k-bits away from S Method Two Fingerprints in S S’: All fingerprints at most k-bits away from S Exact Probes 64-bit Q (Sort) |S’| ≈ |S| ( ) 64 3

Final implementation Observation 1: Consider 2d f-bit fingerprints in sorted order Most 2d combinations in d most significant bits exist Can quickly do exact probe on first d’ (≤d) bits Observation 2: Q’ hd(Q,Q’) = 3 Q exact match!

Example exact search on 16 bits 16-bit Q1 Q2 A B C D 64-bit Q Q1 Q2 Q3 Fingerprints in S

Example: Analysis 64-bits split into 4 pieces 4 tables with permuted fingerprints Exact search on 16 bits If 234 (≈10 billion) fingerprints Each probe gives 234-16 fingerprints

Batch Algorithm Tens of billions of pages indexed Crawl millions of pages each day Quickly find all new pages having a near-duplicate in the index

MapReduce Framework MapReduce framework used within Google Map phase: massively parallel Map phase: operate individually on a set of objects Reduce phase aggregate results of the mapped objects

Batch Algorithm Suppose 8B existing fingerprints (~32GB after compression): File F 1M batch query fingerprints (~8MB): File B F stored in a GFS file system chunked into roughly 64MB replicated at 3 random nodes B stored with much higher replication factor

Batch Algorithm (continued) Map Phase: Duplicate detection within each chunk Fi and whole of B Build multiple tables for B (in memory) Scan Fi and probe into B Output near-duplicates in B Reduce phase Merge outputs

Pros Addressed near-duplicate detection in a web-crawling system Proposed algorithms for single and batch cases Experiments to validate the suitability of simhash Mini-survey of near-duplicate detection techniques in the paper

Cons Weight Selection for feature set Handling of continuously changing IDF How to find near duplicates when data is present in different formats Inadequate results

References G. Manku, A. Jain, A. Das Sarma. Detecting near duplicates for web crawling. WWW 2007, pp. 141-150, 2007. M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th Annual Symposium on Theory of Computing (STOC 2002), pages 380{388, 2002. J. Dean and S. Ghemawat. MapReduce: Simplied data processing on large clusters. In Proc. 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137{150, Dec. 2004. Articles from Wikipedia etc.

Future Research Considering document size while detecting near duplicates Pruning the space of existing fingerprints Categorizing web pages Removal of portions of web pages with ads and time stamps

Q & A