Presentation is loading. Please wait.

Presentation is loading. Please wait.

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma

Similar presentations


Presentation on theme: "Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma"— Presentation transcript:

1 Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Presented By Venkatesh Katari

2 Overview : Why Do we care ? Purpose of the paper.
Proposed solution for finding near duplicates Pros Cons Future Research.

3 Why Do We Care? Why do we want to detect near-duplicates? Save storage
Search quality Web mirrors Clustering for “related documents” query Data extraction Plagiarism Spam detection Duplicates in domain-specific corpora

4 Purpose of The Paper? This paper addresses the following issues:
Finding near duplicates on the web. Handling the scale of the web Tens of billions of documents indexed Millions of pages crawled every day Which features to be selected while detecting duplicates algorithm for single query and batch processing Survey of other techniques in this field

5 What are Near-Duplicates?
Identical content, but differ in small portion of document Advertisements Counters Timestamps

6 Simplified Crawl Architecture
Web one document HTML Document traverse Web Index links Near-duplicate? entire index newly-crawled document(s) Yes No trash insert

7 Feature-set per document
Shingles from page content Connectivity information Anchor text, anchor window Phrases Document vector from page content - case-folding - stop-word removal, - stemming - computing term-frequencies and weighing each term by its inverse document frequency

8 Simhash Dimensionality-reduction technique
Obtain f-bit fingerprint for each document A pair of documents are near duplicate if and only if fingerprints at most k-bits apart Experimental results show that f=64 & k=3 is good for detecting near duplicates.

9 Simhash feature, weight hash, weight w1 w1 w2 w2 wn wn 100110
w1 -w1 -w1 w1 w1 -w1 w2 110000 w2 w2 w2 -w2 -w2 -w2 -w2 Doc. wn 001001 wn -wn -wn wn -wn -wn wn add sign 13,108,-22,-5,-32,55 110001 fingerprint

10 Pre-sorted fingerprints in S
Method One Pre-sorted fingerprints in S Exact Probes 64-bit Q All Q’: hd(Q,Q’)≤k=3 ( ) probes! 64 3

11 S’: All fingerprints at most k-bits away from S
Method Two Fingerprints in S S’: All fingerprints at most k-bits away from S Exact Probes 64-bit Q (Sort) |S’| ≈ |S| ( ) 64 3

12 Final implementation Observation 1: Consider 2d f-bit fingerprints in sorted order Most 2d combinations in d most significant bits exist Can quickly do exact probe on first d’ (≤d) bits Observation 2: Q’ hd(Q,Q’) = 3 Q exact match!

13 Example exact search on 16 bits 16-bit Q1 Q2 A B C D 64-bit Q Q1 Q2 Q3
Fingerprints in S

14 Example: Analysis 64-bits split into 4 pieces
4 tables with permuted fingerprints Exact search on 16 bits If 234 (≈10 billion) fingerprints Each probe gives fingerprints

15 Batch Algorithm Tens of billions of pages indexed
Crawl millions of pages each day Quickly find all new pages having a near-duplicate in the index

16 MapReduce Framework MapReduce framework used within Google Map phase:
massively parallel Map phase: operate individually on a set of objects Reduce phase aggregate results of the mapped objects

17 Batch Algorithm Suppose 8B existing fingerprints (~32GB after compression): File F 1M batch query fingerprints (~8MB): File B F stored in a GFS file system chunked into roughly 64MB replicated at 3 random nodes B stored with much higher replication factor

18 Batch Algorithm (continued)
Map Phase: Duplicate detection within each chunk Fi and whole of B Build multiple tables for B (in memory) Scan Fi and probe into B Output near-duplicates in B Reduce phase Merge outputs

19 Pros Addressed near-duplicate detection in a web-crawling system
Proposed algorithms for single and batch cases Experiments to validate the suitability of simhash Mini-survey of near-duplicate detection techniques in the paper

20 Cons Weight Selection for feature set
Handling of continuously changing IDF How to find near duplicates when data is present in different formats Inadequate results

21 References G. Manku, A. Jain, A. Das Sarma. Detecting near duplicates for web crawling. WWW 2007, pp , 2007. M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th Annual Symposium on Theory of Computing (STOC 2002), pages 380{388, 2002. J. Dean and S. Ghemawat. MapReduce: Simplied data processing on large clusters. In Proc. 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137{150, Dec Articles from Wikipedia etc.

22 Future Research Considering document size while detecting near duplicates Pruning the space of existing fingerprints Categorizing web pages Removal of portions of web pages with ads and time stamps

23 Q & A


Download ppt "Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma"

Similar presentations


Ads by Google