Presentation is loading. Please wait.

Presentation is loading. Please wait.

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/2011 1 Udeshi-CS572.

Similar presentations


Presentation on theme: "Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/2011 1 Udeshi-CS572."— Presentation transcript:

1 Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/2011 1 Udeshi-CS572

2 Introduction  There are various duplicate documents on the web.  Many pages differ in small portion because of advertisement displayed and so on.  Such pages are irrelevant for crawling point of you.  This paper uses Charikar‘s finnger-printing technique for the same to find out duplicate documents.  This technique is useful for both online queries and batch queries. 6/28/20112Udeshi-CS572

3 Advantages of duplicate detection  Saves B.W.  Reduction in storage cost  Improve quality of search engine  Reduces load on remote host. 6/28/20113Udeshi-CS572

4 Limitations of duplicate detection  Scaling  Speed  Use less resources 6/28/20114Udeshi-CS572

5 FINGERPRINTING WITH SIMHASH  Extract set of features from a document along with corresponding weight for each feature.  We use simhash to generate an f-bit finger-print based on presence or absence of feature in a given document.  When we use simhash, 64-it finger-print will be good enough for 8B we pages. 6/28/20115Udeshi-CS572

6 Idea behind using Simhash algorithm Simhash has 2 properties :  A : The fingerprint of a document is hash of its features.  B :Similar documents have similar hash values.  Our algorithms are designed assuming that Property A holds and we experimentally measure the impact of non-uniformity introduced by Property B on real datasets. 6/28/20116Udeshi-CS572

7 Hamming Distance problem  Consider a collection of 8B 64-bit fingerprints, occupying 64GB.  We have to decide whether existing 8B 64-bit fingerprints differs from F in at most k = 3 bit- positions.  Algorithm is different for online queries and batch queries. 6/28/20117Udeshi-CS572

8 Algorithm for online queries  We have to build t tables: T1, T2,……. Tt.  Table Ti is constructed by applying permutation to each existing fingerprints.  There are 2 steps for the same :  Identify all permuted fingerprints in Ti whose top bit-positions match the other fingerprints top bit- positions.  After following the above step, check if it differs from other by at most k bit-positions. 6/28/20118Udeshi-CS572

9 Design parameters for the algorithm  There is a trade-off between number of tables and selecting value of Pi for the table.  Increasing the number of tables increases Pi and hence reduces the query time.  De-creasing the number of tables reduces storage requirements, but reduces Pi and thus increases the query time. 6/28/20119Udeshi-CS572

10 Algorithm for Batch Queries  Files are first broken into 64 MB chunks.  Each chunk is replicated at three randomly chosen machines in a cluster.  Each chunk is stored as a file in the local system.  First, we solve hamming distance problem for each 64MB chunk.  Later on, we combine output from all the chunks to produce final output. 6/28/201110Udeshi-CS572

11 Broder's shingle-based fingerprints  Broder shingle-based finger-print uses Rabin fingerprints.  The algorithm is such that Given an n-bit message m 0,...,m n-1…, fingerprint of m to be the remainder r(x) after division of f(x) by p(x). 6/28/2011Udeshi-CS57211

12 Comparison with Broder's shingle-based fingerprints  For the comparison, 6 Rabin fingerprints are calculated.  Later on, it is checked to see if 2 or more finger-prints are matching or not.  Each finger-print takes approximately 24 bytes.  On the other hand, simhash will take 64-bits for 8B web pages. 6/28/2011Udeshi-CS57212

13 Experimental Results There is a tradeoff between f and k for detection of duplicates for web pages using simhash. Topics includes :  Choice of parameters  Distribution of finger-prints  Scalability 6/28/201113Udeshi-CS572

14 Choice of parameters  Vary K between 1 to 10.  Divide pages into different categories  False Positive  True Positive  Unknown  There is a trade-off.  K=3 gives reasonable result for 64-bit finger- print. 6/28/201114Udeshi-CS572

15 Distribution of finger-print (1)  Left side of the slide doesn’t drop rapidly as the right side one.  This is due to the fact that some pages are similar to each other.  So, finger prints differ by moderate number. 6/28/201115Udeshi-CS572

16 Distribution of finger-print (2)  More or less uniform with spikes in some places.  Reasons:  Empty pages.  File not found.  Multiple websites uses similar login page. 6/28/201116Udeshi-CS572

17 Nature of Corpus: System is mainly divided into 4 documents :  Web pages.  Files in file system  E-mail  Domain-specific Corpora This paper mainly involves finding near duplicates for web pages. 6/28/201117Udeshi-CS572

18 Scalability  For batch mode, compressed version of file Q occupies almost 32GB.  Usually, computational time for each file is approximately 1GBps.  So, Computation usually finishes in 100 seconds. 6/28/2011Udeshi-CS57218

19 Need to detect duplicates  Web Mirror  Clustering for related documents query  Data Extraction  Plagiarism  Spam Detection  Duplicate in domain specific corpora 6/28/201119Udeshi-CS572

20 Feature set per-documents  Shingles from page content  Document vector from page content  Connectivity information  Anchor text and anchor window  Phrases 6/28/201120Udeshi-CS572

21 Future Research  Can we categorize web-pages into categories and search for near duplicates only within the relevant categories.  Feasibility to devise algorithms for detecting portions of web-pages that contains ads or timestamp.  Change sensitivity of simhash algorithm for feature selection and assignment of weights to features.  Algorithm for clustering of the documents.  Can we categories documents based on languages. 6/28/201121Udeshi-CS572

22 Thank you. Q & A ? 6/28/201122Udeshi-CS572


Download ppt "Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/2011 1 Udeshi-CS572."

Similar presentations


Ads by Google