Presentation on theme: "Detecting Near-Duplicates for Web Crawling Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presented by Yen-Yi Hung."— Presentation transcript:
Detecting Near-Duplicates for Web Crawling Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presented by Yen-Yi Hung
Overview Introduction Previous Related Work Algorithm Evaluation Future Work Pros & Cons Comment Reference
Introduction – The drawback of duplicate pages Waste network bandwidth Affect refresh times Impact politeness constraints Increase storage cost Affect the quality of search indexes Increase the load on the remote host that is serving such web pages Affect customer satisfaction
Introduction – Challenge and Contribution of this paper Challenge: Dealing with scale issue Determining near-duplicates efficiently Contribution: Showing that simhash could be used to deal with the huge amount query Developing a way to solve Hamming Distance Problem quickly (for online single query or batch multi-queries)
Previous Related Work Many related techniques are different when they deal with different corpus, end goals, feature sets, or signature schema. Corpus: Web Documents Files in a file system E-mails Domain-Specific corpora
Previous Related Work (II) End Goals: Web mirrors Clustering for related documents query data extraction Plagiarism spam detection duplicates in domain-specific corpora Feature Sets: Shingles from page content Document vector from page content Connectivity information Anchor text, anchor window Phrases
Previous Related Work (III) Signature Schema: Mod-p shingles Min-hash for Jaccard similarity of sets Signatures/fingerprints over IR-based document vectors Checksums This paper focus on Web Documents. Its goal is to improve web crawling using Simhash technique.
Algorithm – Simhash fingerprinting What could Simhash do? Mapping high-dimensional vectors to small-sized fingerprints. The atypical Simhash property Similar documents have similar hash values. How to apply? Converting web pages to a set of weighted features (computed using standard IR techniques.)
Algorithm – Hamming Distance Problem Hamming Distance Problem Given a collection of f-bit fingerprints and a query fingerprint F, we need to identify whether an existing fingerprint differs from F in at most k bits. But simply probing the fingerprint collection is impractical. So what should we do? 1. Build t Table T 1, T 2, …, T t. Each table has an integer p i and a permutation Π i. 2. Apply permutation Π i to each existing fingerprint in each Table T i and sort each T i.
Algorithm – Hamming Distance Problem 3. Given fingerprint F and an integer k which used to determine the hamming distance. We use the following 2 steps to solve hamming distance problem. Step 1: Find all permuted fingerprints in T i whose top p i bit- positions match the top pi bit-positions of Π i (F) Step 2: For each fingerprints found in step 1, check if it differs from Π i (F) in at most k bit-posions. Time Complexity: Step 1 can be done in O(p i ) steps using binary search. Step 2 can be shrink to O(log p i ) steps using interpolation search.
Algorithm – Compression of Fingerprints Step1: The first fingerprint in the block is remembered in its entirety. Step2: Get the most significant 1-bit in the XOR of two successive fingerprints, and we denote it as h. Step3: Append the Huffman code of h to the block. Step4: Append the bits to the right of the most-significant 1-bit to the block. Step5: Repeat step 2,3,4 till a block (1024 bytes) is full
Algorithm – Batch query implementation Both File F (existing fingerprints) and File Q (the batch of query fingerprints) are stored in a shared-nothing distributed file system GFS. The batch queries could be spilt into 2 phases Phase 1: We solve the hamming distance problem over some chunks of F and the entire file Q as input. The outputs of the computation are near-duplicate fingerprints. Phase 2: MapReduce will remove duplicates and produces a single sorted file according to the results of phase 1.
Evaluation Is simhash a reasonable technique when dealing with de-duplication issue? when choosing k=3, precision and recall ≒ 0.75 * According to the result of “ Finding near-duplicate web pages: a large-scale evaluation of algorithms ” by M. R. Henzinger in 2006, its precision and recall are around 0.8.
Evaluation Will the characteristic of simhash affect the results? If yes, then is it a significant impact? Fig 2(a): Right-half displays the specific distribution but not the Left-half. This is because some similar contents only have moderate difference in Simhash values. Fig2(b): Distribution has some spikes because of empty pages, file not found pages, and the similar login pages for some bulletin board software.
Evaluation 32GB batch queries fingerprints with 200 mappers, the combined rates could exceed 1GBps. Given fixed number of mappers, the time taken is roughly proportional to the size of file Q. [Compression plays an important role.]
Future Work Based on this paper: Document size Category information de-duplication Near-duplication vs. Clustering Other Research topic: More cost-effective approach of using just the URLs information for de-duplication
Pros Pros: Efficient and Practical Using compression and specific database design (GFS) to solve the problem of fingerprint based de-duplication issues Given a compact but thorough description of de-duplication related work
Cons Cons: Limit of accuracy -- not based on explicit content matching of the document but the possibility of similarity This paper does not provide any evaluation results compared with other algorithm Though providing compression techniques, the cost of space still remain questioned Content-based de-duplication can only be implemented after the Web pages have been downloaded. So it does not help reduce the waste of bandwidth in crawling.
Comment This technique is good. It provides an efficient way of using Simhash to solve de-duplication issue for a large amount of data. Though not the first paper focusing on large amount of web pages, but it indeed provides actual query size in the real world.
Reference Paolo Ferragina, Roberto Grossi, Ankur Gupta, Rahul Shah, Jeffrey Scott Vitter, On searching compressed string collections cache-obliviously, Proceedings of the twenty-seventh ACM SIGMOD-SIGACT- SIGART symposium on Principles of database systems, June 09-12, 2008, Vancouver, Canada Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, Dmitri Loguinov, IRLbot: Scaling to 6 billion pages and beyond, ACM Transactions on the Web (TWEB), v.3 n.3, p.1-34, June 2009 Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, Lei Zhang, iRobot: an intelligent crawler for web forums, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China Anirban Dasgupta, Ravi Kumar, Amit Sasturkar, De-duping URLs via rewrite rules, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA Lian'en Huang, Lei Wang, Xiaoming Li, Achieving both high precision and high recall in near-duplicate detection, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA Edith Cohen, Haim Kaplan, Leveraging discarded samples for tighter estimation of multiple-set aggregates, Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems, June 15-19, 2009, Seattle, WA, USA Amit Agarwal, Hema Swetha Koppula, Krishna P. Leela, Krishna Prasad Chitrapura, Sachin Garg, Pavan Kumar GM, Chittaranjan Haty, Anirban Roy, Amit Sasturkar, URL normalization for de-duplication of web pages, Proceeding of the 18th ACM conference on Information and knowledge management, November 02-06, 2009, Hong Kong, China Hema Swetha Koppula, Krishna P. Leela, Amit Agarwal, Krishna Prasad Chitrapura, Sachin Garg, Amit Sasturkar, Learning URL patterns for webpage de-duplication, Proceedings of the third ACM international conference on Web search and data mining, February 04-06, 2010, New York, New York, USA M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR 2006, pages 284-291, 2006.