Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger http://labs.google.com/people/monika/ Presented.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Near-Duplicates Detection
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
SASH Spatial Approximation Sample Hierarchy
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
1 Lecture 18 Syntactic Web Clustering CS
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Near Duplicate Detection
Automatic Evaluation Of Search Engines Project Presentation Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim.
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
CS 349: WebBase 1 What the WebBase can and can’t do?
The Further Mathematics network
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Advanced Higher Physics Investigation Report. Hello, and welcome to Advanced Higher Physics Investigation Presentation.
Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Post-Ranking query suggestion by diversifying search Chao Wang.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Syntactic Clustering of the Web By Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig CSCI 572 Ameya Patil Syntactic Clustering of the.
Using ODP Metadata to Personalize Search Presented by Lan Nie 09/21/2005, Lehigh University.
Linear Algebra Review.
Julián ALARTE DAVID INSA JOSEP SILVA
15-499:Algorithms and Applications
Near Duplicate Detection
Hashing Alexandra Stefan.
Introduction to Web Mining
PageRank and Markov Chains
B. Jayalakshmi and Alok Singh 2015
Web Data Integration Using Approximate String Join
Sequence Alignment 11/24/2018.
Detecting Phrase-Level Duplication on the World Wide Web
Characterization of Search Engine Caches
Compact routing schemes with improved stretch
CS 345A Data Mining Lecture 1
On the resemblance and containment of documents (MinHash)
CS 345A Data Mining Lecture 1
Introduction to Web Mining
CS 345A Data Mining Lecture 1
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger http://labs.google.com/people/monika/ Presented By Harish Rayapudi Shiva Prasad Malladi

Overview Introduction Broder’s Algorithm Charikar’s Algorithm Comparing the algorithms Combined algorithm Conclusion

Duplicate web pages How to identify duplicate pages? more space to store index Slow down the performance How to identify duplicate pages? Compare ….. requires O(n^2) comparisons Indexed web ….. 13.92 billion pages

Experimental Data 1.6B pages from a real Google crawl 25%-30% of identical pages removed prior to authors recving the data unknown exactly how many pages identical Of the remainder: Broder's Algorithm (Alg B) found 1.7% near duplicates Charikar's Algorithm (Alg C) found 2.2% near duplicates

Algorithms Broder’s and Charikar’s algorithm were not evaluated against each other previously Used by successful web search engines Algorithms comparison: 1. Precision on a random subset 2. The distribution of the number of term differences per near-duplicate pair 3. The distribution of the number of near-duplicates per page. The algorithms were evaluated on a set of 1.6B unique pages

Sample HTML Page <html> <body bgcolor="cream"> <H3>Harish Rayapudi Website </H3> <H4><a href="http://www.google.com" target="_blank">Google </a></H4> <H4>I am a Computer Science graduate student</H4> </body> </html>

Remove HTML &Formatting Info Harish Rayapudi Website http://www.google.com Google I am a Computer Science graduate student

Remove "." and "/" from URLs Harish Rayapudi Website http www google com Google I am a Computer Science graduate student

Tokens in the Page Harish1 Rayapudi2 Website3 http4 www5 google6 com7 Google9 I10 am11 a12 Computer13 Science14 graduate15 student16 We'll only look at the first 7 tokens The token sequence for this page P1 is {1,2,3,4,5,6,7} This token sequence is used by both the algorithms

Tokens in a Similar Page Harish1 Rayapudi2 Website3 http4 www5 yahoo8 com7 Yahoo17 I10 am11 a12 Computer13 Science14 graduate15 student16 We'll only look at the first 7 tokens The token sequence for this page P2 is {1,2,3,4,5,8,7} This token sequence is used by both the algorithms

Preprocessing step contd.. Let n be the length of token sequence for pages P1 and P2, n=7 k subsequence of tokens is fingerprinted resulting in n-k+1 shingles For k=2, shingles for page P1 {1,2,3,4,5,6,7} and page P2 {1,2,3,4,5,8,7} are, P1 {12,23,34,45,56,67} P2 {12,23,34,45,58,87}

Broder’s Algorithm Shingles are fingerprinted with m different fingerprinting functions For m = 4 we have, F1,F2,F3 and F4 fingerprinting functions For P1, the result of applying m different functions: F1 F2 F3 F4 12 4 7 5 9 23 7 4 8 5 34 1 2 3 6 45 8 5 9 7 56 1 8 7 8 67 6 3 4 5 Smallest value of each function is taken and a m-dimensional vector of min-values is stored for each page 4-dimensional vector for page P1 is {1,2,3,5}

For P2, the result of applying m functions: F1 F2 F3 F4 12 4 7 5 9 23 7 4 8 5 34 1 2 3 6 45 8 5 9 7 58 9 7 3 1 87 5 3 6 4 4-dimensional vector for page P2 is {1,2,3,1}

m dimensional vector reduced to m' dimensional vector of supershingles, m' is chosen such that m is divisible by m' Since m=4, we take m' = 2 Non-overlapping sequence of P1 {1,2,3,5} is {12,35} and for page P2 {1,2,3,1} is {12,31} Generating supershingle vector from non-overlapping sequence For P1, SS {12,35} = {x, y} and for P2, SS {12,31} = {x, z} B-similarity of two pages is the identical number of entries in their supershingle vector B-similarity of pages P1 and P2 is 1 (common entry x)

Experimental results of Broder’s Algorithm Algorithm generated 6 supershingles per page and a total of 10.1 B supershingles For pages P1 and P2 we had 2 supershingles, they are {x, y}, {x, z} For each pair of pages with an identical supershingle B-similarity is determined For pages P1 and P2 we had B-similarity of 1

B-similarity graph Every page is a node in the graph Edge between two nodes if and only if the pair is B-similar Label of an edge is the B-similarity of the pair A node is considered a near-duplicate page if and only if it is incident to at least one edge. 1 The average degree of the B-similarity graph is about 135 P1 P2

A random sample of 96556 B-similar pairs A random sample of 96556 B-similar pairs. Sub sampled and 1910 pairs were chosen The overall precision is 0.38 The precision for pairs on same site is .34 while for pairs on different site is .84 Table taken from the paper

Correctness of a near-duplicate pair Text differs only by URL, session id, a timestamp, visitor count Difference is invisible to the visitors Difference is a combination of above items Pages are entry pages to the same site

Table taken from the paper URL-only differences account for 41% of the correct pairs Table taken from the paper

Table taken from the paper 92% of them are on the same site. Almost half the cases are pairs that could not be evaluated. Table taken from the paper

Figure shows the distribution of term difference up to 200. > diff google yahoo 1,2c1,2 < Harish Rayapudi Google Website < http www google com ----- > Harish Rayapudi Yahoo Website > http www yahoo com Term difference calculated by executing the Linux diff command The average term difference is 24, the mean is 11. Figure shows the distribution of term difference up to 200. Figure taken from the paper

Charikar’s Algorithm 1. Each token is projected into b-dimensional space by randomly choosing b entries from {−1, 1} 2. This projection is the same for all pages. 3. For each page a b-dimensional vector is created by adding the projections of all the tokens in its token sequence. 4. The final vector for the page is created by setting every positive entry in the vector to 1 and every non-positive entry to 0, resulting in a random projection for each page. 5. C-similarity of two pages is the number of bits their projections agree on 6. We chose b = 384 so that both algorithms store a bit string of 48 bytes per page 7. We define two pages to be C-similar iff the number of agreeing bits in their projections lies above a fixed threshold. 8. We set a threshold t, t=372 here. P1, P2 = documents k = 3 (shingle size = 3) b = 3 P1 = 1 2 3 4 5 6 7 P2 = 2 3 4 1 2 7 9 P1 shingles, with 3 random values chosen from {-1,1} 123      -0.7    0.2 0.5 234      0.3    0.8 -0.4 345      -0.1    0.1 0.9 456  0.5 -0.2 -0.4 567     -0.9    -0.7 -0.5 now add columns to get (-0.9, 0.2,0.1) 234      0.3     0.8 -0.4 341     -0.1    0.1 -0.3 412      -0.3    -0.2 0.9 127      -0.7    0.2 -0.2 279 0.6 0.4 -0.8 add columns to get (-0.2,1.3,-0.8) P1 vector = (-0.9, 0.2,0.1) P2 vector = (-0.2,1.3,-0.8) P1 final vector = (0,1,1) P2 final vector = (0,1,0) C-similarity (P1,P2) = 2 22

Experimental results of Charikar’s algorithm Algorithm returns all pairs with C-similarity at least t as near-duplicate pairs. Alg. C found 1630M duplicate web-pages of which only 50% were correct(815M where we set t=372) Alg.C found 1630M near-duplicate pairs of which 1630*0.5=815M are correct pairs. Better than Alg.B C-similarity graph: Similar to that of B-similarity graph 23

Experimental results of Charikar’s algorithm 24

Experimental results of Charikar’s algorithm URL-only differences account for 72% of the correct Interesting Website! http://www.businessline.co.uk/ Table taken from the paper 25

Experimental results of Charikar’s algorithm 95% of the undecided pairs are on the same site Table taken from the paper 26

Comparisons of both the algorithms Manual Evaluation: Alg. C outperforms Alg. B with a precision of 0.50 versus 0.38 for Alg. B Term Difference: The results for term differences are quite similar, except for the larger number (19 vs. 90) of pairs with term differences larger than 200. Correlation: In 96,556 B-similar pairs only 45 had t=372. In 169,757 C-similar pages,4% were B-similar and 95% had, B-similarity 0 27

Comparisons of both the algorithms Table taken from the paper 28

Comparisons of both the algorithms Table taken from the paper 29

Combined Algorithm: Need: Both the algorithms wrongly identify pairs as near-duplicates either a) Because a small difference in tokens causes a large semantic difference or b) Because of unlucky random choices. In This Algorithm: 1)First compute all B-similar pairs. 2)Then filter out those pairs whose C-similarity falls below a certain threshold. 30

Combined Algorithm: Figure taken from the paper Here we select the best threshold value for the higher precision value Here we select threshold=350 Figure taken from the paper 31

Combined Algorithm: Figure taken from paper R is the number of correct near duplicate pairs returned divided by number of correct near duplicate pairs returned by Alg.B plots for S1 precision versus R for all C-similar thresholds between 0 and 384 Figure taken from paper 32

Combined Algorithm: 33

Combined Algorithm: The resulting algorithm returns on the testing set S2 363 out of the 962 pairs as near-duplicates with a precision of 0.79 and an R-value of 0.79. Above Table shows that 82% of the returned pairs are on the same site and that the precision improvement is mostly achieved for these pairs. With 0.74 this number is much better than either of the individual algorithms. Table taken from the paper 34

Conclusion The authors have performed an evaluation of two near-duplicate algorithms on 1.6B web pages. Neither performed well on pages from the same site, but a combined algorithm did without sacrificing much recall. 35

Discussion How can Alg.B be improved? Can we improve both algorithms to perform better on pairs from same website? 36