Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan

Overview Two near-duplicate detecting algorithms (Broder’s & Charikar’s algorithm) are compared on a very large scale (1.6 billion distinct web pages) Need to know the pros and cons of each algorithm when they work in different situations. Need to find a new approach to get better results of detecting near-duplicates Finding Near-Duplicates in a Large Scale 2 3/28/2013

Relation to course material Discuss more details of two algorithms which were introduced in lecture, and draw important conclusions by comparing the experiment results Broder’s algorithm is basically a minhashing algorithm discussed in lecture. The paper goes further to calculate a supershingle based on the minvalue vector. Both algorithms obey the general paradigm of finding near-duplicates, which is to generate and compare signature of each file 3/28/2013Finding Near-Duplicates in a Large Scale3

Broder’s Algorithm Begin with preprocessing HTML tags and URLs for each document (also used in Charikar) Use m functions to fingerprint the shingle sequence, and find m minvalues each from the fingerprinted sequence. 3/28/2013Finding Near-Duplicates in a Large Scale4

Broder’s Algorithm Divide the m minvalues into m’ groups, each with l elements e.g. m = 84, m’ = 6, l = 14 Concatenate minvalues in each group to reduce the vector from m entries to m’ entries Fingerprint each of the m’ entries to generate an m’-dimensional vector (supershingle) 3/28/2013Finding Near-Duplicates in a Large Scale5

B-Similarity Definition: The number of identical entries in the supershingle vectors of two pages Two pages are near-duplicates iff their B- similarity is at least 2. e.g. m’ = 6, pairs with more than 2 entry agrees are near-duplicate 3/28/2013Finding Near-Duplicates in a Large Scale6

Charikar’s algorithm Extract a set of features (meaningful tokens) from a web page, and each feature is tagged with a weight Each feature (token) is projected to a b- bit vector that each entry in the vector has value {-1, 1} 3/28/2013Finding Near-Duplicates in a Large Scale7

Charikar’s algorithm Sum up all b-bit projections of tokens each multiplied by its weight to form a new b-dimensional vector Generate the final b-dimensional vector by setting the positive entry to 1 and non-positive entry to 0 3/28/2013Finding Near-Duplicates in a Large Scale8

C-Similarity Definition: The C-similarity of two pages is the number of bits their final projections agree on Two pages are near-duplicates iff the number of agreeing bits in their projections lies above a fixed threshold e.g. b = 384, threshold = 372 3/28/2013Finding Near-Duplicates in a Large Scale9

Comparison of two algorithms 3/28/2013Finding Near-Duplicates in a Large Scale10 Broder’s algorithmCharikar’s algorithm Considers order of token sequence Ignores order of token sequence Ignores the frequency of shinglesConsiders the frequency of terms O(Tm + Dm’) = O(Tm)O(Tb) Note: T is the total number of tokens in all web pages. D is the number of web pages.

Comparison of experiment results Construct a similarity graph in which every page is a node and every edge denotes a near-duplicate pair. A node is considered a near-duplicate page iff it is incident to at least one edge 3/28/2013Finding Near-Duplicates in a Large Scale11 B-similarity graphC-similarity graph 27.4M/1.6B35.5M/1.6B Average degree: 135Average degree: 92

Comparison of experiment results 3/28/2013Finding Near-Duplicates in a Large Scale12 B-similarity C-similarity Distribution of degree in log-log scale

Comparison of experiment results Precision measurement Precision of results from same sites is low because very often pages on the same site use the same boilerplate text and differ only in the main item in the center of the page. 3/28/2013Finding Near-Duplicates in a Large Scale13 Broder’sCharikar’s Total precision0.380.50 Precision on same sites0.340.36 Precision on different sites0.860.90

Comparison of experiment results Term differences in two algorithms 3/28/2013Finding Near-Duplicates in a Large Scale14 Broder’s algorithmCharikar’s algorithm Average: 24 Mean: 11 Average: 94 Mean: 7 21% with term differences 2 90% with term differences less than 42 24% with term differences=2 90% with term differences less than 44

Comparison of experiment results Distribution of term differences in two algorithms 3/28/2013Finding Near-Duplicates in a Large Scale15 Broder’s algorithm Charikar’s algorithm

Comparison of experiment results Error cases: 3/28/2013Finding Near-Duplicates in a Large Scale16 Broder’s caseCharikar’s case NIH database, Herefordshire database on the web http://www.businessline.co.uk/ a UK business directory Differs in 20 consecutive tokens among 1000-2000 tokens Differs in 1-5 non consecutive tokens among 1000 tokens Affected by large amount of boilerplate text Affected by large amount of common tokens despite of the different order Charikar’s algorithm works here because it ignores the token order-- the number of different tokens are large enough to be detected Broder’s algorithm works here because the dispersal of different token generate considerable amount of distinct shingles.

A combined algorithm Use Broder’s algorithm to compute all B- similar pairs first. Then use Charikar’s algorithm to filter out those pairs whose C-similarity falls below a certain threshold The reason: false positives for Broder’s algorithm (consecutive term differences with large boilerplate text) can be filtered by Charikar’s algorithm Overall precision improves to 0.79 3/28/2013Finding Near-Duplicates in a Large Scale17

Pros Experiment is persuasive and reliable to conclude the pros and cons of the two algorithms. e.g. large data samples, human evaluation, error case analysis The combined approach includes advantages from both algorithms which can avoid large numbers of false positives. In the combined approach, Charikar’s algorithm is computed on the fly, which saves much space. 3/28/2013Finding Near-Duplicates in a Large Scale18

Cons The experiment focus on the precision of the two algorithm, but do not get statistics on the recall. The combined algorithm has overhead on time complexity, because finding a near- duplicate pair need to run both algorithm. 3/28/2013Finding Near-Duplicates in a Large Scale19

Improvement Consider token order in Charikar’s algorithm by using shingling; Consider token frequency in Broder’s algorithm with weighted shingle based on frequency 3/28/2013Finding Near-Duplicates in a Large Scale20

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

Similar presentations

Presentation on theme: "Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

Similar presentations

Presentation on theme: "Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan."— Presentation transcript:

Similar presentations

About project

Feedback