Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

Slides:



Advertisements
Similar presentations
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Fast Algorithms For Hierarchical Range Histogram Constructions
Near-Duplicates Detection
Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Distributed Breadth-First Search with 2-D Partitioning Edmond Chow, Keith Henderson, Andy Yoo Lawrence Livermore National Laboratory LLNL Technical report.
Video Shot Boundary Detection at RMIT University Timo Volkmer, Saied Tahaghoghi, and Hugh E. Williams School of Computer Science & IT, RMIT University.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
A Novel Scheme for Video Similarity Detection Chu-Hong Hoi, Steven March 5, 2003.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
1 Lecture 18 Syntactic Web Clustering CS
Managing Distributed Collections: Evaluating Web Page Change, Movement, and Replacement Richard Furuta and Frank Shipman Center for the Study of Digital.
(hyperlink-induced topic search)
Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle.
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.
Biological Networks Lectures 6-7 : February 02, 2010 Graph Algorithms Review Global Network Properties Local Network Properties 1.
Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.
Chapter 3 EDRS 5305 Fall 2005 Gravetter and Wallnau 5 th edition.
Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
1 CS 430: Information Discovery Lecture 16 Thesaurus Construction.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
Mean and Standard Deviation of Grouped Data Make a frequency table Compute the midpoint (x) for each class. Count the number of entries in each class (f).
Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Chapter 1 Background 1. In this lecture, you will find answers to these questions Computers store and transmit information using digital data. What exactly.
Post-Ranking query suggestion by diversifying search Chao Wang.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
杜嘉晨 PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
Improvement of Apriori Algorithm in Log mining Junghee Jaeho Information and Communications University,
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
Syntactic Clustering of the Web By Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig CSCI 572 Ameya Patil Syntactic Clustering of the.
Linear Algebra Review.
Julián ALARTE DAVID INSA JOSEP SILVA
Text Based Information Retrieval
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
CS 430: Information Discovery
My web site..
CS 430: Information Discovery
Detecting Phrase-Level Duplication on the World Wide Web
Fourier Transform of Boundaries
Information Retrieval and Web Design
Presentation transcript:

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan

Overview Two near-duplicate detecting algorithms (Broder’s & Charikar’s algorithm) are compared on a very large scale (1.6 billion distinct web pages) Need to know the pros and cons of each algorithm when they work in different situations. Need to find a new approach to get better results of detecting near-duplicates Finding Near-Duplicates in a Large Scale 2 3/28/2013

Relation to course material Discuss more details of two algorithms which were introduced in lecture, and draw important conclusions by comparing the experiment results Broder’s algorithm is basically a minhashing algorithm discussed in lecture. The paper goes further to calculate a supershingle based on the minvalue vector. Both algorithms obey the general paradigm of finding near-duplicates, which is to generate and compare signature of each file 3/28/2013Finding Near-Duplicates in a Large Scale3

Broder’s Algorithm Begin with preprocessing HTML tags and URLs for each document (also used in Charikar) Use m functions to fingerprint the shingle sequence, and find m minvalues each from the fingerprinted sequence. 3/28/2013Finding Near-Duplicates in a Large Scale4

Broder’s Algorithm Divide the m minvalues into m’ groups, each with l elements e.g. m = 84, m’ = 6, l = 14 Concatenate minvalues in each group to reduce the vector from m entries to m’ entries Fingerprint each of the m’ entries to generate an m’-dimensional vector (supershingle) 3/28/2013Finding Near-Duplicates in a Large Scale5

B-Similarity Definition: The number of identical entries in the supershingle vectors of two pages Two pages are near-duplicates iff their B- similarity is at least 2. e.g. m’ = 6, pairs with more than 2 entry agrees are near-duplicate 3/28/2013Finding Near-Duplicates in a Large Scale6

Charikar’s algorithm Extract a set of features (meaningful tokens) from a web page, and each feature is tagged with a weight Each feature (token) is projected to a b- bit vector that each entry in the vector has value {-1, 1} 3/28/2013Finding Near-Duplicates in a Large Scale7

Charikar’s algorithm Sum up all b-bit projections of tokens each multiplied by its weight to form a new b-dimensional vector Generate the final b-dimensional vector by setting the positive entry to 1 and non-positive entry to 0 3/28/2013Finding Near-Duplicates in a Large Scale8

C-Similarity Definition: The C-similarity of two pages is the number of bits their final projections agree on Two pages are near-duplicates iff the number of agreeing bits in their projections lies above a fixed threshold e.g. b = 384, threshold = 372 3/28/2013Finding Near-Duplicates in a Large Scale9

Comparison of two algorithms 3/28/2013Finding Near-Duplicates in a Large Scale10 Broder’s algorithmCharikar’s algorithm Considers order of token sequence Ignores order of token sequence Ignores the frequency of shinglesConsiders the frequency of terms O(Tm + Dm’) = O(Tm)O(Tb) Note: T is the total number of tokens in all web pages. D is the number of web pages.

Comparison of experiment results Construct a similarity graph in which every page is a node and every edge denotes a near-duplicate pair. A node is considered a near-duplicate page iff it is incident to at least one edge 3/28/2013Finding Near-Duplicates in a Large Scale11 B-similarity graphC-similarity graph 27.4M/1.6B35.5M/1.6B Average degree: 135Average degree: 92

Comparison of experiment results 3/28/2013Finding Near-Duplicates in a Large Scale12 B-similarity C-similarity Distribution of degree in log-log scale

Comparison of experiment results Precision measurement Precision of results from same sites is low because very often pages on the same site use the same boilerplate text and differ only in the main item in the center of the page. 3/28/2013Finding Near-Duplicates in a Large Scale13 Broder’sCharikar’s Total precision Precision on same sites Precision on different sites

Comparison of experiment results Term differences in two algorithms 3/28/2013Finding Near-Duplicates in a Large Scale14 Broder’s algorithmCharikar’s algorithm Average: 24 Mean: 11 Average: 94 Mean: 7 21% with term differences 2 90% with term differences less than 42 24% with term differences=2 90% with term differences less than 44

Comparison of experiment results Distribution of term differences in two algorithms 3/28/2013Finding Near-Duplicates in a Large Scale15 Broder’s algorithm Charikar’s algorithm

Comparison of experiment results Error cases: 3/28/2013Finding Near-Duplicates in a Large Scale16 Broder’s caseCharikar’s case NIH database, Herefordshire database on the web a UK business directory Differs in 20 consecutive tokens among tokens Differs in 1-5 non consecutive tokens among 1000 tokens Affected by large amount of boilerplate text Affected by large amount of common tokens despite of the different order Charikar’s algorithm works here because it ignores the token order-- the number of different tokens are large enough to be detected Broder’s algorithm works here because the dispersal of different token generate considerable amount of distinct shingles.

A combined algorithm Use Broder’s algorithm to compute all B- similar pairs first. Then use Charikar’s algorithm to filter out those pairs whose C-similarity falls below a certain threshold The reason: false positives for Broder’s algorithm (consecutive term differences with large boilerplate text) can be filtered by Charikar’s algorithm Overall precision improves to /28/2013Finding Near-Duplicates in a Large Scale17

Pros Experiment is persuasive and reliable to conclude the pros and cons of the two algorithms. e.g. large data samples, human evaluation, error case analysis The combined approach includes advantages from both algorithms which can avoid large numbers of false positives. In the combined approach, Charikar’s algorithm is computed on the fly, which saves much space. 3/28/2013Finding Near-Duplicates in a Large Scale18

Cons The experiment focus on the precision of the two algorithm, but do not get statistics on the recall. The combined algorithm has overhead on time complexity, because finding a near- duplicate pair need to run both algorithm. 3/28/2013Finding Near-Duplicates in a Large Scale19

Improvement Consider token order in Charikar’s algorithm by using shingling; Consider token frequency in Broder’s algorithm with weighted shingle based on frequency 3/28/2013Finding Near-Duplicates in a Large Scale20