FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Chapter 5: Introduction to Information Retrieval
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Near-Duplicates Detection
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
VisualRank: Applying PageRank to Large-Scale Image Search Yushi Jing, Member, IEEE, and Shumeet Baluja, Member, IEEE.
Tools for Text Review. Algorithms The heart of computer science Definition: A finite sequence of instructions with the properties that –Each instruction.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Information Retrieval in Practice
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
1 Lecture 18 Syntactic Web Clustering CS
Performance Evaluation of IPv6 Packet Classification with Caching Author: Kai-Yuan Ho, Yaw-Chung Chen Publisher: ChinaCom 2008 Presenter: Chen-Yu Chaug.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Overview of Search Engines
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
India Research Lab Auto-grouping s for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM.
TERMS TO KNOW. Programming Language A vocabulary and set of grammatical rules for instructing a computer to perform specific tasks. Each language has.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
Using Hyperlink structure information for web search.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-1 How Search Engines Work Today we show how a search engine works  What happens when.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Scott Wen-tau Yih (Microsoft Research) Joint work with Hannaneh Hajishirzi (University of Illinois) Aleksander Kolcz (Microsoft Bing)
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Class Imbalance in Text Classification
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
Efficient Crawling Through URL Ordering By: Junghoo Cho, Hector Garcia-Molina, and Lawrence Page Presenter : Omkar S. Kasinadhuni Simerjeet Kaur.
Syntactic Clustering of the Web By Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig CSCI 572 Ameya Patil Syntactic Clustering of the.
Text Based Information Retrieval
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
CS 430: Information Discovery
Multimedia Information Retrieval
The Anatomy of a Large-Scale Hypertextual Web Search Engine
CS 430: Information Discovery
Detecting Phrase-Level Duplication on the World Wide Web
Web Search Engines.
Presentation transcript:

FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1

INTRODUCTION What are near duplicate web pages? What problems it causes?  Increases the space needed to store the index  Either slows down or increases the cost of serving results Why detect near duplicate web pages?  Smarter crawling  Improved pagerank, less storage, smaller indexing 2

APPROACHES TO FIND NEAR DUPLICATES Compare all pairs of documents (naive solution) - very expensive on large databases Manber and Heintze proposed algorithm based on sequences of adjacent characters Shivakumar and Garcia-Molina focused on scaling this algorithm up to multi-gigabyte databases. Broder et al. used word sequences to efficiently find near duplicate web pages Charikar developed an algorithm based on random projections of the words in a document. 3

ALGORITHM FOR DETECTING NEAR DUPLICATES Every HTML page is converted into a token sequence  All HTML markups are replaced by white spaces  All the formatting instructions are ignored Every alphanumeric sequence is considered as a term Term is hashed using Rabin’s fingerprinting scheme to generate tokens Every URL is broken at slashes and dots and is treated like a sequence of terms To distinguish between images the URL in image tag is considered as a term A bit string is generated from the token sequence of a page and used to determine near-duplicates for the pages 4

Broder et al.’s Algorithm (Algorithm B) 5

Let l = m/m’ The concatenation of minvalue j*l, ……(j+1)*l-1 for 0≤ j <m’ is fingerprinted with another function to form a supershingle vector The number of identical entries in supershingle vectors of two pages is their B-similarity Two pages are near duplicates of Alg.B iff their B-similarity is at least 2 6 Broder et al.’s Algorithm Contd…

CHARIKAR’S ALGORITHM (ALGORITHM C) Let b be a constant Each token is projected into b-dimensional space by randomly choosing b entries from {-1,1} This projection is same for all pages For each page a b-dimensional vector is created by adding the projections of all the tokens in its token sequence. The final vector for the page is created by setting every positive entry in the vector to 1 and every non-positive entry to 0, resulting in random projection for each page C-similarity of two pages is the number of bits their projections agree on 7

RESULTS OF BOTH ALGORITHMS Both algorithms were executed on a set of 1.6B unique pages collected by Google crawler Algorithms are compared based on: - precision - the distribution of the number of term differences in the near duplicate pairs Each near duplicate pair is labeled as correct, incorrect, undecided Two web pages are correct near duplicates if - their text differs only by session id, timestamp, a visitor count or part of their url - the difference is invisible to the visitors of the page 8

A near duplicate pair is incorrect if the main item of the page is different The remaining near duplicate pairs are called undecided 9 RESULTS Contd…

COMPARISON OF ALGORITHM B AND C Manual evaluation Algorithm C achieves higher precision than Algorithm B Pairs on different sites: Statistically Algorithm C is superior to Algo. B Pairs on the same site: Neither algorithm achieves high precision for pages on the same site Algorithm B has higher recall while Algo. C has higher precision for pairs on the same site Therefore, combination of two algorithms can achieve higher precision and recall for pairs on the same site 10

THE COMBINED ALGORITHM First compute all B-similar pairs Then filter out those pairs whose C-similarity falls below a certain threshold The precision and recall of the combined algorithm can significantly improve if high value of C-similarity is used. 11

CONCLUSION Evaluation of Algorithm B and C shows that neither performs well on the pages from same site Performance of algorithm B can be improved by: - weighting of shingles by frequency - using a different number k of tokens in a shingle Combined algorithm is superior in both precision and recall 12