Finding replicated web collections

Slides:

Advertisements

Similar presentations

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran.

Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.

Safeguarding and Charging for Information on the Internet Hector Garcia-Molina, Steven P. Ketchpel, Narayanan Shivakumar Stanford University Presented.

DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.

SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.

Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou

CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Aki Hecht Seminar in Databases (236826) January 2009

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.

Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.

LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

Near Duplicate Detection

1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.

1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University.

1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.

Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.

Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma

Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.

FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.

Crawling Slides adapted from

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak

Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.

1 Characterizing Botnet from Spam Records Presenter: Yi-Ren Yeh ( 葉倚任 ) Authors: L. Zhuang, J. Dunagan, D. R. Simon, H. J. Wang, I. Osipkov, G. Hulten,

Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.

Objective Understand concepts used to web-based digital media. Course Weight : 5%

25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Xiaoying Sharon Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol.

Spamscatter: Characterizing Internet Scam Hosting Infrastructure By D. Anderson, C. Fleizach, S. Savage, and G. Voelker Presented by Mishari Almishari.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Parallel Crawlers Junghoo Cho (UCLA) Hector Garcia-Molina (Stanford) May 2002 Ke Gong 1.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)

A search engine is a web site that collects and organizes content from all over the internet Search engines look through their own databases of.

Adaptive Web Sites Authors : Mike Perkowitz, Oren Etzioni Source : Communications of the ACM, Volume 43 Issue 8, 2000 Speaker :Li-Ya Liao Adviser : Ku-Yaw.

Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.

1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.

1 CS 430: Information Discovery Lecture 17 Web Crawlers.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.

Xiaoying Sharon Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,

1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.

Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presenter: Siyuan Hua.

Finding Replicated web collections

UbiCrawler: a scalable fully distributed Web crawler

Near Duplicate Detection

How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho

A Brief Introduction to the Internet

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Lecture 9: Entity Resolution

Application layer Lecture 7.

Web Mining Department of Computer Science and Engg.

CS246: Information Retrieval

Lecture-Hashing.

Presentation transcript:

Finding replicated web collections Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina Stanford University Presented by: William Quach CSCI 572 University of Southern California

Outline Replication on the web Importance of de-duplication in today’s Internet Similarity Identifying similar collections Growing similar collections How is this useful? Contributions of the paper Pros/Cons of the paper Related work Finding replicated web collections June 22, 2010

Replication on the web Some reasons for duplication Reliability Performance: caching, load balancing Archival Anarchy on the web makes duplicating easy but finding duplicates hard. Same page on different URL: protocol, host, domain, etc [2]. Many aspects of mirrored sites prevent us from identifying replication by finding exact matches Freshness, coverage, formats, partial crawls Finding replicated web collections June 22, 2010

Importance of deduplication in today’s Internet The Internet grows at an extremely fast pace [1]. Crawling becomes more and more difficult if done in a brute force attempt. Intelligent algorithms can achieve similar results in less time using less memory. We need these more intelligent algorithms to be able to fully utilize the ever-growing web of information [1]. Finding replicated web collections June 22, 2010

Similarity Similarity of pages Similarity of link structure Similarity of collections Finding replicated web collections June 22, 2010

Similarity of pages Various metrics for determining page similarity based on… Information retrieval Data mining Intuition: Textual Overlap Counting chunks of text that overlap. Requires threshold based on empirical data Finding replicated web collections June 22, 2010

Similarity of pages The paper uses the Textual Overlap metric Convert page into text Divide text into obvious chunk (e.g. sentences) Hash each chunk to determine “fingerprint” of chunks Two pages are similar if has more than some threshold of identical chunks. Finding replicated web collections June 22, 2010

Similarity of link structure At least one matching incoming link, unless no incoming links exists For each page p in C1, let P1(p) be the set of pages in C1 that have a link to page p For the corresponding similar page p’ in C2, let P2(p’) be the set of pages in C2 that have a link to page p’. Then we must have pages p1 ∈ P1(p) and p2 ∈ P2(p’), unless P1(p) and P2(p’) are empty. Finding replicated web collections June 22, 2010

Similarity of collections Collections are similar if they have similar pages and similar link structure To control complexity, the method in the paper only considers: Equi-sized collections One-to-one mapping of similar pages Terminology: Collection: a group of linked pages (e.g. website) Cluster: a group of collections Similar cluster: a group of similar collections Too expensive to compute the optimal set of similar clusters Start with trivial clusters and “grow” them Finding replicated web collections June 22, 2010

Growing Clusters Trivial clusters – similar clusters with single page collections, basically a cluster of similar pages. Two trivial clusters are merged if they become a similar cluster with larger collections. Continue until no merger can produce a similar cluster. Finding replicated web collections June 22, 2010

Growing Clusters Finding replicated web collections June 22, 2010

How is this useful? Improving crawling Improving querying This is obvious. If the crawler knows which collections are similar it can avoid crawling for the same information. Experimental results in the paper shows a 48% drop in the number of similar pages crawled. Improving querying Filter search results to “roll-up” similar pages so that more distinct pages are visible to the user on the first page. Finding replicated web collections June 22, 2010

Contributions Clearly defined the problem and provided a basic solution. Helps people understand the problem. Proposed a new algorithm to identify similar collections. Provided experimental results on the benefits of identifying similar collections to improve crawling and querying. Proves that it is a worthwhile problem to solve. Clearly stated trade-offs and assumptions of their algorithm, setting the stage for future work. Finding replicated web collections June 22, 2010

Pros Thoroughly defined the problem. Presented a concise and effective algorithm to address the problem. Cleary stated any trade-offs made so that algorithm can be improved in future work. Simplifications made are mainly to control complexity and allow the solution to be more comprehensible Left the de-simplification of their algorithm to future work Finding replicated web collections June 22, 2010

Cons Similar collections must be equi-size. Similar collections must have one-to-one mappings of all pages High probability for break points. Collections can become highly chunked. Thresholding required to determine page similarity may be a very tedious task Finding replicated web collections June 22, 2010

Related Work “Detecting Near-Duplicates for Web Crawling” (2007) [5] Takes a lower level, in depth approach to determining page similarity. Hashing algorithms Good supplement “Do Not Crawl in the DUST: Different URLs with Similar Text” (2009) [6] Takes a different approach that identifies URLs that point to the same/similar content. e.g. www.myhomepage.com and www.myhomepage.com/index.html Does not look into page content Focus on the “low-hanging” fruits Finding replicated web collections June 22, 2010

Questions? Finding replicated web collections June 22, 2010

References [1] C. Mattmann. Characterizing the Web. CSCI 572 course lecture at USC, May 20, 2010 [2] C. Mattmann. Deduplication. CSCI 572 course lecture at USC, June 1, 2010 [3] M. Perkowitz and O. Etzioni. Adaptive web sites: Automatically synthesizing web pages. In Fifteenth National Conference on Artificial Intelligence, 1998 [4] Gerald Salton. Introduction to modern information retrieval. McGraw-Hill, New York, 1983 [5] Manku, G. S., Jain, A., and Das Sarma, A. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th international Conference on World Wide Web (Banff, Alberta, Canada, May 08 - 12, 2007). WWW '07. ACM, New York, NY, 141- 150. [6] Bar-Yossef, Z., Keidar, I., and Schonfeld, U. 2009. Do not crawl in the DUST: Different URLs with similar text. ACM Trans. Web 3, 1 (Jan. 2009), 1-31. Finding replicated web collections June 22, 2010