Finding replicated web collections

Finding replicated web collections
Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina Stanford University Presented by: William Quach CSCI 572 University of Southern California

Outline Replication on the web
Importance of de-duplication in today’s Internet Similarity Identifying similar collections Growing similar collections How is this useful? Contributions of the paper Pros/Cons of the paper Related work Finding replicated web collections June 22, 2010

Replication on the web Some reasons for duplication
Reliability Performance: caching, load balancing Archival Anarchy on the web makes duplicating easy but finding duplicates hard. Same page on different URL: protocol, host, domain, etc [2]. Many aspects of mirrored sites prevent us from identifying replication by finding exact matches Freshness, coverage, formats, partial crawls Finding replicated web collections June 22, 2010

Importance of deduplication in today’s Internet
The Internet grows at an extremely fast pace [1]. Crawling becomes more and more difficult if done in a brute force attempt. Intelligent algorithms can achieve similar results in less time using less memory. We need these more intelligent algorithms to be able to fully utilize the ever-growing web of information [1]. Finding replicated web collections June 22, 2010

Similarity Similarity of pages Similarity of link structure
Similarity of collections Finding replicated web collections June 22, 2010

Similarity of pages Various metrics for determining page similarity based on… Information retrieval Data mining Intuition: Textual Overlap Counting chunks of text that overlap. Requires threshold based on empirical data Finding replicated web collections June 22, 2010

Similarity of pages The paper uses the Textual Overlap metric
Convert page into text Divide text into obvious chunk (e.g. sentences) Hash each chunk to determine “fingerprint” of chunks Two pages are similar if has more than some threshold of identical chunks. Finding replicated web collections June 22, 2010

Similarity of link structure
At least one matching incoming link, unless no incoming links exists For each page p in C1, let P1(p) be the set of pages in C1 that have a link to page p For the corresponding similar page p’ in C2, let P2(p’) be the set of pages in C2 that have a link to page p’. Then we must have pages p1 ∈ P1(p) and p2 ∈ P2(p’), unless P1(p) and P2(p’) are empty. Finding replicated web collections June 22, 2010

Similarity of collections
Collections are similar if they have similar pages and similar link structure To control complexity, the method in the paper only considers: Equi-sized collections One-to-one mapping of similar pages Terminology: Collection: a group of linked pages (e.g. website) Cluster: a group of collections Similar cluster: a group of similar collections Too expensive to compute the optimal set of similar clusters Start with trivial clusters and “grow” them Finding replicated web collections June 22, 2010

Growing Clusters Trivial clusters – similar clusters with single page collections, basically a cluster of similar pages. Two trivial clusters are merged if they become a similar cluster with larger collections. Continue until no merger can produce a similar cluster. Finding replicated web collections June 22, 2010

Growing Clusters Finding replicated web collections June 22, 2010

How is this useful? Improving crawling Improving querying
This is obvious. If the crawler knows which collections are similar it can avoid crawling for the same information. Experimental results in the paper shows a 48% drop in the number of similar pages crawled. Improving querying Filter search results to “roll-up” similar pages so that more distinct pages are visible to the user on the first page. Finding replicated web collections June 22, 2010

Contributions Clearly defined the problem and provided a basic solution. Helps people understand the problem. Proposed a new algorithm to identify similar collections. Provided experimental results on the benefits of identifying similar collections to improve crawling and querying. Proves that it is a worthwhile problem to solve. Clearly stated trade-offs and assumptions of their algorithm, setting the stage for future work. Finding replicated web collections June 22, 2010

Pros Thoroughly defined the problem.
Presented a concise and effective algorithm to address the problem. Cleary stated any trade-offs made so that algorithm can be improved in future work. Simplifications made are mainly to control complexity and allow the solution to be more comprehensible Left the de-simplification of their algorithm to future work Finding replicated web collections June 22, 2010

Cons Similar collections must be equi-size.
Similar collections must have one-to-one mappings of all pages High probability for break points. Collections can become highly chunked. Thresholding required to determine page similarity may be a very tedious task Finding replicated web collections June 22, 2010

Related Work “Detecting Near-Duplicates for Web Crawling” (2007) [5]
Takes a lower level, in depth approach to determining page similarity. Hashing algorithms Good supplement “Do Not Crawl in the DUST: Different URLs with Similar Text” (2009) [6] Takes a different approach that identifies URLs that point to the same/similar content. e.g. and Does not look into page content Focus on the “low-hanging” fruits Finding replicated web collections June 22, 2010

Questions? Finding replicated web collections June 22, 2010

References [1] C. Mattmann. Characterizing the Web. CSCI 572 course lecture at USC, May 20, 2010 [2] C. Mattmann. Deduplication. CSCI 572 course lecture at USC, June 1, 2010 [3] M. Perkowitz and O. Etzioni. Adaptive web sites: Automatically synthesizing web pages. In Fifteenth National Conference on Artificial Intelligence, 1998 [4] Gerald Salton. Introduction to modern information retrieval. McGraw-Hill, New York, 1983 [5] Manku, G. S., Jain, A., and Das Sarma, A Detecting near-duplicates for web crawling. In Proceedings of the 16th international Conference on World Wide Web (Banff, Alberta, Canada, May , 2007). WWW '07. ACM, New York, NY, [6] Bar-Yossef, Z., Keidar, I., and Schonfeld, U Do not crawl in the DUST: Different URLs with similar text. ACM Trans. Web 3, 1 (Jan. 2009), 1-31. Finding replicated web collections June 22, 2010

Finding replicated web collections

Similar presentations

Presentation on theme: "Finding replicated web collections"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Finding replicated web collections

Similar presentations

Presentation on theme: "Finding replicated web collections"— Presentation transcript:

Similar presentations

About project

Feedback