Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding replicated web collections

Similar presentations


Presentation on theme: "Finding replicated web collections"— Presentation transcript:

1 Finding replicated web collections
Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina Stanford University Presented by: William Quach CSCI 572 University of Southern California

2 Outline Replication on the web
Importance of de-duplication in today’s Internet Similarity Identifying similar collections Growing similar collections How is this useful? Contributions of the paper Pros/Cons of the paper Related work Finding replicated web collections June 22, 2010

3 Replication on the web Some reasons for duplication
Reliability Performance: caching, load balancing Archival Anarchy on the web makes duplicating easy but finding duplicates hard. Same page on different URL: protocol, host, domain, etc [2]. Many aspects of mirrored sites prevent us from identifying replication by finding exact matches Freshness, coverage, formats, partial crawls Finding replicated web collections June 22, 2010

4 Importance of deduplication in today’s Internet
The Internet grows at an extremely fast pace [1]. Crawling becomes more and more difficult if done in a brute force attempt. Intelligent algorithms can achieve similar results in less time using less memory. We need these more intelligent algorithms to be able to fully utilize the ever-growing web of information [1]. Finding replicated web collections June 22, 2010

5 Similarity Similarity of pages Similarity of link structure
Similarity of collections Finding replicated web collections June 22, 2010

6 Similarity of pages Various metrics for determining page similarity based on… Information retrieval Data mining Intuition: Textual Overlap Counting chunks of text that overlap. Requires threshold based on empirical data Finding replicated web collections June 22, 2010

7 Similarity of pages The paper uses the Textual Overlap metric
Convert page into text Divide text into obvious chunk (e.g. sentences) Hash each chunk to determine “fingerprint” of chunks Two pages are similar if has more than some threshold of identical chunks. Finding replicated web collections June 22, 2010

8 Similarity of link structure
At least one matching incoming link, unless no incoming links exists For each page p in C1, let P1(p) be the set of pages in C1 that have a link to page p For the corresponding similar page p’ in C2, let P2(p’) be the set of pages in C2 that have a link to page p’. Then we must have pages p1 ∈ P1(p) and p2 ∈ P2(p’), unless P1(p) and P2(p’) are empty. Finding replicated web collections June 22, 2010

9 Similarity of collections
Collections are similar if they have similar pages and similar link structure To control complexity, the method in the paper only considers: Equi-sized collections One-to-one mapping of similar pages Terminology: Collection: a group of linked pages (e.g. website) Cluster: a group of collections Similar cluster: a group of similar collections Too expensive to compute the optimal set of similar clusters Start with trivial clusters and “grow” them Finding replicated web collections June 22, 2010

10 Growing Clusters Trivial clusters – similar clusters with single page collections, basically a cluster of similar pages. Two trivial clusters are merged if they become a similar cluster with larger collections. Continue until no merger can produce a similar cluster. Finding replicated web collections June 22, 2010

11 Growing Clusters Finding replicated web collections June 22, 2010

12 How is this useful? Improving crawling Improving querying
This is obvious. If the crawler knows which collections are similar it can avoid crawling for the same information. Experimental results in the paper shows a 48% drop in the number of similar pages crawled. Improving querying Filter search results to “roll-up” similar pages so that more distinct pages are visible to the user on the first page. Finding replicated web collections June 22, 2010

13 Contributions Clearly defined the problem and provided a basic solution. Helps people understand the problem. Proposed a new algorithm to identify similar collections. Provided experimental results on the benefits of identifying similar collections to improve crawling and querying. Proves that it is a worthwhile problem to solve. Clearly stated trade-offs and assumptions of their algorithm, setting the stage for future work. Finding replicated web collections June 22, 2010

14 Pros Thoroughly defined the problem.
Presented a concise and effective algorithm to address the problem. Cleary stated any trade-offs made so that algorithm can be improved in future work. Simplifications made are mainly to control complexity and allow the solution to be more comprehensible Left the de-simplification of their algorithm to future work Finding replicated web collections June 22, 2010

15 Cons Similar collections must be equi-size.
Similar collections must have one-to-one mappings of all pages High probability for break points. Collections can become highly chunked. Thresholding required to determine page similarity may be a very tedious task Finding replicated web collections June 22, 2010

16 Related Work “Detecting Near-Duplicates for Web Crawling” (2007) [5]
Takes a lower level, in depth approach to determining page similarity. Hashing algorithms Good supplement “Do Not Crawl in the DUST: Different URLs with Similar Text” (2009) [6] Takes a different approach that identifies URLs that point to the same/similar content. e.g. and Does not look into page content Focus on the “low-hanging” fruits Finding replicated web collections June 22, 2010

17 Questions? Finding replicated web collections June 22, 2010

18 References [1] C. Mattmann. Characterizing the Web. CSCI 572 course lecture at USC, May 20, 2010 [2] C. Mattmann. Deduplication. CSCI 572 course lecture at USC, June 1, 2010 [3] M. Perkowitz and O. Etzioni. Adaptive web sites: Automatically synthesizing web pages. In Fifteenth National Conference on Artificial Intelligence, 1998 [4] Gerald Salton. Introduction to modern information retrieval. McGraw-Hill, New York, 1983 [5] Manku, G. S., Jain, A., and Das Sarma, A Detecting near-duplicates for web crawling. In Proceedings of the 16th international Conference on World Wide Web (Banff, Alberta, Canada, May , 2007). WWW '07. ACM, New York, NY, [6] Bar-Yossef, Z., Keidar, I., and Schonfeld, U Do not crawl in the DUST: Different URLs with similar text. ACM Trans. Web 3, 1 (Jan. 2009), 1-31. Finding replicated web collections June 22, 2010


Download ppt "Finding replicated web collections"

Similar presentations


Ads by Google