Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding Replicated web collections

Similar presentations


Presentation on theme: "Finding Replicated web collections"— Presentation transcript:

1 Finding Replicated web collections
Authors Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina Paper Presentation By Radhika Malladi and Vijay Reddy Mara

2 Introduction Similarity Measures Similarity in Web Pages Similarity in Link Structure Similar Clusters Computing Implementing Quality of Clusters Exploiting Clusters Improving Crawling Improving search engine results

3 Introduction Replication across the web ?
Replicated collections constitute several hundreds and thousands of pages. Mirrored Web Pages?

4 Mirrored pages.

5

6 What we will know in this paper?
Similarity Measures: Compute similarity measures for collections of web pages. Improved Crawling: use replication information to improve crawling. Reduce clutter from search engines: use replication information.

7 By automatically identifying mirrored collections
We can improve the following: Crawling: fetches web pages. Ranking: pages ranked by replication factor. Archiving: stores subset of the web. Caching: holds web pages that are frequently accessed.

8 Difficulties detecting replicated collections:
Update frequency: Mirror copies may not be updated regularly. Mirror partial coverage: differ from primary Different formats Partial crawls: snapshots may be different

9 Similarity Measures Definition (Web Graph): Graph G=(V,E) having nodes vi for each page pi and a directed edge from vi and vJ ,if there is a hyperlink from pi to pJ . Definition (Collection): Collection Size: No. of pages in the collection

10 Definition (Identical Collections): Equisized collections C1 and C2 are identical if there is a one-to-one mapping M that maps C1 pages to C2 pages such that: Identical pages: For each page p C1 ,p M( p). Identical link structure: For each link in C1 from page a to b, we have a link from M(a) to M(b) in C2 .

11

12 Similarity of link structure:
Collection sizes: One-to-One:

13 Break points: Link Similarity:

14 Definition (Similar Collections): Equisized collections C and C are similar if there is a one-to-one mapping M that maps all C pages to all C pages such that Similar pages: For each page p C1 ,p M( p). Similar links: Two corresponding pages should have atleast one parent in their corresponding collections that are also similar pages.

15 Similar collections

16 Similar Clusters Computing Example 1 of similar clusters
Cluster size(cardinality):2 Collection size:5

17 Computing Example 2 of similar clusters Cluster size(cardinality):3 Collection size:3

18 Cluster Algorithm Example:
Step 1: Find all trivial clusters

19 Step2: Merge trivial clusters that leads to similar clusters

20 Step 3: Outcome

21 Another example of cluster algorithm with two possible clusters.

22 Cont…

23 Quality of Clusters

24 Concept of fingerprints:
entire document fingerprint four line fingerprint two line fingerprint

25

26

27

28 Exploiting Clusters Improving Crawling:

29

30

31 Improving Search engine results:
Reduces clutter by using a prototype. Prototype has links to ‘Collections’ and ‘Replica’.

32 Conclusion

33 Discussion Question: Should Cluster size be more or collection size?

34 Discussion Question: Suppose p is similar to pI and pII is similar pI and p and pII are not similar. Do you think all the three pages are similar?

35


Download ppt "Finding Replicated web collections"

Similar presentations


Ads by Google