Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.

Similar presentations


Presentation on theme: "1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab."— Presentation transcript:

1 1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab

2 2 What is a crawler? n Program that automatically retrieves pages from the Web. n Widely used for search engines.

3 3 Challenges n There are many pages out on the Web. (Major search engines indexed more than 100M pages) n The size of the Web is growing enormously. n Most of them are not very interesting  In most cases, it is too costly or not worthwhile to visit the entire Web space.

4 4 Good crawling strategy n Make the crawler visit “important pages” first. u Save network bandwidth u Save storage space and management cost u Serve quality pages to the client application

5 5 Outline n Importance metrics : what are important pages? n Crawling models : How is crawler evaluated? n Experiments n Conclusion & Future work

6 6 Importance metric The metric for determining if a page is HOT u Similarity to driving query u Location Metric u Backlink count u Page Rank

7 7 Similarity to a driving query n Importance is measured by closeness of the page to the topic (e.g. the number of the topic word in the page) n Personalized crawler Example) “Sports”, “Bill Clinton” the pages related to a specific topic

8 8 Importance metric The metric for determining if a page is HOT u Similarity to driving query u Location Metric u Backlink count u Page Rank

9 9 Backlink-based metric n Backlink count u number of pages pointing to the page u Citation metric n Page Rank u weighted backlink count u weight is iteratively defined

10 10 A B C D E F BackLinkCount(F) = 2 PageRank(F) = PageRank(E)/2 + PageRank(C)

11 11 Ordering metric n The metric for a crawler to “estimate” the importance of a page n The ordering metric can be different from the importance metric

12 12 Crawling models n Crawl and Stop u Keep crawling until the local disk space is full. n Limited buffer crawl u Keep crawling until the whole web space is visited throwing out seemingly unimportant pages.

13 Crawl and stop model

14 14 Crawling models n Crawl and Stop u Keep crawling until the local disk space is full. n Limited buffer crawl u Keep crawling until the whole web space is visited throwing out seemingly unimportant pages.

15 Limited buffer model

16 16 Architecture Repository URL selector Virtual Crawler HTML parser URL pool Page Info crawled page extracted URL page info selected URL WebBase Crawler Stanford WWW

17 17 Experiments n Backlink-based importance metric u backlink count u PageRank n Similiarty-based importance metric u similarity to a query word

18 18 Ordering metrics in experiments n Breadth first order n Backlink count n PageRank

19

20 20 Similarity-based crawling n The content of the page is not available before it is visited n Essentially, the crawler should “guess” the content of the page n More difficult than backlink-based crawling

21 21 Promising page Sports ? Anchor Text Sports!! ? HOT Parent Page ? URL …/sports.html

22 22 Virtual crawler for similarity-based crawling Promising page u Query word appears in its anchor text u Query word appears in its URL u The page pointing to it is “important” page n Visit “promising pages” first n Visit “non-promising pages” in the ordering metric order

23

24 24 Conclusion n PageRank is generally good as an ordering metric. n By applying a good ordering metric, it is possible to gather important pages quickly.

25 25 Future work n Limited buffer crawling model n Replicated page detection n Consistency maintenance

26 26 Problem n In what order should a crawler visit web pages to get the pages we want? n How can we get important pages first?

27 27 WebBase n System for creating and maintaining large local repository n High index speed (50 pages/sec) and large repository (150GB) n Load balancing scheme to prevent servers from crashing

28 28 Virtual Web crawler n The crawler for experiments n Run on top of the WebBase repository n No load balancing n Dataset was restricted to Stanford domain

29 29 Available Information n Anchor text n URL of the page n The content of the page pointing to it


Download ppt "1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab."

Similar presentations


Ads by Google