Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University

Crawling - Recap from last time General procedure : Continuously process a list of URLs and collect respective web pages and links that come along Two problems: Size and frequent changes Page selection : Based on metrics, i.e. - Importance Metric (goal) - Ordering Metric (selection) - Quality Metric (evaluation) Experimental verification with a representative test collection

Crawling - Recap from last time General procedure : Continuously process a list of URLs and collect respective web pages and links that come along Two problems: Size and frequent changes Page selection : Based on metrics, i.e. - Importance Metric (goal) - Ordering Metric (selection) - Quality Metric (evaluation) Experimental verification with a representative test collection Page refresh : Estimating rate of change: see last lecture (Note: other studies exist, e.g. [5]) Observations: - Frequent changes - Significant differences, e.g. among domains Hence: Update rule necessary

3. Page Refresh (Update Rules) Problem : The web is continuously changing Goal : Index and update pages in a way that keeps the index as fresh and young as possible (given the limited resources) Distinguish between Periodic crawlers : Download K pages and stop, repeat this after some time t, and replace old with new collection Incremental crawlers : Continuously crawl the web and incrementally update your collection

3.2 Incremental Crawlers Freshness of a page p i at time t Freshness of a local collection P at time t Main Goal : Keep local collection up-to-date Two measures: Freshness and Age

3.2 Incremental Crawlers Age of a page p i at time t Age of a local collection P at time t Main Goal : Keep local collection up-to-date Two measures: Freshness and Age

3.2 Incremental Crawlers Time average of freshness of page p i at t Time average of freshness of a local collection P at time t ( Time average of age : analogous) Main Goal : Keep local collection up-to-date Two measures: Freshness and Age

Example for Freshness and Age 1 0 0 ELEMENT IS CHANGED SYNCHRONIZED AGE FRESHNESS (SOURCE: [6])

Design alternative 1: Batch mode vs. steady crawler Batch mode crawler : Periodic update of all pages of a collection Steady crawler : Continuous update BATCH MODE CRAWLERSTEADY CRAWLER FRESHNESS TIME (MONTH) FRESHNESS Note: Assuming a distribution of Poisson, we can prove that the average freshness over time is identical in both cases (for the same average crawling speed!)

Design alternative 2: In-place vs. shadowing Replace old with new version of a page in-place or via shadowing, i.e. after all versions of one crawl have been downloaded Shadowing keeps two collections: The crawlers collection and the current collection BATCH MODE CRAWLERSTEADY CRAWLER

Design alternative 3: Fixed vs. variable frequency Fixed frequency / uniform refresh policy : Same access rate to all pages (independent of their actual rate of change) Variable frequency : Access pages depending on their rate of change Example: Proportional refresh policy

Variable frequency update Obvious assumption for a good strategy: Visit a page that changes frequently more often Wrong!!! The optimum update strategy (if we assume a distribution of Poisson) looks like this: RATE OF CHANGE OF A PAGE OPTIMUM UPDATE TIME

Variable frequency update (cont.) Why is this a better strategy? Illustration with a simple example: P 1 P 2

Summary of different design alternatives Steady In-place update Variable frequency Batch-mode Shadowing Fixed frequency vs.

3.3 Expl. for an Incremental Crawler Two main goals: - Keep the local collection fresh Regular, best-possible updates of the pages in the index - Continuously improve the quality of the collection Replace existing pages with low quality through new pages with higher quality

3.3 Expl. for an Incremental Crawler WHILE ( TRUE ) URL = SELECT_TO_CRAWL (ALL_URLS); PAGE = CRAWL (URL); IF (URL IN COLL_URLS) THEN UPDATE (URL, PAGE) ELSE TMP_URL = SELECT_TO_DISCARD (COL_URLS); DISCARD (TMP_URL); SAVE (URL, PAGE); COLL_URLS = (COLL_URLS - {TMP_URL}) U {URL} NEW_URLS = EXTRACT_URLS (PAGE); ALL_URLS = ALL_URLS U NEW_URLS;

3.3 Expl. for an Incremental Crawler ALL_URLS COLL_URLS ADD_URLS UPDATE/SAVE COLLECTION RANKING MODULE DISCARDSCAN ADD/REMOVE CRAWL MODULE UPDATE MODULE CRAWL CHECK SUM POP PUSH BACK

References - Web Crawler [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001 Chapter 2 (Crawling web pages) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998 Chapter 4.3 (Crawling the web) [3] CHO, GARCIA-MOLINA, PAGE: "EFFICIENT CRAWLING THROUGH URL ORDERING", WWW 1998 [4] CHO, GARCIA-MOLINA: "THE EVOLUTION OF THE WEB AND IMPLICATIONS FOR AN INCREMENTAL CRAWLER", PROCEEDINGS OF THE 26th INTL. CONF. ON VERY LARGE DATA BASES (VLDB 2000) [5] FETTERLY, MANASSE, NAJORK, WIENER: "A LARGE-SCALE STUDY OF THE EVOLUTION OF WEB PAGES", WWW 2003 [6] CHO, GARCIA-MOLINA: "SYNCHRONIZING A DATABASE TO IMPROVE FRESHNESS", ACM SIGMOD 2000

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Similar presentations

Presentation on theme: "Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Similar presentations

Presentation on theme: "Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University."— Presentation transcript:

Similar presentations

About project

Feedback