Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University

General Web Search Engine Architecture CLIENT QUERY ENGINE RANKING CRAWL CONTROL CRAWLER(S) USAGE FEEDBACK RESULTS QUERIES WWW COLLECTION ANALYSIS MOD. INDEXER MODULE PAGE REPOSITORY INDEXES STRUCTUREUTILITYTEXT (CF. [1] FIG. 1) CRAWL CONTROL

Crawler (Robots, Spiders) - 1. Intro Goal : Get web pages for indexing Basic procedure (simplified): 1. Given: Initial set of URLs U (in some order) 2. Get next URL u from U 3. Download web page p(u) 4. Extract all URLs from p(u), add them to U 5. Send p(u) to the indexer 6. Continue with 2. until U is empty (or some stop criteria is fulfilled)

1. Introduction (Cont.) Probl.: The web is too big and changes too fast Page selection based on - Coverage (absolute vs. relative) - Quality (e.g. index "good" pages) - Efficiency (e.g. no duplicates) - Etiquette (e.g. minimize server loads) - Freshness (Update how often? What?) Pragmatic issues: - Parallelization of the crawling process - Parsing web pages - Defending spam

2. Page Selection Rules Which pages should we download? Goal: Download only "important" pages Questions: - How can we describe importance? - How can we estimate importance? - How can we judge the quality of diff. crawlers? To answer these questions we need: 1. A mathematical model / measure for importance 2. A selection criteria that maximizes importance based on this measure 3. A measure to compare the performance of different crawlers

2.1 Importance Metrics Interest-driven Metric IS(P): Index pages with a certain interest for you users - Use traditional vector model Problem: Requires queries Q and estimated IDFs - Alternatively: Use hierarchy of topics (estimation of topic based on link structure) Popularity-driven Metric IB(P): Index popular pages - Popularity based on (e.g.) backlinks or PageRank Location-driven Metric IL(P): Index based on local information (URL) - Expls: Suffix (.com,.edu,...), no. of slashes,...

2.2 Ordering Metrics Goal: Sort URLs in such a way as to get the most important subset of pages in the end Problem: Requires an estimation of the importance of the respective web page For popularity-driven metrics IB(P): - E.g. use numbers of backlinks seen so far For location-driven metrics IL(P): - All required information is available! For similarity-/interest-driven metrics IS(P): - Needs queries, estimated IDF, and guess about the page's content (e.g. via anchor text or text on page that contains the link)

Expl. for a Crawling Algorithm ENQUEUE (URL_QUEUE, STARTING_URL); WHILE ( NOT EMPTY (URL_QUEUE)) { URL = DEQUEUE(URL_QUEUE); PAGE = CRAWL_PAGE (URL); ENQUEUE (CRAWLED_PAGES, (URL, PAGE)); URL_LIST = EXTRACT_URLS (PAGE); FOR EACH U IN URL_LIST ENQUEUE (LINKS, (URL, U)); IF (U NOT IN URL_QUEUE) AND ((U, -) NOT IN CRAWLED_PAGES) THEN ENQUEUE (URL_QUEUE, U); REORDER_QUEUE (URL_QUEUE), } (SEE FIGURE 1 IN [3])

2.3 Quality Metrics Quality metric to describe the performance of a crawler: Distinguish two cases 1. Crawl & Stop : Crawler gets K pages - Perfect crawler: R 1, R 2,..., R K with I(R i )  I(R j ) - Real crawler: Only delivers M  K of these R i pages Definition: Performance P of crawler C P(C) = (M * 100) / K Random crawler: P(C) = (K * 100) / T with T = no. of pages in the web 2. Crawl & Stop with Threshold : Define importance target G and get pages with I(P)>G (see literature, i.e. [1], section 2.1.2)

Example: Stanford WebBase Crawler Data base : 225.000 Stanford University web pages Crawler : Stanford WebBaseCrawler (with different ordering metrics) Importance metric : IB(P) Quality metric : Crawl & Stop with Threshold SEE [1]

3. Page Refresh (Update Rules) Problem : The web is continuously changing Goal : Index and update pages in a way that keeps the index as fresh and young as possible (given the limited resources) Distinguish between Periodic crawlers : Download K pages and stop, repeat this after some time t, and replace old with new collection Incremental crawlers : Continuously crawl the web and incrementally update your collection

3.1 Change Frequency of the Web Experiment (Stanford) to answer the following questions: - How long is the lifespan of a web page? - How often do web pages change? - How long does it take till (e.g.) 50% of all web pages change? - Are there any mathematical models to describe these changes? Experiment with a data base of - 720,000 pages from 270 sites - ca. 3,000 pages per site ("window of pages") - Sites selected based on popularity (PageRank) and only with owner's permission - Archived over 4 months (1 time daily) SOURCE OF FOLLOWING DIAGRAMS: CHO & GARCIA-MOLINA [4]

3.1 Change Frequency of the Web How often do web pages change? (OVERALL)(DOMAIN DEPENDENT) Observations: - Pages change rather frequently - Significant differences between different domains (.com,.org,.edu,.gov)

3.1 Change Frequency of the Web How long is the lifespan of a web page? (OVERALL)(DOMAIN DEPENDENT) Note: Only "visible" lifespan is observed here Two methods have been used to estimate the lifespan over 4 months

3.1 Change Frequency of the Web How long does it take till (e.g.) 50% of all web pages change? (OVERALL) (DOMAIN DEPENDENT) Conclusion: Especially the clear differences in the domain dependent case suggest considering change frequency during crawling

3.1 Change Frequency of the Web Are there mathematical models to describe changes? Assumption: Change frequency follows the distribution of Poisson With this: Estimate probability that a page changes at a particular time t CHANGE INTERVALS OF PAGES FOR THE PAGES THAT CHANGE...... EVERY 10 DAYS ON AVERAGE... EVERY 20 DAYS ON AVERAGE

References - Web Crawler [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001 Chapter 2 (Crawling web pages) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998 Chapter 4.3 (Crawling the web) [3] CHO, GARCIA-MOLINA, PAGE: "EFFICIENT CRAWLING THROUGH URL ORDERING", WWW 1998 [4] CHO, GARCIA-MOLINA: "THE EVOLUTION OF THE WEB AND IMPLICATIONS FOR AN INCREMENTAL CRAWLER", PROCEEDINGS OF THE 26th INTL. CONF. ON VERY LARGE DATA BASES (VLDB 2000)

Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.

Similar presentations

Presentation on theme: "Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.

Similar presentations

Presentation on theme: "Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University."— Presentation transcript:

Similar presentations

About project

Feedback