Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Discussion Class 6 Crawling the Web. 2 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for others to.

Similar presentations


Presentation on theme: "1 Discussion Class 6 Crawling the Web. 2 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for others to."— Presentation transcript:

1 1 Discussion Class 6 Crawling the Web

2 2 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for others to comment. When answering: Stand up. Give your name. Make sure that the TA hears it. Speak clearly so that all the class can hear. Suggestions: Do not be shy at presenting partial answers. Differing viewpoints are welcome.

3 3 Question 1: Background (a)When was this paper written, by whom, and why? (b)What, if anything, has changed since this paper was written? (c)How has Yahoo changed?

4 4 Question 2: Search engine architecture

5 5 Question 2: Search Engine Architecture What is the function of the following? (a)Crawl control (b)Indexer module (c)Structure index (d)Ranking module (e)Page repository

6 6 Question 3: What pages should the crawler download? (a)What is the problem? Why do crawlers not download every page? (b)What can a crawler know about a page without downloading it? (c)The paper describes several importance measures: interest- driven, popularity-driven, location-driven. How do they apply? (d)How do these importance measures interact with the ordering metrics?

7 7 Question 4: How should the crawler refresh pages? (a)What is the problem? (b)The paper discusses a "freshness" metric. What is this? Do you consider it a good metric?

8 8 Question 5: How should the load on the visited Web sites be minimized? (a)Why is this a problem? (b)What can a crawler do to minimize the problem? (c)What can a web site do to minimize the problem?

9 9 Question 6: How should the crawling process be parallelized? (a)Why should the crawling process be parallelized? (b)What are the principal options?


Download ppt "1 Discussion Class 6 Crawling the Web. 2 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for others to."

Similar presentations


Ads by Google