What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.

What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms

What is Web The Web is a global system of interconnected computer networks that use the standardized Internet Protocol Suite. It is a network of networks that consists of millions of private,public, academic, business, and government networks. The Internet carries a vast array of information resources and services, and the infrastructure to support electronic mail.

Information retrieval from web Viewing a Web page on the World Wide Web normally begins either by typing the URL of the page into a Web browser, or by following a hyperlink to that page or resource. 1. First, the server-name portion of the URL is resolved into an IP address using the global, distributed Internet database known as the domain name system 2. The browser then requests the resource by sending an HTTP request to the Web server at that particular address. 3.Browser then renders the page onto the screen as specified by its HTML, CSS, and other Web languages.

Search Engine A search engine is an information retrieval system designed to help find information stored on a computer system. It searches for information on the World Wide Web. Search engines use automated software programs known as spiders or bots to survey the Web and build their databases. Web documents are retrieved by these programs and analyzed and indexed.

Search engine operations : A search engine operates in the following order: Web crawling Indexing Searching

Web Crawler: A web crawler is a program or automated script which browses the World Wide web in a methodical, automated manner through Internet pages. Crawlers are small programs that `browse' the Web on the search engine's behalf to collect information Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code.

It starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier.

Crawler control module : This module determines what links to visit next, and feeds the links to visit back to the crawlers. The programs are given a starting set of URLs, whose pages they retrieve from the Web. The crawlers extract URLs appearing in the retrieved pages, and give this information to the crawler control module. The crawl control module is responsible for directing the crawling operation.

Crawling Policies a selection policy that states which pages to download, a re-visit policy that states when to check for changes to the pages, a politeness policy that states how to avoid overloading Web sites, and a parallelization policy that states how to coordinate distributed Web crawlers Some characteristics of the Web that make crawling it very difficult: its large volume, Its fast rate of change, and dynamic page generation. Hence we need certain policies to make works easier. They are:

Selection policy As a crawler always downloads just a fraction of the Web pages, it is highly desirable that the downloaded fraction contains the most relevant pages and not just a random sample of the Web. Hence the Selection policy is required The importance of a page is a function of its intrinsic quality, its popularity in terms of links or visits, and even of its URL. Hence we require a metric of importance for prioritizing the web pages. Designing a good selection policy has an added difficulty: it must work with partial information, as the complete set of Web pages is not known during crawling.

Revisit policy The Web has a very dynamic nature, and crawling a fraction of the Web can take a really long time, By the time a Web crawler has finished its crawl, many events could have happened. Hence the Revisit policy is required Freshness: Let S = {e 1 ; : : : ; e N } be the local database with N elements. Then we define the freshness of the collection as follows: The freshness of a local page e i at time t is 1if p is equal to the local copy at time F p (t) = 0 otherwise Then, the freshness of the local database S at time t is F(S; t) =1/N (∑F (e i, t)) for i= 1to N

Age: To capture ‘how old’ the collection is, we define the metric age as follows: The age of the local element e i at time t is: 0if p is not modified at time t A p (t) = t – modification time (t m ) of p otherwise Then the age of the local database S is A(S; t) = 1/N (∑A(e i ; t)) for i= 1 to N

Suppose that the crawler maintains a collection of two pages: e 1 and e 2. Page e 1 changes 9 times per day and e 2 changes once a day. Our goal is to maximize the freshness of the database averaged over time

Because our crawler is a tiny one, assume that we can refresh one page per day. Then what page should it refresh? Should the crawler refresh e 1 or should it refresh e 2 ?

Politeness Policy Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site. Needless to say, if a single crawler is performing multiple requests per second and/or downloading large files, a server would have a hard time keeping up with requests from multiple crawlers. Hence Politeness Policy is required

Parallelization policy A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximize the download rate while minimizing the overhead from parallelization and to avoid repeated downloads of the same page. Hence Parallelization policy is required

Different synchronization Algorithms a)Fixed-order policy: Under the fixed-order policy, we synchronize the local elements in the same order repeatedly Algorithm : Fixed-order synchronization Input: ElemList = {e1; e2; : : : ; eN} Procedure: [1] While (TRUE) [2] SyncQueue := ElemList [3] While (not Empty(SyncQueue)) [4] e := Dequeue(SyncQueue) [5] Synchronize(e)

b) Random-order policy: Under the random-order policy, the synchronization order of elements might be different from one crawl to the next and we randomize the order of elements before every iteration Algorithm: Random-order synchronization Input: ElemList = {e1; e2; : : : ; eN} Procedure: [1] While (TRUE) [2] SyncQueue := RandomPermutation(ElemList) [3] While (not Empty(SyncQueue)) [4] e := Dequeue(SyncQueue) [5] Synchronize(e)

c) Purely-random policy: Whenever we synchronize an element, we pick an arbitrarily random element under the purely random policy. Algorithm: Purely-random synchronization Input: ElemList = {e1; e2; : : : ; eN} Procedure: [1] While (TRUE) [2] e := PickRandom(ElemList) [3] Synchronize(e)

Conclusion Thus we conclude how subtly a web crawler can optimize the working of a normal search engine.

References 1. http://en.wikipedia.org/wiki/Web_crawler 2.http://www.webcrawler.com/webcrawler/ws/about/_iceUrlFlag=11 3.http://www.wisegeek.com/what-is-a-web-crawler.htm 4.http://thinkpink.com/bp/WebCrawler/History.html

ANY QUERRIES ?

What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.

Similar presentations

Presentation on theme: "What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.

Similar presentations

Presentation on theme: "What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms."— Presentation transcript:

Similar presentations

About project

Feedback