Presentation is loading. Please wait.

Presentation is loading. Please wait.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

Similar presentations


Presentation on theme: "Search Engine and Optimization 1. Introduction to Web Search Engines 2."— Presentation transcript:

1 Search Engine and Optimization 1

2 Introduction to Web Search Engines 2

3 Agenda Web Search Engines What are Web Crawlers Crawling the Web Architecture of Web Crawlers Main policies in crawling Nutch Nutch architecture 3

4 A web search engine is designed to search for information on the World Wide Web and FTP servers. The search results are generally presented in a list of results and are often called hits. The information may consist of web pages, images, information and other types of files. Some search engines also mine data available in databases or open directories. Search engines operate algorithmically or mixture of algorithmic and human input. 4

5 The three most widely used web search engines and their approximate share as of late 2010 A program that indexes documents, then attempts to match documents relevant to a user's search requests. 5

6 How web search engines work A search engine operates, in the following order 1.Web Crawling 2.Indexing 3.Searching 6

7 Market share and wars Search engine Market share in April 2011 Market share in December 2010 Google83.82%84.65% Yahoo5.88%6.69% Baidu4.38%3.39% Bing3.92%3.29% Ask0.51%0.56% AOL0.38%0.42% 7

8 A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. (Wikipedia) Crawl or visit web pages and download them Starting from one page –determine which page(s) to go to next Mainly depends on crawling policies used WEB CRAWLER 8

9 Utilities: Gather pages from the Web. Support a search engine, perform data mining and so on. Features of a crawler: Must provide: 1. Robustness: spider traps Infinitely deep directory structures: http://foo.com/bar/foo/bar/foo/... http://foo.com/bar/foo/bar/foo/ Pages filled a large number of characters. 9

10 2. Politeness: which pages can be crawled, and which cannot Robots exclusion protocol: robots.txt http://blog.sohu.com/robots.txt http://blog.sohu.com/robots.txt Should provide: Distributed Scalable Performance and efficiency Quality Freshness Extensible 10

11 web init get next url get page extract urls initial urls to visit urls visited urls web pages 11

12 Applications Internet Search Engines – Google, Yahoo, MSN, Ask Comparison Shopping Services – Shopping Data mining – Stanford Web Base, IBM Web Fountain 12

13 Crawling the Web  Web pages Few thousand characters long Served through the internet using the hypertext transport protocol (HTTP) Viewed at client end using `browsers’  Crawler To fetch the pages to the computer At the computer  Automatic programs can analyze hypertext documents 13

14 HTML HyperText Markup Language Lets the author – specify layout and typeface – embed diagrams – create hyperlinks. expressed as an anchor tag with a HREF attribute HREF names another page using a Uniform Resource Locator (URL), – URL = protocol field (“HTTP”) + a server hostname (“www.cse.iitb.ac.in”) + file path (/, the `root' of the published file system). 14

15 HTTP(Hypertext Transport Protocol) Built on top of the Transport Control Protocol (TCP) Steps(from client end)  resolve the server host name to an Internet address (IP) Use Domain Name Server (DNS) DNS is a distributed database of name-to-IP mappings maintained at a set of known servers 15

16  contact the server using TCP connect to default HTTP port (80) on the server. Enter the HTTP requests header (E.g.: GET) Fetch the response header  MIME (Multipurpose Internet Mail Extensions)  A meta-data standard for email and Web content transfer Fetch the HTML page 16

17 Crawling overheads Delays involved in – Resolving the host name in the URL to an IP address using DNS – Connecting a socket to the server and sending the request – Receiving the requested page in response Solution: Overlap the above delays by – fetching many pages at the same time 17

18 Architecture of a crawler www DNS Fetch Parse Content Seen? URL Filter Dup URL Elim URL Frontier Doc Fingerprint Robots templates URL set 18

19 www DNS Fetch Parse Content Seen? URL Filter Dup URL Elim URL Frontier Doc Fingerprint Robots templates URL set URL Frontier: containing URLs yet to be fetched in the current crawl. At first, a seed set is stored in URL Frontier, and a crawler begins by taking a URL from the seed set. DNS: domain name service resolution. Look up IP address for domain names. Fetch: generally use the http protocol to fetch the URL. Parse: the page is parsed. Texts (images, videos, and etc.) and Links are extracted. 19

20 www DNS Fetch Parse Content Seen? URL Filter Dup URL Elim URL Frontier Doc Fingerprint Robots templates URL set Content Seen?: test whether a web page with the same content has already been seen at another URL. URL Filter: Whether the extracted URL should be excluded from the frontier (robots.txt). URL should be normalized (relative encoding). En.wikipedia.org/wiki/Main_Page Disclaimers Dup URL Elim: the URL is checked for duplicate elimination. 20

21 21

22 Crawl policies Selection policy Re-visit policy Politeness policy Parallelization policy Selection policy – Page ranks – Path ascending – Focused crawling 22

23 Re-visit policy – Freshness – Age Politeness – So that crawlers don’t overload web servers – Set a delay between GET requests Parallelization – Distributed web crawling – To maximize download rate 23

24 Nutch Is a Open Source web crawler Nutch Web Search Application – Maintain DB of pages and links – Pages have scores, assigned by analysis – Fetches high-scoring, out-of-date pages – Distributed search front end – Based on Lucene 24

25 25


Download ppt "Search Engine and Optimization 1. Introduction to Web Search Engines 2."

Similar presentations


Ads by Google