Crawling The Web. Motivation By crawling the Web, data is retrieved from the Web and stored in local repositories Most common example: search engines,

Crawling The Web

Motivation By crawling the Web, data is retrieved from the Web and stored in local repositories Most common example: search engines, Web archives The idea: use links between pages to traverse the Web Since the Web is dynamic, updates should be done continuously (or frequently)

Crawling Basic Algorithm Init Get next URL Get page Extract DataExtract Links initial seeds to-visit URLS visited URLS database www

The Web as a Graph The Web is modeled as a directed graph -The nodes are the Web pages -The edges are pairs (P 1, P 2 ) such that there is a link from P 1 to P 2 Crawling the Web is a graph traversal (search algorithm) Can we traverse all of the Web this way?

The Hidden Web The hidden Web consists of -Pages that no other page has a link to them how can we get to this pages? -Dynamic pages that are created as a result of filling a form http://www.google.com/search?q=crawlers

Traversal Orders Different traversal orders can be used: -Breath-First Crawlers to-visit pages are stored in a queue -Depth-First Crawlers to-visit pages are stored in a stack -Best-First Crawlers to-visit pages are stored in a priority-queue, according to some metric -How should the traversal order be chosen?

Avoiding Cycles To avoid visiting the same page more than once, a crawler has to keep a list of the URLs it has visited The target of every encountered link is checked before inserting it to the to-visit list Which data structure for visited-links should be used?

Directing Crawlers Sometimes people want to direct automatic crawling over their resources Direction examples: “Do not visit my files!” “Do not index my files!” “Only my crawler may visit my files!” “Please, follow my useful links…” Solution: publish instructions in some known format Crawlers are expected to follow these instructions

Robots Exclusion Protocol A method that allows Web servers to indicate which of their resources should not be visited by crawlers Put the file robots.txt at the root directory of the server - http://www.cnn.com/robots.txt - http://www.w3.org/robots.txt - http://www.ynet.co.il/robots.txt - http://www.whitehouse.gov/robots.txt - http://www.google.com/robots.txt

robots.txt Format A robots.txt file consists of several records Each record consists of a set of some crawler id’s and a set of URLs these crawlers are not allowed to visit - User-agent lines: which crawlers are directed? - Disallowed lines: Which URLs are not to be visited by these crawlers (agents)?

robots.txt Format The following example is taken from http://www.w3.org/robots.txt: User-agent: W3Crobot/1 Disallow: /Out-Of-Date User-agent: * Disallow: /Team Disallow: /Project Disallow: /Systems Disallow: /Web Disallow: /History Disallow: /Out-Of-Date W3Crobot/1 is not allowed to visit files under directory Out- of-Date And those that are not W3Crobot/1…

Robots Meta Tag A Web-page author can also publish directions for crawlers These are expressed by the meta tag with name robots, inside the HTML file Format: Options: - index ( noindex ): index (do not index) this file - follow ( nofollow ): follow (do not follow) the links of this file

Robots Meta Tag... … An Example: How should a crawler act when it visits this page?

Revisit Meta Tag Web page authors may want Web applications to have an up-to-date copy of their page Using the revisit meta tag, page authors can give crawlers some idea of how often the page is being updated For example:

Stronger Restrictions It is possible for a (non-polite) crawler to ignore the restrictions imposed by robots.txt and robots meta directions Therefore, if one wants to ensure that automatic robots do not visit her resources, she has to use other mechanisms -For example, password protections

Resources Read this nice tutorial about web crawling: http://informatics.indiana.edu/fil/Papers/crawling.pdf http://informatics.indiana.edu/fil/Papers/crawling.pdf To find more about crawler direction visit www.robotstxt.org A dictionary of HTML meta tags can be found at http://vancouver-webpages.com/META/ http://vancouver-webpages.com/META/

Basic HTTP Security

Authentication Web servers expose their pages to Web users However, some of the pages should sometimes be exposed only to certain users -Examples? Authentication allows the server to expose a specific page only after a correct name and password has been specified HTTP includes a specification for a basic access authentication scheme -Some servers avoid it and use other mechanisms

Basic Authentication Scheme Realm B Realm A /a/A.html /a/B.jsp /b/C.css /b/D.xml E.xsl GET E.xsl OK + Content F.xml

Basic Authentication Scheme Realm B Realm A /a/A.html /a/B.jsp /b/C.css /b/D.xml E.xsl GET /a/B.jsp 401 + Basic realm="A" F.xml

Basic Authentication Scheme Realm B Realm A /a/A.html /a/B.jsp /b/C.css /b/D.xml E.xsl GET /a/B.jsp + user:pass OK + Content F.xml

Basic Authentication Scheme Realm B Realm A /a/A.html /a/B.jsp /b/C.css /b/D.xml E.xsl GET /a/A.html + user:pass OK + Content F.xml

To restrict a set of pages for certain users, the server designates a realm name for these pages and defines the authorized users (usernames and passwords) When a page is requested without correct authentication information, the server returns a 401 (Unauthorized) response, with the "WWW-Authenticate" header like the following: WWW-Authenticate: Basic realm="realm-name" Basic Authentication Scheme

The browser then prompts the user for a username and a password, and sends them in the "Authorization" header: Authorization: Basic username:password The string username:password is trivially encoded (everyone can decode it...) Does the user fill her name and password for other requests from the same server?

Browser Cooperation Through the session, the browser stores the username and password and automatically sends the latter authorization header in either one of the following cases: -The requested resource is under the directory of the originally authenticated resource -The browser received 401 from the Web server and the WWW-Authenticate header has the same realm as the originally authenticated resource

Crawling The Web. Motivation By crawling the Web, data is retrieved from the Web and stored in local repositories Most common example: search engines,

Similar presentations

Presentation on theme: "Crawling The Web. Motivation By crawling the Web, data is retrieved from the Web and stored in local repositories Most common example: search engines,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Crawling The Web. Motivation By crawling the Web, data is retrieved from the Web and stored in local repositories Most common example: search engines,

Similar presentations

Presentation on theme: "Crawling The Web. Motivation By crawling the Web, data is retrieved from the Web and stored in local repositories Most common example: search engines,"— Presentation transcript:

Similar presentations

About project

Feedback