Presentation is loading. Please wait.

Presentation is loading. Please wait.

DBI Representation and Management of Data on the Internet.

Similar presentations


Presentation on theme: "DBI Representation and Management of Data on the Internet."— Presentation transcript:

1 DBI Representation and Management of Data on the Internet

2 HTTP HyperText Transfer Protocol

3 In the Beginning… The Internet FTP –File Transfer Protocol SMTP –Simple Mail Transfer Protocol NNTP –Network-News Transfer Protocol HTTP –HyperText Transfer Protocol Let there be a Web Tim Berners-Lee

4 The Creation of the Web Tim Berners-Lee implemented the HTTP protocol in 1990-1 at CERN, the European Center for High-Energy Physics in Geneva, Switzerland. The World-Wide Web is based upon –Information representation in HTML (HyperText Markup Language) documents –Resources Transmission in HTTP (HyperText Transfer Protocol)

5 Previous HTTP Versions HTTP/0.9 used by WWW since 1990 HTTP/1.0 [RFC 1945] –Supports MIME (Multipurpose Internet Mail Extension) messages [RFC 1341] MIME transmits non-textual files by encoding them –Content negotiation HTTP/1.1 [RFC 2068] –Persistent connections –Caching

6 General Features Lightness and speed (response time of 100 ms in a hypertext jump) Client-Server protocol Stateless object-oriented protocol Open-ended set of methods and headers Typing and negotiation of data representation

7 Terminology User agent: client which initiates a request (browser, editor, web robot, …) Origin server: the server on which a given resource resides Proxy: acts as both a server and a client Gateway: server which acts as intermediary for other servers Tunnel: acts as a blind relay between two connections

8 Client-Server Protocol The browser is the client The client sends requests to an HTTP Server

9 Client-Server Sessions The HTTP protocol supports a short conversation between browser and server The entire conversation is conducted using ASCII characters (8-bit) The standard (and default) port for HTTP servers to listen on is 80, though they can use any port

10 HTTP Session A basic HTTP session has four phases: 1.Client opens the connection (a TCP connection) 2.Client makes the request 3.Server sends a response 4.Server closes the connection

11 Nested Objects Suppose a client accesses a page containing 10 inline images; to display the page completely would require 11 HTTP sessions Some browsers/servers support a feature called keep-alive which can keep the connection open until it is explicitly closed

12 Index.html Left frame Jumping fish Right frame Fairy iconHUJI icon

13 Stateless Protocol HTTP is a stateless protocol, which means that once a server has delivered the requested data to a client, the connection is broken, and the server retains no memory of what has just taken place

14 Resources A resource is a chunk of information that can be identified by a URL (Universal Resource Locator) –The most common kind of resource is a file, but a resource may also be A dynamically-generated query result The output of a CGI script, or An active server page

15 URL Universal Resource Identifiers [RFC 2396] are used to specify the object of a method –as an address (URL) –as a name (URN) URL = “ http:// ” host [ “ : ” port] [path] IP addresses in URLs should be avoided [RFC 1900]

16 Different URLs There are different types of URL ’ s –http:// : / ? –mailto: –news:

17 In a URL Spaces are represented by “+” Characters such as &,+,% are encoded in the form “%xx” where xx is the ascii value in hexadecimal; For example, “&” = “%26” The inputs to the parameters are in a list of the following form Var1=value1&var2=value2&var3=value3

18 War&peace Tolstoy

19 http://www.google.com/search?lr=&safe=off&q=war%26peace+Tolstoy

20 Format of Request and Response An initial line Zero or more header lines A blank line (i.e., a CRLF by itself), and An optional message body (e.g., a file, query data, or query output) Note: CRLF = “\r\n” (usually ASCII 13 followed by ASCII 10)

21 Request A request consists of: –Initial line –Headers –Blank line –Message body

22 Initial Line of a Request The initial line consists of –Method –Path –HTTP Version

23 Request Format

24 Request Example GET /courses/dbi/index.html HTTP/1.0 From: yarok@cs.huji.ac.il User-Agent: HTTPTool/1.0 [blank line here] Method Path Version Headers Initial line

25 Do Not Forget CRLF GET /courses/dbi/index.html HTTP/1.0 [CRLF] From: yarok@cs.huji.ac.il [CRLF] User-Agent: HTTPTool/1.0 [CRLF] [CRLF]

26 Request Methods GET returns the contents of the indicated document –The most frequently used command HEAD returns the header information for the indicated document –Useful for finding out info about a resource without retrieving it POST treats the document as a script and sends some data to it

27 More Methods PUT replaces the contents of the document with some data DELETE deletes the indicated document TRACE invokes a remote loop-back of the request. The final recipient SHOULD reflect the message back to the client Usually these methods are not allowed

28 GET Method GET is the most common HTTP method It says “give me this resource”

29 GET Requests With a Proxy Proxy Server Client Web ServerClient Web Server /~dbi/index.html www.cs.huji.ac.il http://www.cs.huji.ac.il/~dbi/index.html

30 HEAD Request A HEAD request asks the server to return the response headers only, and not the actual resource (i.e., no message body) Same as GET but without the message body This is useful for checking characteristics of a resource without actually downloading it, thus saving bandwidth Used for testing hypertext links for validity, accessibility and recent modification

31 Post POST request can send data to the server POST is mostly used in form-filling –The contents of a form are translated by the browser into some special format and sent to a script on the server using the POST command

32 Post (cont.) There is a block of data sent with the request, in the message body There are usually extra headers to describe this message body, like Content-Type: and Content-Length: The request URI is a program to handle the sent data, not a resource to retrieve The HTTP response is normally the output of a program, not a static file

33 Post Example Here's a typical form submission, using POST: POST /path/script.cgi HTTP/1.0 From: frog@cs.huji.ac.il User-Agent: HTTPTool/1.0 Content-Type: application/x-www-form-urlencoded Content-Length: 35 home=Ross+109&favorite+flavor=flies 35 characters

34 Headers HTTP 1.0 defines 16 headers –none are required HTTP 1.1 defines 46 headers –one header (Host:) is required in requests

35 Headers From: –gives the email address of whoever is making the request or running the program doing so User-Agent: –identifies the program that's making the request, in the form "Program-name/x.xx", x.xx is the (mostly) alphanumeric version of the program. For example, Netscape 3.0 sends the header "User-agent: Mozilla/3.0Gold"

36 Headers (cont.) Server: –analogous to the User-Agent: header: –it identifies the server software in the form "Program-name/x.xx". –For example, one beta version of Apache's server returns "Server: Apache/1.2b3-dev"Apache's

37 Headers (cont.) If an HTTP message includes a body, there are usually header lines in the message that describe the body. In particular, Content-Type: –gives the MIME-type of the data in the body, such as text/html or image/gif Content-Length: –gives the number of bytes in the body

38 Headers (cont.) Last-Modified: –Gives the modification date of the resource that's being returned –It's used in caching and other bandwidth-saving activities –Greenwich Mean Time should be used and the format is Last-Modified: Fri, 31 Dec 1999 23:59:59 GMT

39 Initial Line of a Response The initial line of a response is also called the status line. The initial line consists of –HTTP version –response status code –reason phrase that describes the status code

40 Response Format

41 HTTP/1.0 200 OK Date: Fri, 31 Dec 1999 23:59:59 GMT Content-Type: text/html Content-Length: 1354 Hello World (more file contents)... Headers Response Example Initial line Version Status code Reason phrase Message body

42 Status Code The status code is a three-digit integer, and the first digit identifies the general category of response: –1xx indicates an informational message only –2xx indicates success of some kind –3xx redirects the client to another URL –4xx indicates an error on the client's part Yes, the system blames it on the client if a resource is not found (i.e., 404) –5xx indicates an error on the server's part

43 Status Code 1xx The 100 (Continue) Status –Allows a client to determine if the Server is willing to accept the request (based on the request headers) before the client sends the request body –The client’s request must have the header Expect: 100 (Continue) 101 Status -- Switching Protocols

44 Status Code 2xx Status codes 2xx -- Success The action was successfully received, understood, and accepted –200 OK –201POST command successful –202Request accepted –203GET or HEAD request fulfilled –204No content

45 Status Code 3xx Status codes 3xx -- Redirection Further action must be taken in order to complete the request –300 Resource found at multiple locations –301 Resource moved permanently –302 Resource moved temporarily –304 Resource has not modified (since date)

46 Status Code 4xx Status codes 4xx -- Client error The request contains bad syntax or cannot be fulfilled –400Bad request from client –401Unauthorized request –402Payment required for request –403Resource access forbidden –404Resource not found –405Method not allowed for resource –406Resource type not acceptable

47 Status Code 5xx Status codes 5xx -- Server error The server failed to fulfill an apparently valid request –500Internal server error –501Method not implemented –502Bad gateway or server overload –503Service unavailable / gateway timeout –504Secondary gateway / server timeout

48 Response Information Description of information –Server Type of server –Date Date and time –Content-Length Number of bytes –Content-Type Mime type –Content-Language English, for example –Content-Encoding Data compression –Last-Modified Date when last modified –Expires Date when file becomes invalid

49 Manually Experimenting with HTTP >host www www.cs.huji.ac.il is a nickname for vafla.cs.huji.ac.il vafla.cs.huji.ac.il has address 132.65.80.39 vafla.cs.huji.as.il mail is handled (pri=10) by cs.huji.ac.il >telnet www.cs.huji.ac.il 80 Trying 132.65.80.39… Connected to vafla.cs.huji.ac.il. Escape character is ‘^]’.

50 Sending a Request >GET /~dbi/index.html HTTP/1.0 [blank line]

51 The Response HTTP/1.1 200 OK Date: Sun, 11 Mar 2001 21:42:15 GMT Server: Apache/1.3.9 (Unix) Last-Modified: Sun, 25 Feb 2001 21:42:15 GMT Content-Length: 479 Content-Type: text/html (html code …)

52 GET /~dbi/index.html HTTP/1.0 HTTP/1.1 200 OK HTML code

53 GET /~dbi/no-such-page.html HTTP/1.0 HTTP/1.1 404 Not Found HTML code

54 GET /index.html HTTP/1.1 HTTP/1.1 400 Bad Request HTML code Why is it a Bad Request? HTTP/1.1 without Host Header

55 HTTP 1.1 HTTP/1.1 is replacing/has replaced HTTP/1.0 as the new Web protocol

56 Improvements Faster response –allowing multiple transactions to take place over a single persistent connection –adding cache support Faster response for dynamically-generated pages –supporting chunked encoding, which allows a response to be sent before its total length is known Efficient use of IP addresses –allowing multiple domains to be served from a single IP address

57 Improvements over HTTP 1.0 HTTP/1.1 has a number of features/improvements over HTTP/1.0, including –Persistent TCP connections –Partial document transfers –Conditional fetch –Support for nonstandard HTTP/1.0 extensions –Better support for alternative character sets –More flexible authentication –Faster response and great bandwidth savings –Efficient use of IP addresses (virtual hosting)

58 Non-Persistent Connections 1Browser opens TCP connection to port 80 of server (handshake) 2Browser sends http request message 3Server receives request, locates object, sends response 4Server closes TCP connection 5Client receives response, parses object 6Repeat 1-4 for each embedded object

59 Persistent Connection 1Browser opens TCP connection to port 80 of server (handshake) 2Browser sends http request message 3Server receives request, locates object, sends response 4Client receives response, parses object 5Repeat 2-4 for each embedded object 6TCP connection closes on demand or timeout

60 Advantages of Persistent Connection CPU time saved in routers and hosts HTTP requests and responses can be pipelined on a connection network congestion is reduced latency on subsequent requests is reduced

61 Pipelines 2 types of persistent connections –without pipelining the client issues a new request only after the previous response has arrived – with pipelining client sends the request as soon as it encounters a reference multiple requests/responses –on the same IP packet, or –on back-to-back packets

62 Virtual Hosts With HTTP 1.1, one server at one IP address can be multi-homed: –“www.cs.huji.ac.il” and “www.math.huji.ac.il” can live on the same server –These are called virtual hosts –Without this mechanism, we have to use 2 different IP addresses It is like several people sharing one phone An HTTP request must specify the host name (and possibly port) for which the request is intended

63 Example The request specifies the host: GET /path/file.html HTTP/1.1 Host: www.host1.com:80

64 Virtual Hosting (cont.) Virtual hosting –reduces hardware expenditures –extends the ability to support additional servers –makes load balancing and capacity planning much easier Without it –each host name requires a unique IP address, and we are quickly running out of IP addresses with the explosion of new domains

65 The Date Header In HTTP 1.1, servers must include the generation time of the response in the Date: header Time values use Greenwich Mean Time (GMT) and have the format Date: Fri, 31 Dec 1999 23:59:59 GMT Date is omitted only in a few cases, e.g., status code 100 (continue) and some server errors Servers must synchronize their clocks with a reliable external standard

66 Caching Caching improves performance Eliminates the need to send requests in many cases (reduces network round-trips), using an expiration mechanism Eliminates the need to send full responses in other cases (reduces network bandwidth), using a validation mechanism

67

68 Client Caching client server cache Client GET /fruit/apple.gif Server responds with Last-Modified-Date:... Client caches object and last- modified-date Client sends GET /fruit/apple.gif … If-Modified-Since: … Server returns either 304 Not Modified or object

69 Network Caches client server proxy server GET /fruit/apple.gif

70 Internet Benefit of Caching client 10Mbps LAN RR 1.5Mbps server 15 req/sec 100Kbits/req proxy server 40% hit rate

71 Expiration Model Servers may provide an expiration time using the Expires header –By checking the expiration time, the cache can return a fresh response without contacting the server If the expiration time is not specified, the cache can heuristically estimate the expiration times (e.g., using header values, such as the Last-Modified time)

72 The Risk in Caching Response might not be “semantically transparent” –the response is different from what would have been returned by the origin server The cache should verify that the copy is fresh (i.e., expiration time has not passed) The copy is stale if it is not fresh

73 Validators A validator is any mechanism that may help in determining whether a copy is fresh or stale –A strong validator is, for example, a counter that is incremented whenever the resource is changed –A weak validator is, for example, a counter that is incremented only when a significant change is made

74 Using the Cache To check whether a copy is fresh, the cache must either –Use the expiration model, or –Compare the Last-Modified time or some validator with the origin server In the second case, the origin server either –Responds with the message 304(Not Modified), or –Sends a full response with the entity body

75 Cache-Control Header Cache-control headers specify directives to the cache –Can be included in either requests or responses The server can specify “must-revalidate” –Cache must revalidate with the origin server that the copy is still fresh The client can specify –the max-age of an unvalidated response –The max-stale time of a stale copy

76 Do not Use a Cache The Pragma: no-cache request header indicates that the request should not be satisfied from a cache Same as the no-cache cash-directive Should include both if server is not HTTP/1.1 compliant Directive applies to any recipient along the request/response chain

77 If-Modified-Since Header The If-Modified-Since : header is used with a GET request If the requested resource has been modified since the given date, the server returns the resource as it normally would (i.e., header is ignored) Otherwise, the server returns a 304 Not Modified response, including the Date: header, but with no message body HTTP/1.1 304 Not Modified Date: Fri, 31 Dec 1999 23:59:59 GMT [blank line here]

78 If-Unmodified-Since Header The If-Unmodified-Since: header can be used with any method If the requested resource has not been modified since the given date, the server returns the resource as it normally would Otherwise, the server returns a 412 Precondition Failed response HTTP/1.1 412 Precondition Failed [blank line here]

79 Cooperative Caching

80 Cooperative Caching (cont.) Higher level cache (e.g., national cash) –larger user population –higher hit rates Multiple Web cashes which cooperate => Improve overall performance Cooperative cashes usually built from clusters –divide the traffic overhead –improve storage capacity

81 Cooperative Caching (cont.) Which cashes should be asked for a particular doc? Hash routing (of URLs) -- an object will not be present in more than one cash

82 Hop by Hop HTTP/1.1 introduces the concept of hop-by- hop headers: –Message headers that apply only to a given connection, and not to the entire path –It enables much more power with the usage of proxies (cashes)

83 Hop-by-Hop Headers Connection –options that are desired for that particular connection (e.g., connection:close) Public –lists the set of methods supported by the server Proxy-Authenticate –enables authentication methods between two hops Transfer-Encoding –compression method between two hops Upgrade –additional communication protocols

84 Chunked Encoding Chunked encoding –Transmission of streaming multimedia One frame varies in size and composition from the next –Streaming video Entire image transmitted in first chunk and differences from the previous image are transmitted in the next chunk Wake up, we speak about movies in the Internet

85 Compression Most image formats (GIF, JPEG, MPEG) are precompressed Many other data types used in the Web are not precompressed Compression could save almost 40% of the bytes sent via HTTP There is a need for negotiating the type of encoding of the compressed resource

86 Compression (cont.) Client sends the header Accept-Encoding –The header indicates the content-encodings that the client can handle and the ones that the client prefers Server Sends –Content-Encoding header – for end-to-end encoding indication –Transfer-Encoding header - for hop-to-hop encoding indication (supported only in HTTP/1.1)

87 Content Negotiation Content Negotiation: –the process of selecting the best representation for a given response when there are multiple representations available HTTP supports two kinds of content negotiation: –Server-driven negotiation –Agent-driven negotiation

88 Server-Driven Negotiation The selection is made by the server, based on: –header field in the request (client preferences): Accept-Language / Accept-Encoding –available representations of the response –other information (i.e., address of the client) Disadvantages: –Impossible for the server to determine what is best for the user –Inefficiency (clients should describe their capabilities in every request) –Complicates implementation of servers

89 Agent-Driven Negotiation Selection is made by the client after receiving an initial response from the server –Based on available representations specified in the initial response –Automatic or manual Disadvantages: –needs a second request to obtain the best alternative representation

90 Protocol Switching Protocol switching –Client can specify another protocol more suited to the data being transferred (e.g., real-time synchronous protocol) I hate HTTP/1.0 I want another protocol

91 Authentication Many sites require users to provide a username and password in order to access the documents housed on the server This requirement provides a mechanism for keeping track of users (more than just a security mechanism)

92 Authentication ClientWeb Server /~dbi/index.html www.cs.huji.ac.il Who are you? /~dbi/index.html I am Donald My password is Duck response

93 Authentication How does it’s work? –Client sends ordinary request message –server responds with 401 Authorization Required status code WWW-Authenticate header which specifies how to perform authentication –Client resends the requested message, but this time including the Authorization header (e.g., user-name & password) –The client continues to add this header for each following request to that server

94 Cookies Alternative way to identify browsers Server response includes the Set-cookie header that has the attributes –name = VALUE –expires = DATE STRING –domain = DOMAIN NAME –path = PATH –secure Client returns cookie with matching URLs

95 Cookies Example: –Client contacts a web site for the first time –Server response includes the header: Set-cookie : 1678453 –Client stores the cookie value and the server name in a special “cookie file” –For each further request for that server, the client will add the header Cookie : 1678453

96 Cookies (cont.) Usage: –Server requires authentication, but doesn’t want to hassle a user with a user-name and password –Remembering user’s preferences for advertising –Cookies enable creating a virtual shopping cart Problems –users who access the same site from different machines

97 Are you HTTP experts now? Not yet There are more headers, for example, that this talk did not cover To know more, go to the specifications

98 Additional Information For specifications and additional information: –http://www.w3.org/Protocols/http://www.w3.org/Protocols/ –http://www.w3.org/Protocols/Specs.htmlhttp://www.w3.org/Protocols/Specs.html –http://www.jmarshall.com/easy/http/http://www.jmarshall.com/easy/http/ –http://wdvl.com/Internet/Protocols/HTTP/articl e.htmlhttp://wdvl.com/Internet/Protocols/HTTP/articl e.html


Download ppt "DBI Representation and Management of Data on the Internet."

Similar presentations


Ads by Google