DBI Representation and Management of Data on the Internet.

DBI Representation and Management of Data on the Internet

HTTP HyperText Transfer Protocol

In the Beginning… The Internet FTP –File Transfer Protocol SMTP –Simple Mail Transfer Protocol NNTP –Network-News Transfer Protocol HTTP –HyperText Transfer Protocol Let there be a Web Tim Berners-Lee

The Creation of the Web Tim Berners-Lee implemented the HTTP protocol in 1990-1 at CERN, the European Center for High-Energy Physics in Geneva, Switzerland. The World-Wide Web is based upon –Information representation in HTML (HyperText Markup Language) documents –Resources Transmission in HTTP (HyperText Transfer Protocol)

Previous HTTP Versions HTTP/0.9 used by WWW since 1990 HTTP/1.0 [RFC 1945] –Supports MIME (Multipurpose Internet Mail Extension) messages [RFC 1341] MIME transmits non-textual files by encoding them –Content negotiation HTTP/1.1 [RFC 2068] –Persistent connections –Caching

General Features Lightness and speed (response time of 100 ms in a hypertext jump) Client-Server protocol Stateless object-oriented protocol Open-ended set of methods and headers Typing and negotiation of data representation

Terminology User agent: client which initiates a request (browser, editor, web robot, …) Origin server: the server on which a given resource resides Proxy: acts as both a server and a client Gateway: server which acts as intermediary for other servers Tunnel: acts as a blind relay between two connections

Client-Server Protocol The browser is the client The client sends requests to an HTTP Server

Client-Server Sessions The HTTP protocol supports a short conversation between browser and server The entire conversation is conducted using ASCII characters (8-bit) The standard (and default) port for HTTP servers to listen on is 80, though they can use any port

HTTP Session A basic HTTP session has four phases: 1.Client opens the connection (a TCP connection) 2.Client makes the request 3.Server sends a response 4.Server closes the connection

Nested Objects Suppose a client accesses a page containing 10 inline images; to display the page completely would require 11 HTTP sessions Some browsers/servers support a feature called keep-alive which can keep the connection open until it is explicitly closed

Index.html Left frame Jumping fish Right frame Fairy iconHUJI icon

Stateless Protocol HTTP is a stateless protocol, which means that once a server has delivered the requested data to a client, the connection is broken, and the server retains no memory of what has just taken place

Resources A resource is a chunk of information that can be identified by a URL (Universal Resource Locator) –The most common kind of resource is a file, but a resource may also be A dynamically-generated query result The output of a CGI script, or An active server page

URL Universal Resource Identifiers [RFC 2396] are used to specify the object of a method –as an address (URL) –as a name (URN) URL = “ http:// ” host [ “ : ” port] [path] IP addresses in URLs should be avoided [RFC 1900]

Different URLs There are different types of URL ’ s –http:// : / ? –mailto: –news:

In a URL Spaces are represented by “+” Characters such as &,+,% are encoded in the form “%xx” where xx is the ascii value in hexadecimal; For example, “&” = “%26” The inputs to the parameters are in a list of the following form Var1=value1&var2=value2&var3=value3

War&peace Tolstoy

http://www.google.com/search?lr=&safe=off&q=war%26peace+Tolstoy

Format of Request and Response An initial line Zero or more header lines A blank line (i.e., a CRLF by itself), and An optional message body (e.g., a file, query data, or query output) Note: CRLF = “\r\n” (usually ASCII 13 followed by ASCII 10)

Request A request consists of: –Initial line –Headers –Blank line –Message body

Initial Line of a Request The initial line consists of –Method –Path –HTTP Version

Request Format

Request Example GET /courses/dbi/index.html HTTP/1.0 From: yarok@cs.huji.ac.il User-Agent: HTTPTool/1.0 [blank line here] Method Path Version Headers Initial line

Do Not Forget CRLF GET /courses/dbi/index.html HTTP/1.0 [CRLF] From: yarok@cs.huji.ac.il [CRLF] User-Agent: HTTPTool/1.0 [CRLF] [CRLF]

Request Methods GET returns the contents of the indicated document –The most frequently used command HEAD returns the header information for the indicated document –Useful for finding out info about a resource without retrieving it POST treats the document as a script and sends some data to it

More Methods PUT replaces the contents of the document with some data DELETE deletes the indicated document TRACE invokes a remote loop-back of the request. The final recipient SHOULD reflect the message back to the client Usually these methods are not allowed

GET Method GET is the most common HTTP method It says “give me this resource”

GET Requests With a Proxy Proxy Server Client Web ServerClient Web Server /~dbi/index.html www.cs.huji.ac.il http://www.cs.huji.ac.il/~dbi/index.html

HEAD Request A HEAD request asks the server to return the response headers only, and not the actual resource (i.e., no message body) Same as GET but without the message body This is useful for checking characteristics of a resource without actually downloading it, thus saving bandwidth Used for testing hypertext links for validity, accessibility and recent modification

Post POST request can send data to the server POST is mostly used in form-filling –The contents of a form are translated by the browser into some special format and sent to a script on the server using the POST command

Post (cont.) There is a block of data sent with the request, in the message body There are usually extra headers to describe this message body, like Content-Type: and Content-Length: The request URI is a program to handle the sent data, not a resource to retrieve The HTTP response is normally the output of a program, not a static file

Post Example Here's a typical form submission, using POST: POST /path/script.cgi HTTP/1.0 From: frog@cs.huji.ac.il User-Agent: HTTPTool/1.0 Content-Type: application/x-www-form-urlencoded Content-Length: 35 home=Ross+109&favorite+flavor=flies 35 characters

Headers HTTP 1.0 defines 16 headers –none are required HTTP 1.1 defines 46 headers –one header (Host:) is required in requests

Headers From: –gives the email address of whoever is making the request or running the program doing so User-Agent: –identifies the program that's making the request, in the form "Program-name/x.xx", x.xx is the (mostly) alphanumeric version of the program. For example, Netscape 3.0 sends the header "User-agent: Mozilla/3.0Gold"

Headers (cont.) Server: –analogous to the User-Agent: header: –it identifies the server software in the form "Program-name/x.xx". –For example, one beta version of Apache's server returns "Server: Apache/1.2b3-dev"Apache's

Headers (cont.) If an HTTP message includes a body, there are usually header lines in the message that describe the body. In particular, Content-Type: –gives the MIME-type of the data in the body, such as text/html or image/gif Content-Length: –gives the number of bytes in the body

Headers (cont.) Last-Modified: –Gives the modification date of the resource that's being returned –It's used in caching and other bandwidth-saving activities –Greenwich Mean Time should be used and the format is Last-Modified: Fri, 31 Dec 1999 23:59:59 GMT

Initial Line of a Response The initial line of a response is also called the status line. The initial line consists of –HTTP version –response status code –reason phrase that describes the status code

Response Format

HTTP/1.0 200 OK Date: Fri, 31 Dec 1999 23:59:59 GMT Content-Type: text/html Content-Length: 1354 Hello World (more file contents)... Headers Response Example Initial line Version Status code Reason phrase Message body

Status Code The status code is a three-digit integer, and the first digit identifies the general category of response: –1xx indicates an informational message only –2xx indicates success of some kind –3xx redirects the client to another URL –4xx indicates an error on the client's part Yes, the system blames it on the client if a resource is not found (i.e., 404) –5xx indicates an error on the server's part

Status Code 1xx The 100 (Continue) Status –Allows a client to determine if the Server is willing to accept the request (based on the request headers) before the client sends the request body –The client’s request must have the header Expect: 100 (Continue) 101 Status -- Switching Protocols

Status Code 2xx Status codes 2xx -- Success The action was successfully received, understood, and accepted –200 OK –201POST command successful –202Request accepted –203GET or HEAD request fulfilled –204No content

Status Code 3xx Status codes 3xx -- Redirection Further action must be taken in order to complete the request –300 Resource found at multiple locations –301 Resource moved permanently –302 Resource moved temporarily –304 Resource has not modified (since date)

Status Code 4xx Status codes 4xx -- Client error The request contains bad syntax or cannot be fulfilled –400Bad request from client –401Unauthorized request –402Payment required for request –403Resource access forbidden –404Resource not found –405Method not allowed for resource –406Resource type not acceptable

Status Code 5xx Status codes 5xx -- Server error The server failed to fulfill an apparently valid request –500Internal server error –501Method not implemented –502Bad gateway or server overload –503Service unavailable / gateway timeout –504Secondary gateway / server timeout

Response Information Description of information –Server Type of server –Date Date and time –Content-Length Number of bytes –Content-Type Mime type –Content-Language English, for example –Content-Encoding Data compression –Last-Modified Date when last modified –Expires Date when file becomes invalid

Manually Experimenting with HTTP >host www www.cs.huji.ac.il is a nickname for vafla.cs.huji.ac.il vafla.cs.huji.ac.il has address 132.65.80.39 vafla.cs.huji.as.il mail is handled (pri=10) by cs.huji.ac.il >telnet www.cs.huji.ac.il 80 Trying 132.65.80.39… Connected to vafla.cs.huji.ac.il. Escape character is ‘^]’.

Sending a Request >GET /~dbi/index.html HTTP/1.0 [blank line]

The Response HTTP/1.1 200 OK Date: Sun, 11 Mar 2001 21:42:15 GMT Server: Apache/1.3.9 (Unix) Last-Modified: Sun, 25 Feb 2001 21:42:15 GMT Content-Length: 479 Content-Type: text/html (html code …)

GET /~dbi/index.html HTTP/1.0 HTTP/1.1 200 OK HTML code

GET /~dbi/no-such-page.html HTTP/1.0 HTTP/1.1 404 Not Found HTML code

GET /index.html HTTP/1.1 HTTP/1.1 400 Bad Request HTML code Why is it a Bad Request? HTTP/1.1 without Host Header

HTTP 1.1 HTTP/1.1 is replacing/has replaced HTTP/1.0 as the new Web protocol

Improvements Faster response –allowing multiple transactions to take place over a single persistent connection –adding cache support Faster response for dynamically-generated pages –supporting chunked encoding, which allows a response to be sent before its total length is known Efficient use of IP addresses –allowing multiple domains to be served from a single IP address

Improvements over HTTP 1.0 HTTP/1.1 has a number of features/improvements over HTTP/1.0, including –Persistent TCP connections –Partial document transfers –Conditional fetch –Support for nonstandard HTTP/1.0 extensions –Better support for alternative character sets –More flexible authentication –Faster response and great bandwidth savings –Efficient use of IP addresses (virtual hosting)

Non-Persistent Connections 1Browser opens TCP connection to port 80 of server (handshake) 2Browser sends http request message 3Server receives request, locates object, sends response 4Server closes TCP connection 5Client receives response, parses object 6Repeat 1-4 for each embedded object

Persistent Connection 1Browser opens TCP connection to port 80 of server (handshake) 2Browser sends http request message 3Server receives request, locates object, sends response 4Client receives response, parses object 5Repeat 2-4 for each embedded object 6TCP connection closes on demand or timeout

Advantages of Persistent Connection CPU time saved in routers and hosts HTTP requests and responses can be pipelined on a connection network congestion is reduced latency on subsequent requests is reduced

Pipelines 2 types of persistent connections –without pipelining the client issues a new request only after the previous response has arrived – with pipelining client sends the request as soon as it encounters a reference multiple requests/responses –on the same IP packet, or –on back-to-back packets

Virtual Hosts With HTTP 1.1, one server at one IP address can be multi-homed: –“www.cs.huji.ac.il” and “www.math.huji.ac.il” can live on the same server –These are called virtual hosts –Without this mechanism, we have to use 2 different IP addresses It is like several people sharing one phone An HTTP request must specify the host name (and possibly port) for which the request is intended

Example The request specifies the host: GET /path/file.html HTTP/1.1 Host: www.host1.com:80

Virtual Hosting (cont.) Virtual hosting –reduces hardware expenditures –extends the ability to support additional servers –makes load balancing and capacity planning much easier Without it –each host name requires a unique IP address, and we are quickly running out of IP addresses with the explosion of new domains

The Date Header In HTTP 1.1, servers must include the generation time of the response in the Date: header Time values use Greenwich Mean Time (GMT) and have the format Date: Fri, 31 Dec 1999 23:59:59 GMT Date is omitted only in a few cases, e.g., status code 100 (continue) and some server errors Servers must synchronize their clocks with a reliable external standard

Caching Caching improves performance Eliminates the need to send requests in many cases (reduces network round-trips), using an expiration mechanism Eliminates the need to send full responses in other cases (reduces network bandwidth), using a validation mechanism

Client Caching client server cache Client GET /fruit/apple.gif Server responds with Last-Modified-Date:... Client caches object and last- modified-date Client sends GET /fruit/apple.gif … If-Modified-Since: … Server returns either 304 Not Modified or object

Network Caches client server proxy server GET /fruit/apple.gif

Internet Benefit of Caching client 10Mbps LAN RR 1.5Mbps server 15 req/sec 100Kbits/req proxy server 40% hit rate

Expiration Model Servers may provide an expiration time using the Expires header –By checking the expiration time, the cache can return a fresh response without contacting the server If the expiration time is not specified, the cache can heuristically estimate the expiration times (e.g., using header values, such as the Last-Modified time)

The Risk in Caching Response might not be “semantically transparent” –the response is different from what would have been returned by the origin server The cache should verify that the copy is fresh (i.e., expiration time has not passed) The copy is stale if it is not fresh

Validators A validator is any mechanism that may help in determining whether a copy is fresh or stale –A strong validator is, for example, a counter that is incremented whenever the resource is changed –A weak validator is, for example, a counter that is incremented only when a significant change is made

Using the Cache To check whether a copy is fresh, the cache must either –Use the expiration model, or –Compare the Last-Modified time or some validator with the origin server In the second case, the origin server either –Responds with the message 304(Not Modified), or –Sends a full response with the entity body

Cache-Control Header Cache-control headers specify directives to the cache –Can be included in either requests or responses The server can specify “must-revalidate” –Cache must revalidate with the origin server that the copy is still fresh The client can specify –the max-age of an unvalidated response –The max-stale time of a stale copy

Do not Use a Cache The Pragma: no-cache request header indicates that the request should not be satisfied from a cache Same as the no-cache cash-directive Should include both if server is not HTTP/1.1 compliant Directive applies to any recipient along the request/response chain

If-Modified-Since Header The If-Modified-Since : header is used with a GET request If the requested resource has been modified since the given date, the server returns the resource as it normally would (i.e., header is ignored) Otherwise, the server returns a 304 Not Modified response, including the Date: header, but with no message body HTTP/1.1 304 Not Modified Date: Fri, 31 Dec 1999 23:59:59 GMT [blank line here]

If-Unmodified-Since Header The If-Unmodified-Since: header can be used with any method If the requested resource has not been modified since the given date, the server returns the resource as it normally would Otherwise, the server returns a 412 Precondition Failed response HTTP/1.1 412 Precondition Failed [blank line here]

Cooperative Caching

Cooperative Caching (cont.) Higher level cache (e.g., national cash) –larger user population –higher hit rates Multiple Web cashes which cooperate => Improve overall performance Cooperative cashes usually built from clusters –divide the traffic overhead –improve storage capacity

Cooperative Caching (cont.) Which cashes should be asked for a particular doc? Hash routing (of URLs) -- an object will not be present in more than one cash

Hop by Hop HTTP/1.1 introduces the concept of hop-by- hop headers: –Message headers that apply only to a given connection, and not to the entire path –It enables much more power with the usage of proxies (cashes)

Hop-by-Hop Headers Connection –options that are desired for that particular connection (e.g., connection:close) Public –lists the set of methods supported by the server Proxy-Authenticate –enables authentication methods between two hops Transfer-Encoding –compression method between two hops Upgrade –additional communication protocols

Chunked Encoding Chunked encoding –Transmission of streaming multimedia One frame varies in size and composition from the next –Streaming video Entire image transmitted in first chunk and differences from the previous image are transmitted in the next chunk Wake up, we speak about movies in the Internet

Compression Most image formats (GIF, JPEG, MPEG) are precompressed Many other data types used in the Web are not precompressed Compression could save almost 40% of the bytes sent via HTTP There is a need for negotiating the type of encoding of the compressed resource

Compression (cont.) Client sends the header Accept-Encoding –The header indicates the content-encodings that the client can handle and the ones that the client prefers Server Sends –Content-Encoding header – for end-to-end encoding indication –Transfer-Encoding header - for hop-to-hop encoding indication (supported only in HTTP/1.1)

Content Negotiation Content Negotiation: –the process of selecting the best representation for a given response when there are multiple representations available HTTP supports two kinds of content negotiation: –Server-driven negotiation –Agent-driven negotiation

Server-Driven Negotiation The selection is made by the server, based on: –header field in the request (client preferences): Accept-Language / Accept-Encoding –available representations of the response –other information (i.e., address of the client) Disadvantages: –Impossible for the server to determine what is best for the user –Inefficiency (clients should describe their capabilities in every request) –Complicates implementation of servers

Agent-Driven Negotiation Selection is made by the client after receiving an initial response from the server –Based on available representations specified in the initial response –Automatic or manual Disadvantages: –needs a second request to obtain the best alternative representation

Protocol Switching Protocol switching –Client can specify another protocol more suited to the data being transferred (e.g., real-time synchronous protocol) I hate HTTP/1.0 I want another protocol

Authentication Many sites require users to provide a username and password in order to access the documents housed on the server This requirement provides a mechanism for keeping track of users (more than just a security mechanism)

Authentication ClientWeb Server /~dbi/index.html www.cs.huji.ac.il Who are you? /~dbi/index.html I am Donald My password is Duck response

Authentication How does it’s work? –Client sends ordinary request message –server responds with 401 Authorization Required status code WWW-Authenticate header which specifies how to perform authentication –Client resends the requested message, but this time including the Authorization header (e.g., user-name & password) –The client continues to add this header for each following request to that server

Cookies Alternative way to identify browsers Server response includes the Set-cookie header that has the attributes –name = VALUE –expires = DATE STRING –domain = DOMAIN NAME –path = PATH –secure Client returns cookie with matching URLs

Cookies Example: –Client contacts a web site for the first time –Server response includes the header: Set-cookie : 1678453 –Client stores the cookie value and the server name in a special “cookie file” –For each further request for that server, the client will add the header Cookie : 1678453

Cookies (cont.) Usage: –Server requires authentication, but doesn’t want to hassle a user with a user-name and password –Remembering user’s preferences for advertising –Cookies enable creating a virtual shopping cart Problems –users who access the same site from different machines

Are you HTTP experts now? Not yet There are more headers, for example, that this talk did not cover To know more, go to the specifications

Additional Information For specifications and additional information: –http://www.w3.org/Protocols/http://www.w3.org/Protocols/ –http://www.w3.org/Protocols/Specs.htmlhttp://www.w3.org/Protocols/Specs.html –http://www.jmarshall.com/easy/http/http://www.jmarshall.com/easy/http/ –http://wdvl.com/Internet/Protocols/HTTP/articl e.htmlhttp://wdvl.com/Internet/Protocols/HTTP/articl e.html

DBI Representation and Management of Data on the Internet.

Similar presentations

Presentation on theme: "DBI Representation and Management of Data on the Internet."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DBI Representation and Management of Data on the Internet.

Similar presentations

Presentation on theme: "DBI Representation and Management of Data on the Internet."— Presentation transcript:

Similar presentations

About project

Feedback