Presentation is loading. Please wait.

Presentation is loading. Please wait.

HTTP for DB Dummies Steve Gribble

Similar presentations


Presentation on theme: "HTTP for DB Dummies Steve Gribble"— Presentation transcript:

1 HTTP for DB Dummies Steve Gribble gribble@cs.berkeley.edu

2 The Web ClientServer GET /document.html TCP HTTP 1.0 model (slowly fading out, replaced by HTTP 1.1): cache

3 The Web ClientServer cache

4 Basics of HTTP

5 Structure of a Request GET /test/index.html?foo=bar+baz&name=steve HTTP/1.0\r\n Connection: Keep-Alive\r\n User-Agent: Mozilla/4.07 [en] (X11; I; Linux 2.0.36 i686)\r\n Host: ninja.cs.berkeley.edu:5556\r\n Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*\r\n Accept-Encoding: gzip\r\n Accept-Language: en\r\n Accept-Charset: iso-8859-1,*,utf-8\r\n \r\n : \r\n … \r\n

6 Structure of a Response \r\n : \r\n … \r\n HTTP/1.0 200 OK Server: Netscape-Enterprise/2.01 Date: Thu, 04 Feb 1999 00:28:19 GMT Accept-ranges: bytes Last-modified: Wed, 01 Jul 1998 17:07:38 GMT Content-length: 1848 Content-type: text/html

7 TCP level analysis HTTP 1.0FTP ( >=2nd file)

8 Interesting TCP gotchas Mandatory roundtripsMandatory roundtrips –TCP three-way handshake –get request, data return –new connections for each inlined image (parallelize) –lots of extra syn or syn/ack packets Slow-start penaltiesSlow-start penalties –can show only affects fast networks, not modems Lots of TCP connections to serverLots of TCP connections to server –spatial/processing overhead in server (TCP stack) –many protocol control block (PCB) TIME_WAIT entries –unfairness because of loss of congestion control info

9 Fix? Persistent HTTPPersistent HTTP –in HTTP/1.0, add “Connection: Keep-Alive\r\n” header –in HTTP/1.1, P-HTTP built in Does it help?Does it help? –mostly for server-side reasons, not network efficiency –allows pipelining of multiple requests on one connection Does it hurt?Does it hurt? –how does a client know when document is returned? –when does the connection get dropped? idle timeouts on server side client drops connections server needs to reclaim resources

10 HTTP/1.0 Client Methods GETGET –fetch and return a document –URL can be overloaded to submit form data GET /foo/bar.html?x=bar&bam=baz POSTPOST –submit a form, and receive response HEADHEAD –like GET, but only return HTTP headers and not the data itself. Useful for caching PUT, DELETE, LINK, UNLINKPUT, DELETE, LINK, UNLINK –not really used - big security issues if not careful

11 HTTP/1.0 Status Codes Family of codes, with 5 “types”Family of codes, with 5 “types” –1xx: informational –2xx: successful, e.g. 200 OK –3xx: redirection (gotcha: redirection loops?) 301 Moved Permanently 304 Not Modified –4xx: Client Error 400 Bad Request 401 Unauthorized 403 Forbidden 404 Not Found –5xx: Server Error 501 Not Implemented 503 Service Unavailable

12 HTTP/1.0 Headers (case insensitive?) Allow - returned by serverAllow - returned by server –Allow: GET, HEAD –never used in practice - clients know what they can do Authorization - sent by clientAuthorization - sent by client –Authorization: –“Basic Auth” is commonly used – = Base64( username:password ) –ok if inside an SSL connection (encrypted) Content-Encoding - sent by eitherContent-Encoding - sent by either –Content-Encoding: x-gzip –selects an encoding for the transport, not the content –sadly, no common support for encodings (Windows)

13 HTTP/1.0 Headers continued Content-Length - sent by eitherContent-Length - sent by either –Content-Length: 56 –how much payload is being sent? –necessary for persistent HTTP, or for POSTs Content-Type - sent by serverContent-Type - sent by server –Content-Type: text/html –what MIME type the payload is –nasty one: multipart/mixed DateDate –Date: Tue, 15 Nov 1994 08:12:31 GMT –3 accepted date formats (RFC 822, RFC 850, asctime())

14 HTTP/1.0 headers, continued Expires - sent by serverExpires - sent by server –Expires: Thu, 01 Dec 1994 16:00:00 GMT –primitive caching expiration date –cannot force clients to update view, only on refresh From - sent by clientFrom - sent by client –From: gribble@cs.berkeley.edu –not really used If-Modified-Since - sent by clientIf-Modified-Since - sent by client –If-Modified-Since: Sat, 29 Oct 1994 19:43:31 GMT –server returns data if modified, else “304 Not Modified”

15 HTTP/1.0 headers, con’t Last-Modified - returned by serverLast-Modified - returned by server –Last-Modified: Sat, 29 Oct 1994 19:43:31 GMT –semantically imprecise - file modification? Record timestamp? Date in case file dynamically generated? –used with If-Modified-Since and HEAD method Location - returned by serverLocation - returned by server –Location: http://www.cs.ubc.ca –used in case of 3xx redirections Pragma - sent by client or serverPragma - sent by client or server –Pragma: no-cache –extensibility mechanism. No-cache is the only popularly used pragma, AFAIK

16 HTTP/1.0 headers, con’t Referer - sent by clientReferer - sent by client –Referer: http://www.xxx-smut.com –specifies address from which request was generated –all sorts of privacy issues - must be careful with this Server - returned by serverServer - returned by server –Server: Netscape-Enterprise/2.01 –identifies server software. why? (measurement…) User-Agent - sent by clientUser-Agent - sent by client –User-Agent: Mozilla/4.07 [en] (X11; I; Linux 2.0.36 i686) –identifies client software –why? Optimize layout, send based on capability of client. –Hint: just pretend to be Netscape. MSIE does..

17 HTTP/1.0 Server headers WWW-Authenticate - sent by serverWWW-Authenticate - sent by server –WWW-Authenticate: –tells client to resend request with Authorization: header Incrementally added hacks:Incrementally added hacks: –Accept: image/gif, image/jpeg, text/*, */* –Accept-Encoding: gzip –Accept-Language: en –Retry-After: (date) or (seconds) –[Set-]Cookie: Part_Number="Rocket_Launcher_0001"; Version="1"; Path="/acme" –Title: (title)

18 HTTP/1.1 Additions Lots of problems associated with HTTP/1.0Lots of problems associated with HTTP/1.0 –the network problems we talked about before –very poor cache consistency models –difficulty implementing multi-homed servers want 1 IP address with multiple DNS names - how? –hard to precalculate content-lengths –connection dropped = lost data no chunking HTTP/1.1 is bloated spec to fix these problemsHTTP/1.1 is bloated spec to fix these problems –introduces many complexities –no longer an easy protocol to implement

19 HTTP/1.1 - a Taste of the New Host: www.ninja.comHost: www.ninja.com –clients MUST send this - fixes multi-homed problem –already in most 1.0 and 1.1 clients Range: bytes=300-304,601-993Range: bytes=300-304,601-993 –useful broken connection recovery (like FTP recovery) Age: Age: –expiration from caches Etag: fa898a3e3Etag: fa898a3e3 –unique tag to identify document (strong or weak forms) Cache-control: Cache-control: –marking documents as private (don’t keep in caches) “chunked” transfer encoding“chunked” transfer encoding –segmenting of documents - don’t have to calculate entire document length. Useful for dynamic query responses..

20 Architectural Complexities

21 Caches ClientServer TCP cache Original web: Problem: no localityProblem: no locality –non-local access pattern (trans-atlantic access) –servers serving the same bytes millions of times to localized communities of users

22 Solution: Cache Hierarchy NLANR cache hierarchy most widely developedNLANR cache hierarchy most widely developed –informally uses Squid cache –root servers squirt out 30GB per day –anybody can join... ClientServer cache Cache

23 Gotchas StalenessStaleness –HTTP/1.1 cache consistency mechanisms mostly solve SecuritySecurity –what happens if I infiltrate a cache? –servers/clients don’t even know this is happening –e.g.: AOL used to have a very stale cache, but has since moved to Inktomi Ad clickthrough countsAd clickthrough counts –how does Yahoo know how many times you accessed their pages, or more importantly, their ads?

24 CGI-BIN gateways CGI = “Common Gateway Interface”CGI = “Common Gateway Interface” –interface that allows independent authors to develop code that interacts with web servers –dynamic content generation, especially from scripts –CGI programs execute in separate process, typically httpd CGI code File System URL data URL data Client cache

25 CGI-BIN to DB gateways JDBC/ODBC gatewaysJDBC/ODBC gateways –single-node DB, often running on remote host –long, blocking operations, usually –nasty transactional issues - how does client know that action succeeded or failed? Datek/E*Trade troubles httpd CGI code File System URL data URL data DB ODBC / JDBC / etc. Client cache

26 cgi-bin security Lots of gotchas with CGI-BIN programsLots of gotchas with CGI-BIN programs –buffer overflows (maximum length checks?) –shell metacharacter expansion what happens if you put `cat /etc/passwd` in a form field? –sending mail, reading files –redirection - allows bypassing IP address-based security

27 Multiple server support We’ve seen how single IP address can server multiple web sites with “Host:” HTTP/1.1 fieldWe’ve seen how single IP address can server multiple web sites with “Host:” HTTP/1.1 field –what about having multiple physical hosts serving a single web site? –useful for scalability reasons Client Server TCP cache Server www.hotbot.com

28 Solutions DNS round-robinDNS round-robin –assign multiple IP addresses to single domain name –client selects amongst them in order –shortcomings: exposes individual nodes to clients can’t take into account machine capabilities (multiprocessors) and currently experienced load Front-end redirectionFront-end redirection –single front-end node serves HTTP redirect to selected backend node –introduces extra round-trip, FE is single point of failure

29 More solutions IP-level multiplexing through smart routerIP-level multiplexing through smart router –munge IP packets and send them to selected host –Cisco, SUN, etc. make hardware to do this Cisco LocalDirector –tricky state management issues, failure semantics “Smart Clients”“Smart Clients” –Netscape “Proxy Autoconfig” (PAC) mechanism only useful if connecting via proxy Javascript selects from amongst proxies –No HTTP protocol support for smart client access to web servers

30 The “Real” Picture of the Web URL data Client cache Redirector HTTP Server HTTP Server HTTP Server cache / firewall HTTP Server IIII CGI code DB $$$$ www.nytimes.com

31 Web Characteristics

32 UCB HIP trace Web traffic circa 1997 is primarily:Web traffic circa 1997 is primarily: –GIF data 27% of bytes transferred, 51% of files transferred average size 4.1 KB –JPEG data 31% of bytes transferred, 16% of files transferred average size: 12.8 KB –HTML data 18% of bytes transferred, 22% of files transferred average size: 5.6 KB File sizes, server latency, access patternsFile sizes, server latency, access patterns –all heavy-tailed: most small, but some very large –self-similarity everywhere - lots and lots of bursts

33 Server-Side Architecture

34 Goals of server High capacity web servers must do the following:High capacity web servers must do the following: –rapidly update corpus of content served –be efficient latency: serve content as quickly as possible throughput: parallel requests from large numbers of clients –be extensible data-types cgi-bin programs server plug-ins –not crash –remain secure

35 Plugin Interface High-level Architecture Network handler Concurrency subsystem Filesystem cache CGI interface Protocol parser Logging subsystem Reverse DNS cache

36 Concurrency How many simultaneously open connections must a server handle?How many simultaneously open connections must a server handle? –1,000,000 hits per day 12 hits per second average upwards of 50 hits per second peak (bursts, diurnal cycle) –latency: 10 milliseconds (out of memory) ==> 1 connection 50 milliseconds (off of disk) ==> 3 connections 200 milliseconds (CGI + disk) ==> 10 connections 5 seconds (CGI to DB gateway) ==> 250 connections Depending on expected usage, need very different concurrency modelsDepending on expected usage, need very different concurrency models

37 Strategies Single process, single thread, serializedSingle process, single thread, serialized –simplest implementation, worst performance –perfectly fine for low traffic sites Multiple processes, single serialized thread / processMultiple processes, single serialized thread / process –Apache web server model –expensive (context switching, process state, …) Multithreaded [and multiprocess]Multithreaded [and multiprocess] –complex synchronization primitives needed –thread creation/destruction vs. thread pool management Event driven, asynchronous I/OEvent driven, asynchronous I/O –eliminates context switch overhead, better memory mgmt –very complex and delicate program flow

38 Disk I/O File system overheadFile system overhead –file system buffer management not optimal –don’t need many of the file system facilities modifying files, moving files, locking files, seeks… Alternatives:Alternatives: –directly interact with disk very fast, very complex –in-memory caching on top of file system works well given high locality of server access be careful to not suffer from double-buffering Interaction: thread subsystem and diskInteraction: thread subsystem and disk –balanced system - enough threads to saturate disk I/O

39 Network I/O Typical server behaviour rough on network stackTypical server behaviour rough on network stack –multiple outstanding connections –very rapid TCP creation and teardown –often, very slow last-hop network segment Redundant operations performedRedundant operations performed –checksum calculations, byte swapping, … Inefficiencies at packet levelInefficiencies at packet level –header, body, FIN usually three separate round-trips Poor network stack implementationsPoor network stack implementations –TIME_WAIT and IDLE PCB entries on single linked list –Nagle’s algorithm invoked when it shouldn’t be

40 Inline scripting Technology: server-side includes (SSIs)Technology: server-side includes (SSIs) –script embedded inside content, interpreted before sent back to client –dynamically computed content inside templates authorization (cert lookup or authentication) DB lookup (inventory lists, product prices, …) ChallengesChallenges –similar to CGI: security efficiency (latency and throughput)

41 Cheetah (Exokernel) Direct access to hardware primitivesDirect access to hardware primitives –disk, network - eliminate costly OS generalizations –scatter/gather IO primitives –allow for common disk/network buffers (eliminate copy) Compiler-assisted ILPCompiler-assisted ILP –eliminate redundancies, staging inefficiencies HTTP-specialized network stack and file systemHTTP-specialized network stack and file system –precomputed HTTP headers, minimal copies –minimize network packets (e.g.piggyback FINs with data) –precomputed TCP/IP checksums

42 Some Parting Thoughts

43 Other things to keep in mind There are non-humans on the webThere are non-humans on the web –spiders, crawlers, worms, etc, may behave badly infinite FTP directory traps, request bursts,... Netscape, MSIE, and Apache set defacto standardsNetscape, MSIE, and Apache set defacto standards –their semantics may subtly differ from standards –error-tolerance of popular clients/servers means that everybody must achieve same levels of tolerance otherwise, you appear to be broken to users e.g.: Netscape not parsing comments properly SSL/X.509SSL/X.509 –transport-level security: fixes up basic auth problems –eliminates caching or proxy mechanisms


Download ppt "HTTP for DB Dummies Steve Gribble"

Similar presentations


Ads by Google