HTTP for DB Dummies Steve Gribble

Slides:



Advertisements
Similar presentations
1 Caching in HTTP Representation and Management of Data on the Internet.
Advertisements

HTTP – HyperText Transfer Protocol
Web basics HTTP – – URI/L/Ns – HTML –
Chapter 9 Application Layer, HTTP Professor Rick Han University of Colorado at Boulder
1 HTTP – HyperText Transfer Protocol Part 1. 2 Common Protocols In order for two remote machines to “ understand ” each other they should –‘‘ speak the.
Chapter 2: Application Layer
HTTP Hypertext Transfer Protocol. HTTP messages HTTP is the language that web clients and web servers use to talk to each other –HTTP is largely “under.
How the web works: HTTP and CGI explained
Cornell CS502 Web Basics and Protocols CS 502 – Carl Lagoze Acks to McCracken Syracuse Univ.
Web architecture Dr Jim Briggs Web architecture.
The World Wide Web and the Internet Dr Jim Briggs 1WUCM1.
Chapter 2 Application Layer Computer Networking: A Top Down Approach Featuring the Internet, 3 rd edition. Jim Kurose, Keith Ross Addison-Wesley, July.
Web, HTTP and Web Caching
Definitions, Definitions, Definitions Lead to Understanding.
Application Layer  We will learn about protocols by examining popular application-level protocols  HTTP  FTP  SMTP / POP3 / IMAP  Focus on client-server.
2/9/2004 Web and HTTP February 9, /9/2004 Assignments Due – Reading and Warmup Work on Message of the Day.
Hypertext Transport Protocol CS Dick Steflik.
 What is it ? What is it ?  URI,URN,URL URI,URN,URL  HTTP – methods HTTP – methods  HTTP Request Packets HTTP Request Packets  HTTP Request Headers.
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
HTTP; The World Wide Web Protocol
Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.
Java Technology and Applications
HTTP Protocol Specification
Server tools. Site server tools can be utilised to build, host, track and monitor transactions on a business site. There are a wide range of possibilities.
FTP (File Transfer Protocol) & Telnet
_______________________________________________________________________________________________________________ E-Commerce: Fundamentals and Applications1.
HTTP Reading: Section and COS 461: Computer Networks Spring
HyperText Transfer Protocol (HTTP).  HTTP is the protocol that supports communication between web browsers and web servers.  A “Web Server” is a HTTP.
CP476 Internet Computing Lecture 5 : HTTP, WWW and URL 1 Lecture 5. WWW, HTTP and URL Objective: to review the concepts of WWW to understand how HTTP works.
Application Layer 2 Figures from Kurose and Ross
Rensselaer Polytechnic Institute Shivkumar Kalvanaraman, Biplab Sikdar 1 The Web: the http protocol http: hypertext transfer protocol Web’s application.
Week 11: Application Layer1 Web and HTTP First some jargon r Web page consists of objects r Object can be HTML file, JPEG image, Java applet, audio file,…
Maryam Elahi University of Calgary – CPSC 441.  HTTP stands for Hypertext Transfer Protocol.  Used to deliver virtually all files and other data (collectively.
Sistem Jaringan dan Komunikasi Data #9. DNS The Internet Directory Service  the Domain Name Service (DNS) provides mapping between host name & IP address.
WebServer A Web server is a program that, using the client/server model and the World Wide Web's Hypertext Transfer Protocol (HTTP), serves the files that.
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 17 This presentation © 2004, MacAvon Media Productions Multimedia and Networks.
World Wide Web (WWW) A Distributed Document- Based System Group E Ricky Tong (D-A0-1611) Eddy Leong (D-A0-1623) Dick Lei (D-A0-1658)
HyperText Transfer Protocol (HTTP) RICHI GUPTA CISC 856: TCP/IP and Upper Layer Protocols Fall 2007 Thanks to Dr. Amer, UDEL for some of the slides used.
HTTP1 Hypertext Transfer Protocol (HTTP) After this lecture, you should be able to:  Know how Web Browsers and Web Servers communicate via HTTP Protocol.
CIS679: Lecture 13 r Review of Last Lecture r More on HTTP.
Operating Systems Lesson 12. HTTP vs HTML HTML: hypertext markup language ◦ Definitions of tags that are added to Web documents to control their appearance.
2: Application Layer 1 Chapter 2: Application layer r 2.1 Principles of network applications  app architectures  app requirements r 2.2 Web and HTTP.
Web Technologies Lecture 1 The Internet and HTTP.
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 17 This presentation © 2004, MacAvon Media Productions Multimedia and Networks.
HTTP Here, we examine the hypertext transfer protocol (http) – originally introduced around 1990 but not standardized until 1997 (version 1.0) – protocol.
Web Services. 2 Internet Collection of physically interconnected computers. Messages decomposed into packets. Packets transmitted from source to destination.
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
EE 122: Lecture 21 (HyperText Transfer Protocol - HTTP) Ion Stoica Nov 20, 2001 (*)
ASP-2-1 SERVER AND CLIENT SIDE SCRITPING Colorado Technical University IT420 Tim Peterson.
Overview of Servlets and JSP
Data Communications and Computer Networks Chapter 2 CS 3830 Lecture 7 Omar Meqdadi Department of Computer Science and Software Engineering University of.
COMP2322 Lab 2 HTTP Steven Lee Jan. 29, HTTP Hypertext Transfer Protocol Web’s application layer protocol Client/server model – Client (browser):
HyperText Transfer Protocol (HTTP) Deepti Kulkarni CISC 856: TCP/IP and Upper Layer Protocols Fall 2008 Acknowledgements Professor Amer Richi Gupta.
Week 11: Application Layer 1 Web and HTTP r Web page consists of objects r Object can be HTML file, JPEG image, Java applet, audio file,… r Web page consists.
Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.
© Janice Regan, CMPT 128, Jan 2007 CMPT 371 Data Communications and Networking HTTP 0.
Hypertext Transfer Protocol
Block 5: An application layer protocol: HTTP
How HTTP Works Made by Manish Kushwaha.
HTTP – An overview.
Web Development Web Servers.
COMP2322 Lab 2 HTTP Steven Lee Feb. 8, 2017.
Hypertext Transport Protocol
Web Caching? Web Caching:.
CISC103 Web Development Basics: Web site:
HTTP Hypertext Transfer Protocol
EE 122: HyperText Transfer Protocol (HTTP)
Hypertext Transfer Protocol (HTTP)
HTTP Hypertext Transfer Protocol
CSCI-351 Data communication and Networks
Presentation transcript:

HTTP for DB Dummies Steve Gribble

The Web ClientServer GET /document.html TCP HTTP 1.0 model (slowly fading out, replaced by HTTP 1.1): cache

The Web ClientServer cache

Basics of HTTP

Structure of a Request GET /test/index.html?foo=bar+baz&name=steve HTTP/1.0\r\n Connection: Keep-Alive\r\n User-Agent: Mozilla/4.07 [en] (X11; I; Linux i686)\r\n Host: ninja.cs.berkeley.edu:5556\r\n Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*\r\n Accept-Encoding: gzip\r\n Accept-Language: en\r\n Accept-Charset: iso ,*,utf-8\r\n \r\n : \r\n … \r\n

Structure of a Response \r\n : \r\n … \r\n HTTP/ OK Server: Netscape-Enterprise/2.01 Date: Thu, 04 Feb :28:19 GMT Accept-ranges: bytes Last-modified: Wed, 01 Jul :07:38 GMT Content-length: 1848 Content-type: text/html

TCP level analysis HTTP 1.0FTP ( >=2nd file)

Interesting TCP gotchas Mandatory roundtripsMandatory roundtrips –TCP three-way handshake –get request, data return –new connections for each inlined image (parallelize) –lots of extra syn or syn/ack packets Slow-start penaltiesSlow-start penalties –can show only affects fast networks, not modems Lots of TCP connections to serverLots of TCP connections to server –spatial/processing overhead in server (TCP stack) –many protocol control block (PCB) TIME_WAIT entries –unfairness because of loss of congestion control info

Fix? Persistent HTTPPersistent HTTP –in HTTP/1.0, add “Connection: Keep-Alive\r\n” header –in HTTP/1.1, P-HTTP built in Does it help?Does it help? –mostly for server-side reasons, not network efficiency –allows pipelining of multiple requests on one connection Does it hurt?Does it hurt? –how does a client know when document is returned? –when does the connection get dropped? idle timeouts on server side client drops connections server needs to reclaim resources

HTTP/1.0 Client Methods GETGET –fetch and return a document –URL can be overloaded to submit form data GET /foo/bar.html?x=bar&bam=baz POSTPOST –submit a form, and receive response HEADHEAD –like GET, but only return HTTP headers and not the data itself. Useful for caching PUT, DELETE, LINK, UNLINKPUT, DELETE, LINK, UNLINK –not really used - big security issues if not careful

HTTP/1.0 Status Codes Family of codes, with 5 “types”Family of codes, with 5 “types” –1xx: informational –2xx: successful, e.g. 200 OK –3xx: redirection (gotcha: redirection loops?) 301 Moved Permanently 304 Not Modified –4xx: Client Error 400 Bad Request 401 Unauthorized 403 Forbidden 404 Not Found –5xx: Server Error 501 Not Implemented 503 Service Unavailable

HTTP/1.0 Headers (case insensitive?) Allow - returned by serverAllow - returned by server –Allow: GET, HEAD –never used in practice - clients know what they can do Authorization - sent by clientAuthorization - sent by client –Authorization: –“Basic Auth” is commonly used – = Base64( username:password ) –ok if inside an SSL connection (encrypted) Content-Encoding - sent by eitherContent-Encoding - sent by either –Content-Encoding: x-gzip –selects an encoding for the transport, not the content –sadly, no common support for encodings (Windows)

HTTP/1.0 Headers continued Content-Length - sent by eitherContent-Length - sent by either –Content-Length: 56 –how much payload is being sent? –necessary for persistent HTTP, or for POSTs Content-Type - sent by serverContent-Type - sent by server –Content-Type: text/html –what MIME type the payload is –nasty one: multipart/mixed DateDate –Date: Tue, 15 Nov :12:31 GMT –3 accepted date formats (RFC 822, RFC 850, asctime())

HTTP/1.0 headers, continued Expires - sent by serverExpires - sent by server –Expires: Thu, 01 Dec :00:00 GMT –primitive caching expiration date –cannot force clients to update view, only on refresh From - sent by clientFrom - sent by client –From: –not really used If-Modified-Since - sent by clientIf-Modified-Since - sent by client –If-Modified-Since: Sat, 29 Oct :43:31 GMT –server returns data if modified, else “304 Not Modified”

HTTP/1.0 headers, con’t Last-Modified - returned by serverLast-Modified - returned by server –Last-Modified: Sat, 29 Oct :43:31 GMT –semantically imprecise - file modification? Record timestamp? Date in case file dynamically generated? –used with If-Modified-Since and HEAD method Location - returned by serverLocation - returned by server –Location: –used in case of 3xx redirections Pragma - sent by client or serverPragma - sent by client or server –Pragma: no-cache –extensibility mechanism. No-cache is the only popularly used pragma, AFAIK

HTTP/1.0 headers, con’t Referer - sent by clientReferer - sent by client –Referer: –specifies address from which request was generated –all sorts of privacy issues - must be careful with this Server - returned by serverServer - returned by server –Server: Netscape-Enterprise/2.01 –identifies server software. why? (measurement…) User-Agent - sent by clientUser-Agent - sent by client –User-Agent: Mozilla/4.07 [en] (X11; I; Linux i686) –identifies client software –why? Optimize layout, send based on capability of client. –Hint: just pretend to be Netscape. MSIE does..

HTTP/1.0 Server headers WWW-Authenticate - sent by serverWWW-Authenticate - sent by server –WWW-Authenticate: –tells client to resend request with Authorization: header Incrementally added hacks:Incrementally added hacks: –Accept: image/gif, image/jpeg, text/*, */* –Accept-Encoding: gzip –Accept-Language: en –Retry-After: (date) or (seconds) –[Set-]Cookie: Part_Number="Rocket_Launcher_0001"; Version="1"; Path="/acme" –Title: (title)

HTTP/1.1 Additions Lots of problems associated with HTTP/1.0Lots of problems associated with HTTP/1.0 –the network problems we talked about before –very poor cache consistency models –difficulty implementing multi-homed servers want 1 IP address with multiple DNS names - how? –hard to precalculate content-lengths –connection dropped = lost data no chunking HTTP/1.1 is bloated spec to fix these problemsHTTP/1.1 is bloated spec to fix these problems –introduces many complexities –no longer an easy protocol to implement

HTTP/1.1 - a Taste of the New Host: –clients MUST send this - fixes multi-homed problem –already in most 1.0 and 1.1 clients Range: bytes= , Range: bytes= , –useful broken connection recovery (like FTP recovery) Age: Age: –expiration from caches Etag: fa898a3e3Etag: fa898a3e3 –unique tag to identify document (strong or weak forms) Cache-control: Cache-control: –marking documents as private (don’t keep in caches) “chunked” transfer encoding“chunked” transfer encoding –segmenting of documents - don’t have to calculate entire document length. Useful for dynamic query responses..

Architectural Complexities

Caches ClientServer TCP cache Original web: Problem: no localityProblem: no locality –non-local access pattern (trans-atlantic access) –servers serving the same bytes millions of times to localized communities of users

Solution: Cache Hierarchy NLANR cache hierarchy most widely developedNLANR cache hierarchy most widely developed –informally uses Squid cache –root servers squirt out 30GB per day –anybody can join... ClientServer cache Cache

Gotchas StalenessStaleness –HTTP/1.1 cache consistency mechanisms mostly solve SecuritySecurity –what happens if I infiltrate a cache? –servers/clients don’t even know this is happening –e.g.: AOL used to have a very stale cache, but has since moved to Inktomi Ad clickthrough countsAd clickthrough counts –how does Yahoo know how many times you accessed their pages, or more importantly, their ads?

CGI-BIN gateways CGI = “Common Gateway Interface”CGI = “Common Gateway Interface” –interface that allows independent authors to develop code that interacts with web servers –dynamic content generation, especially from scripts –CGI programs execute in separate process, typically httpd CGI code File System URL data URL data Client cache

CGI-BIN to DB gateways JDBC/ODBC gatewaysJDBC/ODBC gateways –single-node DB, often running on remote host –long, blocking operations, usually –nasty transactional issues - how does client know that action succeeded or failed? Datek/E*Trade troubles httpd CGI code File System URL data URL data DB ODBC / JDBC / etc. Client cache

cgi-bin security Lots of gotchas with CGI-BIN programsLots of gotchas with CGI-BIN programs –buffer overflows (maximum length checks?) –shell metacharacter expansion what happens if you put `cat /etc/passwd` in a form field? –sending mail, reading files –redirection - allows bypassing IP address-based security

Multiple server support We’ve seen how single IP address can server multiple web sites with “Host:” HTTP/1.1 fieldWe’ve seen how single IP address can server multiple web sites with “Host:” HTTP/1.1 field –what about having multiple physical hosts serving a single web site? –useful for scalability reasons Client Server TCP cache Server

Solutions DNS round-robinDNS round-robin –assign multiple IP addresses to single domain name –client selects amongst them in order –shortcomings: exposes individual nodes to clients can’t take into account machine capabilities (multiprocessors) and currently experienced load Front-end redirectionFront-end redirection –single front-end node serves HTTP redirect to selected backend node –introduces extra round-trip, FE is single point of failure

More solutions IP-level multiplexing through smart routerIP-level multiplexing through smart router –munge IP packets and send them to selected host –Cisco, SUN, etc. make hardware to do this Cisco LocalDirector –tricky state management issues, failure semantics “Smart Clients”“Smart Clients” –Netscape “Proxy Autoconfig” (PAC) mechanism only useful if connecting via proxy Javascript selects from amongst proxies –No HTTP protocol support for smart client access to web servers

The “Real” Picture of the Web URL data Client cache Redirector HTTP Server HTTP Server HTTP Server cache / firewall HTTP Server IIII CGI code DB $$$$

Web Characteristics

UCB HIP trace Web traffic circa 1997 is primarily:Web traffic circa 1997 is primarily: –GIF data 27% of bytes transferred, 51% of files transferred average size 4.1 KB –JPEG data 31% of bytes transferred, 16% of files transferred average size: 12.8 KB –HTML data 18% of bytes transferred, 22% of files transferred average size: 5.6 KB File sizes, server latency, access patternsFile sizes, server latency, access patterns –all heavy-tailed: most small, but some very large –self-similarity everywhere - lots and lots of bursts

Server-Side Architecture

Goals of server High capacity web servers must do the following:High capacity web servers must do the following: –rapidly update corpus of content served –be efficient latency: serve content as quickly as possible throughput: parallel requests from large numbers of clients –be extensible data-types cgi-bin programs server plug-ins –not crash –remain secure

Plugin Interface High-level Architecture Network handler Concurrency subsystem Filesystem cache CGI interface Protocol parser Logging subsystem Reverse DNS cache

Concurrency How many simultaneously open connections must a server handle?How many simultaneously open connections must a server handle? –1,000,000 hits per day 12 hits per second average upwards of 50 hits per second peak (bursts, diurnal cycle) –latency: 10 milliseconds (out of memory) ==> 1 connection 50 milliseconds (off of disk) ==> 3 connections 200 milliseconds (CGI + disk) ==> 10 connections 5 seconds (CGI to DB gateway) ==> 250 connections Depending on expected usage, need very different concurrency modelsDepending on expected usage, need very different concurrency models

Strategies Single process, single thread, serializedSingle process, single thread, serialized –simplest implementation, worst performance –perfectly fine for low traffic sites Multiple processes, single serialized thread / processMultiple processes, single serialized thread / process –Apache web server model –expensive (context switching, process state, …) Multithreaded [and multiprocess]Multithreaded [and multiprocess] –complex synchronization primitives needed –thread creation/destruction vs. thread pool management Event driven, asynchronous I/OEvent driven, asynchronous I/O –eliminates context switch overhead, better memory mgmt –very complex and delicate program flow

Disk I/O File system overheadFile system overhead –file system buffer management not optimal –don’t need many of the file system facilities modifying files, moving files, locking files, seeks… Alternatives:Alternatives: –directly interact with disk very fast, very complex –in-memory caching on top of file system works well given high locality of server access be careful to not suffer from double-buffering Interaction: thread subsystem and diskInteraction: thread subsystem and disk –balanced system - enough threads to saturate disk I/O

Network I/O Typical server behaviour rough on network stackTypical server behaviour rough on network stack –multiple outstanding connections –very rapid TCP creation and teardown –often, very slow last-hop network segment Redundant operations performedRedundant operations performed –checksum calculations, byte swapping, … Inefficiencies at packet levelInefficiencies at packet level –header, body, FIN usually three separate round-trips Poor network stack implementationsPoor network stack implementations –TIME_WAIT and IDLE PCB entries on single linked list –Nagle’s algorithm invoked when it shouldn’t be

Inline scripting Technology: server-side includes (SSIs)Technology: server-side includes (SSIs) –script embedded inside content, interpreted before sent back to client –dynamically computed content inside templates authorization (cert lookup or authentication) DB lookup (inventory lists, product prices, …) ChallengesChallenges –similar to CGI: security efficiency (latency and throughput)

Cheetah (Exokernel) Direct access to hardware primitivesDirect access to hardware primitives –disk, network - eliminate costly OS generalizations –scatter/gather IO primitives –allow for common disk/network buffers (eliminate copy) Compiler-assisted ILPCompiler-assisted ILP –eliminate redundancies, staging inefficiencies HTTP-specialized network stack and file systemHTTP-specialized network stack and file system –precomputed HTTP headers, minimal copies –minimize network packets (e.g.piggyback FINs with data) –precomputed TCP/IP checksums

Some Parting Thoughts

Other things to keep in mind There are non-humans on the webThere are non-humans on the web –spiders, crawlers, worms, etc, may behave badly infinite FTP directory traps, request bursts,... Netscape, MSIE, and Apache set defacto standardsNetscape, MSIE, and Apache set defacto standards –their semantics may subtly differ from standards –error-tolerance of popular clients/servers means that everybody must achieve same levels of tolerance otherwise, you appear to be broken to users e.g.: Netscape not parsing comments properly SSL/X.509SSL/X.509 –transport-level security: fixes up basic auth problems –eliminates caching or proxy mechanisms