Presentation on theme: "The Web and Content Distribution Networks"— Presentation transcript:
1The Web and Content Distribution Networks Nick Feamster CS 6250 Fall 2011Blast through HTTP review slides on the projector.Swap back to chalkboard for new stuff.(some notes from David Andersen and Christian Kauffman)
3HTTP Basics HTTP layered over bidirectional byte stream Interaction Almost always TCPInteractionClient sends request to server, followed by response from server to clientRequests/responses are encoded in textStatelessServer maintains no information about past client requests
4How to Mark End of Message? Size of message Content-LengthMust know size of transfer in advanceDelimiter MIME-style Content-TypeServer must “escape” delimiter in contentClose connectionOnly server can do this
5HTTP Request Request line Method URL (relative) HTTP version GET – return URIHEAD – return headers only of GET responsePOST – send data to the server (forms, etc.)URL (relative)E.g., /index.htmlHTTP version
6HTTP Request (cont.) Request headers Blank-line Body Authorization – authentication infoAcceptable document types/encodingsFrom – userIf-Modified-SinceReferrer – what caused this page to be requestedUser-Agent – client softwareBlank-lineBody
9HTTP Response Status-line HTTP version 3 digit response code 1XX – informational2XX – success200 OK3XX – redirection301 Moved Permanently303 Moved Temporarily304 Not Modified4XX – client error404 Not Found5XX – server error505 HTTP Version Not SupportedReason phrase
10HTTP Response Headers Blank-line Body Location – for redirection Server – server softwareWWW-Authenticate – request for authenticationAllow – list of methods supported (get, head, etc)Content-Encoding – E.g x-gzipContent-LengthContent-TypeExpiresLast-ModifiedBlank-lineBody
11HTTP Response Example% curl -D - -o /dev/null HTTP/ OK Date: Tue, 25 Oct :50:28 GMT Server: Apache/ (Debian) X-Powered-By: PHP/ squeeze1 Set-Cookie: a1c6441c5610e cceee6bfd0ef=q3a5oq1e47ncpkn97mqbo47qm0; path=/ P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM" Set-Cookie: ja_purity_tpl=ja_purity; expires=Sun, 14-Oct :50:28 GMT; path=/ Expires: Mon, 1 Jan :00:00 GMT Last-Modified: Tue, 25 Oct :50:28 GMT Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Pragma: no-cache Vary: Accept-Encoding Transfer-Encoding: chunked Content-Type: text/html; charset=utf-8
12Outline HTTP intro and details Persistent HTTP HTTP caching Content distribution networks
13HTTP 0.9/1.0 One request/response per TCP connection Disadvantages Simple to implementDisadvantagesMultiple connection setups three-way handshake each timeSeveral extra round trips added to transferMultiple slow startsNow that you’ve seen TCP in the previous lectures, the overhead of one TCP connection per object is more clear.
14Single Transfer Example ClientServer0 RTTSYNClient opens TCP connectionSYN1 RTTACKDATClient sends HTTP request for HTMLServer reads from diskACKDATFIN2 RTTACKClient parses HTMLClient opens TCP connectionFINACKSYNSYN3 RTTACKClient sends HTTP request for imageDATServer reads from diskACK4 RTTDATImage begins to arriveLecture 19:
15-- Things to think about -- More ProblemsShort transfers are hard on TCPStuck in slow startLoss recovery is poor when windows are smallLots of extra connectionsIncreases server state/processingServer also forced to keep TIME_WAIT connection state-- Things to think about --Why must server keep these?Tends to be an order of magnitude greater than # of active connections, why?
16Persistent Connection Solution Multiplex multiple transfers onto one TCP connectionHow to identify requests/responsesDelimiter Server must examine response for delimiter stringContent-length and delimiter Must know size of transfer in advanceBlock-based transmission send in multiple length delimited blocksStore-and-forward wait for entire response and then use content-lengthSolution use existing methods and close connection otherwise
17Persistent Connection Example ClientServer0 RTTDATClient sends HTTP request for HTMLServer reads from diskACKDAT1 RTTACKClient parses HTMLClient sends HTTP request for imageDATServer reads from diskACKDAT2 RTTImage begins to arriveLecture 19:
18Persistent HTTP Nonpersistent HTTP issues: Requires 2 RTTs per objectOS must work and allocate host resources for each TCP connectionBut browsers often open parallel TCP connections to fetch referenced objectsPersistent HTTPServer leaves connection open after sending responseSubsequent HTTP messages between same client/server are sent over connectionPersistent without pipelining:Client issues new request only when previous response has been receivedOne RTT for each referenced objectPersistent with pipelining:Default in HTTP/1.1Client sends requests as soon as it encounters a referenced objectAs little as one RTT for all the referenced objects
19Outline HTTP intro and details Persistent HTTP HTTP caching Content distribution networks
20HTTP Caching Clients often cache documents Challenge: update of documentsIf-Modified-Since requests to checkHTTP 0.9/1.0 used just dateHTTP 1.1 has an opaque “entity tag” (could be a file signature, etc.) as wellWhen/how often should the original be checked for changes?Check every time?Check each session? Day? Etc?Use Expires headerIf no Expires, often use Last-Modified as estimateAsk: What’s the problem with just using date for IMS queries?- Web servers may generate content per-client Examples: CGI scripts, things like google customized home,vary with cookies, client IP, referrer, accept-encoding, etc.
21Example Cache Check Request GET / HTTP/1.1 Accept: */* Accept-Language: en-us Accept-Encoding: gzip, deflate If-Modified-Since: Mon, Oct :54:18 GMT If-None-Match: "7a11f-10ed-3a75ae4a" User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv: ) Host: Connection: Keep-Alive
23Ways to cacheClient-directed caching: Web Proxies Server-directed caching: Content Delivery Networks (CDNs)
24Web Proxy Caches User configures browser: Web accesses via cache Browser sends all HTTP requests to cacheObject in cache: cache returns objectElse cache requests object from origin server, then returns object to clientoriginserverProxyserverHTTP requestHTTP requestclientHTTP responseHTTP responseHTTP requestHTTP responseclientoriginserver
25Caching Example (1) Assumptions Average object size = 100,000 bits Avg. request rate from institution’s browser to origin servers = 15/secDelay from institutional router to any origin server and back to router = 2 secConsequencesUtilization on LAN = 15%Utilization on access link = 100%Total delay = Internet delay + access delay + LAN delay= 2 sec + minutes + millisecondsoriginserverspublicInternet1.5 Mbpsaccess linkinstitutionalnetwork10 Mbps LAN
26Caching Example (2) Install cache origin Consequence servers Suppose hit rate is .4Consequence40% requests will be satisfied almost immediately (say 10 msec)60% requests satisfied by origin serverUtilization of access link reduced to 60%, resulting in negligible delaysWeighted average of delays= .6*2 sec + .4*10msecs < 1.3 secsoriginserverspublicInternet1.5 Mbpsaccess linkinstitutionalnetwork10 Mbps LANinstitutionalcache
27Problems Over 50% of all HTTP objects are uncacheable – why? Not easily solvableDynamic data stock prices, scores, web camsCGI scripts results based on passed parametersObvious fixesSSL encrypted data is not cacheableMost web clients don’t handle mixed pages well many generic objects transferred with SSLCookies results may be based on past dataHit metering owner wants to measure # of hits for revenue, etc.What will be the end result?
28Outline HTTP intro and details Persistent HTTP HTTP caching Content distribution networks
29Content Distribution Networks (CDNs) origin serverin North AmericaThe content providers are the CDN customers.Content replicationCDN company installs hundreds of CDN servers throughout InternetClose to usersCDN replicates its customers’ content in CDN servers. When provider updates content, CDN updates serversCDN distribution nodeCDN serverin S. AmericaCDN serverin AsiaCDN serverin Europe
30Content Distribution Networks & Server Selection Replicate content on many serversChallengesHow to replicate contentWhere to replicate contentHow to find replicated contentHow to choose among know replicasHow to direct clients towards replica
31Server Selection Which server? Lowest load to balance load on serversBest performance to improve client performanceBased on Geography? RTT? Throughput? Load?Any alive node to provide fault toleranceHow to direct clients to a particular server?As part of routing anycast, cluster load balancingNot coveredAs part of application HTTP redirectAs part of naming DNSAsk: What things might you want to pick the server using?How do you direct clients to a server? Anyone know how Akamai works?
32Application-Based Content Routing HTTP supports simple way to indicate that Web page has moved (30X responses)Server receives GET request from clientDecides which server is best suited for particular client and objectReturns HTTP redirect to that serverCan make informed application specific decisionMay introduce additional overhead multiple connection setup, name lookups, etc.OK solution in general, but…HTTP Redirect has some flaws – especially with current browsersIncurs many delays, which operators may really care aboutDraw diagram: Client ---> Server<---redir----> Server 2<----answer----
33Naming-Based Content Routing Client does DNS name lookup for serviceName server chooses appropriate server addressA-record returned is “best” one for the clientWhat information can name server base decision on?Server load/location must be collectedInformation in the name lookup requestName service client typically the local name server for client
34Other Findings Each CDN performed best for at least one (NIMI) client Why? Because of proximity?The best origin sites were better than the worst CDNsCDNs with more servers don’t necessarily perform betterNote that they don’t know load on servers…HTTP 1.1 improvements (parallel download, pipelined download) help a lotEven more so for origin (non-CDN) casesNote not all origin sites implement pipelining
35What is a Content Distribution Network? The RFCs and Internet Drafts define a Content Distribution Network, “CDN”, as:Content Delivery Network or Content Distribution Network. A type of CONTENT NETWORK in which the CONTENT NETWORK ELEMENTS are arranged for more effective delivery of CONTENT to CLIENTS.
36What is a Content Distribution Network? A CDN is an overlay network, designed to deliver content from the optimal locationIn many cases, optimal does not mean geographically closestCDNs are made of distinct, geographically disparate groups of servers, with each group able to serve all content on the CDNServers may be separated by typeE.g. One group may serve Windows Streaming Media, another group may serve HTTPServers are not typically shared between media types
37What is a Content Distribution Network Some CDNs are network-owned (Level 3, Limelight, AT&T), some are not (Akamai, Mirror Image, CacheFly, Panther Express)Network-owned CDNs have all / most of their servers in their own ASNNon-network CDNs can place servers directly in other ASNsThis means things like NetFlow will not be useful for determining traffic to/from non-network CDNs
38The Akamai System 85,000+ Servers 1700+ POPs 72+ Countries The Akamai EdgePlatform:85,000+Servers1700+POPs72+Countries950+ Networks660+ CitiesResulting in traffic of:5.4 petabytes / day790+ billion hits / day436+ million unique clients IPs / day(circa April 2011)
39How CDNs WorkWhen content is requested from a CDN, the user is directed to the optimal serverThis is usually done through the DNS, especially for non-network CDNsIt can be done though anycasting for network owned CDNsUsers who query DNS-based CDNs be returned different A records for the same hostnameThis is called “mapping”The better the mapping, the better the CDN
40How CDNs Work Example of CDN mapping Notice the different A records for different locations:[NYC]% hostCNAME a568.d.akamai.neta568.d.akamai.net Aa568.d.akamai.net A[Boston]% hosta568.d.akamai.net Aa568.d.akamai.net A
41CDN Mapping: Another Example From Atlanta% ping youtube.com PING youtube.com ( ) 56(84) bytes of data.64 bytes from yx-in-f93.1e100.net ( ): icmp_req=1 ttl=54 time=1.81 msFrom Boston% ping youtube.com PING youtube.com ( ) 56(84) bytes of data. 64 bytes from ord08s07-in-f9.1e100.net ( ): icmp_seq=1 ttl=52 time=22.8 ms
42How CDNs Work CDNs use multiple criteria to choose the optimal server These include standard network metrics:LatencyThroughputPacket lossThese also include things like CPU load on the server, HD space, network utilization, etc.Geography still countsThat whole speed-of-light thingShould be able to solve that with the next version of ethernet…
43Why CDNs Peer with ISPsThe first and foremost reason to peer is improved performanceSince a CDN tries to serve content as “close” to the end user as possible, peering directly with networks (over non-congested links) obviously helpsPeering gives better throughputRemoving intermediate AS hops seems to give higher peak traffic for same demand profileMight be due to lower latency opening TCP windows fasterMight be due to lower packet loss
44Why CDNs Peer with ISPs Redundancy Burstability Having more possible vectors to deliver content increases reliabilityBurstabilityDuring large events, having direct connectivity to multiple networks allows for higher burstability than a single connection to a transit providerBurstability is important to CDNsOne of the reasons customers use CDNs is for burstability
45Why CDNs Peer with ISPs Peering reduces costs Network Intelligence Reduces transit bill (duh)Network IntelligenceReceiving BGP directly from multiple ASes helps CDNs map the InternetBackup for on-net serversIf there are servers on-net, the IX can act as a backup during downtime and overflowAllows serving different content types
46Why ISPs peer with CDNs Performance Cost Reduction CDNs and ISPs are in the same business, just on different sides - we both want to serve end users as quickly and reliably as possibleYou know more about your network than any CDN ever will, so working with the CDN directly can help them deliver the content more quickly and reliablyCost ReductionTransit savingsPossible backbone savings
47How Non-Network CDNs use IXes Peer NetworkNon-network CDNs do not have a backbone, so each IX instance is independentThe CDN uses transit to pull content into the serversContent is then served to peers over the IXIXContentCDN ServersTransitOrigin Server
48How CDNs use IXesNon-network CDNs usually do not announce large blocks of address space because no one location has a large number of serversIt is not uncommon to see a single /24 from a CDN at an IXThis does not mean you will not see a lot of trafficHow many web servers does it take to fill a gigabit these days?
49How well do CDNs work? Hosting Center Hosting Center OS Backbone ISP CSCSCSIXIXRecall that the bottleneck links are at the edges.Even if CSs are pushed towards the edge, they are still behind the bottleneck link!SiteISPCSISPISPCSSSSSitesSSSSSCCC
50Reduced latency can improve TCP performance DNS round tripTCP handshake (2 round trips)Slow-start~8 round trips to fill DSL pipetotal 128K bytesCompare to 56 Kbytes for cnn.com home pageDownload finished before slow-start completesTotal 11 round tripsCoast-to-coast propagation delay is about 15 msMeasured RTT last night was 50msNo difference between west coast and Cornell!30 ms improvement in RTT means 330 ms total improvementCertainly noticeable
51Lets look at a study Zhang, Krishnamurthy and Wills AT&T LabsTraces taken in Sept and Jan. 2001Compared CDNs with each otherCompared CDNs against non-CDN
52Methodology Selected a bunch of CDNs Akamai, Speedera, Digital IslandNote, most of these gone now!Selected a number of non-CDN sites for which good performance could be expectedU.S. and international originU.S.: Amazon, Bloomberg, CNN, ESPN, MTV, NASA, Playboy, Sony, YahooSelected a set of images of comparable size for each CDN and non-CDN siteCompare apples to applesDownloaded images from 24 NIMI machines
53Response Time Results (II) Including DNS Lookup Time About one secondCumulative ProbabilityAuthor conclusion: CDNs generally provide much shorter download time.
54HTTP (Summary) Simple text-based file exchange protocol Workloads Support for status/error responses, authentication, client-side state maintenance, cache maintenanceWorkloadsTypical documents structure, popularityServer workloadInteractions with TCPConnection setup, reliability, state maintenancePersistent connectionsHow to improve performanceCachingReplication