The Web and Content Distribution Networks

Slides:



Advertisements
Similar presentations
Nick Feamster CS 3251: Computer Networking I Spring 2013
Advertisements

Protocol layers and Wireshark Rahul Hiran TDTS11:Computer Networks and Internet Protocols 1 Note: T he slides are adapted and modified based on slides.
INTERNET PROTOCOLS Class 9 CSCI 6433 David C. Roberts Entire contents copyright 2011, David C. Roberts, all rights reserved.
HTTP Reading: Section and COS 461: Computer Networks Spring
Computer Networking Lecture 25 – The Web. Lecture 19: Outline HTTP review and details (more in notes) Persistent HTTP review HTTP caching.
Hypertext Transfer PROTOCOL ----HTTP Sen Wang CSE5232 Network Programming.
Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.
1 Server Selection & Content Distribution Networks (slides by Srini Seshan, CS CMU)
HyperText Transfer Protocol (HTTP)
EEC-484/584 Computer Networks Lecture 6 Wenbing Zhao
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Chapter 9 Application Layer, HTTP Professor Rick Han University of Colorado at Boulder
Application Layer  We will learn about protocols by examining popular application-level protocols  HTTP  FTP  SMTP / POP3 / IMAP  Focus on client-server.
Chapter 2: Application Layer
EEC-484/584 Computer Networks Discussion Session for HTTP and DNS Wenbing Zhao
HyperText Transfer Protocol (HTTP) Computer Networks Computer Networks Spring 2012 Spring 2012.
CDNs & Replication Prof. Vern Paxson EE122 Fall 2007 TAs: Lisa Fowler, Daniel Killebrew, Jorge Ortiz.
HTTP and Web Content Delivery COS 461: Computer Networks Spring 2011 Mike Freedman
Chapter 2 Application Layer Computer Networking: A Top Down Approach Featuring the Internet, 3 rd edition. Jim Kurose, Keith Ross Addison-Wesley, July.
Web, HTTP and Web Caching
Application Layer  We will learn about protocols by examining popular application-level protocols  HTTP  FTP  SMTP / POP3 / IMAP  Focus on client-server.
5/12/05CS118/Spring051 A Day in the Life of an HTTP Query 1.HTTP Brower application Socket interface 3.TCP 4.IP 5.Ethernet 2.DNS query 6.IP router 7.Running.
Application Layer  We will learn about protocols by examining popular application-level protocols  HTTP  FTP  SMTP / POP3 / IMAP  Focus on client-server.
2/9/2004 Web and HTTP February 9, /9/2004 Assignments Due – Reading and Warmup Work on Message of the Day.
Caching and Content Distribution Networks. Web Caching r As an example, we use the web to illustrate caching and other related issues browser Web Proxy.
1 Content Distribution Networks. 2 Replication Issues Request distribution: how to transparently distribute requests for content among replication servers.
{ Content Distribution Networks ECE544 Dhananjay Makwana Principal Software Engineer, Semandex Networks 5/2/14ECE544.
CS640: Introduction to Computer Networks Aditya Akella Lecture 18 - The Web, Caching and CDNs.
Chapter 4. After completion of this chapter, you should be able to: Explain “what is the Internet? And how we connect to the Internet using an ISP. Explain.
CP476 Internet Computing Lecture 5 : HTTP, WWW and URL 1 Lecture 5. WWW, HTTP and URL Objective: to review the concepts of WWW to understand how HTTP works.
Rensselaer Polytechnic Institute Shivkumar Kalvanaraman, Biplab Sikdar 1 The Web: the http protocol http: hypertext transfer protocol Web’s application.
20-1 Last time □ NAT □ Application layer ♦ Intro ♦ Web / HTTP.
Week 11: Application Layer1 Web and HTTP First some jargon r Web page consists of objects r Object can be HTML file, JPEG image, Java applet, audio file,…
Introduction 1 Lecture 6 Application Layer (HTTP) slides are modified from J. Kurose & K. Ross University of Nevada – Reno Computer Science & Engineering.
2: Application Layer1 Web and HTTP First some jargon Web page consists of base HTML-file which includes several referenced objects Object can be HTML file,
1 Computer Communication & Networks Lecture 28 Application Layer: HTTP & WWW p Waleed Ejaz
Protocol(TCP/IP, HTTP) 송준화 조경민 2001/03/13. Network Computing Lab.2 Layering of TCP/IP-based protocols.
2: Application Layer1 Chapter 2 outline r 2.1 Principles of app layer protocols r 2.2 Web and HTTP r 2.3 FTP r 2.4 Electronic Mail r 2.5 DNS r 2.6 Socket.
Sockets process sends/receives messages to/from its socket
Hui Zhang, Fall Computer Networking Web, HTTP, Caching.
Data Communications and Computer Networks Chapter 2 CS 3830 Lecture 8 Omar Meqdadi Department of Computer Science and Software Engineering University of.
CIS679: Lecture 13 r Review of Last Lecture r More on HTTP.
2: Application Layer 1 Chapter 2: Application layer r 2.1 Principles of network applications  app architectures  app requirements r 2.2 Web and HTTP.
1 COMP 431 Internet Services & Protocols HTTP Persistence & Web Caching Jasleen Kaur February 11, 2016.
Data Communications and Computer Networks Chapter 2 CS 3830 Lecture 7 Omar Meqdadi Department of Computer Science and Software Engineering University of.
EEC-484/584 Computer Networks Lecture 4 Wenbing Zhao (Part of the slides are based on Drs. Kurose & Ross ’ s slides for their Computer.
09/13/04 CDA 6506 Network Architecture and Client/Server Computing Peer-to-Peer Computing and Content Distribution Networks by Zornitza Genova Prodanoff.
HyperText Transfer Protocol (HTTP) Deepti Kulkarni CISC 856: TCP/IP and Upper Layer Protocols Fall 2008 Acknowledgements Professor Amer Richi Gupta.
Content Distribution Networks (CDNs)
Week 11: Application Layer 1 Web and HTTP r Web page consists of objects r Object can be HTML file, JPEG image, Java applet, audio file,… r Web page consists.
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
© Janice Regan, CMPT 128, Jan 2007 CMPT 371 Data Communications and Networking HTTP 0.
Lecture 5 Internet Core: Protocol layers. Application Layer  We will learn about protocols by examining popular application-level protocols  HTTP 
2: Application Layer 1 Chapter 2 Application Layer These ppt slides are originally from the Kurose and Ross’s book. But some slides are deleted and added.
Content Distribution Networks
Block 5: An application layer protocol: HTTP
How HTTP Works Made by Manish Kushwaha.
HTTP request message: general format
Caching Temporary storage of frequently accessed data (duplicating original data stored somewhere else) Reduces access time/latency for clients Reduces.
Internet transport protocols services
Web Caching? Web Caching:.
Utilization of Azure CDN for the large file distribution
Internet Applications
ECE 671 – Lecture 16 Content Distribution Networks
Computer Communication & Networks
Lecture 6 – Web Optimizations
CSE 461 HTTP and the Web.
EE 122: HyperText Transfer Protocol (HTTP)
Hypertext Transfer Protocol (HTTP)
CSCI-351 Data communication and Networks
Presentation transcript:

The Web and Content Distribution Networks Nick Feamster CS 6250 Fall 2011 Blast through HTTP review slides on the projector. Swap back to chalkboard for new stuff. (some notes from David Andersen and Christian Kauffman)

Outline HTTP review Persistent HTTP review HTTP caching Content distribution networks

HTTP Basics HTTP layered over bidirectional byte stream Interaction Almost always TCP Interaction Client sends request to server, followed by response from server to client Requests/responses are encoded in text Stateless Server maintains no information about past client requests

How to Mark End of Message? Size of message  Content-Length Must know size of transfer in advance Delimiter  MIME-style Content-Type Server must “escape” delimiter in content Close connection Only server can do this

HTTP Request Request line Method URL (relative) HTTP version GET – return URI HEAD – return headers only of GET response POST – send data to the server (forms, etc.) URL (relative) E.g., /index.html HTTP version

HTTP Request (cont.) Request headers Blank-line Body Authorization – authentication info Acceptable document types/encodings From – user email If-Modified-Since Referrer – what caused this page to be requested User-Agent – client software Blank-line Body

HTTP Request (review)

HTTP Request Example GET / HTTP/1.1 Accept: */* Accept-Language: en-us Accept-Encoding: gzip, deflate User-Agent: Mozilla/5.0 Host: www.gtnoise.net Connection: Keep-Alive

HTTP Response Status-line HTTP version 3 digit response code 1XX – informational 2XX – success 200 OK 3XX – redirection 301 Moved Permanently 303 Moved Temporarily 304 Not Modified 4XX – client error 404 Not Found 5XX – server error 505 HTTP Version Not Supported Reason phrase

HTTP Response Headers Blank-line Body Location – for redirection Server – server software WWW-Authenticate – request for authentication Allow – list of methods supported (get, head, etc) Content-Encoding – E.g x-gzip Content-Length Content-Type Expires Last-Modified Blank-line Body

HTTP Response Example % curl -D - -o /dev/null http://gtnoise.net HTTP/1.1 200 OK Date: Tue, 25 Oct 2011 13:50:28 GMT Server: Apache/2.2.16 (Debian) X-Powered-By: PHP/5.3.3-7+squeeze1 Set-Cookie: a1c6441c5610e3424571cceee6bfd0ef=q3a5oq1e47ncpkn97mqbo47qm0; path=/ P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM" Set-Cookie: ja_purity_tpl=ja_purity; expires=Sun, 14-Oct-2012 13:50:28 GMT; path=/ Expires: Mon, 1 Jan 2001 00:00:00 GMT Last-Modified: Tue, 25 Oct 2011 13:50:28 GMT Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Pragma: no-cache Vary: Accept-Encoding Transfer-Encoding: chunked Content-Type: text/html; charset=utf-8

Outline HTTP intro and details Persistent HTTP HTTP caching Content distribution networks

HTTP 0.9/1.0 One request/response per TCP connection Disadvantages Simple to implement Disadvantages Multiple connection setups  three-way handshake each time Several extra round trips added to transfer Multiple slow starts Now that you’ve seen TCP in the previous lectures, the overhead of one TCP connection per object is more clear.

Single Transfer Example Client Server 0 RTT SYN Client opens TCP connection SYN 1 RTT ACK DAT Client sends HTTP request for HTML Server reads from disk ACK DAT FIN 2 RTT ACK Client parses HTML Client opens TCP connection FIN ACK SYN SYN 3 RTT ACK Client sends HTTP request for image DAT Server reads from disk ACK 4 RTT DAT Image begins to arrive Lecture 19: 2006-11-02

-- Things to think about -- More Problems Short transfers are hard on TCP Stuck in slow start Loss recovery is poor when windows are small Lots of extra connections Increases server state/processing Server also forced to keep TIME_WAIT connection state -- Things to think about -- Why must server keep these? Tends to be an order of magnitude greater than # of active connections, why?

Persistent Connection Solution Multiplex multiple transfers onto one TCP connection How to identify requests/responses Delimiter  Server must examine response for delimiter string Content-length and delimiter  Must know size of transfer in advance Block-based transmission  send in multiple length delimited blocks Store-and-forward  wait for entire response and then use content-length Solution  use existing methods and close connection otherwise

Persistent Connection Example Client Server 0 RTT DAT Client sends HTTP request for HTML Server reads from disk ACK DAT 1 RTT ACK Client parses HTML Client sends HTTP request for image DAT Server reads from disk ACK DAT 2 RTT Image begins to arrive Lecture 19: 2006-11-02

Persistent HTTP Nonpersistent HTTP issues: Requires 2 RTTs per object OS must work and allocate host resources for each TCP connection But browsers often open parallel TCP connections to fetch referenced objects Persistent HTTP Server leaves connection open after sending response Subsequent HTTP messages between same client/server are sent over connection Persistent without pipelining: Client issues new request only when previous response has been received One RTT for each referenced object Persistent with pipelining: Default in HTTP/1.1 Client sends requests as soon as it encounters a referenced object As little as one RTT for all the referenced objects

Outline HTTP intro and details Persistent HTTP HTTP caching Content distribution networks

HTTP Caching Clients often cache documents Challenge: update of documents If-Modified-Since requests to check HTTP 0.9/1.0 used just date HTTP 1.1 has an opaque “entity tag” (could be a file signature, etc.) as well When/how often should the original be checked for changes? Check every time? Check each session? Day? Etc? Use Expires header If no Expires, often use Last-Modified as estimate Ask: What’s the problem with just using date for IMS queries? - Web servers may generate content per-client Examples: CGI scripts, things like google customized home, vary with cookies, client IP, referrer, accept-encoding, etc.

Example Cache Check Request GET / HTTP/1.1 Accept: */* Accept-Language: en-us Accept-Encoding: gzip, deflate If-Modified-Since: Mon, Oct 24 2011 17:54:18 GMT If-None-Match: "7a11f-10ed-3a75ae4a" User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.3) Host: www.gtnoise.net Connection: Keep-Alive

Example Cache Check Response HTTP/1.1 304 Not Modified Date: Mon, 24 Oct 2011 03:50:51 GMT Server: Server: Apache/2.2.16 (Debian) Connection: Keep-Alive Keep-Alive: timeout=15, max=100 ETag: "7a11f-10ed-3a75ae4a”

Ways to cache Client-directed caching: Web Proxies Server-directed caching: Content Delivery Networks (CDNs)

Web Proxy Caches User configures browser: Web accesses via cache Browser sends all HTTP requests to cache Object in cache: cache returns object Else cache requests object from origin server, then returns object to client origin server Proxy server HTTP request HTTP request client HTTP response HTTP response HTTP request HTTP response client origin server

Caching Example (1) Assumptions Average object size = 100,000 bits Avg. request rate from institution’s browser to origin servers = 15/sec Delay from institutional router to any origin server and back to router = 2 sec Consequences Utilization on LAN = 15% Utilization on access link = 100% Total delay = Internet delay + access delay + LAN delay = 2 sec + minutes + milliseconds origin servers public Internet 1.5 Mbps access link institutional network 10 Mbps LAN

Caching Example (2) Install cache origin Consequence servers Suppose hit rate is .4 Consequence 40% requests will be satisfied almost immediately (say 10 msec) 60% requests satisfied by origin server Utilization of access link reduced to 60%, resulting in negligible delays Weighted average of delays = .6*2 sec + .4*10msecs < 1.3 secs origin servers public Internet 1.5 Mbps access link institutional network 10 Mbps LAN institutional cache

Problems Over 50% of all HTTP objects are uncacheable – why? Not easily solvable Dynamic data  stock prices, scores, web cams CGI scripts  results based on passed parameters Obvious fixes SSL  encrypted data is not cacheable Most web clients don’t handle mixed pages well many generic objects transferred with SSL Cookies  results may be based on past data Hit metering  owner wants to measure # of hits for revenue, etc. What will be the end result?

Outline HTTP intro and details Persistent HTTP HTTP caching Content distribution networks

Content Distribution Networks (CDNs) origin server in North America The content providers are the CDN customers. Content replication CDN company installs hundreds of CDN servers throughout Internet Close to users CDN replicates its customers’ content in CDN servers. When provider updates content, CDN updates servers CDN distribution node CDN server in S. America CDN server in Asia CDN server in Europe

Content Distribution Networks & Server Selection Replicate content on many servers Challenges How to replicate content Where to replicate content How to find replicated content How to choose among know replicas How to direct clients towards replica

Server Selection Which server? Lowest load  to balance load on servers Best performance  to improve client performance Based on Geography? RTT? Throughput? Load? Any alive node  to provide fault tolerance How to direct clients to a particular server? As part of routing  anycast, cluster load balancing Not covered As part of application  HTTP redirect As part of naming  DNS Ask: What things might you want to pick the server using? How do you direct clients to a server? Anyone know how Akamai works?

Application-Based Content Routing HTTP supports simple way to indicate that Web page has moved (30X responses) Server receives GET request from client Decides which server is best suited for particular client and object Returns HTTP redirect to that server Can make informed application specific decision May introduce additional overhead  multiple connection setup, name lookups, etc. OK solution in general, but… HTTP Redirect has some flaws – especially with current browsers Incurs many delays, which operators may really care about Draw diagram: Client ---> Server <---redir---- -------> Server 2 <----answer----

Naming-Based Content Routing Client does DNS name lookup for service Name server chooses appropriate server address A-record returned is “best” one for the client What information can name server base decision on? Server load/location  must be collected Information in the name lookup request Name service client  typically the local name server for client

Other Findings Each CDN performed best for at least one (NIMI) client Why? Because of proximity? The best origin sites were better than the worst CDNs CDNs with more servers don’t necessarily perform better Note that they don’t know load on servers… HTTP 1.1 improvements (parallel download, pipelined download) help a lot Even more so for origin (non-CDN) cases Note not all origin sites implement pipelining

What is a Content Distribution Network? The RFCs and Internet Drafts define a Content Distribution Network, “CDN”, as: Content Delivery Network or Content Distribution Network. A type of CONTENT NETWORK in which the CONTENT NETWORK ELEMENTS are arranged for more effective delivery of CONTENT to CLIENTS.

What is a Content Distribution Network? A CDN is an overlay network, designed to deliver content from the optimal location In many cases, optimal does not mean geographically closest CDNs are made of distinct, geographically disparate groups of servers, with each group able to serve all content on the CDN Servers may be separated by type E.g. One group may serve Windows Streaming Media, another group may serve HTTP Servers are not typically shared between media types

What is a Content Distribution Network Some CDNs are network-owned (Level 3, Limelight, AT&T), some are not (Akamai, Mirror Image, CacheFly, Panther Express) Network-owned CDNs have all / most of their servers in their own ASN Non-network CDNs can place servers directly in other ASNs This means things like NetFlow will not be useful for determining traffic to/from non-network CDNs

The Akamai System 85,000+ Servers 1700+ POPs 72+ Countries The Akamai EdgePlatform: 85,000+ Servers 1700+ POPs 72+ Countries 950+ Networks 660+ Cities Resulting in traffic of: 5.4 petabytes / day 790+ billion hits / day 436+ million unique clients IPs / day (circa April 2011)

How CDNs Work When content is requested from a CDN, the user is directed to the optimal server This is usually done through the DNS, especially for non-network CDNs It can be done though anycasting for network owned CDNs Users who query DNS-based CDNs be returned different A records for the same hostname This is called “mapping” The better the mapping, the better the CDN

How CDNs Work Example of CDN mapping Notice the different A records for different locations: [NYC]% host www.symantec.com www.symantec.com CNAME a568.d.akamai.net a568.d.akamai.net A 207.40.194.46 a568.d.akamai.net A 207.40.194.49 [Boston]% host www.symantec.com a568.d.akamai.net A 81.23.243.152 a568.d.akamai.net A 81.23.243.145

CDN Mapping: Another Example From Atlanta % ping youtube.com PING youtube.com (74.125.45.93) 56(84) bytes of data. 64 bytes from yx-in-f93.1e100.net (74.125.45.93): icmp_req=1 ttl=54 time=1.81 ms From Boston % ping youtube.com PING youtube.com (74.125.225.73) 56(84) bytes of data. 64 bytes from ord08s07-in-f9.1e100.net (74.125.225.73): icmp_seq=1 ttl=52 time=22.8 ms

How CDNs Work CDNs use multiple criteria to choose the optimal server These include standard network metrics: Latency Throughput Packet loss These also include things like CPU load on the server, HD space, network utilization, etc. Geography still counts That whole speed-of-light thing Should be able to solve that with the next version of ethernet…

Why CDNs Peer with ISPs The first and foremost reason to peer is improved performance Since a CDN tries to serve content as “close” to the end user as possible, peering directly with networks (over non-congested links) obviously helps Peering gives better throughput Removing intermediate AS hops seems to give higher peak traffic for same demand profile Might be due to lower latency opening TCP windows faster Might be due to lower packet loss

Why CDNs Peer with ISPs Redundancy Burstability Having more possible vectors to deliver content increases reliability Burstability During large events, having direct connectivity to multiple networks allows for higher burstability than a single connection to a transit provider Burstability is important to CDNs One of the reasons customers use CDNs is for burstability

Why CDNs Peer with ISPs Peering reduces costs Network Intelligence Reduces transit bill (duh) Network Intelligence Receiving BGP directly from multiple ASes helps CDNs map the Internet Backup for on-net servers If there are servers on-net, the IX can act as a backup during downtime and overflow Allows serving different content types

Why ISPs peer with CDNs Performance Cost Reduction CDNs and ISPs are in the same business, just on different sides - we both want to serve end users as quickly and reliably as possible You know more about your network than any CDN ever will, so working with the CDN directly can help them deliver the content more quickly and reliably Cost Reduction Transit savings Possible backbone savings

How Non-Network CDNs use IXes Peer Network Non-network CDNs do not have a backbone, so each IX instance is independent The CDN uses transit to pull content into the servers Content is then served to peers over the IX IX Content CDN Servers Transit Origin Server

How CDNs use IXes Non-network CDNs usually do not announce large blocks of address space because no one location has a large number of servers It is not uncommon to see a single /24 from a CDN at an IX This does not mean you will not see a lot of traffic How many web servers does it take to fill a gigabit these days?

How well do CDNs work? Hosting Center Hosting Center OS Backbone ISP CS CS CS IX IX Recall that the bottleneck links are at the edges. Even if CSs are pushed towards the edge, they are still behind the bottleneck link! Site ISP CS ISP ISP CS S S S Sites S S S S S C C C

Reduced latency can improve TCP performance DNS round trip TCP handshake (2 round trips) Slow-start ~8 round trips to fill DSL pipe total 128K bytes Compare to 56 Kbytes for cnn.com home page Download finished before slow-start completes Total 11 round trips Coast-to-coast propagation delay is about 15 ms Measured RTT last night was 50ms No difference between west coast and Cornell! 30 ms improvement in RTT means 330 ms total improvement Certainly noticeable

Lets look at a study Zhang, Krishnamurthy and Wills AT&T Labs Traces taken in Sept. 2000 and Jan. 2001 Compared CDNs with each other Compared CDNs against non-CDN

Methodology Selected a bunch of CDNs Akamai, Speedera, Digital Island Note, most of these gone now! Selected a number of non-CDN sites for which good performance could be expected U.S. and international origin U.S.: Amazon, Bloomberg, CNN, ESPN, MTV, NASA, Playboy, Sony, Yahoo Selected a set of images of comparable size for each CDN and non-CDN site Compare apples to apples Downloaded images from 24 NIMI machines

Response Time Results (II) Including DNS Lookup Time About one second Cumulative Probability Author conclusion: CDNs generally provide much shorter download time.

HTTP (Summary) Simple text-based file exchange protocol Workloads Support for status/error responses, authentication, client-side state maintenance, cache maintenance Workloads Typical documents structure, popularity Server workload Interactions with TCP Connection setup, reliability, state maintenance Persistent connections How to improve performance Caching Replication