Lent Term M/W/F 11-midday

Slides:



Advertisements
Similar presentations
Improving Datacenter Performance and Robustness with Multipath TCP
Advertisements

COS 461 Fall 1997 Routing COS 461 Fall 1997 Typical Structure.
1 o Two issues in practice – Scale – Administrative autonomy o Autonomous system (AS) or region o Intra autonomous system routing protocol o Gateway routers.
Cs/ee 143 Communication Networks Chapter 6 Internetworking Text: Walrand & Parekh, 2010 Steven Low CMS, EE, Caltech.
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric. Presented by: Vinuthna Nalluri Shiva Srivastava.
PFabric: Minimal Near-Optimal Datacenter Transport Mohammad Alizadeh Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker.
Cisco Hierarchical Network Model RD-CSY /101.
Multicast Fundamentals n The communication ways of the hosts n IP multicast n Application level multicast.
Copyright © 2005 Department of Computer Science 1 Solving the TCP-incast Problem with Application-Level Scheduling Maxim Podlesny, University of Waterloo.
“It’s going to take a month to get a proof of concept going.” “I know VMM, but don’t know how it works with SPF and the Portal” “I know Azure, but.
Receiver-driven Layered Multicast S. McCanne, V. Jacobsen and M. Vetterli SIGCOMM 1996.
Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat David Andersen, Greg Ganger, Garth Gibson, Brian Mueller* Carnegie Mellon University, *Panasas.
Datacenter Network Topologies
What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.
Virtual Layer 2: A Scalable and Flexible Data-Center Network Work with Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Parantap Lahiri,
Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.
1 Network Layer: Host-to-Host Communication. 2 Network Layer: Motivation Can we built a global network such as Internet by extending LAN segments using.
Portland: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric Offense Kai Chen Shih-Chi Chen.
Computer Networks Transport Layer. Topics F Introduction  F Connection Issues F TCP.
CS335 Networking & Network Administration Tuesday, April 20, 2010.
Datacenters CS 168, Fall 2014 Sylvia Ratnasamy
Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.
COS 461: Computer Networks
A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat Presented by Gregory Peaker and Tyler Maclean.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Data-Center Traffic Management COS 597E: Software Defined Networking.
Jennifer Rexford Fall 2010 (TTh 1:30-2:50 in COS 302) COS 561: Advanced Computer Networks Data.
A Scalable, Commodity Data Center Network Architecture.
Datacenter Networks Mike Freedman COS 461: Computer Networks
Spanning Tree and Datacenters EE 122, Fall 2013 Sylvia Ratnasamy Material thanks to Mike Freedman, Scott Shenker,
Layer 2 Switch  Layer 2 Switching is hardware based.  Uses the host's Media Access Control (MAC) address.  Uses Application Specific Integrated Circuits.
Switching, routing, and flow control in interconnection networks.
Practical TDMA for Datacenter Ethernet
IA-TCP A Rate Based Incast- Avoidance Algorithm for TCP in Data Center Networks Communications (ICC), 2012 IEEE International Conference on 曾奕勳.
Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.
Detail: Reducing the Flow Completion Time Tail in Datacenter Networks SIGCOMM PIGGY.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Connecting to the Network Networking for Home and Small Businesses.
Elephants, Mice, and Lemmings! Oh My! Fred Baker Fellow 25 July 2014 Making life better in data centers and high speed computing.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
ARMD – Next Steps Next Steps. Why a WG There is a problem People want to work to solve the problem Scope of problem is defined Work items are defined.
© 1999, Cisco Systems, Inc. 1-1 Chapter 2 Overview of a Campus Network © 1999, Cisco Systems, Inc.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Networking Fundamentals. Basics Network – collection of nodes and links that cooperate for communication Nodes – computer systems –Internal (routers,
Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Connecting to the Network Introduction to Networking Concepts.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
1 12-Jan-16 OSI network layer CCNA Exploration Semester 1 Chapter 5.
3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-1.
1 Transport Layer: Basics Outline Intro to transport UDP Congestion control basics.
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
Data Centers and Cloud Computing 1. 2 Data Centers 3.
MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,
Network Processing Systems Design
© 2006 Cisco Systems, Inc. All rights reserved.Cisco Public 1 OSI network layer CCNA Exploration Semester 1 – Chapter 5.
Network Virtualization Ben Pfaff Nicira Networks, Inc.
Data Center Architectures
CIS 700-5: The Design and Implementation of Cloud Networks
Lecture 2: Cloud Computing
Lecture 2: Leaf-Spine and PortLand Networks
OTCP: SDN-Managed Congestion Control for Data Center Networks
Improving Datacenter Performance and Robustness with Multipath TCP
Chapter 4 Data Link Layer Switching
Chapter 4: Network Layer
Switching, routing, and flow control in interconnection networks
IT351: Mobile & Wireless Computing
Internet and Web Simple client-server model
Data Center Architectures
Chapter 4: Network Layer
Elmo Muhammad Shahbaz Lalith Suresh, Jennifer Rexford, Nick Feamster,
Lecture 8, Computer Networks (198:552)
Data Center Traffic Engineering
Presentation transcript:

Lent Term M/W/F 11-midday Computer Networking Lent Term M/W/F 11-midday LT1 in Gates Building Slide Set 7 Andrew W. Moore andrew.moore@cl.cam.ac.uk 2015-2016 Topic 7

Topic 7: Datacenters

What we will cover Characteristics of a datacenter environment goals, constraints, workloads, etc. How and why DC networks are different (vs. WAN) e.g., latency, geo, autonomy, … How traditional solutions fare in this environment e.g., IP, Ethernet, TCP, ARP, DHCP Not details of how datacenter networks operate

Disclaimer Material is emerging (not established) wisdom Material is incomplete many details on how and why datacenter networks operate aren’t public

Header[Header] 26.04.2017[Date] Why Datacenters? Your <public-life, private-life, banks, government> live in my datacenter. Security, Privacy, Control, Cost, Energy, (breaking) received wisdom; all this and more come together into sharp focus in datacenters. Do I need to labor the point? Please identify your research field; why is it strategic and growth area for research in the UK and beyond …? 25-Feb-2015 [Footer]

What goes into a datacenter (network)? Servers organized in racks

What goes into a datacenter (network)? Servers organized in racks Each rack has a `Top of Rack’ (ToR) switch

What goes into a datacenter (network)? Servers organized in racks Each rack has a `Top of Rack’ (ToR) switch An `aggregation fabric’ interconnects ToR switches

What goes into a datacenter (network)? Servers organized in racks Each rack has a `Top of Rack’ (ToR) switch An `aggregation fabric’ interconnects ToR switches Connected to the outside via `core’ switches note: blurry line between aggregation and core With network redundancy of ~2x for robustness

Brocade reference design Example 1 Brocade reference design

Cisco reference design Example 2 Internet CR CR . . . AR AR AR AR S S . . . S S S S A A … A A A … A ~ 40-80 servers/rack Cisco reference design

Observations on DC architecture Regular, well-defined arrangement Hierarchical structure with rack/aggr/core layers Mostly homogenous within a layer Supports communication between servers and between servers and the external world Contrast: ad-hoc structure, heterogeneity of WANs

What’s new?

SCALE!

How big exactly? 1M servers [Microsoft] less than google, more than amazon > $1B to build one site [Facebook] >$20M/month/site operational costs [Microsoft ’09] But only O(10-100) sites

What’s new? Scale Service model user-facing, revenue generating services multi-tenancy jargon: SaaS, PaaS, DaaS, IaaS, …

Implications Scale Service model need scalable solutions (duh) improving efficiency, lowering cost is critical `scale out’ solutions w/ commodity technologies Service model performance means $$ virtualization for isolation and portability

Multi-Tier Applications Applications decomposed into tasks Many separate components Running in parallel on different machines

Componentization leads to different types of network traffic “North-South traffic” Traffic between external clients and the datacenter Handled by front-end (web) servers, mid-tier application servers, and back-end databases Traffic patterns fairly stable, though diurnal variations

North-South Traffic user requests from the Internet You Live Here Router Web Server Data Cache Database Front-End Proxy

Componentization leads to different types of network traffic “North-South traffic” Traffic between external clients and the datacenter Handled by front-end (web) servers, mid-tier application servers, and back-end databases Traffic patterns fairly stable, though diurnal variations “East-West traffic” Traffic between machines in the datacenter Comm within “big data” computations (e.g. Map Reduce) Traffic may shift on small timescales (e.g., minutes)

East-West Traffic Distributed Storage Map Tasks Reduce Tasks

East-West Traffic . . . … … … … CR CR AR AR AR AR S S S S S S S S S S

East-West Traffic Distributed Storage Map Tasks Reduce Tasks Often doesn’t cross the network Some fraction (typically 2/3) crosses the network East-West Traffic Distributed Storage Map Tasks Reduce Tasks Distributed Storage Always goes over the network

What’s different about DC networks? Characteristics Huge scale: ~20,000 switches/routers contrast: AT&T ~500 routers

What’s different about DC networks? Characteristics Huge scale: Limited geographic scope: High bandwidth: 10/40/100G Contrast: Cable/aDSL/WiFi Very low RTT: 10s of microseconds Contrast: 100s of milliseconds in the WAN

What’s different about DC networks? Characteristics Huge scale Limited geographic scope Single administrative domain Can deviate from standards, invent your own, etc. “Green field” deployment is still feasible

What’s different about DC networks? Characteristics Huge scale Limited geographic scope Single administrative domain Control over one/both endpoints can change (say) addressing, congestion control, etc. can add mechanisms for security/policy/etc. at the endpoints (typically in the hypervisor)

What’s different about DC networks? Characteristics Huge scale Limited geographic scope Single administrative domain Control over one/both endpoints Control over the placement of traffic source/sink e.g., map-reduce scheduler chooses where tasks run alters traffic pattern (what traffic crosses which links)

What’s different about DC networks? Characteristics Huge scale Limited geographic scope Single administrative domain Control over one/both endpoints Control over the placement of traffic source/sink Regular/planned topologies (e.g., trees/fat-trees) Contrast: ad-hoc WAN topologies (dictated by real-world geography and facilities)

What’s different about DC networks? Characteristics Huge scale Limited geographic scope Single administrative domain Control over one/both endpoints Control over the placement of traffic source/sink Regular/planned topologies (e.g., trees/fat-trees) Limited heterogeneity link speeds, technologies, latencies, …

What’s different about DC networks? Goals Extreme bisection bandwidth requirements recall: all that east-west traffic target: any server can communicate at its full link speed problem: server’s access link is 10Gbps!

Full Bisection Bandwidth O(40x10x100)Gbps Internet CR CR . . . O(40x10)Gbps AR AR AR AR S S 10Gbps . . . S S S S Traditional tree topologies “scale up” full bisection bandwidth is expensive typically, tree topologies “oversubscribed” A A … A A A … A ~ 40-80 servers/rack

All links are the same speed (e.g. 10Gps) A “Scale Out” Design Build multi-stage `Fat Trees’ out of k-port switches k/2 ports up, k/2 down Supports k3/4 hosts: 48 ports, 27,648 hosts All links are the same speed (e.g. 10Gps)

Full Bisection Bandwidth Not Sufficient To realize full bisectional throughput, routing must spread traffic across paths Enter load-balanced routing How? (1) Let the network split traffic/flows at random (e.g., ECMP protocol -- RFC 2991/2992) How? (2) Centralized flow scheduling? Many more research proposals

What’s different about DC networks? Goals Extreme bisection bandwidth requirements Extreme latency requirements real money on the line current target: 1μs RTTs how? cut-through switches making a comeback reduces switching time

What’s different about DC networks? Goals Extreme bisection bandwidth requirements Extreme latency requirements real money on the line current target: 1μs RTTs how? cut-through switches making a comeback how? avoid congestion reduces queuing delay

What’s different about DC networks? Goals Extreme bisection bandwidth requirements Extreme latency requirements real money on the line current target: 1μs RTTs how? cut-through switches making a comeback (lec. 2!) how? avoid congestion how? fix TCP timers (e.g., default timeout is 500ms!) how? fix/replace TCP to more rapidly fill the pipe

An example problem at scale - INCAST Worker 1 Synchronized mice collide. Caused by Partition/Aggregate. Aggregator Worker 2 Worker 3 RTOmin = 300 ms Worker 4 TCP timeout

The Incast Workload Synchronized Read Client Switch Client now sends Data Block Synchronized Read 1 R R R R 2 3 Client Switch Server Request Unit (SRU) 1 2 3 4 4 Client now sends next batch of requests Storage Servers 40

Incast Workload Overfills Buffers Synchronized Read 1 R R R R 2 4 3 Client Switch Server Request Unit (SRU) 1 2 3 4 4 4 Requests Received Responses 1-3 completed Link Idle! Requests Sent Response 4 Resent Response 4 dropped 41

Queue Buildup Big flows buildup queues. Measurements in Bing cluster Sender 1 Big flows buildup queues. Increased latency for short flows. Receiver Sender 2 Measurements in Bing cluster For 90% packets: RTT < 1ms For 10% packets: 1ms < RTT < 15ms

Link-Layer Flow Control Common between switches but this is flow-control to the end host too… Another idea to reduce incast is to employ Link-Layer Flow Control….. Recall: the Data-Link can use specially coded symbols in the coding to say “Stop” and “Start” Topic 7

Link Layer Flow Control – The Dark side Head of Line Blocking…. Such HOL blocking does not even differentiate processes so this can occur between competing processes on a pair of machines – no datacenter required. Waiting for no good reason….

Link Layer Flow Control But its worse that you imagine…. Double down on trouble…. Did I mention this is Link-Layer! That means no (IP) control traffic, no routing messages…. a whole system waiting for one machine Incast is very unpleasant. … Reducing the impact of HOL in Link Layer Flow Control can be done through priority queues and overtaking….

What’s different about DC networks? Goals Extreme bisection bandwidth requirements Extreme latency requirements Predictable, deterministic performance “your packet will reach in Xms, or not at all” “your VM will always see at least YGbps throughput” Resurrecting `best effort’ vs. `Quality of Service’ debates How is still an open question

What’s different about DC networks? Goals Extreme bisection bandwidth requirements Extreme latency requirements Predictable, deterministic performance Differentiating between tenants is key e.g., “No traffic between VMs of tenant A and tenant B” “Tenant X cannot consume more than XGbps” “Tenant Y’s traffic is low priority”

What’s different about DC networks? Goals Extreme bisection bandwidth requirements Extreme latency requirements Predictable, deterministic performance Differentiating between tenants is key Scalability (of course) Q: How’s that Ethernet spanning tree looking?

What’s different about DC networks? Goals Extreme bisection bandwidth requirements Extreme latency requirements Predictable, deterministic performance Differentiating between tenants is key Scalability (of course) Cost/efficiency focus on commodity solutions, ease of management some debate over the importance in the network case

Summary new characteristics and goals some liberating, some constraining scalability is the baseline requirement more emphasis on performance less emphasis on heterogeneity less emphasis on interoperability -- incast? -- one big switch abstraction…

Computer Networking UROP Assessed Practicals for Computer Networking. so supervisors can set/use work so we can have a Computer Networking tick running over summer 2016 Talk to me. Part 2 projects for 16-17 Fancy doing something at scale or speed?