Lecture 9, Computer Networks (198:552)

Slides:

Advertisements

Similar presentations

Improving Datacenter Performance and Robustness with Multipath TCP

Advertisements

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric. Presented by: Vinuthna Nalluri Shiva Srivastava.

Data Center Fabrics. Forwarding Today Layer 3 approach: – Assign IP addresses to hosts hierarchically based on their directly connected switch. – Use.

Data Center Network Topologies: VL2 (Virtual Layer 2) Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems.

Datacenter Network Topologies

Virtual Layer 2: A Scalable and Flexible Data-Center Network Work with Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Parantap Lahiri,

VL2: A Scalable and Flexible data Center Network

A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat Presented by Gregory Peaker and Tyler Maclean.

Jennifer Rexford Fall 2010 (TTh 1:30-2:50 in COS 302) COS 561: Advanced Computer Networks Data.

Nick McKeown CS244 Lecture 7 Valiant Load Balancing.

Interconnect Networks

Networking the Cloud Presenter: b 電機三姜慧如.

VL2 – A Scalable & Flexible Data Center Network Authors: Greenberg et al Presenter: Syed M Irteza – LUMS CS678: 2 April 2013.

Routing & Architecture

Floodless in SEATTLE : A Scalable Ethernet ArchiTecTure for Large Enterprises. Changhoon Kim, Matthew Caesar and Jenifer Rexford. Princeton University.

VL2: A Scalable and Flexible Data Center Network Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David.

SDN Management Layer DESIGN REQUIREMENTS AND FUTURE DIRECTION NO OF SLIDES : 26 1.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo,

6.888: Lecture 2 Data Center Network Architectures Mohammad Alizadeh Spring 2016  Slides adapted from presentations by Albert Greenberg and Changhoon.

Software Defined Networking BY RAVI NAMBOORI. Overview  Origins of SDN.  What is SDN ?  Original Definition of SDN.  What = Why We need SDN ?  Conclusion.

VL2: A Scalable and Flexible Data Center Network

Data Center Architectures

Network Layer COMPUTER NETWORKS Networking Standards (Network LAYER)

Chapter 3 Part 3 Switching and Bridging

Advanced Computer Networks

Yiting Xia, T. S. Eugene Ng Rice University

Data Center Networking

HULA: Scalable Load Balancing Using Programmable Data Planes

CIS 700-5: The Design and Implementation of Cloud Networks

Data Center Network Topologies II

Overview Parallel Processing Pipelining

Heitor Moraes, Marcos Vieira, Italo Cunha, Dorgival Guedes

Lecture 2: Leaf-Spine and PortLand Networks

Interconnect Networks

Data Center Network Architectures

Data Centers: Network Architecture

A Survey of Data Center Network Architectures By Obasuyi Edokpolor

Advanced Computer Networks

Data Center Network Architectures

Improving Datacenter Performance and Robustness with Multipath TCP

Improving Datacenter Performance and Robustness with Multipath TCP

NTHU CS5421 Cloud Computing

Chapter 3 Part 3 Switching and Bridging

湖南大学-信息科学与工程学院-计算机与科学系

Chuanxiong Guo, Haitao Wu, Kun Tan,

Dingming Wu+, Yiting Xia+*, Xiaoye Steven Sun+,

High Throughput Route Selection in Multi-Rate Ad Hoc Wireless Networks

Indirect Networks or Dynamic Networks

VL2: A Scalable and Flexible Data Center Network

PRESENTATION COMPUTER NETWORKS

Jellyfish: Networking Data Centers Randomly

EE 122: Lecture 7 Ion Stoica September 18, 2001.

Internet and Web Simple client-server model

Data Center Architectures

Network-on-Chip Programmable Platform in Versal™ ACAP Architecture

Centralized Arbitration for Data Centers

Data Center Networks Mohammad Alizadeh Fall 2018

Chapter 3 Part 3 Switching and Bridging

Backbone Traffic Engineering

CS 6290 Many-core & Interconnect

Project proposal: Questions to answer

CS434/534: Topics in Network Systems Cloud Data Centers: Topology, Control; VL2 Yang (Richard) Yang Computer Science Department Yale University 208A.

2019/5/13 A Weighted ECMP Load Balancing Scheme for Data Centers Using P4 Switches Presenter：Hung-Yen Wang Authors：Peng Wang, George Trimponias, Hong Xu,

CS 401/601 Computer Network Systems Mehmet Gunes

Elmo Muhammad Shahbaz Lalith Suresh, Jennifer Rexford, Nick Feamster,

Towards Predictable Datacenter Networks

Lecture 8, Computer Networks (198:552)

Data Center Traffic Engineering

Presentation transcript:

Lecture 9, Computer Networks (198:552) Data Centers Part III Lecture 9, Computer Networks (198:552) Fall 2019 Slide material heavily adapted courtesy of Albert Greenberg, Changhoon Kim, Mohammad Alizadeh.

Data center costs Amortized Cost* Component Sub-Components ~45% Servers CPU, memory, disk ~25% Power infrastructure UPS, cooling, power distribution ~15% Power draw Electrical utility costs Network Switches, links, transit Source: http://web.cse.ohio-state.edu/icdcs2009/Keynote_files/greenberg-keynote.pdf The Cost of a Cloud: Research Problems in Data Center Networks. Sigcomm CCR 2009. Greenberg, Hamilton, Maltz, Patel. *3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money

Goal: Agility -- any service, any server Turn the servers into a single large fungible pool Dynamically expand and contract service footprint as needed Place workloads where server resources are available Easier to maintain availability If one rack goes down, machines from another still available Want to view DCN as a pool of compute connected by one big high-speed fabric

How the network can support agility Need flexibility & low configuration burden of L2 network Without the scaling concerns and single path routing restriction Need massive bisection bandwidth Limit oversubscription at higher layers of the tree topology

Building a high-speed switching fabric

A single (n X m)-port switching fabric Different designs of switching fabric possible Assume n ingress ports and m egress ports, half duplex links

A single (n X m)-port switching fabric Electrical/mechanical/electronic crossover We are OK with any design such that: Any port can connect to any other directly if all other ports free Nonblocking: if input port x and output port y are both free, they should be able to connect Regardless of other ports being connected. If not satisfied, switch is blocking.

High port density + nonblocking == hard! Low-cost nonblocking crossbars are feasible for small # ports However, it is costly to be nonblocking with a large number of ports If each crossover is as fast as each input port, Number of crossover points == n * m Cost grows quadratically on the number of input ports Else, crossover must transition faster than the port … so that you can keep the number of crossovers small Q: How is this relevant to the data center network fabric?

Nonblocking switches with many ports Key principle: Every fast nonblocking switch with a large number of ports is built out of many fast nonblocking switches with a small number of ports. How to build large nonblocking switches? The subject of interconnection networks from the telephony era

3-stage Clos network (r*n X r*n ports) n X m … r X r … m X n … … … … n X m … r X r … m X n … … … … n X m … r X r … m X n … r switches n X m m switches r X r r switches m X n

How Clos networks become nonblocking if m > 2n – 2, then the Clos network is strict-sense nonblocking. That is, any new demand between any pair of free (input, output) ports can be satisfied without re-routing any of the existing demands. How?

Need at most (n-1)+(n-1) middle stage n X m … r X r … m X n … … … … n X m … r X r … m X n … At most n-1 existing demands … … … n X m … r X r … m X n … r switches n X m m switches r X r r switches m X n

Surprising result about Clos networks if m >= n, then the Clos network is rearrangeably nonblocking That is, any new demand between any pair of free (input, output) ports can be satisfied by suitably re-routing existing demands. It is easy to see that m >= n is necessary The surprising part is that m >= n is sufficient

Rearrangeably nonblocking Clos built with identical switches n X n … n X n … n X n … … … … n X n … n X n … n X n … … … … n X n … n X n … n X n … n switches n X n n switches n X n n switches n X n

Modern data center network topologies are just folded Clos topologies.

How does one design a Clos DCN? Switches are usually n X n with full-duplex links Fold the 3-stage Clos into 2-stages Share physical resources between ingress and egress stages Share ports and links across the two “sides” of the middle stage

… … … ToR/aggregation layer Spine layer ToR/aggregation layer n/2 X n/2 … n X n … n/2 X n/2 … … … … From/to hosts n/2 X n/2 … n X n … n/2 X n/2 … … … … n/2 X n/2 … n X n … n/2 X n/2 … n X n full duplex n/2 switches n X n full duplex n X n full duplex

Consequences of using folded Clos 2-stage high throughput data center topology All can use the same switches! (port density and link rates)

What about routing? We said that the Clos topology above is rearrangeably nonblocking. So, how to rearrange existing demands when a new packet arrives, so that it can get across as quickly as possible? How to do it without “interference” to (ie: rerouting) other pkts? VL2: We don’t need to rearrange anything.

Valiant Load Balancing (VLB) Designed to move data quickly for shuffling in parallel computing Setting: Connectivity is sparse (“hypercube” topology): log n links per node in a network with n nodes Key idea: pick a random node to redirect a message to from the source, then follow the shortest path to the destination from there Guarantee: With high probability, the message reaches its destination very quickly (log n steps) Practically, this means there is very less queueing in the network

VLB in data center networks VLB is more general than data center networks or Clos It is a form of oblivious routing No need to measure DCN traffic patterns before deciding on routing Extremely simple to implement: no global state VLB is very handy in folded Clos topologies due to the numerous options to pick the first-hop from the ToR switch. Balance load across many paths Very beneficial in practice Performance isolation: other flows don’t matter High capacity (“bisection bandwidth”) between any two ToR ports

Is ToR-port to ToR-port high capacity enough? VLB + Clos provides high capacity if no ToR port demands more than its bandwidth. But is that enough?

Example: Web Search TLA MLA Worker Nodes ……… Deadline = 250ms Deadline = 50ms Deadline = 10ms A bunch of results arrive at MLAs and TLAs in a short period. Demand may exceed ToR port bandwidth! After the animation, say “this follows what we call the partition/aggregate application structure. I’ve described this in the context of search, but it’s actually quite general - “the foundation of many large-scale web applications.”

Hose traffic model The guarantees of VLB + Clos only hold under the hose model: Demands for any one ToR port (send or receive) must not exceed its bandwidth. Very hard to enforce especially on the receiver side without sender-side rate limits. VL2 uses TCP convergence as a way of ensuring that aggregate ToR port demand is within its bandwidth. Q: do you see any problems?

Discussion of VL2

What you said

Cost "In order to implement VL2, extra servers are needed, which brings more cost on devices. It seems like a tradeoff for a network with less over-subscription. It is not suitable to utilize VL2 in a large network because as the required servers increase, it becomes much much harder to implement than traditional data center network system and since the link and switches work all the time, it will cost more power. " Junlin Lu

Performance The requirement for the directory system is demanding. Is this system deployed in one single machine, or in a distributed system. If it comes to the latter, what is the network topology inside that system. In other words, what kind of distributed system can form a high-performing directory system to manage a high-performing distributed DCN." Huixiong Qin

Performance The paper proposes the addition of a shim layer which would fetch directory listing at the start of the flow. This process would be performed for each new flow and thus might affect performance. If the cluster is supporting highly interactive workloads such that they produce a large number of short-lived connections, the performance impact might be significant. Sukumar Gaonkar

Applicability and Context The authors state at the beginning of the paper that they did not consider solutions that would require significant changes to the APIs of switches or similar network components. Had the authors considered this, are there any parts of their design that they would have changed significantly? Would considering such solutions only improve existing solutions slightly, or does it open a whole new array of ideas and solutions that were previously completely eliminated? Jakob Degen

Weaknesses One technical weakness I can think of is the lack of visibility that arises from using VLB. While the average case seems to be sufficient to ensure that hot spots do not arise in the network, there may be certain cases in which you want to see how flows are being routed or even control their route in particular as we discussed during the lecture on SDNs. In this case you have absolutely no control over how things are routed because it is chosen randomly. Amber Rawson

Weaknesses I am not sure whether the scalability of VL2 is good enough. Nowadays, the data centers are usually with hundreds of thousands of servers. Since the VL2 already shows its tail latency in a relative small datacenter environment, will a larger datacenter make the latency increase significantly? The maintaining of data consistency will also become harder if the network administrators want the DS to have small response delay. Qiongwen Xu

Building on VL2 Though the traffic pattern showed in the paper is totally volatile, I think these data can be applied to machine learning method, to see whether there is some pattern that can be configured with AI technology. Wenhao Lu

VL2: Virtual Layer 2 Backup slides

The Illusion of a Huge L2 Switch 3. Performance isolation VL2 goals The Illusion of a Huge L2 Switch 1. L2 semantics 2. Uniform high capacity 3. Performance isolation A A A A A … A A A A A A … A A A A A A A A A A … A A A A A A A A A A A A … A A A A

VL2 design principles Randomizing to cope with volatility Tremendous variability in traffic matrices Separating names from locations Any server, any service Embracing end systems Leverage the programmability & resources of servers Avoid changes to switches Building on proven networking technology Build with parts shipping today Leverage low cost, powerful merchant silicon ASICs, though do not rely on any one vendor

Single-chip “merchant silicon” switches Switch ASIC 6 pack Wedge Image courtesy of Facebook

Specific objectives and solutions Approach Solution 1. Layer-2 semantics Employ flat addressing Name-location separation & resolution service 2. Uniform high capacity between servers Guarantee bandwidth for hose-model traffic Flow-based random traffic indirection (Valiant LB) 3. Performance Isolation Enforce hose model using existing mechanisms only TCP

Addressing and routing: Name-location separation Cope with host churns with very little overhead VL2 Switches run link-state routing and maintain only switch-level topology Directory Service … x  ToR2 y  ToR3 z  ToR3 … x  ToR2 y  ToR3 z  ToR4 ToR1 . . . ToR2 . . . ToR3 . . . ToR4 ToR3 y payload Lookup & Response x y y, z z ToR4 ToR3 z z payload payload Servers use flat names

Cope with host churns with very little overhead Addressing and routing: Name-location separation Cope with host churns with very little overhead VL2 Switches run link-state routing and maintain only switch-level topology Directory Service Allows to use low-cost switches Protects network and hosts from host-state churn Obviates host and switch reconfiguration … x  ToR2 y  ToR3 z  ToR4 … x  ToR2 y  ToR3 z  ToR3 ToR1 . . . ToR2 . . . ToR3 . . . ToR4 ToR3 y payload Lookup & Response x y y, z z ToR4 ToR3 z z payload payload Servers use flat names

Example topology: Clos network Offer huge aggr capacity and multi paths at modest cost VL2 Int . . . Aggr . . . K aggr switches with D ports . . . . . . . . . TOR . . . . . . . . . . . 20 Servers 20*(DK/4) Servers

Offer huge aggr capacity and multi paths at modest cost Example topology: Clos network Offer huge aggr capacity and multi paths at modest cost VL2 Int . . . D (# of 10G ports) Max DC size (# of Servers) 48 11,520 96 46,080 144 103,680 Aggr . . . K aggr switches with D ports . . . . . . . . . TOR . . . . . . . . . . . 20 Servers 20*(DK/4) Servers

Traffic forwarding: Random indirection Cope with arbitrary TMs with very little overhead Links used for up paths Links used for down paths IANY IANY IANY T1 T2 T3 T4 T5 T6 IANY T5 T3 z y payload payload x y z

Cope with arbitrary TMs with very little overhead Traffic forwarding: Random indirection Cope with arbitrary TMs with very little overhead Links used for up paths Links used for down paths IANY IANY IANY [ ECMP + IP Anycast ] Harness huge bisection bandwidth Obviate esoteric traffic engineering or optimization Ensure robustness to failures Work with switch mechanisms available today T1 T2 T3 T4 T5 T6 IANY T3 T5 z y payload payload x y z

Some other DC network designs… BCube [SIGCOMM’10] Jellyfish (random) [NSDI’12]