15-744: Computer Networking L-14 Data Center Networking II.

Slides:



Advertisements
Similar presentations
Martin Suchara, Ryan Witt, Bartek Wydrowski California Institute of Technology Pasadena, U.S.A. TCP MaxNet Implementation and Experiments on the WAN in.
Advertisements

Data Center Networking with Multipath TCP
Improving Datacenter Performance and Robustness with Multipath TCP
Data Center Fabrics Lecture 12 Aditya Akella.
Big Data + SDN SDN Abstractions. The Story Thus Far Different types of traffic in clusters Background Traffic – Bulk transfers – Control messages Active.
B 黃冠智.
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric
CSIT560 Internet Infrastructure: Switches and Routers Active Queue Management Presented By: Gary Po, Henry Hui and Kenny Chong.
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric. Presented by: Vinuthna Nalluri Shiva Srivastava.
Data Center Fabrics. Forwarding Today Layer 3 approach: – Assign IP addresses to hosts hierarchically based on their directly connected switch. – Use.
1 Scalability is King. 2 Internet: Scalability Rules Scalability is : a critical factor in every decision Ease of deployment and interconnection The intelligence.
Improving Datacenter Performance and Robustness with Multipath TCP Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik,
PFabric: Minimal Near-Optimal Datacenter Transport Mohammad Alizadeh Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker.
Balajee Vamanan et al. Deadline-Aware Datacenter TCP (D 2 TCP) Balajee Vamanan, Jahangir Hasan, and T. N. Vijaykumar.
Bertha & M Sadeeq.  Easy to manage the problems  Scalability  Real time and real environment  Free data collection  Cost efficient  DCTCP only covers.
Spring 2002CS 4611 Router Construction Outline Switched Fabrics IP Routers Tag Switching.
Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.
10 - Network Layer. Network layer r transport segment from sending to receiving host r on sending side encapsulates segments into datagrams r on rcving.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
Alternative Switching Technologies: Optical Circuit Switches Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance.
Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.
A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat Presented by Gregory Peaker and Tyler Maclean.
A Scalable, Commodity Data Center Network Architecture Mohammad AI-Fares, Alexander Loukissas, Amin Vahdat Presented by Ye Tao Feb 6 th 2013.
A Scalable, Commodity Data Center Network Architecture
A Scalable, Commodity Data Center Network Architecture.
Helios: A Hybrid Electrical/Optical Switch Architecture for Modular Data Centers Nathan Farrington George Porter, Sivasankar Radhakrishnan,
Practical TDMA for Datacenter Ethernet
Mohammad Alizadeh Stanford University Joint with: Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, Masato Yasuda HULL: High bandwidth, Ultra Low-Latency.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.
Network Aware Resource Allocation in Distributed Clouds.
Routing & Architecture
David G. Andersen CMU Guohui Wang, T. S. Eugene Ng Rice Michael Kaminsky, Dina Papagiannaki, Michael A. Kozuch, Michael Ryan Intel Labs Pittsburgh 1 c-Through:
Patch Panels in the Sky: A Case for Free-Space Optics in Data Centers Navid Hamed Azimi, Himanshu Gupta Vyas Sekar, Samir Das.
High-speed TCP  FAST TCP: motivation, architecture, algorithms, performance (by Cheng Jin, David X. Wei and Steven H. Low)  Modifying TCP's Congestion.
HighSpeed TCP for High Bandwidth-Delay Product Networks Raj Kettimuthu.
TCP Trunking: Design, Implementation and Performance H.T. Kung and S. Y. Wang.
Department of Computer Science A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares Alexander Loukissas Amin Vahdat SIGCOMM’08 Reporter:
Software Defined Networks for Dynamic Datacenter and Cloud Environments.
1 | © 2015 Infinera Open SDN in Metro P-OTS Networks Sten Nordell CTO Metro Business Group
TimeThief: Leveraging Network Variability to Save Datacenter Energy in On-line Data- Intensive Applications Balajee Vamanan (Purdue UIC) Hamza Bin Sohail.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
1 IEX8175 RF Electronics Avo Ots telekommunikatsiooni õppetool, TTÜ raadio- ja sidetehnika inst.
L Subramanian*, I Stoica*, H Balakrishnan +, R Katz* *UC Berkeley, MIT + USENIX NSDI’04, 2004 Presented by Alok Rakkhit, Ionut Trestian.
Theophilus Benson*, Ashok Anand*, Aditya Akella*, Ming Zhang + *University of Wisconsin, Madison + Microsoft Research.
C-Through: Part-time Optics in Data centers Aditi Bose, Sarah Alsulaiman.
Data Centers and Cloud Computing 1. 2 Data Centers 3.
MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,
Shuihai Hu, Wei Bai, Kai Chen, Chen Tian (NJU), Ying Zhang (HP Labs), Haitao Wu (Microsoft) Sing Hong Kong University of Science and Technology.
15-744: Computer Networking L-14 Data Center Networking III.
BDTS and Its Evaluation on IGTMD link C. Chen, S. Soudan, M. Pasin, B. Chen, D. Divakaran, P. Primet CC-IN2P3, LIP ENS-Lyon
Scalable Congestion Control Protocol based on SDN in Data Center Networks Speaker : Bo-Han Hua Professor : Dr. Kai-Wei Ke Date : 2016/04/08 1.
Network Virtualization Ben Pfaff Nicira Networks, Inc.
Yiting Xia, T. S. Eugene Ng Rice University
Resilient Datacenter Load Balancing in the Wild
Data Center Network Topologies II
Data Center Network Architectures
Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.
HyGenICC: Hypervisor-based Generic IP Congestion Control for Virtualized Data Centers Conference Paper in Proceedings of ICC16 By Ahmed M. Abdelmoniem,
Improving Datacenter Performance and Robustness with Multipath TCP
15-744: Computer Networking
Improving Datacenter Performance and Robustness with Multipath TCP
A Scalable, Commodity Data Center Network Architecture
IS3120 Network Communications Infrastructure
Dingming Wu+, Yiting Xia+*, Xiaoye Steven Sun+,
NTHU CS5421 Cloud Computing
TimeTrader: Exploiting Latency Tail to Save Datacenter Energy for Online Search Balajee Vamanan, Hamza Bin Sohail, Jahangir Hasan, and T. N. Vijaykumar.
Cross-Layer Optimizations between Network and Compute in Online Services Balajee Vamanan.
Centralized Arbitration for Data Centers
Lecture 17, Computer Networks (198:552)
Presentation transcript:

15-744: Computer Networking L-14 Data Center Networking II

Overview Data Center Topologies Data Center Scheduling 2

Current solutions for increasing data center network bandwidth 3 1. Hard to construct 2. Hard to expand FatTree

Background: Fat-Tree Inter-connect racks (of servers) using a fat-tree topology Fat-Tree: a special type of Clos Networks (after C. Clos) K-ary fat tree: three-layer topology (edge, aggregation and core) each pod consists of (k/2) 2 servers & 2 layers of k/2 k-port switches each edge switch connects to k/2 servers & k/2 aggr. switches each aggr. switch connects to k/2 edge & k/2 core switches (k/2) 2 core switches: each connects to k pods 4 Fat-tree with K=4

Why Fat-Tree? Fat tree has identical bandwidth at any bisections Each layer has the same aggregated bandwidth Can be built using cheap devices with uniform capacity Each port supports same speed as end host All devices can transmit at line speed if packets are distributed uniform along available paths Great scalability: k-port switch supports k 3 /4 servers 5 Fat tree network with K = 3 supporting 54 hosts

An alternative: hybrid packet/circuit switched data center network 6 Goal of this work: –Feasibility: software design that enables efficient use of optical circuits –Applicability: application performance over a hybrid network

Electrical packet switching Optical circuit switching Switching technology Store and forwardCircuit switching Switching capacity Switching time Optical circuit switching v.s. Electrical packet switching 7 16x40Gbps at high end e.g. Cisco CRS-1 320x100Gbps on market, e.g. Calient FiberConnect Packet granularityLess than 10ms e.g. MEMS optical switch

Optical Circuit Switch SIGCOMMNathan Farrington 8 Lenses Fixed Mirror Mirrors on Motors Glass Fiber Bundle Input 1 Output 2 Output 1 Rotate Mirror Does not decode packets Needs take time to reconfigure

Electrical packet switching Optical circuit switching Switching technology Store and forwardCircuit switching Switching capacity Switching time Switching traffic Optical circuit switching v.s. Electrical packet switching 9 16x40Gbps at high end e.g. Cisco CRS-1 320x100Gbps on market, e.g. Calient FiberConnect Packet granularityLess than 10ms For bursty, uniform trafficFor stable, pair-wise traffic

Optical circuit switching is promising despite slow switching time [IMC09][HotNets09]: “Only a few ToRs are hot and most their traffic goes to a few other ToRs. …” [WREN09]: “…we find that traffic at the five edge switches exhibit an ON/OFF pattern… ” 10 Full bisection bandwidth at packet granularity may not be necessary Full bisection bandwidth at packet granularity may not be necessary

Hybrid packet/circuit switched network architecture Optical circuit-switched network for high capacity transfer Electrical packet-switched network for low latency delivery Optical paths are provisioned rack-to-rack –A simple and cost-effective choice –Aggregate traffic on per-rack basis to better utilize optical circuits

Design requirements 12 Control plane: –Traffic demand estimation –Optical circuit configuration Data plane: –Dynamic traffic de-multiplexing –Optimizing circuit utilization (optional) Traffic demands

c-Through (a specific design) 13 No modification to applications and switches Leverage end- hosts for traffic management Centralized control for circuit configuration

c-Through - traffic demand estimation and traffic batching 14 Per-rack traffic demand vector 2. Packets are buffered per-flow to avoid HOL blocking. 1. Transparent to applications. Applications Accomplish two requirements: –Traffic demand estimation –Pre-batch data to improve optical circuit utilization Socket buffers

c-Through - optical circuit configuration 15 Use Edmonds’ algorithm to compute optimal configuration Many ways to reduce the control traffic overhead Traffic demand configuration Controller configuration

c-Through - traffic de-multiplexing 16 VLAN #1 Traffic de-multiplexer Traffic de-multiplexer VLAN #1 VLAN #2 circuit configuration traffic VLAN #2 VLAN-based network isolation: –No need to modify switches –Avoid the instability caused by circuit reconfiguration Traffic control on hosts: –Controller informs hosts about the circuit configuration –End-hosts tag packets accordingly

17

Overview Data Center Topologies Data Center Scheduling 22

Datacenters and OLDIs  OLDI = O n L ine D ata I ntensive applications  e.g., Web search, retail, advertisements  An important class of datacenter applications  Vital to many Internet companies OLDIs are critical datacenter applications

OLDIs Partition-aggregate Tree-like structure Root node sends query Leaf nodes respond with data Deadline budget split among nodes and network E.g., total = 300 ms, parents-leaf RPC = 50 ms Missed deadlines  incomplete responses  affect user experience & revenue

Challenges Posed by OLDIs Two important properties: 1)Deadline bound (e.g., 300 ms) Missed deadlines affect revenue 2)Fan-in bursts  Large data, 1000s of servers  Tree-like structure (high fan-in)  Fan-in bursts  long “tail latency”  Network shared with many apps (OLDI and non-OLDI) Network must meet deadlines & handle fan-in bursts

Current Approaches TCP: deadline agnostic, long tail latency  Congestion  timeouts (slow), ECN (coarse) Datacenter TCP (DCTCP) [SIGCOMM '10]  first to comprehensively address tail latency  Finely vary sending rate based on extent of congestion  shortens tail latency, but is not deadline aware  ~25% missed deadlines at high fan-in & tight deadlines DCTCP handles fan-in bursts, but is not deadline-aware

D 2 TCP Deadline-aware and handles fan-in bursts Key Idea: Vary sending rate based on both deadline and extent of congestion  Built on top of DCTCP  Distributed: uses per-flow state at end hosts  Reactive: senders react to congestion  no knowledge of other flows

D 2 TCP’s Contributions 1)Deadline-aware and handles fan-in bursts  Elegant gamma-correction for congestion avoidance  far-deadline  back off more near-deadline  back off less  Reactive, decentralized, state (end hosts) 2)Does not hinder long-lived (non-deadline) flows 3)Coexists with TCP  incrementally deployable 4)No change to switch hardware  deployable today D 2 TCP achieves 75% and 50% fewer missed deadlines than DCTCP and D 3

Coflow Definition

30

31

32

33

34

35

36

37

Data Center Summary Topology Easy deployment/costs High bi-section bandwidth makes placement less critical Augment on-demand to deal with hot-spots Scheduling Delays are critical in the data center Can try to handle this in congestion control Can try to prioritize traffic in switches May need to consider dependencies across flows to improve scheduling 38

Review Networking background OSPF, RIP, TCP, etc. Design principles and architecture E2E and Clark Routing/Topology BGP, powerlaws, HOT topology 39

Review Resource allocation Congestion control and TCP performance FQ/CSFQ/XCP Network evolution Overlays and architectures Openflow and click SDN concepts NFV and middleboxes Data centers Routing Topology TCP Scheduling 40

41 Testbed setup Ethernet switch Emulated optical circuit switch 4Gbps links 100Mbps links 16 servers with 1Gbps NICs Emulate a hybrid network on 48- port Ethernet switch Optical circuit emulation –Optical paths are available only when hosts are notified –During reconfiguration, no host can use optical paths –10 ms reconfiguration delay

42 Evaluation Basic system performance: –Can TCP exploit dynamic bandwidth quickly? –Does traffic control on servers bring significant overhead? –Does buffering unfairly increase delay of small flows? Application performance: –Bulk transfer (VM migration)? –Loosely synchronized all-to-all communication (MapReduce)? –Tightly synchronized all-to-all communication (MPI-FFT) ? Yes No Yes

43 TCP can exploit dynamic bandwidth quickly Throughput reach peak within 10 ms Throughput reach peak within 10 ms

Traffic control on servers bring few overhead Although optical management system adds an output scheduler in the server kernel, it does not significantly affect TCP or UDP throughput.

Application performance Three different Benchmark applications

VM migration Application(1)

VM migration Application(2)

MapReduce(1)

MapReduce(2)

Yahoo Gridmix benchmark 50 3 runs of 100 mixed jobs such as web query, web scan and sorting 200GB of uncompressed data, 50 GB of compressed data

MPI FFT(1)

MPI FFT(2)

D 2 TCP: Congestion Avoidance A D 2 TCP sender varies sending window (W) based on both extent of congestion and deadline Note: Larger p ⇒ smaller window. p = 1 ⇒ W/2. p = 0 ⇒ W/2 P is our gamma correction function W := W * ( 1 – p / 2 )

D 2 TCP: Gamma Correction Function Gamma Correction (p) is a function of congestion & deadlines  α: extent of congestion, same as DCTCP’s α (0 ≤ α ≤ 1)  d: deadline imminence factor  “completion time with window (W)” ÷ “deadline remaining”  d 1 for near-deadline flows p = α d

Gamma Correction Function (cont.) Key insight: Near-deadline flows back off less while far-deadline flows back off more   d < 1 for far-deadline flows  p large  shrink window  d > 1 for near-deadline flows  p small  retain window  Long lived flows  d = 1   DCTCP behavior Gamma correction elegantly combines congestion and deadlines p 1.0 d = 1 d < 1 (far deadline) d > 1 (near deadline) α W := W * ( 1 – p / 2 ) far near p = α d d = 1

Gamma Correction Function (cont.)  α is calculated by aggregating ECN (like DCTCP)  Switches mark packets if queue_length > threshold  ECN enabled switches common  Sender computes the fraction of marked packets averaged over time Threshold

Gamma Correction Function (cont.)  The deadline imminence factor (d): “completion time with window (W)” ÷ “deadline remaining” (d = T c / D)  B  Data remaining, W  Current Window Size Avg. window size ~= 3⁄4 * W ⇒ T c ~= B ⁄ (3⁄4 * W) A more precise analysis in the paper!

D 2 TCP: Stability and Convergence  D 2 TCP’s control loop is stable  Poor estimate of d corrected in subsequent RTTs  When flows have tight deadlines (d >> 1) 1.d is capped at 2.0  flows not over aggressive 2.As α (and hence p) approach 1, D 2 TCP defaults to TCP  D 2 TCP avoids congestive collapse p = α d W := W * ( 1 – p / 2 )

D 2 TCP: Practicality  Does not hinder background, long-lived flows  Coexists with TCP  Incrementally deployable  Needs no hardware changes  ECN support is commonly available D 2 TCP is deadline-aware, handles fan-in bursts, and is deployable today