Centralized Arbitration for Data Centers

Slides:



Advertisements
Similar presentations
Finishing Flows Quickly with Preemptive Scheduling
Advertisements

Big Data + SDN SDN Abstractions. The Story Thus Far Different types of traffic in clusters Background Traffic – Bulk transfers – Control messages Active.
Winter 2004 UCSC CMPE252B1 CMPE 257: Wireless and Mobile Networking SET 3f: Medium Access Control Protocols.
Deconstructing Datacenter Packet Transport Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker Stanford University.
Improving Datacenter Performance and Robustness with Multipath TCP Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik,
PFabric: Minimal Near-Optimal Datacenter Transport Mohammad Alizadeh Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker.
Congestion Control: TCP & DC-TCP Swarun Kumar With Slides From: Prof. Katabi, Alizadeh et al.
Defense: Christopher Francis, Rumou duan Data Center TCP (DCTCP) 1.
CS 268: Lecture 12 (Router Design) Ion Stoica March 18, 2002.
Computer Networking Lecture 17 – Queue Management As usual: Thanks to Srini Seshan and Dave Anderson.
Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.
Information-Agnostic Flow Scheduling for Commodity Data Centers
Mohammad Alizadeh, Abdul Kabbani, Tom Edsall,
Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1.
Detail: Reducing the Flow Completion Time Tail in Datacenter Networks SIGCOMM PIGGY.
On the Data Path Performance of Leaf-Spine Datacenter Fabrics Mohammad Alizadeh Joint with: Tom Edsall 1.
Scalable Multi-Class Traffic Management in Data Center Backbone Networks Amitabha Ghosh (UtopiaCompression) Sangtae Ha (Princeton) Edward Crabbe (Google)
Wei Bai with Li Chen, Kai Chen, Dongsu Han, Chen Tian, Hao Wang SING HKUST Information-Agnostic Flow Scheduling for Commodity Data Centers 1 SJTU,
CS640: Introduction to Computer Networks Aditya Akella Lecture 20 - Queuing and Basics of QoS.
Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,
CIS 700-5: The Design and Implementation of Cloud Networks
Incast-Aware Switch-Assisted TCP Congestion Control for Data Centers
6.888 Lecture 5: Flow Scheduling
QoS & Queuing Theory CS352.
OTCP: SDN-Managed Congestion Control for Data Center Networks
Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.
HyGenICC: Hypervisor-based Generic IP Congestion Control for Virtualized Data Centers Conference Paper in Proceedings of ICC16 By Ahmed M. Abdelmoniem,
CS 268: Lecture 6 Scott Shenker and Ion Stoica
Improving Datacenter Performance and Robustness with Multipath TCP
Presented by Kristen Carlson Accardi
Improving Datacenter Performance and Robustness with Multipath TCP
Congestion Control and Resource Allocation
Congestion-Aware Load Balancing at the Virtual Edge
TCP, XCP and Fair Queueing
Hamed Rezaei, Mojtaba Malekpourshahraki, Balajee Vamanan
Lecture 19 – TCP Performance
Queuing and Queue Management
Congestion Control in Software Define Data Center Network
So far, On the networking side, we looked at mechanisms to links hosts using direct linked networks and then forming a network of these networks. We introduced.
Scheduling Algorithms in Broad-Band Wireless Networks
Distributed Channel Assignment in Multi-Radio Mesh Networks
AMP: A Better Multipath TCP for Data Center Networks
Advance Computer Networking
Congestion Control (from Chapter 05)
Congestion Control (from Chapter 05)
Advance Computer Networking
EE 122: Lecture 7 Ion Stoica September 18, 2001.
Congestion Control (from Chapter 05)
CS 6290 Many-core & Interconnect
Iroko A Data Center Emulator for Reinforcement Learning
Lecture 16, Computer Networks (198:552)
TCP Congestion Control
Project proposal: Questions to answer
Congestion-Aware Load Balancing at the Virtual Edge
Administrivia Review 1 due tomorrow Office hours on Thursdays 10—12
Lecture 17, Computer Networks (198:552)
Congestion Control (from Chapter 05)
Congestion Control (from Chapter 05)
Transport Layer: Congestion Control
Congestion Control (from Chapter 05)
Congestion Control (from Chapter 05)
Congestion Control (from Chapter 05)
Congestion Control and Resource Allocation
Horizon: Balancing TCP over multiple paths in wireless mesh networks
Lecture 5, Computer Networks (198:552)
Control-Data Plane Separation
Lecture 6, Computer Networks (198:552)
Lecture 8, Computer Networks (198:552)
Presentation transcript:

Centralized Arbitration for Data Centers Lecture 18, Computer Networks (198:552)

Datacenter transport Goals: Complete flows and jobs quickly Short flows (e.g., query, coordination) Large flows (e.g., data update, backup) Big-data jobs (e.g., map-reduce, ML training) Low Latency High Throughput Complete fast

Low latency approaches Endpoint algorithms keep queues small (e.g., DCTCP) Prioritize mice over elephants in the fabric (e.g., pFabric)

Continuing problems Flows need not be the most semantically meaningful entities! Co-flows: Big data jobs, partition/aggregate queries Result isn’t meaningful until a set of related flows complete Multiple competing objectives: provider & apps Fair sharing vs. efficient usage vs. app objectives Change all switches? Endpoints? 4 Front end Server Aggregator … … Worker …

Opportunity Many DC apps/platforms know flow size or deadlines in advance Key/value stores Data processing Web search 5 Front end Server Aggregator … … Worker …

H1 H1 H2 H2 H3 H3 H4 H4 H5 H5 H6 H6 H7 H7 H8 H8 H9 H9 TX RX

Flow scheduling on giant switch Objective? App: Min. avg FCT, co-flow completion time, delays, retransmits Net: Share the network fairly DC transport = Flow scheduling on giant switch H1 H1 H2 H2 ingress & egress capacity constraints H3 H3 H4 H4 H5 H5 H6 H6 H7 H7 H8 H8 H9 H9 TX RX

Building such a network is hard! “One big switch” no longer Failures Heterogeneous links Respect internal link rate constraints

Objective? TX RX DC transport = App: Min. avg FCT, co-flow completion time, delays, retransmits Net: Share the network fairly DC transport = Flow scheduling on giant switch H1 H1 H2 H2 Ingress, egress, and core capacity constraints H3 H3 H4 H4 H5 H5 H6 H6 H7 H7 H8 H8 H9 H9 TX RX

Building such a network is hard! “One big switch” no longer Failures Heterogeneous links Respect internal link rate constraints Switch modifications to prioritize at high speeds Can we do better?

Objective? TX RX DC transport = App: Min. avg FCT, co-flow completion time, delays, retransmits Net: Share the network fairly DC transport = A controller scheduling packets at fine granularity H1 H1 H2 H2 H3 H3 H4 H4 H5 H5 H6 H6 H7 H7 H8 H8 H9 H9 TX RX

Jonathan Perry et al., SIGCOMM’14 FastPass Jonathan Perry et al., SIGCOMM’14

Multiple app and provider objectives FastPass Goals Is it possible to design a network that provides Zero network queues High utilization Multiple app and provider objectives

Logically centralized arbiter Processing times (1) 10 us Aarbiter (2) 1—20 us Arbiter (3) 10 us ArbiterA (4) No queuing AB (3) At t=95, send AB through R1 R1 (2) Arbiter allocates time slot & path for pkt (4) Packet A  B with 0 queues (1) A has 1 packet for B B A

Logically centralized arbiter Schedule and assign paths to all packets Will the arbiter scale to schedule packets over a large network? Will the latency to start flows be a problem? How will the system deal with faults? Can endpoints send at the right time? Clocks must be synchronized! Concerns with centralization Concerns with endpoint behavior

Arbiter problem (1): Time slot allocation Choose which hosts get to send Send as much as possible without endpoint port contention Allocate packet transmissions per time slot 1.2 us: 1500 byte packet at 10 Gbit/s Send as many packets as possible without contending on host input or output ports

Arbiter problem (2): Path selection Select paths for allocated host transmissions Ensure no two packets conflict on any network port At all levels of the topology

Can you solve (1) and (2) separately, or should we solve them together? Solve separately if the network has full bisection bandwidth i.e.: one big switch: any port to any other without contention

Time slot allocation == Matching problem! Want: optimal matching Maximize # packets transmitted in a timeslot Expensive to implement! Can do: online maximal matching Greedy approach: allocate to current time slot if source and destination free; else push to next time slot Fastpass: allocates in 10 ns per demand

Pipelined time slot allocation src, dst, pkts: 2  4, 6 src, dst, pkts: 2  4, 5 src, dst, pkts: 2  4, 5 Allocate traffic for 2 Tbits/s on 8 cores!

Time slot allocation: Meeting app goals Ordering the demands changes achieved allocation goal Max-min fairness: least recently allocated first FCT minimization: fewest remaining MTUs first Allocation in batches of 8 timeslots Optimized using find-first-set hardware instruction

Path selection Assume a 2-tier topology: ToR  aggregate switch  ToR Given pair of ToRs, find non-overlapping agg switch Edge coloring problem!

Will the delay to start flows be an issue? Arbiter must respond before ideal zero-queue transmission occurs Light load: independent transmissions may win 

Can endpoints transmit at the right time? What happens if not? (1) Need to ensure that clocks aren’t too out of sync (2) Endpoints transmit when told Use PTP to sync within few us Fastpass qdisc to shape traffic as told hrtimers for microsecond resolution

Experimental results on Facebook rack

Convergence to network share

Discussion Ordering demands: an indirect way to set allocation goals? Online time slot allocation without preemption: what about priorities? Comparison with delay-based congestion control? Fine-grained network scheduling approaches (ex: pFabric)? How to scale to much larger networks? Can you completely eliminate retransmissions and reordering? What is the impact on the host networking stack? Do you need to arbitrate the arbitration packets?

Acknowledgment Slides heavily adapted from material by Mohammad Alizadeh and Jonathan Perry