Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo,

Similar presentations


Presentation on theme: "Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo,"— Presentation transcript:

1 Per-packet Load-balanced, Low-Latency Routing for Clos-based Data Center Networks
Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo, Guohan Lu, Lihua Yuan, Yixin Zheng, Haitao Wu, Yongqiang Xiong, Dave Maltz December Santa Barbara, California

2 Outline Background DRB for load-balance and low latency
DRB for 100% bandwidth utilization DRB latency modeling Routing design and failure handling Evaluations Related work Conclusion

3 Clos-based DCN: background
Topology Routing Equal-cost multi-path (ECMP) Given a spine switch, there is only one path from a src to a dst in fat-tree

4 Clos-based DCN: issues
Low network utilization Due to flow-based hash collision for ECMP High network latency tail results high user perceived latency Many DC applications use thousands or more TCP connections

5 Network latency measurement
400us Network latency has a long tail Busy servers do not contribute to the long latency tail Server network stack increases the latency by several hundred us 1.5ms 2ms

6 Where the latency tail comes from
A (temporary) congested switch port can use several MB for packet buffering 1MB buffer introduces 1ms latency for a 10G link For a three layer DCN, intra-DC communications take up to 5 hops

7 The challenge The challenge
Given a full bisection bandwidth Clos network, achieve 100% bandwidth utilization and 0 in-network latency Many ways to improve, but none addresses the challenge fully E.g., use traffic engineering for better bandwidth utilization, introduce ECN for latency mitigation Our answer: DRB The challenge:

8 Digit-reversal bouncing (DRB)
Right time for per-packet routing Regular Clos topology Server software stack under control Switches become open and programmable DRB Achieves 100% bandwidth utilization by per-packet routing Achieves small queuing delay by its “Digit-reversal” algorithm Can be readily implemented

9 Achieve 100% bandwidth utilization
Sufficient condition for 100% utilization In a Fat-tree network, given an arbitrary feasible traffic matrix, if a routing algorithm can evenly spread the traffic ai,j from server i to server j among all the possible uplinks at every layer, then all the links, including all the downlinks, are not overloaded The condition implies: Oblivious load-balancing: no need of traffic matrix Packet bouncing: only need to load-balance uplinks Source-destination pair instead of flow-based

10 DRB for fat-tree DRB for bouncing switch selection: Seq Digit-reversal
Spine switch 00 3.0 (00) 01 10 3.2 (10) 3.1 (01) 11 3.3 (11) Many ways to meet the sufficient condition: RB (random bouncing) RRB (round-robin bouncing)

11 Queuing latency modeling
First-hop queue length vs. traffic load with switch port number 24 First-hop queue length vs. switch port number when traffic load is 0.95 DRB and RRB achieves bounded queue length when load approaches 100% Queue length of RRB in proportion to $n^2$ Queue length of DRB is very small (2-3 pkts)

12 DRB for VL2 Given a spine switch, there are multiple paths between a source and a destination in VL2 DRB splits a spine switch into multiple “virtual spine switches”

13 DRB routing and failure handling
Servers choose bouncing switch for each packet Switches use static routing Switches are programmed to maintain up-to-date network topology Leverage network topology to minimize broadcast messages

14 Simulation: network utilization
Simulation setup: Pkt level simulation with NS3 Three-layer fat-tree and VL2 with servers Permutation traffic pattern TCP for transport protocol with 256KB buffer size Resequencing buffer for out-of-order packet arrivals

15 Simulation: queuing delay
RRB results in large queuing delay at the first and forth hops DRB achieves the smallest queuing delay even thought its throughput is the highest

16 Simulations: out-of-order arrivals
Resequencing delay is defined as the time a packet stays in the resequencing buffer RB’s resequencing delay is the worst Resequencing delay is not directly related to queuing delay DRB achieves very small number of out-of-order packet arrivals

17 Implementation and testbed
Servers : perform ip-in-ip packet encap for each source-destination pair at the sending side; packet re-sequencing at the receiving side Switches: ip-in-ip packet decap; topology maintenance Testbed A three-layer fat-tree with 54 servers Each switch has 6 ports

18 Experiments: queuing delay
RB results in large queue length (250KB per port) DRB and RRB performs similar since each switch has only 3 uplinks DRB’s queue length is only 2-3 pkts Same with the queue modeling and simulation results

19 Related work Random-based per-packet routing Flowlet based LocalFlow
Random Packet Spraying (RPS) Random-based per packet VLB Flowlet based LocalFlow DeTail (lossless link layer + per-packet adaptive routing) Flow-level deadline-based approaches D3, D2TCP, PDQ

20 Conclusion DRB achieves DRB can be readily implemented
100% bandwidth utilization Almost 0 queuing delay Few out-of-order packet arrivals DRB can be readily implemented Servers for packet encap and switches for packet decap

21 Q & A This is the end my presentation. Thank you. Any questions?


Download ppt "Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo,"

Similar presentations


Ads by Google