Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hedera: Dynamic Flow Scheduling for Data Center Networks Mohammad Al-Fares Sivasankar Radhakrishnan Barath Raghavan Nelson Huang Amin Vahdat Mohammad Al-Fares.

Similar presentations


Presentation on theme: "Hedera: Dynamic Flow Scheduling for Data Center Networks Mohammad Al-Fares Sivasankar Radhakrishnan Barath Raghavan Nelson Huang Amin Vahdat Mohammad Al-Fares."— Presentation transcript:

1 Hedera: Dynamic Flow Scheduling for Data Center Networks Mohammad Al-Fares Sivasankar Radhakrishnan Barath Raghavan Nelson Huang Amin Vahdat Mohammad Al-Fares Sivasankar Radhakrishnan Barath Raghavan Nelson Huang Amin Vahdat Presented by 馬嘉伶

2 Easy to understand problem MapReduce style DC applications need bandwidth DC networks have many ECMP paths between servers Flow-hash-based load balancing insufficient

3 ECMP Paths Many equal cost paths going up to the core switches Only one path down from each core switch Randomly allocate paths to flows using hash of the flow Agnostic to available resources Long lasting collisions between long (elephant) flows D D S S

4 Collisions of elephant flows Collisions possible in two different ways Upward path Downward path D1D1 D1D1 S1S1 S1S1 D2D2 D2D2 S2S2 S2S2 D3D3 D3D3 S3S3 S3S3 D4D4 D4D4 S4S4 S4S4

5 Collisions of elephant flows Average of 61% of bisection bandwidth wasted on a network of 27K servers D1D1 D1D1 S1S1 S1S1 D2D2 D2D2 S2S2 S2S2 D3D3 D3D3 S3S3 S3S3 D4D4 D4D4 S4S4 S4S4

6 Hedera Scheduler Hedera: Dynamic Flow Scheduling - Optimize achievable bisection bandwidth by assigning flow non-conflicting paths - Uses flow demand estimation + placement heuristics to find good flow-to-core mappings Estimate Flow Demands Place Flows Detect Large Flows

7 Hedera Scheduler Detect Large Flows Flows that need bandwidth but are network-limited Estimate Flow Demands Use max-min fairness to allocate flows between src-dst pairs Place Flows Use estimated demands to heuristically find better placement of large flows on the ECMP paths Estimate Flow Demands Place Flows Detect Large Flows

8 Elephant Detection Scheduler continually polls edge switches for flow byte-counts Flows exceeding B/s threshold are “large” > %10 of hosts’ link capacity (i.e. > 100Mbps) What if only “small” flows? Default ECMP load-balancing efficient. GigE

9 Elephant Detection Hydera complements ECMP! - Default forwarding uses ECMP - Hedera schedules large flows that cause bisection bandwidth problems.

10 Demand Estimation Flows can be constrained in two ways Host-limited (at source, or at destination) Network-limited Measured flow rate is misleading Need to find a flow’s “natural” bandwidth requirement when not limited by the network Forget network, just allocate capacity between flows using max-min fairness

11 Demand Estimation Given traffic matrix of large flows, modify each flow’s size at it source and destination iteratively… Sender equally distributes bandwidth among outgoing flows that are not receiver-limited Network-limited receivers decrease exceeded capacity equally between incoming flows Repeat until all flows converge Guaranteed to converge in O(|F|) time

12 Demand Estimation AA BB CC XX YY Flow Estimat e Conv. ? AXAX AYAY BYBY CYCY Sende r Available Unconv. BW FlowsShare A121/2 B111 C111 Senders

13 Demand Estimation RecvRL? Non-SL Flows Share XNo-- YYes31/3 Receivers Flow Estimat e Conv. ? AXAX1/2 AYAY BYBY1 CYCY1 AA BB CC XX YY

14 Demand Estimation Flow Estimat e Conv. ? AXAX1/2 AYAY1/3Yes BYBY1/3Yes CYCY1/3Yes Sende r Available Unconv. BW FlowsShare A2/31 B000 C000 Senders AA BB CC XX YY

15 Demand Estimation Flow Estimat e Conv. ? AXAX2/3Yes AYAY1/3Yes BYBY1/3Yes CYCY1/3Yes RecvRL? Non-SL Flows Share XNo-- Y -- Receivers AA BB CC XX YY

16 Flow placement Find a good allocation of paths for the set of large flows, such that the average bisection bandwidth of the flows is maximized That is maximum utilization of theoretically available b/w Two approaches Global First Fit: Greedily choose path that has sufficient unreserved b/w Simulated Annealing: Iteratively find a globally better mapping of paths to flows

17 Flow Placement Global First Fit Greedy Speed is O([ports/switch] 2 ) Far from optimal Simulated Annealing Iterative Speed is O(# flows) Closer to optimal

18 Global First-Fit New flow detected, linearly search all possible paths from S  D Place flow on first path whose component links can fit that flow ? Flow A Flow B Flow C ?? 00112233 Scheduler

19 Global First-Fit Flows placed upon detection, are not moved Once flow ends, entries + reservations time out Flow A Flow B Flow C 00112233 Scheduler

20 Simulated Annealing Probabilistic search for good flow-to-core mappings - Goal: Maximize achievable bisection bandwidth Current flow-to-core mapping generates neighbor state - Calculate total exceeded bandwidth capacity - Accept move to neighbor state if bisection BW gain Few thousand iterations for each scheduling round - Avoid local-minima; non-zero prob. to worse state

21 Simulated Annealing

22 Example run: 3 flows, 3 iterations 00112233 Flow A Flow B Flow C Core 2 1 0 ? ?? 2 0 2 ? 2 0 3 Scheduler

23 Simulated Annealing Final state is published to the switches and used as the initial state for next round 001133 Flow A Flow B Flow C Core ? ??? 2 0 3 22 Scheduler

24 Fault-Tolerance Link / Switch failure Use PortLand’s fault notification protocol Hedera routes around failed components 001133 Flow A Flow B Flow C 22 Scheduler

25 Fault-Tolerance Scheduler failure Soft-state, not required for correctness (connectivity) Switches fall back to ECMP 001133 Flow A Flow B Flow C 22 Scheduler

26 Evaluation 16-host testbed - k=4 fat-tree data-plane - 20 machines; 4-port NetFGPAs / OpenFlow - Parallel 48-port non-blocking Quanta switch(debug & control) 1 Scheduler machine - Dynamic traffic monitoring - OpenFlow routing control

27 Evaluation

28 16-hosts: 120 GB all-to-all in-memory shuffle Hedera achieves 39% better bisection BW over ECMP, 88% of ideal non-blocking switch ECMPGFFSAControl Total Shuffle Time (s) 438.4335.5336.0306.4 Avg. Completion Time (s) 358.1258.7262.0226.6 Avg. Bisection BW (Gbps) 2.813.893.844.44 Avg. host goodput (MB/s) 20.929.028.633.1 Data Shuffle

29 Reactiveness Demand Estimation: 27K hosts, 250K flows, converges < 200ms Simulated Annealing: Asymptotically dependent on # of flows + # iterations 50K flows and 10K iter: 11ms Most of final bisection BW: first few hundred iterations Scheduler control loop: Polling + Estimation + SA = 145ms for 27K hosts

30 Limitations Dynamic workloads, large flow turnover faster than control loop Scheduler will be continually chasing the traffic matrix Need to include penalty term for unnecessary SA flow re-assignments Flow Size Matrix Stability Stable Unstabl e ECMPHedera

31 Conclusions Simulated Annealing delivers significant bisection BW gains over standard ECMP Hedera complements ECMP If you are running MapReduce/Hadoop jobs on your network, you stand to benefit greatly from Hedera; tiny investment!

32 Thoughts [1] Inconsistencies Motivation is MapReduce, but Demand Estimation assumes sparse traffic matrix Something fuzzy Demand Estimation step assumes all bandwidth available at host Forgets the poor mice :’(

33 Thoughts [2] Inconsistencies Evaluation Results (already discussed) Simulation using their own flow-level simulator Makes me question things Some evaluation details fuzzy No specification of flow sizes (other than shuffle), arrival rate, etc

34 Thoughts [3] Periodicity of 5 seconds Sufficient? Limit? Scalable to higher b/w?

35 Problem Statement Single path forwarding (no flow splitting) Expressed as Binary Integer Programming (BIP) Combinatorial, NP-complete Exact solvers CPLEX/GLPK impractical for realistic networks 1. Capacity Constraint2. Flow Conservation 3. Demand Satisfaction Multi-Commodity Flow problem:

36 Problem Statement Polynomial-time algorithms known for 3-stage Clos Networks (based on bipartite edge- coloring) None for 5-stage Clos (3-tier fat-trees) Need to target arbitrary/general DC topologies! 1. Capacity Constraint2. Flow Conservation 3. Demand Satisfaction


Download ppt "Hedera: Dynamic Flow Scheduling for Data Center Networks Mohammad Al-Fares Sivasankar Radhakrishnan Barath Raghavan Nelson Huang Amin Vahdat Mohammad Al-Fares."

Similar presentations


Ads by Google