Presentation is loading. Please wait.

Presentation is loading. Please wait.

Approaches to Improve Data Center Performance through Networking - Gurubaran.

Similar presentations


Presentation on theme: "Approaches to Improve Data Center Performance through Networking - Gurubaran."— Presentation transcript:

1 Approaches to Improve Data Center Performance through Networking - Gurubaran

2 Outline Data Center Architectures Common Issues Improving performance via Multipath TCP Improving performance through Multicast Routing Conclusion

3 DataCenter Architectures Main challenge is how to build a scalable DCN that delivers significant aggregate bandwidth Example: BCube, DCell, PortLand, VL2, Helois, and c-Through

4 Server Centric DataCeters Servers act not only as end hosts but also as relay nodes for multihop communications Ex: Bcube A BCube0 is simply n servers connecting to an n-port switch. A BCube1 is constructed from n BCube0s and n n-port switches

5 Server Centric DataCeters In BCube, two servers are neighbors 1) if they connect to the same switch. 2) If and only if their address arrays differ in one digit BCube build its routing path by “correcting” one digit at one hop from the source to the destination

6 Server Centric DataCeters: BCube

7 Switch Centric DataCenters In switch-centric DCNs, switches are the only relay nodes. PortLand and VL2 belong to this category. Generally, they use a special instance of a Clos topology called Fattree to interconnect commodity Ethernet switches

8 Switch Centric DataCenters PortLand includes core, aggregate, and edge switches. Pseudo MAC (PMAC) addresses encodes the location of the host 48bit:pod.position.port.vmid Pod (16 bit): pod number of the edge switch Position (8 bit): position in the pod Port (8 bit): the port number it connects to Vmid (16 bit): VM id of the host

9 Switch Centric DataCenters PortLand switches forward a packet based on its destination PMAC address.

10 Common Issues The top-level switches are the bandwidth bottleneck, and high-end high-speed switches have to be used. Moreover, a high-level switch shows as a singlepoint failure spot for its subtree branch. Using redundant switches does not fundamentally solve the problem but incurs even higher cost.

11

12 Improving performance via Multipath TCP

13 Improving Performance Via Multipath TCP Datacenter apps are distributed across thousands of machines Want any machine to play any role To achieve this: Use dense parallel datacenter topologies Map each flow to a path Problem: Naive random allocation gives poor performance Improving performance adds complexity

14 The Two Key Questions MPTCP can greatly improve performance in today’s data centers. Under which circumstances does it do so, how big are the benefits, and on what do they depend? If MPTCP were deployed, how the data centers can be designed differently in the future to take advantage of its capabilities?

15 Main Components Data Center networking architecture Physical topology Routing over the topology Selection between multiple paths supplied by routing Congestion control of traffic on the selected paths

16 Topology Denseness of interconnection they provide poses its own problems Determine how traffic should be routed

17 Fat Tree Topology [Fares et al., 2008; Clos, 1953] K=4 1Gbps Aggregation Switches K Pods with K Switches each Racks of servers

18 Fat Tree Topology [Fares et al., 2008; Clos, 1953] K=4 Aggregation Switches K Pods with K Switches each Racks of servers

19 Collisions

20 Single-path TCP collisions reduce throughput

21 Routing Dense interconnection topologies provide many possible parallel paths between each pair of hosts Routing system must spread traffic across these paths Simplest solution - use randomized load balancing If each switch uses a link-state routing protocol to provide ECMP forwarding then, based on a hash of the five-tuple in each packet, flows will be split roughly equally across equal length paths

22 Path Selection ECMP or multiple VLANs provide the basis for randomized load balancing as the default path selection mechanism Randomized load balancing cannot achieve the full bisectional bandwidth and is not fair. Allows hot spot to develop To address these issues, the use of a centralized flow scheduler has been proposed Scheduler running every 500ms has similar performance to randomized load balancing when these assumptions do not hold.

23 Collision

24

25

26 Not fair

27

28

29

30 No matter how you do it, mapping each flow to a path is the wrong goal

31 Instead, pool capacity from different links

32

33

34

35 Multipath Transport

36 Multipath Transport can pool datacenter networks – Instead of using one path for each flow, use many random paths – Don’t worry about collisions. – Just don’t send (much) traffic on colliding paths

37 MPTCP is a drop in replacement for TCP MPTCP spreads application data over multiple subflows Multipath TCP Primer [IETF MPTCP WG]

38 Congestion Control MPTCP can establish multiple subflows on different paths between the same pair of endpoints for a single TCP connection. “By linking the congestion control dynamics on these multiple subflows, MPTCP can explicitly move traffic off more congested paths and place it on less congested ones”

39 Congestion Control Given sufficiently many randomly chosen paths, MPTCP will find at least one good unloaded path, and move most of its traffic that way. Relieve congestion on links that got more than their fair share of ECMP balanced flows. Allow the competing flows to achieve their full potential, maximizing the bisectional bandwidth of the network and also improving fairness

40 Congestion Control Each MPTCP subflow has its own sequence space Each subflow maintains its own congestion window For each ACK on subflow r, increase the window wr by min(a/wtotal, 1/wr). For each loss on subflow r, decrease the window wr by wr/2.

41 Multipath TCP: Congestion Control [NSDI, 2011]

42 MPTCP better utilizes the FatTree network

43 MPTCP on EC2 Amazon EC2: infrastructure as a service – We can borrow virtual machines by the hour – These run in Amazon data centers worldwide – We can boot our own kernel A few availability zones have multipath topologies – 2-8 paths available between hosts not on the same machine or in the same rack – Available via ECMP

44 Amazon EC2 Experiment 40 medium CPU instances running MPTCP For 12 hours, they sequentially ran all-to-all iperf cycling through: – TCP – MPTCP (2 and 4 subflows)

45 MPTCP improves performance on EC2 Same Rack

46 Analysis Examine how MPTCP performs in a range of topologies and with a varying number of subflows

47 What do the benefits depend on? How many subflows are needed? How does the topology affect results? How does the traffic matrix affect results?

48 At most 8 subflows are needed Total Throughput TCP

49 MPTCP improves fairness in VL2 topologies VL2 Fairness is important: Jobs finish when the slowest worker finishes Fairness is important: Jobs finish when the slowest worker finishes

50 MPTCP improves throughput and fairness in BCube BCube

51 Oversubscribed Topologies To saturate full bisectional bandwidth:  There must be no traffic locality  All hosts must send at the same time  Host links must not be bottlenecks It makes sense to under-provision the network core  This is what happens in practice  Does MPTCP still provide benefits?

52 Performance improvements depend on traffic matrix OverloadedUnderloaded Sweet Spot Increase Load

53 What is an optimal datacenter topology for multipath transport?

54 In single homed topologies:  Hosts links are often bottlenecks  ToR switch failures wipe out tens of hosts for days Multi-homing servers is the obvious way forward

55 Fat Tree Topology

56 ToR Switch Servers Upper Pod Switch

57 Dual Homed Fat Tree Topology ToR Switch Servers Upper Pod Switch

58 Is DHFT any better than Fat Tree? Not for traffic matrices that fully utilize the core Let’s examine random traffic patterns – Other TMs in the paper

59 Core Overloaded Core Underloaded DHFT provides significant improvements when core is not overloaded

60 Improving performance through Multicast Routing

61 Introduction Multicast benefits group communications in saving network traffic and improving application throughput The technical trend of future data center design poses new challenges for efficient and scalable Multicast routing.

62 Challenges The densely connected networks make traditional receiver-driven Multicast routing protocols inefficient in Multicast tree formation It is difficult for the low-end switches (used in data centers) to hold the routing entries of massive Multicast groups.

63 Approach Use a source-to-receiver expansion approach to build efficient Multicast trees, excluding many unnecessary intermediate switches used in receiver-driven Multicast For scalable Multicast routing, combine both in-packet Bloom Filters and in-switch entries to make the tradeoff between the number of Multicast groups supported and the additional bandwidth overhead

64 General Multicast

65 Multicast Trees

66 Multicast Tree Formation Building a Multicast tree with the lowest cost covering a given set of nodes on a general graph is well-known as the Steiner Tree problem. The problem is NP-Hard and there are many approximate solutions

67 Multicast Tree Formation in Data Centers For data center Multicast, Bcube proposes an algorithm to build server-based Multicast trees, and switches are used only as dummy crossbars. But obviously network-level Multicast with switches involved can save much more bandwidth than the server-based one. In VL2, traditional IP Multicast protocols are used for tree building.

68 Scalable Multicast Routing For data center networks where low-end switches with limited routing space are used, the problem of scalable Multicast Routing is very challenging. One possible solution is to aggregate a number of Multicast routing entries into a single one, as used in Unicast. Bloom Filter can be used to compress in- switch Multicast routing entries

69 Scalable Multicast Routing Encode the tree information into in-packet Bloom Filter, and thus there is no need to install any Multicast routing entries in network equipment. But the in-packet Bloom Filter field brings network bandwidth cost. In this paper, they achieve scalable Multicast routing by making the tradeoff between the number of Multicast groups supported and the additional bandwidth overhead

70 Bloom Filters A Bloom filter is a probabilistic data structure designed to tell, rapidly and memory- efficiently, whether an element is present in a set. Tells that if the element is either definitely is not in the set or may be in the set.

71 Efficient Multicast Tree Building: The Problem Densely connected data center networks implies a large number of tree candidates for a group. Given multiple equal-cost paths between servers/switches - undesirable to run traditional receiver-driven Multicast routing protocols such as PIM for tree building. Why? Because independent path selection by receivers can result in many unnecessary intermediate links.

72 Bcube Example

73 Assume the receiver set is {v5,v6, v9, v10} and the sender is v0. Using receiver-driven Multicast routing, the resultant Multicast tree can have 14 links as follows (Representing the tree as the paths from the sender to each receiver): v0 -> w0 -> v1 -> w5 -> v5 v0 -> w4 -> v4 -> w1 -> v6 v0 -> w4 -> v8 -> w2 -> v9 v0 -> w0 -> v2 -> w6 -> v10

74 Bcube Example However, an efficient Multicast tree for this case includes only 9 links if we construct in the following way: v0 -> w0 -> v1 -> w5 -> v5 v0 -> w0 -> v2 -> w6 -> v6 v0 -> w0 -> v1 -> w5 -> v9 v0 -> w0 -> v2 -> w6 -> v10

75 The Approach: Continued Receivers send join/leave requests to a Multicast Manager. The Multicast Manager then calculates the Multicast tree based on the data center topology and group membership distribution. Data center topologies are regular graphs and the Multicast Manager can easily maintain the topology information (with failure management). The problem is then translated as how to calculate an efficient Multicast tree on the Multicast Manager.

76 Source Driven Tree Building Recently proposed data center architectures (BCube, PortLand, VL2) use several levels of switches for server interconnection, and switches within the same level are not directly connected. Hence, they are multistage graphs. Group Spanning graph: The possible paths from the Multicast source to all receivers can be expanded as a directed multistage graph with d +1 stages. For example: The sender is v0 and the receiver set is {v1,v5,v9,v10,v11,v12,v14}

77 Group Spanning Graph: BCube

78 Source Driven Tree Building A covers B: For any two node sets A and B in a group spanning graph, A covers B if and only if for each node j € B, there exists a directed path from a node i € A to j in the group spanning graph. A strictly covers B: If A covers B and any subset of A does not cover B, then A strictly covers B

79 Source Driven Tree Building They propose to build Multicast tree in a source-to- receiver expansion way upon the group spanning graph, with the tree node set from each stage strictly covering downstream receivers. The merits are two fold. 1) Many unnecessary intermediate switches used in receiver-driven Multicast routing are eliminated. 2) The source-to-receiver latency is bounded by the number of stages of the group spanning graph, e.g., the diameter of data center topology, which favors delay-sensitive applications such as redirecting search queries to indexing servers.

80 Source Driven Tree Building: BCube BCube: The tree node selection in a BCube network can be conducted in an iterative way on the group spanning graph. For a BCube(n,k) with the sender s, first select the set of servers from stage 2 of the group spanning graph, which are covered by both s and a single switch in stage 1. Assume the server set in stage 2 is E, and the switch selected in stage 1 is W. The tree node set for BCube(n,k) is the union of the tree node sets for |E|+ 1 BCube(n,k - 1)s. |E|of the BCube(n,k - 1)s has a server in E as the source p, and the receivers in stage 2 * (k +1) covered by p as the receiver set.

81 Source Driven Tree Building: BCube The other BCube(n,k - 1) has s as the source and the receivers in stage 2 * k which are covered by s while not covered by W as the receiver set. In the same way, get the tree node set in each BCube(n,k ¡ 1) by dividing it into several BCube(n,k ¡ 2)s. The process iterates until when getting all the BCube(n,0)s. Hence, the computation complexity is O(N), where N is the total number of servers in BCube.

82 Dynamic Receiver Join The dynamical receiver join/leave does not change the source-to-end paths of other receivers in the group When a new receiver rj joins an existing group in a BCube(n,k ), first recompute the group spanning graph involving rj. Then in the group spanning graph, check whether there is a BCube(n,0) that can hold rj when calculating the previous Multicast tree. If so, add rj to the BCube(n,0). Otherwise, try to find the BCube(n,1) when calculating the previous Multicast tree that can hold rj and add a BCube(n,0) to it containing rj. If not able find such a BCube(n,1), try to find a BCube(n,2) and add a corresponding BCube(n,1), so on and so forth until rj is successfully added in the Multicast tree. In this way, the final tree obeys the proposed tree-building algorithm, and there is no need to change the source-to-end paths for existing receivers in the Multicast group.

83 Dynamic Receiver Leave Given a receiver rl leaves the group in a BCube(n,k ), regenerate the group spanning graph by eliminating rl. Then, if the deletion of rl results in zero BCube(n,m-1 )s in a BCube(n,m ) from the group spanning graph when calculating the previous Multicast tree, eliminate the nodes in the BCube(n,m ). Of course this process does not change the source-to-end paths for other receivers, either

84 Results: Number of Links

85 Computation Time

86 Bandwidth Overhead Ratio Ratio of additional traffic caused by in-packet Bloom Filter over the actual payload traffic to carry. Assume the packet length (including the Bloom Filter field) is p, the length of the in-packet Bloom Filter field is f, the number of links in the Multicast tree is t, and the number of actual links covered by Bloom Filter based forwarding is c, then

87 Bandwidth Overhead Ratio To reduce the bandwidth overhead of in-packet Bloom Filter, either control the false positive ratio during packet forwarding, or limit the size of Bloom Filter field. Inferences: When the Bloom Filter length is shorter than the optimal value, false-positive forwarding is the major factor for bandwidth overhead ratio. But when the length grows larger than the optimal value, the Bloom Filter field itself dominates the bandwidth overhead

88 Bandwidth Overhead Ratio

89 Summary “One flow, one path” thinking has constrained datacenter design – Collisions, unfairness, limited utilization Multipath transport enables resource pooling in datacenter networks: – Improves throughput – Improves fairness – Improves robustness “One flow, many paths” frees designers to consider topologies that offer improved performance for similar cost. Receiver-driven Multicast routing does not perform well in densely connected data center networks. For scalable Multicast routing on low-end data center switches, combine both in-packet Bloom Filters and inswitch entries. This can save 40% - 50% of network traffic.


Download ppt "Approaches to Improve Data Center Performance through Networking - Gurubaran."

Similar presentations


Ads by Google