Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kai Chen, Ankit Singla, Atul Singh, Kishore Ramachandran,

Similar presentations


Presentation on theme: "Kai Chen, Ankit Singla, Atul Singh, Kishore Ramachandran,"— Presentation transcript:

1 OSA: An Optical Switching Architecture for Data Center Networks with Unprecedented Flexibility
Kai Chen, Ankit Singla, Atul Singh, Kishore Ramachandran, Lei Xu, Yueping, Zhang, Xitao Wen, Yan Chen Northwestern University, UIUC, NEC Labs America USENIX NSDI’12, San Jose, USA

2 Big Data for Modern Applications
Scientific: 200GB of astronomy data a night Business: 1 million customer transactions, 2.5PB of data per hour Social network: 60 billion photos in its user base, 25TB of log data per day Web search: 20PB of search data per day Nowadays, people are living in a world of big data! Many applications and services from scientific research, business, social network, web search, and so on need to generate and process a large amount of data every day. For example, in the Sloan Digital Sky survey project, an telescope would collect 200GB data per night from 2000; Walmart, the world largest retailer, handles more than 1 million customer transactions every hour, process over 2.5 petabytes of data into their databases; Facebook, stores 60 billion photos from its user base, generates 25 terabytes log data per day; Google caches the whole world wide webs. They have to process 20 petabytes of data everyday to handle Internet search from users. It is also reported that, Google is currently trying to build 1 exabyte storage system. Besides these, many other examples of data intensive applications and services exist with Microsoft, Amazon, Yahoo! Youtube, so on and so forth …

3 Data Center as Infrastructure
In order to support such data intensive applications, data centers are being built around the world for data processing and storage. Here are the 36 google’s world-wide data centers. Example of Google’s 36 world wide data centers

4 Conventional DCN is Problematic
Core switch Serious communication bottleneck! 1:240 Considerations: Bandwidth Wiring complexity Power consumption Network cost Aggregation switch 1:5 ~ 1:20 (ToR switch) Top-of-Rack 1:1 This is the conventional DCN structure adapted from Cisco. It has three-tiers of switches. Servers under the same ToR have non-blocking bandwidth, paths go through aggregation layer has a oversubscription ratio of 5 to 20, which means that 5 to 20 servers share 1G/b link. So, how to design a high-performance DCN has become an important research topic and motivated a lot of research efforts in the community. A DCN structure adapted from Cisco Efficient DCN architecture is desirable, but challenging

5 Recent Efforts and Their Problems
Fattree All-electrical (static) Static over-provisioning CLUE Fattree, BCube, VL2, PortLand [SIGCOMM’08 ’09] BCube In the first wave, pure electrical structures such as Bcube Dcell PortLand VL2 etc have been proposed in sigcomm’08’09. Some of them deliver full bisection bandwidth between servers though with significant wiring complexity, power consumption and network building cost especially when supporting 10 GigE which is an industrial trend and already deployed by large companies such as Google. High bandwidth, but high wiring complexity, high power, high cost

6 Recent Efforts and Their Problems
All-electrical (static) Hybrid electrical/optical (semi-flexible) Conventional electrical network Fattree, BCube, VL2, PortLand [SIGCOMM’08 ’09] c-Through, Helios [SIGCOMM’10] Optical links Then, to avoid the complexity, hybrid structures like c-Through, Helios were proposed in Sigcomm’10, which try to supplement traditional electrical network with optics, in which optical connections are set up between hot ToR pairs. However, their optical interconnect has very limited flexibility. As you can see, one ToR through the optical links, can only connect to one other ToRs at a time, furthermore, the capacity of these optical links once constructed, are fixed. We find that this limited flexibility can only provide ~10% of bandwidth for real traffic patterns. High bandwidth, but high wiring complexity, high power, high cost Reduced complexity, power and cost, but insufficient bandwidth c-Through Limited flexibility

7 Our Effort: OSA All-electrical (static) Hybrid electrical/optical
(semi-flexible) All-optical (high-flexible) Insight behind OSA: Data center traffic exhibits regionality and some stability [IMC’09] [WREN’09] [HotNets’09][IMC’10] [SIGCOMM’11][ICDCS’12] So, we flexibly arrange bandwidth to where it is needed, instead of static over-provisioning! Fattree, BCube, VL2, PortLand [SIGCOMM’08 ’09] c-Through, Helios [SIGCOMM’10] OSA We therefore propose OSA, a novel all-optical data center architecture which potentially can deliver full-bisection bandwidth in a simple and high-flexible way. The insight behind OSA is that ….. This mean while a server can potentially talk to any other servers in the network in a long, at a certain time period, the communication in within a subset of servers. Instead of statically provision the bandwidth everywhere as fattree, our idea is to flexibly arrange the bandwidth to where it is need. High bandwidth, but high wiring complexity, high power, high cost Reduced complexity, power and cost, but insufficient bandwidth High bandwidth, and low wiring complexity, low power, low cost

8 OSA’s Flexibility: An Example
High capacity link for increased demand G C F A D E B H A B C D E F G H Traffic demand 10 A G 10 B H C E D F 20 OSA can dynamically change its ToR topology and link capacity to adapt to the real demand, thus delivering high bandwidth without static over-provisioning! Change link capacity C F A E H D B G Change topology Demand change Before introducing the architecture of OSA, I first use an example to show what kind of flexibility OSA has. Suppose this is a hypercube connecting 8 ToRs using 10G links with traffic demand on the right. For this demand, no matter what routing paths are used on this hypercube, at least one link will be congested. One way to tackle this congestion is to reconnect the ToRs using a different topology. In the new topology, all the communicating ToR pairs are directly connected and their demand can be perfectly satisfied. Now, suppose the traffic demand changes with a new demand of 20 between F and G. If no adjustment is made, at least one link will face congestion. With the shortest path routing, FG will be that link. In this scenario, one solution to avoid congestion is to increase the capacity of the FG to 20G at the expense of decreasing capacity of link FD and link GC to 0. Critically, note that in all three topologies, the degree and the capacity of nodes remain the same, i.e., 3 and 30G respectively. As above, OSA’s flexibility lies in its flexible topology and link capacity. In the absence of such flexibility, the above example would require additional links and capacities to handle both traffic patterns. Certain traffic patterns may necessitate non-oversubscribed network. OSA, with its high flexibility, can avoid such non-blocking construction, while still providing equivalent performance for various traffic patterns. A G 10 B H C E F 20 D Direct link for real demand

9 Outline of Presentation
Background and high-level idea How OSA achieves such flexibility? OSA architecture and optimization Implementation and Evaluation Summary

10 How We Achieve Such Flexibility?
Micro-Electro-Mechanical Switch imaging lens fiber MEMS mirror reflector N × N N N Flexible topology MEMS A B C D Fixed degree A D B C A D C B

11 How We Achieve Such Flexibility?
Micro-Electro-Mechanical Switch Wavelength Selective Switch imaging lens fiber MEMS mirror reflector N × N WSS 1 × k N N Flexible topology MEMS Output 1 Wavelengths Output 2 A B C D Fixed degree WSS Input A D B C A D C B Output k

12 How We Achieve Such Flexibility?
Micro-Electro-Mechanical Switch Wavelength Selective Switch imaging lens fiber MEMS mirror reflector N × N WSS 1 × k N 100 Terabits X 1 Optical fiber C Send Receive bidirectional WDM (DE)MUX Circulator MUX DEMUX 32 port Coupler 4 port Common features: Support high bit-rate, high capacity Power-efficient Small and compact (except MEMS) Other optical devices: N Flexible topology Flexible link capacity Fixed node capacity MEMS A A A Wavelength uniqueness A B C D WSS B D Fixed degree A D B C A D C B C C B D

13 OSA Architecture Overview
(MEMS 320 ports) Send part Receive part With all these optical devices, we now come to see the whole architecture. The whole network is divided by ToRs into two parts: below the ToRs are the servers and above the ToRs is an all-optical interconnect. The optical component above each ToR has a Top-of-Rack switch

14 OSA Architecture Overview
MEMS (320 ports) ToR WSS k At its core (MEMS 320 ports) A B C D E F G H OSA can arrange any k-regular topology with flexible link capacity among the ToRs! Each ToR can connect to any k other ToRs Each link can have flexible capacity With all these optical devices, we now come to see the whole architecture. The whole network is divided by ToRs into two parts: below the ToRs are the servers and above the ToRs is an all-optical interconnect. The optical component above each ToR has a

15 OSA Architecture Overview
MEMS (320 ports) ToR WSS k At its core Two notes about OSA: 1. Multi-hop routing for indirect ToRs 2. OSA is container-sized DCN for now (MEMS 320 ports) With all these optical devices, we now come to see the whole architecture. The whole network is divided by ToRs into two parts: below the ToRs are the servers and above the ToRs is an all-optical interconnect. The optical component above each ToR has a

16 Control Plane: Logically Centralized
OSA Manager Optimize the network to better serve the traffic Topology (MEMS 320 ports) Link capacity Given OSA architecture can assume any k-regular graph with dynamic link capacities, an important question is how to optimize the network topology and the link capacities? To handle this, we employ a centralized OSA manager. The main objective of the manager is to maximize the network throughput based on the current traffic pattern. Specifically, it will talk to OSM to configure the topology, talk to WSS to configure the link capacities, and ToR for the routing. Routing

17 Optimization Procedure in OSA Manager
Hedera [NSDI’10] Maximum k-matching 1. Estimate traffic demand between ToRs 2. Assign direct link to heavy communication ToR pairs The optimization contains 4 steps. First, … OSA Manager

18 Maximum K-matching for Direct Links Setup
ToR demand graph A E D F H C B G 3 5 2 1 4 A E D F H C B G ToR traffic demand A B C D E F G H -- 3 5 2 1 4 A B C D E F G H -- 3 5 2 1 4 Maximum weighted 3-matching Edmonds’ algorithm[1] ToR connection graph To optimize the network topology, our intention is to localize the traffic for better communications. One way to do this is to find the high-communication ToR pairs in the demand matrix, and then we try to set up direct circuit links to them. For this purpose, based on the traffic demand matrix, we construct a graph with each node as a ToR, with the link weight set as the demand between the two ToRs. Then, we want to derive a k-regular graph from this traffic matrix graph, in such a way that, the sum of link weights of the derived k-regular graph is maximized. We found that this problem can be naturally mapped to the maximal weighted k-matching problem. In our implementation, we just apply Edmonds’ matching algorithm to solve it. A B C D E F G H [1] J. Edmonds, “Paths, trees and flowers”, Canad. J. of Math., 1965

19 Optimization Procedure in OSA Manager
Hedera [NSDI’10] Maximum k-matching Shortest path routing 1. Estimate traffic demand between ToRs 2. Assign direct link to heavy communication ToR pairs 3. Compute the routing paths 4. Compute the traffic demand on each link 5. Assign wavelengths to provision the link bandwidth The optimization contains 4 steps. First, … Edge-coloring theory OSA Manager

20 Edge-coloring for Wavelength Assignment
B C D E F G H 4 3 A B C D E F G H Multigraph based on # of wavelengths 3 4 E.g., from F’s perspective 4 3 2 2 5 4 3 Expected wavelength graph Given required link capacity on each link, we need to provision a corresponding amount of wavelengths. However, wavelength assignment is not arbitrary: due to the contention problem, a wavelength can only be assigned to a ToR at most once. Furthermore, we require that the total number of wavelengths used in the whole network is minimized. Given this constraint, we reduce the problem to an edge coloring problem on a multigraph. Here is how to we reduce it: First, we represent our ToR level graph as a multigraph, which multiple edges corresponding to the number of wavelengths between two nodes; Second, we assume each wavelength is has a unique color. Then, a feasible wavelength assignment is equivalent to an assignment of colors to the edges of the multigraph so that no two adjacent edges have the same color, which is exactly the edge-coloring problem. In fact, edge-coloring is a well explored problem in the graph theory and fast algorithms such as Vizing’s algorithm are known. Libraries implementing this are publicly available. As a system researcher, our contribution here is not to develop new algorithms or theories, but instead, we are identifying and applying these simple yet deep and elegant theories to solve real-world problems and design efficient networked systems. Wavelength assignment: A wavelength cannot be associated with a ToR twice Edge-coloring: A color cannot be associated with a node twice Vizing’s theorem[2] [2] J. Misra, et. al., “A constructive proof of Vizing’s Theorem,” Inf. Process. Lett., 1992.

21 Optimization Procedure in OSA Manager
Topology, MEMS Routing, ToR 1. Estimate traffic demand between ToRs 2. Assign direct link to heavy communication ToR pairs 3. Compute the routing paths 4. Compute the traffic demand on each link 5. Assign wavelengths to provision the link bandwidth The optimization contains 4 steps. First, … Link capacity, WSS OSA Manager

22 Prototype Implementation
MEMS WSS 1 MEMS (32 ports: 16×16) 8 WSS units (1×4 ports) 8 ToRs* and 32 servers Theoretical curve Experiment curve Build with all real optical devices, server emulated ToRs Messy because of lab settup, in practices, all the devices is very compact and can be packed onto ToRs However, one time setting, when upload to 10G 40G 100G… no need to touch the network, because they are bit-rate Experiment results strictly follow the expectation: Demonstrate the feasibility of the OSA design! *Server-emulated ToR

23 Simulation Results (2560 servers*)
OSA can be close to non-blocking 85% 90% ~100% 80% Demonstrate the high-performance of the OSA design! 3.86X 3.1X 3.54X 3X OSA is significantly better than hybrid *80 ToRs (each with 32 servers) form a 4-regular graph for OSA.

24 Cost, Power & Wiring (2560 Servers)
OSA is slightly better than hybrid OSA is significantly better than Fattree Demonstrate OSA can potentially deliver high bandwidth in a simple, power-efficient and cost-effective way!

25 Summary and Caveats Static, “fat” Flexible, “thin” Fattree Hybrid OSA
Performance Complexity Power Cost Fattree X Hybrid OSA OSA is inspired by traffic regionality and stability Sweet spot for performance, cost, power, and wiring complexity Caveats: not intended for all-to-all, non-stable traffic

26 Thanks!

27 Data Center Traffic Characteristics
[IMC’09][HotNets’09]: only a few ToRs are hot and most of their traffic goes to a few other ToRs [IMC’10]: traffic at ToRs exhibits an ON/OFF pattern [SIGCOMM’09]: over 90% bytes flow in elephant flows [WREN’10]: 60% ToRs see less than 20% change in traffic volume for between seconds [ICDCS’12]: a production DCN traffic shows stability even on a hourly time scale Static full bisection bandwidth between all servers at all the time is a waste of resource!

28 Circuit Switch vs Packet Switch
Optical Circuit Switch circuit switching 500$/port rate free 0.24mW/port ~10ms circuit switching latency Electrical Packet Switch(10G) store and forward 500$/port 10Gb/s fixed rate 12.5W/port per-packet switching

29 Cost and Power

30 Data Sending (MEMS 320 ports)
Let’s see how the data is transmitted in the network.

31 Data Receiving (MEMS 320 ports)

32 Multi-hop Routing O-E-O (MEMS 320 ports) Sub-nanosecond
Let’s see how the data is transmitted in the network. O-E-O Sub-nanosecond

33 The effect of dynamic topology and link capacity

34 The effect of reconfiguration


Download ppt "Kai Chen, Ankit Singla, Atul Singh, Kishore Ramachandran,"

Similar presentations


Ads by Google