1OSA: An Optical Switching Architecture for Data Center Networks with Unprecedented Flexibility Kai Chen, Ankit Singla, Atul Singh, Kishore Ramachandran,Lei Xu, Yueping, Zhang, Xitao Wen, Yan ChenNorthwestern University, UIUC, NEC Labs AmericaUSENIX NSDI’12, San Jose, USA
2Big Data for Modern Applications Scientific: 200GB of astronomy data a nightBusiness: 1 million customer transactions,2.5PB of data per hourSocial network: 60 billion photos in its userbase, 25TB of log data per dayWeb search: 20PB of search data per dayNowadays, people are living in a world of big data!Many applications and services from scientific research, business, social network, web search, and so on need to generate and process a large amount of data every day.For example, in the Sloan Digital Sky survey project, an telescope would collect 200GB data per night from 2000; Walmart, the world largest retailer, handles more than 1 million customer transactions every hour, process over 2.5 petabytes of data into their databases; Facebook, stores 60 billion photos from its user base, generates 25 terabytes log data per day; Google caches the whole world wide webs. They have to process 20 petabytes of data everyday to handle Internet search from users. It is also reported that, Google is currently trying to build 1 exabyte storage system.Besides these, many other examples of data intensive applications and services exist with Microsoft, Amazon, Yahoo! Youtube, so on and so forth ………
3Data Center as Infrastructure In order to support such data intensive applications, data centers are being built around the world for data processing and storage. Here are the 36 google’s world-wide data centers.Example of Google’s 36 world wide data centers
4Conventional DCN is Problematic CoreswitchSerious communication bottleneck!1:240Considerations:BandwidthWiring complexityPower consumptionNetwork cost…Aggregationswitch1:5 ~ 1:20(ToR switch)Top-of-Rack1:1This is the conventional DCN structure adapted from Cisco. It has three-tiers of switches. Servers under the same ToR have non-blocking bandwidth, paths go through aggregation layer has a oversubscription ratio of 5 to 20, which means that 5 to 20 servers share 1G/b link.So, how to design a high-performance DCN has become an important research topic and motivated a lot of research efforts in the community.A DCN structure adapted from CiscoEfficient DCN architecture is desirable, but challenging
5Recent Efforts and Their Problems FattreeAll-electrical(static)Static over-provisioningCLUEFattree, BCube,VL2, PortLand[SIGCOMM’08 ’09]BCubeIn the first wave, pure electrical structures such as Bcube Dcell PortLand VL2 etc have been proposed in sigcomm’08’09. Some of them deliver full bisection bandwidth between servers though with significant wiring complexity, power consumption and network building cost especially when supporting 10 GigE which is an industrial trend and already deployed by large companies such as Google.High bandwidth, buthigh wiring complexity,high power, high cost
6Recent Efforts and Their Problems All-electrical(static)Hybridelectrical/optical(semi-flexible)Conventional electrical networkFattree, BCube,VL2, PortLand[SIGCOMM’08 ’09]c-Through, Helios[SIGCOMM’10]Optical linksThen, to avoid the complexity, hybrid structures like c-Through, Helios were proposed in Sigcomm’10, which try to supplement traditional electrical network with optics, in which optical connections are set up between hot ToR pairs.However, their optical interconnect has very limited flexibility. As you can see, one ToR through the optical links, can only connect to one other ToRs at a time, furthermore, the capacity of these optical links once constructed, are fixed.We find that this limited flexibility can only provide ~10% of bandwidth for real traffic patterns.High bandwidth, buthigh wiring complexity,high power, high costReduced complexity,power and cost, butinsufficient bandwidthc-ThroughLimited flexibility
7Our Effort: OSA All-electrical (static) Hybrid electrical/optical (semi-flexible)All-optical(high-flexible)Insight behind OSA:Data center traffic exhibits regionality and some stability [IMC’09] [WREN’09] [HotNets’09][IMC’10] [SIGCOMM’11][ICDCS’12]So, we flexibly arrange bandwidth to where it is needed, instead of static over-provisioning!Fattree, BCube,VL2, PortLand[SIGCOMM’08 ’09]c-Through, Helios[SIGCOMM’10]OSAWe therefore propose OSA, a novel all-optical data center architecture which potentially can deliver full-bisection bandwidth in a simple and high-flexible way.The insight behind OSA is that …..This mean while a server can potentially talk to any other servers in the network in a long, at a certain time period, the communication in within a subset of servers. Instead of statically provision the bandwidth everywhere as fattree, our idea is to flexibly arrange the bandwidth to where it is need.High bandwidth, buthigh wiring complexity,high power, high costReduced complexity,power and cost, butinsufficient bandwidthHigh bandwidth, andlow wiring complexity,low power, low cost
8OSA’s Flexibility: An Example High capacity link for increased demandGCFADEBHABCDEFGHTraffic demand10AG10BHCEDF20OSA can dynamically change its ToR topology and link capacity to adapt to the real demand, thus delivering high bandwidth without static over-provisioning!Changelink capacityCFAEHDBGChange topologyDemand changeBefore introducing the architecture of OSA, I first use an example to show what kind of flexibility OSA has. Suppose this is a hypercube connecting 8 ToRs using 10G links with traffic demand on the right. For this demand, no matter what routing paths are used on this hypercube, at least one link will be congested. One way to tackle this congestion is to reconnect the ToRs using a different topology. In the new topology, all the communicating ToR pairs are directly connected and their demand can be perfectly satisfied.Now, suppose the traffic demand changes with a new demand of 20 between F and G. If no adjustment is made, at least one link will face congestion. With the shortest path routing, FG will be that link. In this scenario, one solution to avoid congestion is to increase the capacity of the FG to 20G at the expense of decreasing capacity of link FD and link GC to 0. Critically, note that in all three topologies, the degree and the capacity of nodes remain the same, i.e., 3 and 30G respectively.As above, OSA’s flexibility lies in its flexible topology and link capacity. In the absence of such flexibility, the above example would require additional links and capacities to handle both traffic patterns. Certain traffic patterns may necessitate non-oversubscribed network. OSA, with its high flexibility, can avoid such non-blocking construction, while still providing equivalent performance for various traffic patterns.AG10BHCEF20DDirect link for real demand
9Outline of Presentation Background and high-level ideaHow OSA achieves such flexibility?OSA architecture and optimizationImplementation and EvaluationSummary
10How We Achieve Such Flexibility? Micro-Electro-Mechanical Switchimaging lensfiberMEMSmirrorreflectorN × NNNFlexible topologyMEMSABCDFixed degreeADBCADCB
11How We Achieve Such Flexibility? Micro-Electro-Mechanical SwitchWavelength Selective Switchimaging lensfiberMEMSmirrorreflectorN × NWSS1 × kNNFlexible topologyMEMSOutput 1WavelengthsOutput 2ABCDFixed degreeWSSInputADBCADCBOutput k
12How We Achieve Such Flexibility? Micro-Electro-Mechanical SwitchWavelength Selective Switchimaging lensfiberMEMSmirrorreflectorN × NWSS1 × kN100 Terabits X 1Optical fiberCSendReceivebidirectionalWDM (DE)MUXCirculatorMUXDEMUX32 portCoupler4 portCommon features:Support high bit-rate, high capacityPower-efficientSmall and compact (except MEMS)Other optical devices:NFlexible topologyFlexible link capacityFixed node capacityMEMSAAAWavelength uniquenessABCDWSSBDFixed degreeADBCADCBCCBD
13OSA Architecture Overview (MEMS 320 ports)Send partReceive partWith all these optical devices, we now come to see the whole architecture.The whole network is divided by ToRs into two parts: below the ToRs are the servers and above the ToRs is an all-optical interconnect.The optical component above each ToR has aTop-of-Rack switch
14OSA Architecture Overview MEMS (320 ports)ToRWSS…kAt its core(MEMS 320 ports)ABCDEFGHOSA can arrange any k-regular topology with flexible link capacity among the ToRs!Each ToR can connect to any k other ToRsEach link can have flexible capacityWith all these optical devices, we now come to see the whole architecture.The whole network is divided by ToRs into two parts: below the ToRs are the servers and above the ToRs is an all-optical interconnect.The optical component above each ToR has a
15OSA Architecture Overview MEMS (320 ports)ToRWSS…kAt its coreTwo notes about OSA:1. Multi-hop routing for indirect ToRs2. OSA is container-sized DCN for now(MEMS 320 ports)With all these optical devices, we now come to see the whole architecture.The whole network is divided by ToRs into two parts: below the ToRs are the servers and above the ToRs is an all-optical interconnect.The optical component above each ToR has a
16Control Plane: Logically Centralized OSA ManagerOptimize the network to better serve the trafficTopology(MEMS 320 ports)Link capacityGiven OSA architecture can assume any k-regular graph with dynamic link capacities, an important question is how to optimize the network topology and the link capacities? To handle this, we employ a centralized OSA manager. The main objective of the manager is to maximize the network throughput based on the current traffic pattern. Specifically, it will talk to OSM to configure the topology, talk to WSS to configure the link capacities, and ToR for the routing.Routing
17Optimization Procedure in OSA Manager Hedera [NSDI’10]Maximum k-matching1. Estimate traffic demand between ToRs2. Assign direct link to heavy communication ToR pairsThe optimization contains 4 steps. First, …OSA Manager
18Maximum K-matching for Direct Links Setup ToR demand graphAEDFHCBG35214AEDFHCBGToR traffic demandABCDEFGH--35214ABCDEFGH--35214Maximumweighted 3-matchingEdmonds’ algorithmToR connection graphTo optimize the network topology, our intention is to localize the traffic for better communications. One way to do this is to find the high-communication ToR pairs in the demand matrix, and then we try to set up direct circuit links to them.For this purpose, based on the traffic demand matrix, we construct a graph with each node as a ToR, with the link weight set as the demand between the two ToRs.Then, we want to derive a k-regular graph from this traffic matrix graph, in such a way that, the sum of link weights of the derived k-regular graph is maximized. We found that this problem can be naturally mapped to the maximal weighted k-matching problem. In our implementation, we just apply Edmonds’ matching algorithm to solve it.ABCDEFGH J. Edmonds, “Paths, trees and flowers”, Canad. J. of Math., 1965
19Optimization Procedure in OSA Manager Hedera [NSDI’10]Maximum k-matchingShortest path routing1. Estimate traffic demand between ToRs2. Assign direct link to heavy communication ToR pairs3. Compute the routing paths4. Compute the traffic demand on each link5. Assign wavelengths to provision the link bandwidthThe optimization contains 4 steps. First, …Edge-coloring theoryOSA Manager
20Edge-coloring for Wavelength Assignment BCDEFGH43ABCDEFGHMultigraph based on # of wavelengths34E.g., from F’s perspective4322543Expected wavelength graphGiven required link capacity on each link, we need to provision a corresponding amount of wavelengths.However, wavelength assignment is not arbitrary: due to the contention problem, a wavelength can only be assigned to a ToR at most once. Furthermore, we require that the total number of wavelengths used in the whole network is minimized. Given this constraint, we reduce the problem to an edge coloring problem on a multigraph.Here is how to we reduce it:First, we represent our ToR level graph as a multigraph, which multiple edges corresponding to the number of wavelengths between two nodes; Second, we assume each wavelength is has a unique color. Then, a feasible wavelength assignment is equivalent to an assignment of colors to the edges of the multigraph so that no two adjacent edges have the same color, which is exactly the edge-coloring problem. In fact, edge-coloring is a well explored problem in the graph theory and fast algorithms such as Vizing’s algorithm are known. Libraries implementing this are publicly available.As a system researcher, our contribution here is not to develop new algorithms or theories, but instead, we are identifying and applying these simple yet deep and elegant theories to solve real-world problems and design efficient networked systems.Wavelength assignment:A wavelength cannot be associated with a ToR twiceEdge-coloring:A color cannot be associated with a node twiceVizing’s theorem J. Misra, et. al., “A constructive proof of Vizing’s Theorem,” Inf. Process. Lett., 1992.
21Optimization Procedure in OSA Manager Topology, MEMSRouting, ToR1. Estimate traffic demand between ToRs2. Assign direct link to heavy communication ToR pairs3. Compute the routing paths4. Compute the traffic demand on each link5. Assign wavelengths to provision the link bandwidthThe optimization contains 4 steps. First, …Link capacity, WSSOSA Manager
22Prototype Implementation MEMSWSS1 MEMS (32 ports: 16×16)8 WSS units (1×4 ports)8 ToRs* and 32 serversTheoretical curveExperiment curveBuild with all real optical devices, server emulated ToRsMessy because of lab settup, in practices, all the devices is very compact and can be packed onto ToRsHowever, one time setting, when upload to 10G 40G 100G… no need to touch the network, because they are bit-rateExperiment results strictly follow the expectation:Demonstrate the feasibility of the OSA design!*Server-emulated ToR
23Simulation Results (2560 servers*) OSA can be close to non-blocking85%90%~100%80%Demonstrate the high-performance of the OSA design!3.86X3.1X3.54X3XOSA is significantly better than hybrid*80 ToRs (each with 32 servers) form a 4-regular graph for OSA.
24Cost, Power & Wiring (2560 Servers) OSA is slightly better than hybridOSA is significantly better than FattreeDemonstrate OSA can potentially deliver high bandwidth in a simple, power-efficient and cost-effective way!
25Summary and Caveats Static, “fat” Flexible, “thin” Fattree Hybrid OSA PerformanceComplexityPowerCostFattree√XHybridOSAOSA is inspired by traffic regionality and stabilitySweet spot for performance, cost, power, and wiring complexityCaveats: not intended for all-to-all, non-stable traffic
27Data Center Traffic Characteristics [IMC’09][HotNets’09]: only a few ToRs are hot and most of their traffic goes to a few other ToRs[IMC’10]: traffic at ToRs exhibits an ON/OFF pattern[SIGCOMM’09]: over 90% bytes flow in elephant flows[WREN’10]: 60% ToRs see less than 20% change in traffic volume for between seconds[ICDCS’12]: a production DCN traffic shows stability even on a hourly time scaleStatic full bisection bandwidth between all servers at all the time is a waste of resource!
28Circuit Switch vs Packet Switch Optical Circuit Switchcircuit switching500$/portrate free0.24mW/port~10ms circuit switching latencyElectrical Packet Switch(10G)store and forward500$/port10Gb/s fixed rate12.5W/portper-packet switching