VL2: A Scalable and Flexible data Center Network

VL2: A Scalable and Flexible data Center Network
CS538 10/23/2014 Presentation by: Soteris Demetriou Scribe: Cansu Erdogan

Credits Some of the slides were used in their original form or adjusted from Assistant Professor’s Hakim Weatherspoon (Cornell) Those slides are annotated with a * on the top right hand side of the slide.

Paper Details Title Authors Venue Citations
VL2: A Scalable and Flexible Data Center Network Authors Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A. Maltz, Paveen Patel, Sudipta Sengupta Microsoft Research Venue Proceedings of the ACM SIGCOMM 2009 conference on Data communication Citations 918

Overview Problem: Conventional Data Center Networks do not provide agility. I.e assigning any service to any server efficiently is challenging. Approach: Merge layer 2 and layer 3 into a virtual layer 2 (VL2). How? Use of flat addressing to provide Layer-2 semantics, Valiant Load Balancing for uniform high capacity between servers and TCP to ensure performance isolation. Findings: VL2 can provide uniform high capacity between any 2 servers, performance isolation between services and agility through layer-2 semantics

Outline Background Motivation Measurement Study VL2 Evaluation
Summary and Discussion

Clos Topology Multistage circuit switching network
Usage: when the physical network exceeds the capacity of the largest crossbar switch (MxN)

Clos Topology Ingress Stage Middle Stage Egress Stage

Clos Topology m n r

Traffic Matrix Traffic rate between a node and every other node
E.g network with N nodes Each node N connects to every other node NxN flows  NxN representative matrix Valid: a valid Traffic Matrix is one that ensures no node is oversubscribed

Valiant Load Balancing
Keslassy et al. proved that uniform Valiant load-balancing is the unique architecture which requires the minimum node capacity in interconnecting a set of identical nodes Zhang-Shen et al used it to design a predictable network backbone Intuition: It is much easier to estimate the aggregate traffic entering and leaving a node than to estimate a complete traffic matrix (traffic rate from every node to every other node) A valid traffic matrix approach where an ingress to egress path is used, requires link capacity = node capacity (= r) VLB load balances traffic among any two-hop paths Link capacity among any 2 nodes: r/N + r/N = r * 2/N Zhang-Shen, Rui, and Nick McKeown. "Designing a predictable Internet backbone network." HotNets, 2004.

Valiant Load Balancing
Backbone network Point of Presence Consider a backbone network consisting of multiple PoPs interconnected by long-haul links. The whole network is arranged as a hierarchy, and each PoP connects an access network to the backbone (see Figure 1). Although traffic matrices are hard to obtain, it is straightforward to measure, or estimate, the total amount of traffic entering (leaving) a PoP from (to) its access network. When a new customer joins the network, we add its aggregate traffic rate to the node. When new locations are planned, the aggregate traffic demand for a new node can be estimated from the population that the node serves. This is much easier than trying to estimate the traffic rates from one node to every other node in the backbone. Imagine full mesh between nodes in the backbone network Traffic entering the backbone network will be spread equally across all the nodes A flow is load balanced across every two-hop path from its ingress to egress node Thus Each packet traverses the network twice Link capacity analysis Stage 1: Each node uniformly distributes its traffic to every other node Thus each node receives 1/Nth of each node’s traffic The incoming traffic rate to each node is at most r, But traffic is evenly distributed among N nodes  each link requires r/N capacity Stage 2: All packets are delivered to final destination Each node can receive traffic at a maximum rate of r And it receives 1/N of the traffic from every other node  traffic on the link is at most r/N Total: 2* r/N Access Network Zhang-Shen, Rui, and Nick McKeown. "Designing a predictable Internet backbone network." HotNets, 2004.

Conventional Data Center Network Architecture
* Conventional Data Center Network Architecture As shown in Figure 1, the network is a hierarchy reaching from a layer of servers in racks at the bottom to a layer of core routers at the top. There are typically 20 to 40 servers per rack, each singly connected to a Top of Rack (ToR) switch with a 1 Gbps link. ToRs connect to two aggregation switches for redundancy, and these switches aggregate further connecting to access routers. At the top of the hierarchy, core routers carry traffic between access routers and manage traffic into and out of the data center. All links use Ethernet as a physical-layer protocol, with a mix of copper and fiber cabling. All switches below each pair of access routers form a single layer-2 domain, typically connecting several thousand servers. To limit overheads (e.g., packet flooding and ARP broadcasts) and to isolate different services or logical server groups (e.g., , search, web front ends, web back ends), servers are partitioned into virtual LANs (VLANs). Unfortunately, this conventional design suffers from some fundamental limitations

* DCN Problems . . . I have spare ones, but… 1:5 1:240 I want more
CR CR 1:240 AR AR AR AR S S I have spare ones, but… S S I want more 1:80 . . . S S S S S S S S 1:5 A A … A A A … A A A … A A A … A Static network assignment Fragmentation of resource Poor server to server connectivity Traffics affects each other Poor reliability and utilization

* End Result The Illusion of a Huge L2 Switch . . . . . .
CR AR S 1. L2 semantics . . . 2. Uniform high capacity 3. Performance isolation . . . A A A A A … A A A A A A … A A A A A A A A A A … A A A A A A A A A A A A … A A A A

* Objectives Uniform high capacity: Performance isolation:
Maximum rate of server to server traffic flow should be limited only by capacity on network cards Assigning servers to service should be independent of network topology Performance isolation: Traffic of one service should not be affected by traffic of other services Layer-2 semantics: Easily assign any server to any service Configure server with whatever IP address the service expects VM keeps the same IP address even after migration

Methodology Setting design objectives Deriving typical workloads
Interview stakeholders to derive a set of objectives Deriving typical workloads Measurement study on traffic patterns Data-center traffic analysis Flow distribution analysis Traffic matrix analysis Failure characteristics

Measurement Study 2 main questions
Who sends how much data, to whom and when How often does the state of the network change due to changes in demand, or switch/link failures and recoveries Studied production data centers of a large cloud provider

Data-Center Traffic Analysis
Setting Instrumentation of a highly utilized cluster in a data-center Cluster of 15,000 nodes Center supports data mining on PB of data Servers are distributed roughly evenly among 75 ToR (Top of Rack) switches which are connected hierarchically Setting Instrumentation of a highly utilized cluster in a data-center Cluster of 15,000 nodes Center supports data mining on PB of data Servers are distributed roughly evenly among 75 ToR (Top of Rack) switches which are connected hierarchically

Data-Center Traffic Analysis
Ratio of traffic volume between servers in the data-center with traffic entering/leaving the data-center is 4:1 Bandwidth demand between servers inside a data-center shows grows faster than bandwidth demand to external hosts Network is the bottleneck of computation

Flow Distribution Analysis
Majority of flows are small (few KB) in par with Internet flows Why? Mostly hellos and meta-data requests to the distributed file system Almost all bytes (>90%) are transported in flows of 100MB to 1GB size Mode is around 100MB The distributed file system breaks long files into 100-MB chunks Flows over a few GB are rear

The distribution of internal flows is simpler than that of internet flows and more uniform

2 modes >50% of the time, an average machine has ~10 concurrent flows At least 5% of the time it has >80 concurrent flows Implies that randomizing path selection at flow granularity will not cause perpetual congestion in case of unlucky placement of flows

Traffic Matrix Analysis
Poor summarizibility of traffic patterns Even when approximating with clusters, fitting error remains high (60%) Engineering for just a few traffic matrices is unlikely to work well for “real” traffic in data centers Instability of traffic patterns Traffic pattern shows no periodicity that can be exploited for prediction

Failure Characteristics 1/2
Failure definition The event that occurs when a system or component is unable to perform its required function for more than 30s Most failures are small in size 50% of network failures involve < 4 devices 95% of network failures involve < 20 devices Downtimes can be significant 95% are resolved in 10 min 98% in < 1 hour 99.6% in < 1 day 0.09% last > 10 days

Failure Characteristics 2/2
In 0.3% of failures all redundant components in a network device group became unavailable Main causes of downtimes Network misconfigurations Firmware bugs Faulty components No obvious way to eliminate all failures from the top of the hierarchy

* Objectives Methodology: Interviews with architects, developers and operators Uniform high capacity: Maximum rate of server to server traffic flow should be limited only by capacity on network cards Assigning servers to service should be independent of network topology Performance isolation: Traffic of one service should not be affected by traffic of other services Layer-2 semantics: Easily assign any server to any service Configure server with whatever IP address the service expects VM keeps the same IP address even after migration

Design Overview Flat Addressing Directory System (resolution service)
Layer-2 semantics Name – Location Separation Enforce hose model using existing mechanisms Performance Isolation TCP Valiant Load Balancing Guarantee bandwidth for hose-model traffic Uniform high-capacity Scale-out CLOS topology Solution Approach Objective

Design Overview Randomizing to cope with unpredictability and volatility Valiant Load Balancing: Destination independent traffic spreading across multiple intermediate nodes Clos topology is used to support randomization Proposal of a flow spreading mechanism

Design Overview Building on proven technologies
VL2 is based on IP routing and forwarding technologies that are available in commodity switches Link state routing Maintains switch-level topology Does not disseminate end hosts’ info Equal-Cost Multi-Path forwarding with anycast addresses Enables VLB with minimal control plane messaging

Design Overview Separating names from locators Enables agility
rapid VM migration Use of Application addresses (AA) Location Addresses (LA) Directory System for name resolution

VL2 Components Scale-out CLOS topology Addressing and Routing – VLB
Directory System

* Scale-out topology Bipartite graph Graceful degradation of bandwidth if an IS fails VL2 Int . . . CLOS Aggr . . . The links between the intermediate switches and the aggregation switches form a complete bipartite graph As in the conventional topology, ToR connect to two Aggregation Switches However now, the large number of paths in between AS and IS means that if there are n IS, the failure of 1 of them reduces the bisection bandwidth by only 1/n Thus we get a graceful degradation of bandwidth Furthermore, it is easy and less expensive to build a Clos network for which there is no over subscription Basically they choose to scale down the devices instead of scaling them up, but utilize more of them. . . . TOR . . . 20 Servers

Scale-out topology Clos very suitable for VLB
By indirectly forwarding traffic through an IS at the top, the network can provide bandwidth guarantees for any traffic matrices subject to the hose model Routing is simple and resilient Need a random path up to an IS and a random path down to a destination ToR

VL2 Addressing and Routing: name-location separation
* VL2 Addressing and Routing: name-location separation Allows usage of low-cost switches Protects network and hosts from host-state churn Directory Service Switches run link-state routing and maintain only switch-level topology VL2 … x  ToR2 y  ToR3 z  ToR4 … x  ToR2 y  ToR3 z  ToR3 ToR1 . . . ToR2 . . . ToR3 . . . ToR4 The network infrastructure operates with location specific addresses (LAs) All switches and interfaces are assigned Las Switches run a layer-3 link state protocol that disseminates only those LAs capturing the whole topology eventually Then they can forward packets encapsulated with Las Applications use application-specific addresses AAs Remain unaltered even though servers’ locations change (due to VM migration, re-provisioning) Packet forwarding VL2 agent traps packets from host and encapsulates the apcket with the LA address of the ToR of the destination Once the packet arrives at the dst ToR it decapsulates it and delivers it to the AA address in the inner header Address resolution Servers believe that all of the other servers in the same service are part of the same IP subnet Thus when sending for the first time and ARP is broadcasted VL2 agent on the server hsot, intercepts that and in turn sends a unicast request to the directory system Dir Systems responds with the LA of the ToR of the destination When a ToR fails, we re-assign the service Once a service is assigned to a server, the Directory System will store the mapping between the AA and LA CHURN: how often does the state changes due to switch/link failures, recoveries e.t.c ToR3 y payload Lookup & Response x y y, z z ToR3 ToR4 z z payload payload Servers use flat names

VL2 Addressing and Routing: VLB indirection
* VL2 Addressing and Routing: VLB indirection [ ECMP + IP Anycast ] Harness huge bisection bandwidth Avoid esoteric traffic engineering or optimization Ensure robustness to failures Work with switch mechanisms available today Links used for up paths Links used for down paths IANY IANY IANY Overview: VLB causes the traffic between any pair of servers to bounce off a randomly selected Intermediate switch. Then it utilizes layer-3 router features to perform ECMP to spread the traffic along multiple subpaths for these 2 path segment 1 segment: up link path 2nd segment down link path VLB distributes traffic across a set of intermediate nodes Uses flows as the basic unit of traffic spreading, avoiding out-of-order delivery ECMP distributes across equal-cost paths To implement VLB the VL2 agent encapsulate packets to a specific but randomly chosen IS The IS decapsulates the packet, sends it to the dst ToR which decapsulates again to send it to the dst server However this would require a large number of updates once an IS fails. To address that they assign the same LA anycast address to the Iss The directory system returns this anycast address to VL2 agents upon lookup request. Thus if any IS fails, we don’t need to update all the affected VL2 agents that would have had stale values Also, Since all ISs are exactly 3 hops away from the src, ECMP takes care of delivering packets encapsulated with the anycast address to any of the active Iss, taking care of failures ASK: What would be a problem here? Elephant flows: then the random flow placement could lead to persistent congestion on some links while others are underutilized However elephant flows are rear in Data-Centers T1 T2 T3 T4 T5 T6 IANY T3 T5 y z payload payload 1. Must spread traffic 2. Must ensure dst independence Equal Cost Multi Path Forwarding x y z

VL2 Directory System Three key functions Goals Lookups Updates
AA to LA mapping Reactive cache update For latency sensitive updates E.g VM during migration Goals Scalability Reliability for updates High lookup performance Eventual consistency (like ARP) Reactive cache update Mappings are cached to at Directory Servers and in VL2 agents’ caches Thus an update can lead to inconsistency

VL2 Directory System . . . RSM RSM Servers Directory Servers DS Agent
2. Reply 1. Lookup “Lookup” 5. Ack 2. Set 4. Ack (6. Disseminate) 3. Replicate 1. Update “Update” Replicated Directory Servers (DS) Moderate # ( servers for 100K servers) Cache AA to LA mappings Lazy sync its mapping with the RSM every 30 seconds (we don’t need strong consistency here) Handle queries from VL2 agents Ensure (for lookups): High Throughput and High Availability and Low latency : an agent sends a LOOKUP to k randomly chosen DS. The agent chooses the fastest reply. Small # of Replicated State Machine servers (RSM) Strongly consistent, reliable store of mappings Ensure (for a modest # of updates) Strong consistency Durability UPDATE: an update is sent to a randomly chosen DS which forwards the update to a RSM server. The RSM reliably replicates the update to every RSM server and then replies with an ACK to the DS which forwards the ACK to the client. To enhance consistency the DS server can disseminate the ACK to a few other DSs. Reactive cache update Mappings are cached to at Directory Servers and in VL2 agents’ caches Thus an update can lead to inconsistency Reactive cache update: Observation: a stale host mapping needs to be corrected only when that mapping is used to sent traffic Thus when these packets are arrived at a stale LA (a ToR that doesn’t host the dst server anymore), the ToR can forward a sample of such non-deliverable packets to a directory server, triggering the directory sever to gratuitously correct the stale mapping in the source’s cache via unicast

Evaluation Uniform high capacity:
All-to-all data shuffle traffic matrix: 75 servers Each delivers 500MB to all others (for a total of 2.7TB shuffle from memory to memory) VL2 completes the shuffle in 395s Aggregate goodput: 58.8Gbps 10x times better than their current data center network Maximal achievable goodput over all flows is 62.3Gbps VL2 network efficiency as 58.8/62.3 = 94%

Evaluation VLB Fairness: 75 node testbed
Traffic characteristics as per measurement study All flows pass through the Aggregation switches  sufficient to check there for the split ratio among the links to the Intermediate switches Plot Jain’s fairness index for traffics to intermediate switches VLB split ratio fairness index, averages >0.98% for all ASs Time (s) 1.00 0.98 0.96 0.94 Fairness Index Aggr Aggr Aggr3 Goal: Evaluate if VLB with ECMP is splitting traffic evenly across the network

Evaluation Performance isolation:
Added two types of services to the network: Service one: 18 servers do single TCP transfer to another server, starting at time 0 and lasting throughout the experiment Service two: Start one server at 60s and assigns a new server every 2s for a total of 19servers Each one starts a 8GB transfer over TCP as soon as it starts up (every 2 seconds) To achieve isolation they rely on TCP to ensure that each flow offered to the network is rate limited to its fair share of the bottleneck (i.e obeys the hose model) Q: Does TCP react sufficiently quickly to control the offered rate of flows within services? (enforcement of the hose model for traffic in each service means that it can provide performance isolation between services) TCP works with packets and adjusts their sending rate at the time-scale of RTTs However, conformance to the hose model requires instantaneous feedback to avoid oversubscription of traffic ingress/egress bounds No perceptible change in Service 1, as servers start up on Service 2

Evaluation Performance isolation (cnt’d):
To evaluate how mice flows (large number of short TCP connections) – common in DC – affect performance on other services: Service 2: Servers create successively more bursts of short TCP connections (1 to 20 KB) No perceptible change in Service 1 TCP’s natural enforcement of the hose model is sufficient to provide performance isolation when combined with VLB and no oversubscription

Evaluation Convergence after link failures 75 servers
All-to-all data shuffle Disconnect links between intermediate and aggregation switches Figure shows a time series of the aggregate goodput achieved by the flows in the data shuffle. Vertical lines illustrate the time where a disconnection or reconnection has occured Maximum capacity of the network, degrades gracefully Restoration is delayed VL2 fully uses a link, roughly 50s after it is restored Restoration does not interfere with traffic and the aggregate throughput eventually returns to its initial level.

That’s a lot to take in! Take Aways please!
Problem: Over-subscription in data-centers and lack of agility

Overall Approach Measurement Study Stakeholder Interviews
Datacenter workload measurements Design Objectives Architecture Application of known techniques when possible Evaluation Testbed: includes all design components Evaluation with respect to objectives

Design Overview Flat Addressing Directory System (resolution service)
Layer-2 semantics Name – Location Separation Enforce hose model using existing mechanisms Performance Isolation TCP Valiant Load Balancing Guarantee bandwidth for hose-model traffic Uniform high-capacity Scale-out CLOS topology Solution Approach Objective

* End Result The Illusion of a Huge L2 Switch . . . . . .
CR AR S 1. L2 semantics . . . 2. Uniform high capacity 3. Performance isolation . . . A A A A A … A A A A A A … A A A A A A A A A A … A A A A A A A A A A A A … A A A A

Discussion What is the impact of VL2 on Data-Center power consumption?
Security The Directory System can perform access control What are the challenges with that? Other issues? Flat vs Hierarchical addressing VL2 has issues with large flows. How can we address that challenge? What is the impact of VL2 on Data-Center power consumption? All links and switches are working all the times, not power efficient Considering the enormous power consumption of data centers and the grave efforts towards reducing that consumption this is going in the opposite direction in this respect Solutions? Better load balancing vs selective shut down

VL2: A Scalable and Flexible data Center Network

Similar presentations

Presentation on theme: "VL2: A Scalable and Flexible data Center Network"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

VL2: A Scalable and Flexible data Center Network

Similar presentations

Presentation on theme: "VL2: A Scalable and Flexible data Center Network"— Presentation transcript:

Similar presentations

About project

Feedback