Towards Predictable Datacenter Networks

Slides:

Advertisements

Similar presentations

Elastic Provisioning In Virtual Private Clouds

Advertisements

1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.

Improving Datacenter Performance and Robustness with Multipath TCP

The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1.

The Platform as a Service Model for Networking Eric Keller, Jennifer Rexford Princeton University INM/WREN 2010.

Cloud Service Models and Performance Ang Li 09/13/2010.

Virtual Network Diagnosis as a Service Wenfei Wu (UW-Madison) Guohui Wang (Facebook) Aditya Akella (UW-Madison) Anees Shaikh (IBM System Networking)

Ed Duguid with subject: MACE Cloud

Sharing Cloud Networks Lucian Popa, Gautam Kumar, Mosharaf Chowdhury Arvind Krishnamurthy, Sylvia Ratnasamy, Ion Stoica UC Berkeley.

The Case for Enterprise Ready Virtual Private Clouds Timothy Wood, Alexandre Gerber *, K.K. Ramakrishnan *, Jacobus van der Merwe *, and Prashant Shenoy.

The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.

Virtualization of Fixed Network Functions on the Oracle Fabric Krishna Srinivasan Director, Product Management Oracle Networking Savi Venkatachalapathy.

Performance Anomalies Within The Cloud 1 This slide includes content from slides by Venkatanathan Varadarajan and Benjamin Farley.

“It’s going to take a month to get a proof of concept going.” “I know VMM, but don’t know how it works with SPF and the Portal” “I know Azure, but.

FI-WARE – Future Internet Core Platform FI-WARE Cloud Hosting July 2011 High-level description.

Virtualization and the Cloud

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.

Jennifer Rexford Princeton University MW 11:00am-12:20pm Data-Center Traffic Management COS 597E: Software Defined Networking.

IOFlow: A Software-defined Storage Architecture Eno Thereska, Hitesh Ballani, Greg O’Shea, Thomas Karagiannis, Antony Rowstron, Tom Talpey, Richard Black,

Bandwidth Measurements for VMs in Cloud Amit Gupta and Rohit Ranchal Ref. Cloud Monitoring Framework by H. Khandelwal, R. Kompella and R. Ramasubramanian.

WHAT IS PRIVATE CLOUD? Michał Jędrzejczak Główny Architekt Rozwiązań Infrastruktury IT

Additional SugarCRM details for complete, functional, and portable deployment.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.

A User Experience-based Cloud Service Redeployment Mechanism KANG Yu.

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

Network Sharing Issues Lecture 15 Aditya Akella. Is this the biggest problem in cloud resource allocation? Why? Why not? How does the problem differ wrt.

Sharing the Data Center Network Alan Shieh, Srikanth Kandula, Albert Greenberg, Changhoon Kim, Bikas Saha Microsoft Research, Cornell University, Windows.

Virtual Machine Hosting for Networked Clusters: Building the Foundations for “Autonomic” Orchestration Based on paper by Laura Grit, David Irwin, Aydan.

Cloud Computing. What is Cloud Computing? Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable.

Virtualization Infrastructure Administration Network Jakub Yaghob.

Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.

Virtual techdays INDIA │ august 2010 SQL Azure – Tips and Tricks Ramaprasanna Chellamuthu │ Developer Evangelist, Microsoft.

 Configuring a vSwitch Cloud Computing (ISM) [NETW1009]

The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1.

Presented by: Sanketh Beerabbi University of Central Florida COP Cloud Computing.

Microsoft Virtual Academy.

Cloud Scale Performance & Diagnosability Comprehensive SDN Core Infrastructure Enhancements vRSS Remote Live Monitoring NIC Teaming Hyper-V Network.

1 Finding Constant From Change: Revisiting Network Performance Aware Optimizations on IaaS Clouds Yifan Gong, Bingsheng He, Dan Li Nanyang Technological.

Quantifying and Improving I/O Predictability in Virtualized Systems Cheng Li, Inigo Goiri, Abhishek Bhattacharjee, Ricardo Bianchini, Thu D. Nguyen 1.

The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1.

SDN AND OPENFLOW SPECIFICATION SPEAKER: HSUAN-LING WENG DATE: 2014/11/18.

CLOUD COMPUTING. What is cloud computing ? History Virtualization Cloud Computing hardware Cloud Computing services Cloud Architecture Advantages & Disadvantages.

Subways: A Case for Redundant, Inexpensive Data Center Edge Links Vincent Liu, Danyang Zhuo, Simon Peter, Arvind Krishnamurthy, Thomas Anderson University.

20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.

1 Agility in Virtualized Utility Computing Hangwei Qian, Elliot Miller, Wei Zhang Michael Rabinovich, Craig E. Wills {EECS Department, Case Western Reserve.

WS-B327 Dynamic, policy-driven network (re)configuration Consistent, profile- based deployment of SDN traffic policies through distributed.

3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-1.

Towards Predictable Data Centers Why Johnny can’t use the cloud and what we can do about it? Hitesh Ballani, Paolo Costa, Thomas Karagiannis, Greg O’Shea.

Microsoft Cloud Solution.  What is the cloud?  Windows Azure  What services does it offer?  How does it all work?  How to go about using it  Further.

Microsoft Virtual Academy Module 12 Managing Services with VMM and App Controller.

| Basel Fabric Management with Virtual Machine Manager Philipp Witschi – Cloud Architect & Microsoft vTSP Thomas Maurer – Cloud Architect & Microsoft MVP.

1 Cloud Computing, CS An OS for Multicore and Cloud + Microsoft Azure Platform.

6.888 Lecture 6: Network Performance Isolation Mohammad Alizadeh Spring

© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.

R2C2: A Network Stack for Rack-scale Computers Paolo Costa, Hitesh Ballani, Kaveh Razavi, Ian Kash Microsoft Research Cambridge EECS 582 – W161.

CS 6027 Advanced Networking FINAL PROJECT . Cloud Computing KRANTHI CHENNUPATI PRANEETHA VARIGONDA SANGEETHA LAXMAN VARUN DENDUKURI.

Chen Qian, Xin Li University of Kentucky

Unit 3 Virtualization.

VPN Extension Requirements for Private Clouds

Heitor Moraes, Marcos Vieira, Italo Cunha, Dorgival Guedes

Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.

Improving Datacenter Performance and Robustness with Multipath TCP

Logo here Module 8 Implementing and managing Azure networking 1.

VIDIZMO Deployment Options

Building Applications with Windows Azure and SQL Azure

EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.

20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.

MICROSOFT NETWORK VIRTUALIZATION

Towards Predictable Datacenter Networks

Presentation transcript:

Towards Predictable Datacenter Networks Hitesh Ballani, Paolo Costa, Thomas Karagiannis and Ant Rowstron Microsoft Research, Cambridge

This talk is about … Guaranteeing network performance for tenants in multi-tenant datacenters Multi-tenant datacenters Datacenters with multiple (possibly competing) tenants Private datacenters Run by organizations like Facebook, Intel, etc. Tenants: Product groups and applications Cloud datacenters Amazon EC2, Microsoft Azure, Rackspace, etc. Tenants: Users renting virtual machines

Network performance is not guaranteed Cloud datacenters 101 Simple interface: Tenants ask for a set of VMs Charging is per-VM, per-hour Amazon EC2 small instances: $0.085/hour No (intra-cloud) network cost Tenant Amazon EC2 Interface Request VMs Network performance is not guaranteed Bandwidth between a tenant’s VMs depends on their placement, network load, protocols used, etc.

Performance variability in the wild Up to 5x variability Study Provider Duration A [Giurgui’10] Amazon EC2 n/a B [Schad’10] 31 days C/D/E [Li’10] (Azure, EC2, Rackspace) 1 day F/G [Yu’10] H [Mangot’09] Which instances??

Network performance can vary ... so what? Data analytics on an isolated cluster Map Reduce Job Results Completion Time 4 hours Tenant Enterprise Unpredictability of application performance and tenant costs is a key hindrance to cloud adoption Key Contributor: Network performance variation Data analytics in a multi-tenant datacenter Completion Time 10-16 hours Map Reduce Job Results Tenant Datacenter Variable tenant costs Expected cost (based on 4 hour completion time) = $100 Actual cost = $250-400 Variable network performance can inflate the job completion time

Predictable datacenter networks Extend the tenant-provider interface to account for the network Contributions- Virtual network abstractions To capture tenant network demands Oktopus: Proof of concept system Implements virtual networks in multi-tenant datacenters Can be incrementally deployed today! VM1 VM2 VMN Virtual Network Request # of VMs and network demands Request # of VMs and network demands Tenant Key Idea: Tenants are offered a virtual network with bandwidth guarantees This decouples tenant performance from provider infrastructure

Key takeaway Exposing tenant network demands to providers enables a symbiotic tenant-provider relationship Tenants get predictable performance (and lower costs) Provider revenue increases

Talk Outline Introduction Virtual network abstractions Oktopus Allocating virtual networks Enforcing virtual networks Evaluation

Virtual Network Abstractions: Design Goals Easier transition for tenants Tenants should be able to predict the performance of applications running atop the virtual network Provider flexibility Providers should be able to multiplex many virtual networks on the physical network These are competing design goals Our abstractions strive to strike a balance between them VM1 VM2 VMN Virtual Network Request Virtual to Physical Tenant

Abstraction 1: Virtual Cluster (VC) Motivation: In enterprises, tenants run applications on dedicated Ethernet clusters Total bandwidth = N * B VM 1 VM N VM 2 B Mbps Virtual Switch Request <N, B> N VMs. Each VM can send and receive at B Mbps Tenants get a network with no oversubscription Suitable for data-intensive apps. (MapReduce, BLAST) Moderate provider flexibility

Abstraction 2: Virtual Oversubscribed Cluster (VOC) VMs can send traffic to group members at B Mbps Root Virtual Switch Total bandwidth at root = N * B / O Total bandwidth at VMs = N * B B * S / O Mbps … VM 1 VM S B Mbps Group 2 Group Virtual Switch … VM 1 VM S B Mbps Group N/S B Mbps … …. VM 1 VM N VM S Group 1 Request <N, B, S, O> N VMs in groups of size S. Oversubscription factor O. Motivation: Many applications moving to the cloud have localized communication patterns Applications are composed of groups with more traffic within groups than across groups VOC capitalizes on tenant communication patterns Suitable for typical applications (though not all) Improved provider flexibility Oversubscription factor O for inter-group communication (captures the sparseness of inter-group communication) No oversubscription for intra-group communication Intra-group communication is the common case!

Offers virtual networks to tenants in datacenters Oktopus Offers virtual networks to tenants in datacenters

Offers virtual networks to tenants in datacenters Oktopus Offers virtual networks to tenants in datacenters Two main components Management plane: Allocation of tenant requests Allocates tenant requests to physical infrastructure Accounts for tenant network bandwidth requirements Data plane: Enforcement of virtual networks Enforces tenant bandwidth requirements Achieved through rate limiting at end hosts

Talk Outline Introduction Virtual network abstractions Oktopus Allocating virtual networks Enforcing virtual networks Evaluation

Allocating Virtual Clusters Request : <3 VMs, 100 Mbps> 100 Mbps Max Sending Rate = 2*100 = 200 Max Receive Rate = 1*100 = 100 B/W needed on link = Min (200, 100) = 100Mbps VM for an existing tenant What bandwidth needs to be reserved for the tenant on this link? For a virtual cluster <N,B>, bandwidth needed on a link that connects m VMs to the remaining (N-m) VMs is = Min (m, N-m) * B For a valid allocation: Bandwidth needed <= Link’s Residual Bandwidth How to find a valid allocation? Datacenter Physical Topology 4 physical machines, 2 VM slots per machine Tenant Request Allocate a tenant asking for 3 VMs arranged in a virtual cluster with 100 Mbps each, i.e. <3 VMs, 100Mbps> An allocation of tenant VMs to physical machines Tenant traffic traverses the highlighted links Link divides virtual tree into two parts Consider all traffic from the left to right part

Allocation is fast and efficient Greedy allocation algorithm Request : <3 VMs, 100 Mbps> 100 Mbps 1000 200 How many VMs can be allocated to this machine? Solution At most 1 VM for this tenant can be allocated here 2 VMs 3 VMs 2 VMs 1 VM Allocation is fast and efficient Packing VMs together motivated by the fact that datacenter networks are typically oversubscribed Allocation can be extended for goals like failure resiliency, etc. Constraints for # of VMs (m) that can be allocated to the machine- VMs can only be allocated to empty slots  m <= 1 3 VMs are requested  m <= 3 Enough b/w on outbound link  min (m, 3-m)*100 <= 200 Key intuition Validity conditions can be used to determine the number of VMs that can be allocated to any level of the datacenter; machines, racks and so on Greedy allocation algorithm Traverse up the hierarchy and determine the lowest level at which all 3 VMs can be allocated

Talk Outline Introduction Virtual network abstractions Oktopus Allocating virtual networks Enforcing virtual networks Evaluation

Enforcing Virtual Networks Allocation algorithms assume No VM exceeds its bandwidth guarantees Enforcement of virtual networks To satisfy the above assumption Limit tenant VMs to the bandwidth specified by their virtual network Irrespective of the type of tenant traffic (UDP/TCP/...) Irrespective of number of flows between the VMs

Enforcement in Oktopus: Key highlights Oktopus enforces virtual networks at end hosts Use egress rate limiters at end hosts Implement on hypervisor/VMM Oktopus can be deployed today No changes to tenant applications No network support Tenants without virtual networks can be supported Good for incremental roll out

Talk Outline Introduction Virtual network abstractions Oktopus Allocating virtual networks Enforcing virtual networks Evaluation

both tenants and providers Evaluation Oktopus deployment On a 25-node testbed Benchmark Oktopus implementation Cross-validate simulation results Large-scale simulation Allows us to quantify the benefits of virtual networks at scale The use of virtual networks benefits both tenants and providers

Datacenter Simulator Flow-based simulator 16,000 servers and 4 VMs/server  64,000 VMs Three-tier network topology (10:1 oversubscription) Tenants submit requests for VMs and execute jobs Job: VMs process and shuffle data between each other Baseline: representative of today’s setup Tenants simply ask for VMs VMs are allocated in a locality-aware fashion Virtual network request Tenants ask for Virtual Cluster (VC) or Virtual Oversubscribed Cluster (VOC)

Virtual networks improve completion time Private datacenters VC is Virtual Cluster VOC-10 is Virtual Oversubscribed Cluster with oversubscription=10 Worse Execute a batch of 10,000 tenant jobs Jobs vary in network intensiveness (bandwidth at which a job can generate data) Better Virtual networks improve completion time VC: 50% of Baseline VOC-10: 31% of Baseline Jobs become more network intensive

Private datacenters With virtual networks, tenants get guaranteed network b/w Job completion time is bounded With Baseline, tenant network b/w can vary significantly Job completion time varies significantly For 25% of jobs, completion time increases by >280% Lagging jobs hurt datacenter throughput Virtual networks benefit both tenants and provider Tenants: Job completion is faster and predictable Provider: Higher datacenter throughput

Cloud Datacenters Tenant job requests arrive over time Amazon EC2’s reported target utilization Worse Tenant job requests arrive over time Jobs are rejected if they cannot be accommodated on arrival (representative of cloud datacenters) Better Rejected Requests Baseline: 31% VC: 15% VOC-10: 5% Job requests arrive faster

Provider revenue increases while tenants pay less Tenant Costs What should tenants pay to ensure provider revenue neutrality, i.e. provider revenue remains the same with all approaches Based on today’s EC2 prices, i.e. $0.085/hour for each VM Provider revenue increases while tenants pay less At 70% target utilization, provider revenue increases by 20% and median tenant cost reduces by 42%

Oktopus Deployment Implementation scales well and imposes low overhead Allocation of virtual networks is fast In a datacenter with 105 machines, median allocation time is 0.35ms Enforcement of virtual networks is cheap Use Traffic Control API to enforce rate limits at end hosts Deployment on testbed with 25 end hosts End hosts arranged in five racks

Cross-validation of simulation results Oktopus Deployment Cross-validation of simulation results Completion time for jobs in the simulator matches that on the testbed

How to determine tenant network demands? Summary Proposal: Offer virtual networks to tenants Virtual network abstractions Resemble physical networks in enterprises Make transition easier for tenants Proof of concept: Oktopus Tenants get guaranteed network performance Sufficient multiplexing for providers Win-win: tenants pay less, providers earn more! How to determine tenant network demands? Ongoing work: Map high-level goals (like desired completion time) to Oktopus abstractions

Thank you

©2011 Microsoft Corporation. All rights reserved. Backup slides ©2011 Microsoft Corporation. All rights reserved. This material is provided for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft is a registered trademark or trademark of Microsoft Corporation in the United States and/or other countries.

“These are my abstractions and if you don’t like them, I have others ” Other Abstractions “These are my abstractions and if you don’t like them, I have others ” … paraphrasing Groucho Marx Amazon EC2 Cluster Compute Guaranteed 10Gbps bandwidth (at a high cost though) Tenants get a <N, 10Gbps> Virtual Cluster Virtual Datacenter Networks Eg., SecondNet offers tenants pairwise bandwidth guarantees Tenants get a clique virtual network Suitable for all tenants, but limited provider flexibility Virtual Networks from the HPC world Many direct connect topologies, like hypercube, Butterfly networks, etc.

Tenant Guarantees vs. Provider Flexibility

Allocation algorithms Goals for allocation Performance: Bandwidth between VMs Failure resiliency: VMs in different failure domains Energy efficiency: Packing VMs to minimize power ... Oktopus allocation protocols can be extended to account for goals beyond bandwidth requirements

Oktopus: Nits and Warts 1 Oktopus focuses on guaranteed internal network bandwidth for tenants and is a first step towards predictable datacenters Other contributors to performance variability Bandwidth to storage tier External network bandwidth Virtual networks provide a concise means to capture tenant demands for such resources

Oktopus: Nits and Warts 2 Oktopus semantics: Tenants get the bandwidth specified by their virtual network (nothing less, nothing more!) Spare network capacity Used by tenants without virtual networks Work conserving solution Tenants get guarantees for minimum bandwidth Spare network capacity shared amongst tenants who can use it Can be achieved through work-conserving enforcement mechanisms

Hose Model Flexible expression of tenant demands in VPN settings Same as the virtual cluster abstraction Better than pipe model [Sigcomm 1999] Allocation problem is different Virtual clusters: VMs can be allocated anywhere Hose model: Tenant locations are fixed. Need to determine the mapping of virtual to physical links