Presentation on theme: "Managing Open vSwitch Across a Large Heterogeneous Fleet"— Presentation transcript:
1 Managing Open vSwitch Across a Large Heterogeneous Fleet Chad NorganSystems EngineerBeardyMcBeards in #openvswitch
2 100 60% 9 Worldwide About Rackspace We Serve Global Footprint Customers in CountriesAnnualized RevenueOver $1B60%100OF THEWe ServeFORTUNE®OVER300,000+ Customers ≅70 PB Stored5,000+ Rackers9 WorldwideData CentersPortfolio of Hosted Solutions Dedicated - Cloud - Hybrid
3 Rackspace’s Public Cloud Large FleetHeterogenousSeveral different hardware manufacturersSeveral XenServer major versions (sometimes on varying kernels)Five networking configurationsSix production public cloudsSix internal private cloudsVarious non-production environmentsTens of thousands of hypervisorsHundreds of thousands of virtual machinesInterfacesWorth mentioning the # of kernel versions?
4 Networks Available to Customers IPv4 & IPv6 Publicly Accessible NetworkBandwidth MeteredPublic NetDC-Routable IPv4 IPAccess Other Rackspace ProductsUnmetered BandwidthService NetNSX L2 Overlay NetworkExtendable to dedicated hardware via NSX GatewaysCloud Networks
5 Our History With OVSRackspace has used Open vSwitch since the 0.9 versionBehind most of First Generation Cloud Servers (Slicehost)Powers 100% of Next Generation Cloud ServersUpgraded OVS nine times since the launch of Next Gen Public Cloud in August 2012
6 Why We Use OVS Service provider features: Software = Flexible Overlay NetworksQoSVLAN TaggingPort SecurityLACPSoftware = FlexibleUpgrades are easier than hardware
7 Our Favorite Improvements Save & restore datapath flows during kmod reloadOVS 1.7Logging removed from main loop, faster flow setupsOVS 1.9Collapsed data path & flow-eviction-threshold raised to 2500OVS 1.10Megaflows & wildcardingOVS 1.11Multi-treading!OVS 2.0flow-limit replaces flow-eviction-threshold & TCP flagsOVS 2.1
9 Mission Accomplished! We moved the bottleneck! New bottlenecks: Guest OS kernel configurationXen Netback/Netfront Driver
10 Challenges of Upgrading OVS Matching the OVS kernel module to both the running and staged kernelHypervisor updates often come with a newer kernelWe often don’t immediately rebootRunning kernel != Kernel at next rebootDetect both kernels and install both sets of OVS kernel modulesHeterogeneous Scale
11 OVS Upgrade Solution Playbook-style upgrades Asynchronous plays with parallel limitsExtensibleEasy to build validations and pre-checks to prevent unwanted impactWe would not be able to achieve the velocity of improvements at our scale without it. It allows us to make very complex changes with confidence.
19 Ansible + OVS = Flexible Network Rewiring Public NetBridgePatch PortInterface BridgePublic Net Bridge_oldPublic Net BridgeVIFPIFPatch Port
20 Measuring OVS – PavlOVS.py Publishes metrics to StatsD/GraphitePer bridge byte, packet, open flow countDatapath hit, missed, lost, flow countsOpen vSwitch CPU utilizationInstance countTunnels configured and in fault state