Download presentation
Presentation is loading. Please wait.
1
Keeping My (Telco) Cloud Afloat
SFQM and Doctor Keeping My (Telco) Cloud Afloat Emma Foley, Intel Maryam Tahhan, Intel Carlos Gonçalves, NEC Ryota Mibu, NEC
2
Outline Introduction The project formerly known as SFQM Doctor Demo
Summary
3
Components I’ll start by introducing you to a few opensource components that we’ll be discussing today. DPDK -> An open source project that provides a set of libraries and drivers that enable fast packet processing in user space. collectd is a Unix daemon that collects, transfers and stores performance data of computers and network equipment. The acquired data is meant to help system administrators maintain an overview over available resources to detect existing or looming bottlenecks. Openstack is a set of tools for building and managing cloud computing platforms. It provides a set of core components to manage network, compute and storage, as well as many optional components for enhanced functionality. Open Platform for NFV (OPNFV) is a carrier-grade, integrated, open source platform to accelerate the introduction of new NFV products and services. It is a Collaborative Project hosted by The Linux Foundation. By combining these components, we can provide a solution for data center fault detection and management. Let’s step back for a minute and ask: why is this automatic detection and maintenance so important?
4
“Data Centres are powering our everyday lives
“Data Centres are powering our everyday lives. Organizations lose an average of $138,000 for one hour of downtime.” [1]. Telco and Enterprise alike are asking how they get and provide Service Assurance, QoS and provide SLA’s on the platform and services when deploying NFV. It is vital to monitor systems for malfunctions or misbehaviours that could lead to service disruption and promptly react to these faults/events to minimize service disruption/downtime. As the world becomes increasingly dependent on the internet, data centers have come to power our everyday lives. Knowing this and taking 2 things into account, firstly the transition to SDN/NFV and secondly the cost of Data center downtime has left the telco and enterprise industries asking how they can get and provide SA, QOS and provide SLA’s on the platform and services when deploying nfv. If fixed function appliances are going to be replaced by virtualized appliances the service levels, manageability and service assurance needs to remain consistent or improve on what is available today Lack of service assurance in NFV is a barrier in the path to adoption and deployment A key part of breaking down this barrier includes expanding the amount of data available from the platform (e.g. DPDK statistics), and improving alarming functionality in OpenStack Aodh
5
Barometer/telemetry4NFV Overview
The ability to monitor the Network Function Virtualization Infrastructure (NFVI) where VNFs are in operation will be a key part of Service Assurance within an NFV environment, in order to: 1. Enforce SLAs 2. Detect violations and faults 3. Detect degradation in the performance of NFVI resources so that events and relevant metrics are reported to higher level fault management systems. The output of the project will provide interfaces to support monitoring of the NFVI. The project will not enforce or take any actions based on the gathered information.
6
Collecting Statistic and Events with collectd
Platform Consumer collectd Platform/application features OpenStack/ Fault Management Application Legacy plugin Legacy Plug-Ins BIOS plugin BIOS RAS plugin RAS Stats Ceilometer plugin MANO RDT plugin RDT Events SNMP plugin OVS plugins Open vSwitch DPDK SFQM Plug-Ins DPDK plugins Extended Stats Output Input SFQM Plug-Ins So what are the SFQM features we can use to collect stats/events from the platform and relay that information back to ceilometer/fault management Application? The features fall into 2 categories: Platform/application features/open APIs (right hand side of diagram) and collectd features On the Platform/application features/open APIs side you have the Extended statistics API that augments the generic stats API by exposing detailed statistics, * legacy = IPMI (Platform thermals, voltage info, fan speed) * BIOS = Version, maunfacturer, Vendor and other info. * RAS = Reliability, Availablilty and Servicability features (Reporting Machine Check Errors aka HW errors that are corrected/reported by the HW to SW). * RDT = Resource Dierctor Technology (Last level cache occupancy and memory bw utilization per core) * OVS = stats and events. * DPDK = xstats, keep alive + Link Status On the collectd side we implemented a number of plugins to retrieve stats and events from the Open APIs and relay them to higher level fault management applications through output plugins like ceilometer and SNMP plugin (note the SA core team are working on the SNMP plugin, it’s not available in Collectd yet) Provided Functionality
7
Collectd + DPDK Statistics
Collectd plugin : Merged! DPDK secondary process Monitor DPDK primary application Read extended NIC statistics Publish statistics to collectd dpdkevents Will be upstreamed soon DPDK secondary process Monitor DPDK primary application liveliness and link status Publish liveliness and link status notifications when a failure occurs or all the time.
8
Collectd + OVS Statistics and Events
OVS Stats plugin Features: Subscribe for DB table events relavent to Stats per interface DPDK agnostic. OVS Events plugin Features: Connect / Disconnect, Subscribe for DB table events, Custom requests, DB Echos for livelines Upstreaming at 71
9
Collectd with RAS Events and Stats
Reliability, Availability and Serviceability features Reporting Machine Check Exceptions (MCEs) from mcelog Where possible report metrics relevant to MCEs Machine Check Exceptions: Hardware errors that are corrected get reported by the HW to SW RAS is a set of reliability, availability and servicability features from the platform. The data that can be extracted from the platform is mainly in events, but there are some metrics as well. RAS features detect and attempt to repair faults, and report in the case of detection, whether of not hte fault is detected
10
Collectd with RDT Statistics
Resource Director Technology Per core group: Last Level Cache (LLC) Occupancy Local Memory Bandwidth Remote Memory Bandwidth Merged to collectd master
11
Collectd + Legacy and BIOS plugins
Leverage the IPMI interface to retrieve platform thermals, voltage info, fan speeds… BIOS: Retrieve Version, manufacturer, Vendor and other info from SMBIOS table.
12
Collecting OVS Interface Events + Stats with collectd
Plug-Ins Consumer/ OpenStack Platform collectd NB to MANO/VNFM DPDK Application SFQM Plug-Ins SFQM Plug-Ins SFQM Features Example Example collectd VM RX 1. Read TX Ceilometer collectd Ceilometer Plugin collectd OVS stats + Events Plugins 5. Post Values 4. Pass Values OVS 2. Get stats RX 3. Dispatch Values TX Emma to present dpdkstat retrieves plugin directly from DPDK and can be used with ANY dpdk application. OVS stats retrieves stats from the OVS DB and can be used with OVS only Provided Functionality
13
Collecting OVS Interface Events + Stats with collectd
Plug-Ins Consumer/ OpenStack Platform collectd NB to MANO/VNFM DPDK Application SFQM Plug-Ins SFQM Plug-Ins SFQM Features Example Example collectd VM RX 1. Read TX Ceilometer collectd Ceilometer Plugin collectd OVS stats + Events plugins 5. Post Values 4. Pass Values OVS With DPDK OVS 2. Get stats RX 3. Dispatch Values TX Emma to present dpdkstat retrieves plugin directly from DPDK and can be used with ANY dpdk application. OVS stats retrieves stats from the OVS DB and can be used with OVS only Provided Functionality
14
Fault Management Application
Status Update Platform Consumer collectd OpenStack/ Fault Management Application Legacy plugin Legacy Plug-Ins BIOS plugin BIOS RAS plugin RAS Stats Ceilometer plugin MANO RDT plugin RDT Events SNMP plugin OVS plugins Open vSwitch DPDK SFQM Plug-Ins DPDK plugins Extended Stats Under Implementation Output Input SFQM Plug-Ins Being Upstreamed Upstreamed
15
Doctor Project in OPNFV working on building an open-source NFVI fault management and maintenance framework to ensure Telco VNFs availability in fault and maintenance events Identify requirements Gap analysis Implementation work in upstream Integration and testing Consistent Resource State Awareness Immediate Notification Extensible Monitoring Fault Correlation
16
Status Update II Taking advantage of the notification plugin architecture in collectd to post an event (like link status failure or application thread failure) directly to the notification bus for immediate alarming in Aodh. Performance, scalability and aggregation analysis. Gnocchi integration
17
Doctor: fault management use case
18
Doctor: mapping to the OpenStack ecosystem
19
Doctor: focus of initial contributions
Consistent Resource State Awareness Immediate Notification
20
Doctor: focus of initial contributions
Consistent Resource State Awareness Immediate Notification
21
Doctor: extending contribution focus
Consistent Resource State Awareness Immediate Notification Extensible Monitoring Fault Correlation
22
Doctor Inspector The module has the ability to...
… receive various failure notifications regarding physical resource(s) from Monitor module(s) … find the affected virtual resource(s) by querying the resource map in the Controller module … update the state of the virtual resource (and physical resource) It has drivers for different types of events and resources Uses a failure policy database
23
Why a failure policy database?
“Failure” can be subjective. Depends on Applications (VNFs) Back-end technologies used in the deployment Redundancy of the equipment/components Operator Policy Regulation Topologies of Network / Power-supply So, “failure” has to be dynamically configurable case by case
24
Doctor Inspector: OpenStack Congress
Governance as a Service Define and enforce policy for Cloud Services Dynamic data collection from OpenStack services Flexible policy definition for correlation (Datalog) Well integrated with other OpenStack projects Policy example host_down(host) :- doctor:events(hostname=host, type="compute.host.down", status="down") execute[nova:services.force_down(host, "nova-compute", "True")] :- host_down(host)
25
Congress PushType DataSource Driver
26
Congress Doctor Driver
27
Doctor blueprints in OpenStack
Project Blueprint Spec Drafter Developer Status Aodh Event Alarm Evaluator Ryota Mibu (NEC) Completed (Liberty) Nova New nova API call to mark nova-compute down Tomi Juvonen (Nokia) Roman Dobosz (Intel) Support forcing service down Carlos Goncalves (NEC) Get valid server state Completed (Mitaka) Add notification for service status change Balazs Gibizer (Ericsson) Maintenance Reason to Server WIP (Ocata) Congress Push Type Datasource Driver Masahito Muroi (NTT) Adds Doctor Driver Neutron Port data plane status
28
collectd Ceilometer Plugin
SFQM + Doctor Ceilometer collectd dpdkstat Plugin 1. Read 2. Get stats 3. Dispatch Values OVS With DPDK RX TX collectd Ceilometer Plugin VM 5. Post Values 4. Pass Values
29
Demo
30
Painting the pedestrian crossing
Summary “Trying to manage a complex cloud solution without a proper telemetry infrastructure in place is like trying to walk across a busy highway with blind eyes and deft ears. You have little to no idea of where the issues can come from, and no chances to take any smart move without getting in trouble”. [2] Doctor Painting the pedestrian crossing
31
References [1] [2]
33
Legal notices and disclaimers
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit Intel, the Intel logo and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © 2016 Intel Corporation.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.