Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006.

Similar presentations

Presentation on theme: "Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006."— Presentation transcript:

1 Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006 IEEE/IFIP Network Operations and Management Symposium

2 CRK 6/2/20142 Vision for IP Network Management Approach Manage the entire network, not network elements Instrument the network, rely on direct correlation of real data Model interactions to predict the effects of actions in advance Automate as much as possible, audit results Topology, Configuration, Workflow Offered Traffic, Routing, Fault Network Network-wide model auditing, what-if, etc. measure control Goal: A robust, global, multi-service IP/MPLS network Provisioning, Changes to the Network Design goals, policies

3 CRK 6/2/20143 Why Its Hard Scale & Diversity Challenges Large, distributed networks (100,000s of NEs) Complex, diverse building blocks Ongoing maintenance, spanning multiple time zones Fragile IP network control planes Complex software systems on top Constant change Architectural change, new features & services, new protocols… Customers join, leave, change/upgrade service Network events – failures, migrations, upgrades, etc. Measurement and data challenges Inadequate implementation of the basics Data often locked up in NM systems smokestacks Diverse data sources, with highly variable data quality Limited direct measurements of causality Inadequate ability to trace events across the network

4 CRK 6/2/20144 Tier-1 Service Provider Network PoP: Point-of-Presence P: Backbone (core) Router PE: Provider Edge Router CE: Customer Edge Router Access Network Intercity Metro CPE CE E PE E C P C P Customer facing PE interfaces C P C P C P C P C P C P C P C P C P C P C P C P C P C P PoP E PE E E E E E OC-48 or OC-192 DWDM Rough stats: 100s of offices 100s of Ps, 1000s of PEs, 10000s of CEs 100,000s of transport facilities DWDM systems LEC PoP C P C P C P C P E PE E Customer Network (Enterprise customer networks rival ISPs in size & complexity!)

5 CRK 6/2/20145 Unlocking Network Data Measurement data is essential to running the network Marketing and customer acquisition Network and customer care Network engineering and capacity management Research to improve / evolve the network If you dont have the data, you cant design, manage, secure, or improve the network If you cant evolve systems, you cant evolve the network Example 1: Fault/performance management Example 2: Router Provisioning

6 CRK 6/2/20146 Network Troubleshooting Goals Automate the entire life cycle of event detection and repair for every performance impacting event – Detect, Localize, Diagnose, Fix, Verify Drive short and long term network, operations & systems improvements – Use forensics to reveal chronic events Systems and Tools Active and passive performance monitoring – Each data source has its unique value and limitations Maintenance and troubleshooting require correlation across multiple data sets – Associations of customers to access circuits, router interfaces, network policies, network elements, monitoring systems, …

7 CRK 6/2/20147 Example: Cross-Layer Troubleshooting IP composite link: multiple SONET links combined together Example: 5 OC192s IP routing does not take bandwidth into account. – On component failure: how to decide between mechanisms to take traffic off the link, as function of remaining capacity? LA NY LA NY Logical IP link 3 units of traffic 3 units of traffic congestion 1 unit of capacity

8 CRK 6/2/20148 Example: Cross-Layer Troubleshooting (cont.) Detect: Packet loss from active measurements for a set of PE pairs Localize/Diagnose: Temporal correlation: PE-PE measurement alerts occurring at the same time as flapping on several composite link members Spatial correlation: paths where packet loss occurs contain flapping composite link components (PE-PE measurements mapped to paths via route monitoring) Diagnose: Congestion due to composite link component flapping Fix: Short term: cost out the link Permanent: repair failing components Verify: Packet loss alerts disappear

9 CRK 6/2/20149 Example: Chronic Control Plane Outage Detect Active performance monitoring shows high loss at a PE Localize/Diagnose Correlation of performance alerts, fault data, routing updates, configuration, and workflow logs reveals recurring pattern – OSPF sessions flap during customer provisioning on some PE platforms Diagnosis: BGP starves OSPF processing on this class of PEs Fix Short-term: process changes to control provisioning on this class of PE Long-term: better OSPF and BGP process scheduler for PE Verify High loss disappears at the PE PE

10 CRK 6/2/201410 Data Distribution Problem Many, diverse data feeds required Labor-intensive and error-prone to create and maintain each feed Ad-hoc development to convert, copy, encrypt, & ingest the data Several groups with business critical functions need network data Stringent delivery requirements (security, timeliness, reliability) Network data Network inventory Route monitors, BGP tables SNMP link utilization & faults Syslog info (status, health, events) Active path monitoring Netflow Other: workflow, VoIP, transport Customer data Access: location, circuit ID, IP addresses, CE platform, LEC interface, layer 2 info (Frame Relay, Ethernet, DSL, Private Line, … ), router info (hardware, software version) Trouble tickets Performance and SLA reports Service orders

11 CRK 6/2/201411 Data Correlation Framework Flexible data/systems architecture Pluggable data-source specific collectors Data distribution bus Common real time and archival data store Variety of network management applications on top Evolving domain knowledge Its an iterative process: exploratory data mining (EDM) – Apply statistical tools, visualization, hunches, … – Export results to case manager for analysis Diagnosis engines Near real-time drill down, forensics Temporal and spatial event clustering Scalable statistical mechanisms to uncover correlations

12 CRK 6/2/201412 Data/Systems Architecture Network Internal Portal Customer Portal OA&M Topology I/F Netflow Collector L3 Control Plane Collector Active Probe Collector Syslog Collector CDR Collector Real-time Network Mgt Applications End-to-end Reporting Application Planning Application Surveillance Application Data Distribution Bus (DDB) Data Store Component (DSC) SNMP Collector GUI Data Distribution Bus Publish/subscribe system handling all incoming data feeds Supports multiple transport options, normalizes data to standard formats Reliably delivers data to consumers Data Store Component Efficient long-term storage of operational data Automatic generation of schema, loading scripts, access scripts, data aging allowing non-DBAs to manage warehouse Network data is available to multiple applications allowing auditing, correlation, reporting, EDM, …

13 CRK 6/2/201413 Router Provisioning Goal: translate service intent to network reality Get hardware & circuits to the right place at the right time Access & update network inventory databases Configure routers to establish and verify the service Challenges Huge diversity at network element layer (dependencies on hardware & software versions, physical configuration, vendor, etc.) Low level configuration languages, no abstraction layer, multiple ways of achieving the same thing Config generator must consider hardware limitations, service definition, customer order info, additional customer info, etc. Commercial tools offer limited customizability, only solve pieces of the problem Initial provisioning is only part of the life cycle problem (network-wide changes, firmware mgt, auditing, CE-PE coordination, change requests, …)

14 CRK 6/2/201414 Detect/Fix Discords Non-compliance to architectural intent – e.g., errors in route-maps for VPNs crossing routing domains Config time-bombs – e.g., gaps in the ACL perimeter defense Additional Benefits Assessment, Bootstrapping automation, Decision Support Technology Parsers, Algorithms, Rules and Queries encoding domain expertise : e.g., ACL analysis Auditing Discords Low level standard form (tables) Customer/ network database polled queries Router configuration Provisioning fix Configuration File Analysis

15 CRK 6/2/201415 Automated CPE Router Provisioning Technical Questionnaire E.g., Web form (Service Level) Device/service specific templates, with embedded variables and callouts to computations and databases E.g., callouts for ports, IP addresses, ACL clauses, … Detailed Device Configuration commands – bundled as a configlet (Network Element Level) Logic: allocations of ports, IP addresses, VRFs, …

16 CRK 6/2/201416 Template-driven Config Generation Executing templates in a given context (stored in a database) produces configs, similar to code generation –Evolves easily to integrate new features, router models, access types, resiliency options –Eliminates errors, reduces holds –Ensures conformance to engineering guidelines router bgp no synchronization bgp log-neighbor-changes network,255.255.25 5.252)> mask network mask Example: BGP configuration Context Substitution Functional Substitution

17 CRK 6/2/201417 Conclusions Unlocking data and fault/performance management systems enables innovation Exploratory data mining and data correlation are essential to forensics and network maintenance automation Approach: Flexible data distribution and data storage architecture Unlocking provisioning systems enables innovation Bottom-up analysis is a useful tool for discord-detection, etc. Template driven approach allows network engineering to add new network features without new systems development Challenges are legion… How to overcome proprietary data models, systems thwarting forensics? How to find efficiently find needles in (massive) data haystacks? How to raise the level of provisioning abstraction? How to reduce the systems drag on network feature and architecture change?

Download ppt "Unlocking Systems and Data: The Key to Network Management Innovation Charles Kalmanek Internet & Network Systems Research V.P. AT&T Labs-Research 2006."

Similar presentations

Ads by Google