Operations and Management of IP Networks: What Researchers Should Know

Operations and Management of IP Networks: What Researchers Should Know
Aman Shaikh Albert Greenberg AT&T Labs (Research) SIGCOMM 2005 Tutorial Aman Shaikh, Albert Greenberg, August 2005

Aman Shaikh, Albert Greenberg, August 2005
Perceptions… IP networks are simple … Best effort service only Simple and stupid core; complex and intelligent edge IP networks manage themselves and work just fine… Capable of routing around failures Excess capacity in the core IP network operations and management is SNMP SNMP MIBs exist for everything you need to know SNMP is widely supported and deployed Aman Shaikh, Albert Greenberg, August 2005

Reality: Limitations of IP
Application needs: reliable, predictable network service But, IP only provides best-effort service But, IP network elements are not very reliable Operators want fine-grained control over the network But, routers do not do fine-grained resource allocation Operators want accountability of resources But, routers do not maintain state about packet transfers But, measurement is not part of the infrastructure SNMP is the only exception; but SNMP is not adequate IP was not designed with network management in mind! Network management is much more than SNMP! Aman Shaikh, Albert Greenberg, August 2005

Reality: Scale and Diversity
IP networks are large, diverse and complex Network elements, protocols, processes, applications, services IP/Optical integration and general cross layer interactions Multiple vendors and multiple platforms (even within one vendor) IP networks are large and distributed Workflow management and maintenance across multiple time zones, in a highly coupled distributed system IP network control plane is diverse and complex Protocols offer numerous complex, overlapping features Protocols sometimes interact in strange and complex ways IP supports diverse services and applications Applications overlays on the IP infrastructure (e.g.. VoIP) Application-level servers, gateways, databases… Aman Shaikh, Albert Greenberg, August 2005

Reality: Dynamism Failures, maintenance and upgrades are common (e.g., IOS upgrades) Technological advances (e.g., ULH), protocol evolution (e.g., SIP) Architectural changes (e.g., MPLS) Network migrations, convergence (e.g., BGP route free cores) New applications and services (e.g., IPTV) Threats: Worms, viruses, DDoS, malware, … (e.g., Witty worm) Customer leave, join, change/upgrade services, … (e.g., Frame Relay to VPN migrations) Traffic fluctuates: routing anomalies, failures, misconfiguration, attacks, flash crowds (e.g., customer-side rebalancing) Two IOS and IPriori upgrades happen in AT&T CBB. That’s the target. Usually bug fixes results in more. For example, Jay sent me chars for IPriori upgrades for 2004 and 2005 YTD. In 2004, there were three Ipriori roll-outs and a special one for dtrmi92av (why? Ask Jay). In 2005 YTD, there have been two rollouts. Route processor upgrades are not that frequent. Aman Shaikh, Albert Greenberg, August 2005

How to Build an IP Network?
Shell scripts Traffic Eng tools Shell scripts Traffic Eng tools Shell scripts Traffic Eng tools Multiple routing processes on each router Each router with different configuration program Huge number of control knobs: metrics, ACLs, policy Distributed routers Forwarding, filtering, queuing FIBs, LFIBs, Labels Plethora of uncoordinated, overlapping network management scripts, tools, databases Databases Planning tools Databases Planning tools Planning tools Databases Configs SNMP netflow modems OSPF Link metrics Routing policies OSPF BGP LDP OSPF OSPF BGP BGP LDP FIB/ LFIB LDP FIB/ LFIB FIB/ LFIB Packet filters Aman Shaikh, Albert Greenberg, August 2005

Complex Associations Below (and Above) IP Layer
Dual uplinks on common ring Dual uplinks on ION BR BR ADM ADM RAR BR RAR BR Dual uplinks on different rings, common conduits One uplink ring; other ION BR ADM ADM ADM ADM BR ADM ADM RAR BR RAR Dual uplinks on different rings, diversely routed BR Uplinks with ring and ION BR BR ADM ADM ADM ADM ADM ADM ADM ADM BR RAR ADM ADM ADM ADM RAR BR Uplinks with unprotected, non-diverse segments ADM ADM ADM ADM ADM ADM ADM RAR ADM Aman Shaikh, Albert Greenberg, August 2005

Network Management Systems Interactions
Ticket Management Mgt 1 Mgt 2 Mgt 3 Mgt 4 abc xyz xyz abc xyzr abc1 abc5 abc2 abc3 abc4 19 abc 26 11 Mgt sys xyz 7 25 1 abc abc def 20 5 24 abc xyz 5 ghi 23 xyz 29 gef 13 7 14 ISE 7 26 xyz abc 24 abc abce xyz abc abc def 6 27 xyz 25 xyz 7 abc abc abc ghi 26 11 xxx def xyz abc 11 xxx abc 13 7 13 xyz 6 ghi 5 abc System for xyz 18 Transport abc yyz Portals 3 xxx abc 7 7 21 abc xyz Task xyzs 1 xxx 14 def ghi def To abc 2 13 30 11 9 9 abc ghi xyz 1 30 29 nms abc CFKB 9 xyz xyz 2 Platform for doing xyz Platform for task xyz xyz 11 xyz 9 9 6 ghi abc abc occ xxx xyz xyz xyz abc mno def ghi def abc ghi def 1 7 11 xyz xxx 30 xxx abc abc ABC Portal abc abc 1 29 abc xxxx xyz abc xxx 11 abc abc platform for xyz xyz xyz yyy xyz abc abc abc abc xyz abc abc abc abc zzzzz xyz 11 System X abc NF-TA CAPRI def def xxx! xyz xyz xyz 11 xyz ghi def abce xyz abc abc abc ghi To another sys Platform for abc xyz 23 11 6 abct DB4 abc 9 abc 28 DB3 abc 28 def 18 xyz abc xyz 3 9 DB2 ghi abc 9 7 11 7 xyz 11 DB1 abc Platform for xyz xyz 7 7 abc abc xyz def 11 xyz 7 xxx xyz 1 7 27 abc 19 abc 7 xyz abc 20 abc Platform xyz abc xyz xyz xyz yyy 19 abc def abc abc 21 abc abc xyz 20 def xyz def abc ghi abc abc dsagjag abc Dga;ljag;lkj xyz ghi 19 xyz 21 abc 20 abc abc abc def abc 3 abc ghi Databases Database 1 Database 2 Database 3 abc def ghi Aman Shaikh, Albert Greenberg, August 2005

Complex, Massive Streaming Data
First 25 lines describing individual VoIP Call Detail Record Data Types A simple case! Aman Shaikh, Albert Greenberg, August 2005

Tutorial Objectives Understanding elements of network management Numbers, network elements, services, systems, processes Problems, solutions Research challenges and opportunities Expose the tip of the iceberg To excite you to look deeper and help improve the state of the art IP/MPLS networking Optical networking Statistics Security Visualization Software Algorithms Machine Learning Data mining Automation Aman Shaikh, Albert Greenberg, August 2005

How does Network Management Fit In?
Product Management and Sales Strategy and New Technologies: VPNs, IPTV, WiMax, VoIP, CDNs, … Network Development Architecture, Capacity Planning, Testing and Certification, Technology incubation Software Development Network management systems; Billing systems Network Management (Operations) Customer Care Network Care Tutorial Focus on net mgt, but will touch on the other components as well. Boundaries can be fuzzy. IP Operations often write significant (and creative) code or scripts, for example. Aman Shaikh, Albert Greenberg, August 2005

Our Network Management Tutorial
Lay of the Land Network Operations and Management VoIP Case Study Some Directions and Challenges Aman Shaikh, Albert Greenberg, August 2005

Lay of the Land Aman Shaikh, Albert Greenberg, August 2005

Lay of the Land Physical networking What IP networks look like Topologies, network structures, taxonomies Logical networking Routing protocols, MPLS switching Aman Shaikh, Albert Greenberg, August 2005

IP Networks IP is the most prevalent technology for communication Everything over IP Enterprise networks Use IP networking for internal communication needs Hierarchical topologies typically: the right structure for small set of hubs (data centers), huge set of spokes (remote offices) Service provider networks Use IP to support a wide range of communication services to a wide range of business and residential customers Mesh-like backbone structure: the right structure for convolving tens of thousands of enterprise and other networks Routers concentrated in PoPs (Points of Presences) Both enterprise and service provider networks can have enormous geographic span, and involve thousands of complex network elements Tutorial Focus Aman Shaikh, Albert Greenberg, August 2005

AT&T North America IP Network
TO ANCHORAGE, AK TO ANCHORAGE, AK VANCOUVER CALGARY SEATTLE SPOKANE SPOKANE SEATTLE NTS ST PAUL ST PAUL MONTREAL PORTLAND PORTLAND MINNEAPOLIS MONTREAL MINNEAPOLIS MILWAUKEE MILWAUKEE TORONTO (2) TORONTO (2) GLENVIEW GLENVIEW MANCHESTER MANCHESTER PORTLAND, ME PORTLAND, ME DES MOINES DES MOINES SYRACUSE SYRACUSE ROLLING MEADOWS GRAND RAPIDS GRAND RAPIDS ROLLING MEADOWS PROVIDENCE, RI PROVIDENCE, RI CHICAGO CHICAGO BIRMINGHAM BIRMINGHAM BUFFALO DETROIT DETROIT ROCHESTER BUFFALO ROCHESTER CAMBRIDGE WORCESTER, MA CHESHIRE CHESHIRE CAMBRIDGE WORCESTER, MA SALT LAKE CITY SALT LAKE CITY OMAHA OMAHA DAVENPORT DAVENPORT PLYMOUTH PLYMOUTH BOSTON BOSTON FRAMINGHAM, MA MADISON CLEVELAND FRAMINGHAM, MA CLEVELAND ALBANY, NY TO TOKYO, JAPAN MADISON OAK BROOK SOUTH BEND SOUTH BEND AKRON AKRON PHILADELPHIA PHILADELPHIA ALBANY, NY TO TOKYO, JAPAN OAK BROOK NYC BROADWAY, NY NYC BROADWAY, NY TO HONG KONG TO HONG KONG KANSAS CITY KANSAS CITY COLUMBUS COLUMBUS NEW YORK CITY NEW YORK CITY TO SYDNEY, AUSTRALIA TO SYDNEY, AUSTRALIA SACRAMENTO DENVER FLORISSANT SACRAMENTO DENVER FLORISSANT INDIANAPOLIS DAYTON DAYTON SECAUCUS NTS SECAUCUS NTS NEWARK, NJ NEWARK, NJ INDIANAPOLIS SAN JOSE NTS COLORADO SPRINGS COLORADO SPRINGS ST. LOUIS ST. LOUIS CINCINNATTI CINCINNATTI WASHINGTON DC WASHINGTON DC ALBANY, NY ALBANY, NY LOUISVILLE PITTSBURGH, PA STAMFORD, CT STAMFORD, CT OAKLAND OAKLAND SAN FRANCISCO SAN FRANCISCO LOUISVILLE PITTSBURGH, PA LAS VEGAS LAS VEGAS OKLAHOMA CITY OKLAHOMA CITY SPRINGFIELD SPRINGFIELD GREENSBORO GREENSBORO HAMILTON SQ., NJ HAMILTON SQ., NJ HARTFORD, CT HARTFORD, CT REDWOOD CITY REDWOOD CITY SAN JOSE ALBUQUERQUE TO TOKYO, JAPAN TO TOKYO, JAPAN SAN JOSE ALBUQUERQUE TULSA NASHVILLE TULSA NASHVILLE MEMPHIS CHARLOTTE NORFOLK NORFOLK CAMDEN, NJ CAMDEN, NJ BRIDGEPORT, CT BRIDGEPORT, CT TO SINGAPORE MEMPHIS CHARLOTTE TO SINGAPORE LOS ANGELES LOS ANGELES RALEIGH RALEIGH WAYNE, PA WAYNE, PA NEW BRUNSWICK, NJ NEW BRUNSWICK, NJ ANAHEIM LITTLE ROCK SHERMAN OAKS ANAHEIM LITTLE ROCK NORCROSS NORCROSS DUNWOODY DUNWOODY HARRISBURG, PA WHITE PLAINS, NY WHITE PLAINS, NY SHERMAN OAKS HARRISBURG, PA PHOENIX PHOENIX DALLAS DALLAS COLUMBIA NYC BROADWAY, NY NYC BROADWAY, NY FORT WORTH FORT WORTH BIRMINGHAM BIRMINGHAM COLUMBIA CEDAR KNOLLS, NJ CEDAR KNOLLS, NJ GARDENA GARDENA LA NTS2 LA NTS2 SAN DIEGO SAN DIEGO ATLANTA ATLANTA DALLAS NTS DALLAS NTS ROCHELLE PARK, NJ ROCHELLE PARK, NJ SAN BERNARDINO SAN BERNARDINO This is a detailed view of the AT&T Global IP Network and AT&T’s Internet Data Centers as it exists as of November 2004, for North America. As you can see, we have a cross-country OC192 link (NYC-Chicago-SF). We were the first ISP to announce this link, which we brought up in December, We have 18 major nodes interconnected by OC48s/OC192s, as well as 16 additional small nodes (R-GARS) connected with OC48s, and ~100 locations (called remote access routers or RARs) connected with OC3s. This version of the map also shows AT&T’s Non-Traditional Space nodes, which are located in carrier neutral facilities, not AT&T POPs. Within the continental US, this network uses all AT&T private line facilities, connected through the use of IP over DWDM technology, where the routers are connected directly into the optical amplifiers of AT&T’s core DWDM systems. Each backbone node (node types: backbone nodes and internet gateway nodes) is connected to at least two others, to allow for automatic re-routing in the event of a physical facility failure. Small nodes (RARs and RGARs) are connected either via SONET protected facilities or diversely routed facilities to enhance reliability. The core of the US IP Backbone (18 nodes) network is designed in a series of physical layer “rings” which use layer 3 re-routing. This means that if a physical facility fails, such as when a fiber is cut and an OC48 or OC192 is lost, the router does the re-routing via the alternate link(s) out of the node. [The OC48s/192s in this network are not “protected” facilities and and thus do not receive restoration via SONET restoration. ] History—this network was originally designed in 1996 and implemented with 11 nodes an ~40 DS3s. Today it serves 100 cities and is one of the largest IP backbones in the US, serving over 36,000 dedicated customer connections (as of 6/1/03). We have enhanced the design of this network over time as we’ve perceived changes in customer requirements, such as in deploying R-GARs in cities where there has been demand for high-speed Internet connections. In addition, most of the link connectivity has changed over time as we’ve redesigned the network for more efficiency and cost-effectiveness. LA NTS LA NTS AUSTIN AUSTIN NEW ORLEANS NEW ORLEANS JACKSONVILLE JACKSONVILLE FREEHOLD, NJ FREEHOLD, NJ BOHEMIA, NY TO HONOLULU, HI BOHEMIA, NY TO HONOLULU, HI SAN ANTONIO SAN ANTONIO ORLANDO ORLANDO NEWARK, NJ NEWARK, NJ MONTERREY MONTERREY HOUSTON NEWARK NTS NEWARK NTS TAMPA TAMPA W. PALM BEACH W. PALM BEACH BALTIMORE, MD BALTIMORE, MD FT. LAUDERDALE OJUS FT. LAUDERDALE OJUS ARLINGTON, VA ARLINGTON, VA SILVER SPRINGS, MD SILVER SPRINGS, MD MIAMI NTS MIAMI NTS MIAMI MIAMI RICHMOND, VA RICHMOND, VA ASHBURN NTS ASHBURN NTS GUADALAJARA GUADALAJARA MEXICO CITY MEXICO CITY SAN JUAN, PR SAN JUAN, PR November 2004 Aman Shaikh, Albert Greenberg, August 2005

AT&T EMEA IP Network OSLO ST. PETERSBURG HELSINKI STOCKHOLM COPENHAGEN MOSCOW DUBLIN (2) WARWICK AMSTERDAM (2) HAMBURG WARSAW BIRMINGHAM/REDDITCH TO NEW YORK CITY LONDON (2) BERLIN ROTTERDAM DUSSELDORF PORTSMOUTH BRUSSELS (2) FRANKFURT (2) PRAGUE LA HULPE STUTTGART LINZ BRNO EHNINGEN MUNICH BRATISLAVA OLTEN VIENNA (2) PARIS (2) BASEL BUDAPEST BERN LAUSANNE GENEVA TO WASHINGTON DC ZAGREB MILAN (2) TURIN LJUBLJANA ST. GALLEN BUCHAREST ZURICH NICE This is the first of several views which show the Europe, Middle East and Africa (EMEA) region of the AT&T Global IP Network and AT&T’s Internet Data Centers. SOFIA BARCELONA ISTANBUL MADRID THESSALONIKA LISBON ATHENS NICOSIA TO PAKISTAN TO SOUTH AFRICA November 2004 HAIFA Aman Shaikh, Albert Greenberg, August 2005

CalREN Backbone Aman Shaikh, Albert Greenberg, August 2005

Abilene Backbone Aman Shaikh, Albert Greenberg, August 2005

Taxonomy of Routers by Roles
Customer Edge Routers (CE) On the customer premise Provider Edge Routers (PEs) Terminate access for large number of customers Complex, customer specific access control, packet handling, routing policies IP and IP VPN service End-to-end SLAs for on-net services (VPN, VoIP, IPTV, …) Terminate peering for a moderate number of private and public peering points Complex, peer specific routing policies Bilateral/proprietary peering agreements Provider Core Routers (P) WAN transport between and within PoPs High-speed links, high-speed switching, low functionality, high reliability Aman Shaikh, Albert Greenberg, August 2005

Tier-1 Service Provider Network
DWDM systems C P C P OC-48 or OC-192 DWDM PoP C P C P C P Intercity E PE C P PoP E PE C P C P E PE C P C P E PE E PE Customer facing PE interfaces PoP Metro Metro Access Access Access Access CPE CPE CE CE LEC CE CPE Access CE CPE Rough stats: 100s of offices 100s of Ps, 1000s of PEs, 10000s of CEs 100000s of transport facilities P: Backbone (core) Router PE: Provider Edge Router CE: Customer Edge Router Aman Shaikh, Albert Greenberg, August 2005

Taxonomy of Links by Roles
Core links High-speed links: OC48, OC192, n x OCX composite links Core Link Protection IP Layer Intra-PoP and inter-PoP carried directly over DWDMs Optical restoration currently has little utility for IP backbones ULH (Ultra-long Haul) technologies may change that Edge links Access, peering and network management High-speed links: OCX, Ethernet/OCX Low-speed links: TDM backhauled over transport access network to PE, potentially over multiple carriers Plethora of transport technologies (Ethernet, Cable, DSL, Frame Relay, Wireless), and vendors Edge link protection IP layer and transport layer Higher speed: SONET Rings, Intelligent Optical Networks Lower speed: TDM mesh networks (intelligent networks or centralized control) Aman Shaikh, Albert Greenberg, August 2005

Routing Routing protocols allow routers to build their FIB (Forwarding Information Base) FIB contains (next-hop router, outgoing interface) for each prefix and is consulted when router forwards packets Every router performs following steps: Learn topology information Identify and keep up with changes Calculate the FIB Variety of different routing protocols OSPF (Open Shortest Path First) [rfc2328] RIP (Routing Information Protocol) [rfc2453] IS-IS (Intermediate System-Intermediate System) [rfc1195] EIGRP (Cisco proprietary protocol) BGP (Border Gateway Protocol) [rfc1771] Aman Shaikh, Albert Greenberg, August 2005

Taxonomy of Routing Protocols Administrative Hierarchy
AS 1 AS 2 AS 3 The Internet is a collection of Autonomous Systems (ASes) An AS is roughly a network administered by a single authority ISPs, enterprises, educational institutes, government organizations AS is identified by AS Number (ASN) Two classes of routing protocols Intra-AS or Interior Gateway Protocol (IGP) Requirements: simplicity, stability, fast convergence OSPF, IS-IS, RIP, EIGRP Inter-AS or Exterior Gateway Protocol (EGP) Requirements: stability, “fast” convergence, scalability, and policy control Policy example: AT&T would not want to provide transit to peers BGP is the only EGP used today! Aman Shaikh, Albert Greenberg, August 2005

Taxonomy of Routing Protocols Topology Information Learned by each Router Link-state routing protocol: each router learns the entire topology Examples: OSPF, IS-IS Distance vector protocol: each router learns how far every destination in the network is from each of its neighbors Example: RIP Path vector protocol: each router learns each neighbor’s path to every destination Example: BGP Aman Shaikh, Albert Greenberg, August 2005

IGP in Service Provider Networks
Most tier-1 service providers use OSPF and IS-IS as IGPs For scalability, OSPF and IS-IS allow hierarchical routing Use of areas (OSPF) or levels (IS-IS) to form a hub-and-spoke topology Typically each PoP forms a spoke and inter-PoP links form the hub Link-state routing is used within an area, whereas distance vector approach is used across areas Advantage: reduction of state and processing within each area, problem localization Example: impact of problems in one area can be minimized/hidden from other areas Disadvantage: sub-optimal routing, management complexity C P C P Area 0 PoP PoP Intercity C P C P C P E PE C P E PE C P Area 1 C P E PE Area 3 C P C P PoP E PE E PE Area 2 Aman Shaikh, Albert Greenberg, August 2005 Area 2

BGP in Service Provider Networks
BGP is used to learn routes from neighbor ASes (peers and customers) PE routers form eBGP (external BGP) sessions with CE routers (customers) or PE routers (peers) PE router sends externally learned route to all routers in service provider AS PE router forms iBGP (internal BGP) sessions with all routers (PE and P) in the AS iBGP scalability Routers (PE and P) have to form a full mesh which does not scale beyond few tens of routers Form clusters of routers (cluster leader is called Route Reflector) Typically routers in a PoP form a cluster Disadvantage: information hiding, complicated routing and management eBGP eBGP Route Reflectors iBGP iBGP Aman Shaikh, Albert Greenberg, August 2005

MPLS (Multi-Protocol Label Switching)
Outgrowth of IP switching technologies E.g. Epsilon’s IP switching, Cisco’ tag switching Key concept: separate routing (i.e., selection of paths) from forwarding/switching Traditional forwarding: each router looks at destination address along the path Even though routing protocol has already determined the path! MPLS-based forwarding: assign a label to a path and switch packets based on the label at each router Gives rise to a Label Switched Path (LSP) Layer 2 Header | PID MPLS Label 1 MPLS Label 2 … MPLS Label n Layer 3 Packet Label (20bits) | CoS (3 bits) | Stack (1 bit) | TTL (8 bits) Aman Shaikh, Albert Greenberg, August 2005

MPLS in Service Provider Networks
Form an FEC (Forwarding Equivalence Class) of all packets with same forwarding requirements E.g., BGP destination prefix, packets with same CoS (Class of Service) bits Associate an LSP with each FEC All packets within an FEC are forwarded same way Typically LSPs are established from ingress to egress Switch packets from ingress to egress egress router: pop label from packet LSP ingress router: push a label onto packet Backbone routers Label-switch packet Aman Shaikh, Albert Greenberg, August 2005

MPLS Applications IP VPNs Provider-based, simple, scalable VPNs Overlapping, private addressing Converged Cores IP VPNS + Internet Access Common BGP-free core Switch packets from ingress router to egress router Backbone routers do not need BGP routes Potential for More Reliable Cores Traffic engineering Establish “customized” paths IGPs do not provide fine-grained control over traffic Fast re-route Pre-compute and establish alternate MPLS paths to quickly route around failures Booming Demand Aman Shaikh, Albert Greenberg, August 2005

Customer Care Network Care
Network Management Customer Care Network Care Aman Shaikh, Albert Greenberg, August 2005

Operations and Management Division of Roles Customer vs. Network Care
Customer Care is Edge (CE-PE) centric Focus: where customers meet the network CE-PE access circuit; CE, PE configuration A great deal of coordination is needed across the lifecycle of assessment, onboarding, and steady state management Often about determining on which side of the customer/provider interface a problem and associated action lies Call centers, relatively large teams, with technical and soft skills needed to deal with customer problems Customer care is a differentiating service feature sold to customers Network Care is Core (PE) centric Focus: Where network internals that customers have no interest in e.g., BGP route reflectors Network Operations Centers, relatively small teams, with deep technical skills needed to deal with network problems Aman Shaikh, Albert Greenberg, August 2005

Customer Care Fundamentals Provisioning SLAs Data Management Aman Shaikh, Albert Greenberg, August 2005

What is a Customer? Multiple views for multiple purposes Sales, Billing, Provisioning, Troubleshooting, Interactions with third parties A network provider publishes different data to different contacts Who gets views of the bill vs. who gets views of the trouble tickets Defining a customer is not easy! Much more complex than knowing the customer’s Dunn & Bradsheet D-U-N-S number (though this helps) Customer Data Management is a difficult, dynamic problem A layer above networking, yet critical to network management For our purposes, associate a customer with a project Contracted network services: Internet access, VPN, reporting, SLAs, … CE (Customer Edge) routers – on the customer premises Managed by the customer or outsourced to the provider outsourced management may extend beyond the CE’s WAN interface PE (Provider Edge) routers – on the provider’s network Managed by the provider Access arrangements Site info: location, circuit ID, associated IP addresses, … Off-net (third parity) access (LECs, PTTs) or on-net access via Packet on SONET (POS), Frame Relay, Ethernet, DSL, … PTT: Abbreviation for postal, telegraph, and telephone (organization). In countries having nationalized telephone and telegraph services, the organization, usually a governmental department, which acts as its nation's common carrier. RBOC (ilec) : regional Bell operating company. Seven regional telephone companies formed by the breakup of AT&T. RBOCs differ from RBHCs in that RBOCs do not cross state boundaries. CLEC: competitive local exchange carrier. A company that builds and operates communication networks in metropolitan areas and provides its customers with an alternative to the local telephone company. See also CAF. Aman Shaikh, Albert Greenberg, August 2005

Elements of Customer Care Bootstrapping: customer + requirements  network + services Vanilla customer: automated flow-through from technical questionnaire to service activation Complex customer: multiple, iterative steps Assessment Understanding existing customer networks/services Understanding requirements for new networks/services Base-lining of existing services Understanding traffic, topology, configuration … Design of new services Phased implementation of new services A complex enterprise may migrate to a new provider over a multiyear time frame Provisioning Logistics of getting routers, circuits, configurations to right place at right time Managed by workflow management systems Phased, scripted component and end to end test and turn up procedures In synchrony with updates to databases supporting network management systems and billing Gold mine of complex data about network operations Aman Shaikh, Albert Greenberg, August 2005

Elements of Customer Care Troubleshooting/Tech Support: customer + problem  solution Reactive: call centers – 24x7 tiered support Low tiers handle high volume, relatively simple or localized problem types High tiers handle lower volume, relatively complex and higher severity problems Proactive: alerts from monitoring systems Triggered by reachability, performance and fault monitoring Internal notifications to network care, access providers External notifications (IVR, contact lists) to customers Essence of Customer Care: superfast problem localization and dispatch Detection and classification including level of severity Localization to the appropriate control domain: customer, network provider, access provider Solution dispatch Again, to the appropriate control domain: customer, provider (network care), or access provider And track the problem At the heart, this is automated systems workflow Rules driven, automated and audited escalations of problems through technical and business channels from detection to localization to post mortem reporting network care does the complicate stuff will talk about sophisticated tools in network care part of tutorial Aman Shaikh, Albert Greenberg, August 2005

Provisioning Transforming Service Intent to Network Reality
Customers want service increasingly “on demand” Providers want revenue, which flows the moment service is provisioned Provisioning speed is a huge priority Today’s bottleneck: physical provisioning of circuits Technology mechanisms, such as intelligent optical networking (with bandwidth on demand) have sprung up to address networking issues Market mechanisms, such as exchange points (e.g., PAIX), have sprung up to address some of the physical wiring issues Customer brings a fiber to the exchange point, and chooses a provider among those already there Aman Shaikh, Albert Greenberg, August 2005

Provisioning Workflow
Technical Questionnaire E.g., Web form (Service Level) Logic: allocations of ports, IP addresses, VRFs, … Device/service specific templates, with embedded variables and callouts to computations and databases E.g., callouts for ports, IP addresses, ACL clauses, … This is the workflow: (a) user configures some basic information from site specific documentation using web interface, (b) the information is put into the provisioning database, © the “active templates” extracted from the repository -- these are equipment specific -- and the provisioning generator (compiler) executes the active code on in the template, which pulls data from (and can push data into) the provisioning database, the resulting router configuration file (e.g., CISCO IOS file) is pushed to the equipment by some external, but possibly automated service. Detailed Device Configuration commands – bundled as a “configlet” (Network Element Level) Aman Shaikh, Albert Greenberg, August 2005

Provisioning Example Access Interfaces
Basic interface configuration Media and location in router (POS7/3, ATM5/0.1) IP address and network address (mask) Capacity (bandwidth) Rich configurable parameters at layer 3 Packet marking and scheduling (differentiated services) Buffer management (memory size, RED parameters) Access control (inbound and outbound packet filters) Diverse communication media at layer 2 Serial link, ATM, Frame Relay, packet over SONET, etc. Various low-level, media-specific parameters Aman Shaikh, Albert Greenberg, August 2005

Example: BGP Customer Configuration
Provisioning Example BGP Customer Configuration Example: BGP Customer Configuration Determine customer’s AS number Some customers have their own AS number Example: customers multi-homed to multiple providers Some customers cannot get their own AS number Example: single-homed customers Assign private ASN (64,512 to 65,535) or use provider’s ASN Establish communication with the customer Determine interface(s) connected to the customer Configure BGP session with the customer Associate BGP session with the interfaces Enforce provider’s routing policies while taking customer’s routing intent into account BGP import and export policies Configure other BGP sessions parameters Password, timer settings, description, etc. Aman Shaikh, Albert Greenberg, August 2005

Provisioning Example BGP Routing Policies
What are BGP routing policies? Applied to BGP update messages at PE (or AR) router Based on the prefix (and/or other attributes) listed in the update Determines route selection and distribution within AS as well as distribution to other customers and peers Two kinds of routing policies Import: applied to routes received from the customer Filter routes for unwanted prefixes Influence the selection of the best route Tag routes for future export to other customers, and/or peers Export: applied to routers sent to the customer Select routes and attributes to send to customer E.g., send default route to customer (if needed) What makes them complicated? Often have to decompose them across routers to achieve intent Aman Shaikh, Albert Greenberg, August 2005

Example: Controlling Route Distribution
A C Peer Customer Customer intent: “Don’t advertise my routes to peers” Need policies at both the customer and peer neighbor route-map IMPORT-C in route-map IMPORT-C permit 10 set community 0:1000 Assign routes “Don’t import to peers” tag at router C ip community-list 1 permit 0:1000 neighbor route-map EXPORT-A out route-map EXPORT-A deny 10 match community 1 Don’t send route with “Don’t import to peers” tag to peer at router A Aman Shaikh, Albert Greenberg, August 2005

Auditing What’s Provisioned (Checks and Balances)
Again, provisioning is about translating service intent to network reality Automation helps enormously Simpler, better configuration languages (e.g., XML-based) and configuration protocols (e.g., IETF’s netconf) may help Yet!!! Engineered artifacts (large scale, operational complex networks and databases) are imperfect, are moving targets, and are hard to reason about Flaws creep into design, realization, management Some level of noise or error is inevitable Key parts of the solution Auditing service intent and network reality, flagging and fixing “discords” Data integrity and data cleaning Aman Shaikh, Albert Greenberg, August 2005

Auditing is Bottom Up! Auditing Discords Low level standard form (tables) customer/ network database polled queries Router configuration Provisioning fix Parsing network-level data Box-level dumps (show running config; show diag; show trace …) translated into a form (RDBMS, XML) for network-level query and analysis Cross-validation Box level compliance to templates Network-wide integrity (routing…) Access control and security Alignment of network views with database views IP and Optical Associations (interfaces to circuit-IDs) Fixing config discords Report warnings & errors Cruft, serious problems, time-bombs waiting to explode when the triggering network event occurs Aman Shaikh, Albert Greenberg, August 2005

Example: Joining Parts of OSPF Config Together (references/constraints scattered thru config file)
hostname MyRouter ! interface POS7/0 ip address ip ospf cost 50 router ospf 2 network area 9 passive-interface Serial2/1/0/3.1 Remote end is in /30 Interface participates in OSPF Aman Shaikh, Albert Greenberg, August 2005

OSPF passive interface OSPF link with area mismatch
Example: Remote End in Different OSPF Area (auditing tool joins/analyzes info in database) Extracted tables interface OSPF network OSPF passive interface OSPF interface link Intermediate tables active OSPF interface Simple SQL queries Presentation query result OSPF link with area mismatch Aman Shaikh, Albert Greenberg, August 2005

Service Level Agreements (SLA’s)
A sort of warrantee: financials + the “fine print” Fine print: technical reliability and performance Measurement intervals/methods, statistics (VoIP R-values, delay, loss, jitter, availability), force majeure indemnifications regarding hurricanes …, outages caused by the customer itself, variations based on interfaces and bandwidth characteristics, etc. IP networking is maturing and the marketplace is extremely competitive SLAs have real meaning and are getting increasingly stringent Site to site (CE to CE) VPN SLAs cover Class of Service (COS) specific targets for delay, loss, jitter availability between pairs of sites within the customer’s VPN Financials SLA compliance data incorporated into the billing data stream When SLAs are not met Customers unhappy: service quality is below expectations Providers unhappy: revenue suffers Aman Shaikh, Albert Greenberg, August 2005

Example: Provisioning for site to site SLAs Network
CE PE Provider Network PE CE CE, PE interfaces 4 interfaces are essential for a given CE pair To meet the SLA the detailed configurations must be aligned end to end across the network must match customer and service-specific data bandwidths (e.g. rate limiting parameters for FR/ATM CE-PE link), CoS markings, shaping/queuing/marking/ dropping packet handling behaviors, customer-specific routing and packet filter parameters Aman Shaikh, Albert Greenberg, August 2005

Example: Provisioning for site to site SLAs
Example: Provisioning for site to site SLAs Probes/reporting CE PE Provider Network PE CE CE probes 2 CE probes are essential for a given CE pair Probes have roles as both senders and responders Agent running on the CE router (e.g., Cisco’s SAA), or another box attached to a CE port or on the CE – PE link (e.g., RMON probe) Collects detailed data on site to site performance via passive and active measurements The detailed probe configurations must be aligned with the network configurations, the customer, and the service Probe packet type (UDP, ICMP, CoS), probing frequency Interface packet filtering must permit the probes The performance/SLA monitoring system must collect the data with very high fidelity Agent must itself be very reliable Performance monitoring platform must be designed for statistical soundness and high reliability Polling frequency; data collection; data validation; SLA reporting Aman Shaikh, Albert Greenberg, August 2005

Data Management Extremely important issue in running networks Often overlooked by the academic community True for both customer and network care Customer level Service contracts, VPNs, CE, routing/access control parameters/policies, site and access/circuit data, ordering/ billing records, provisioning/ updating service, workflow related events and trouble tickets, performance reports, … Network level Layers 1-3 (+ network servers, such as DNS) topology, routing, performance, security, fault, operational workflow, provisioning/updating network and associated systems, network-focused inventory and configuration, … Look up how the TMN model talks about this Aman Shaikh, Albert Greenberg, August 2005

Data Management Challenges!
Scale: tens of thousands of business customers; millions of consumers Cisco platforms/command sets, VoIP telephony adapters, firmware updates Customization for complex customers Example: variations in IOS version, features, architecture Rapid Evolution of IP network services VoIP: multiple telephony adapters, firmware loads, interactions with equipment Number of features that need to be maintained increases Features (almost) never die Data is managed by applications Software rot can lead to data rot Strikes when a program’s assumptions become out of date Churn in database design and development Multiple teams, creating multiple APIs Aman Shaikh, Albert Greenberg, August 2005

Important Data Management Activities
Data integration/correlation Associations (keys) for mapping customers to access circuits, router interfaces, network policies, transport facilities, monitoring systems Provisioning requires precise, normalized current data Troubleshooting requires extensive correlation of current and historical data Data integrity/cleaning Getting high quality, readily available data Large real world datasets always have some level of dirty data Auditing and fixing process and process “fallouts” due to inconsistent or missing data Approaches Top Down: Data modeling/engineering: integration Bottom Up: Google-like (read-only) virtual integration, but read/write Aman Shaikh, Albert Greenberg, August 2005

Virtual Data Integration Methodology
W C X B Y Web Crawlers Z Data Sources Data Staging VIP GUI D Data Access DB snapshots Direct Access Custom Views Local Interfaces External Interfaces VIP Cache V MetaSearch Ideal solution: integration off all systems BUT Large integration projects often fail because it is too expensive and time-consuming to re-engineer everything and get all the necessary buy-ins. Will create just another monster? Virtual integration: use lightweight web and database technologies to give users the impression/value of systems integration Virtually integrated = not physically integrated No re-engineering of legacy systems Aman Shaikh, Albert Greenberg, August 2005

Virtual Integration Benefits
Troubleshooting requires accessing many different systems Multiple logins; variety of interfaces (terminal, java-based, web) Access to the data is the first step towards assuring data quality Virtual integration system exploits established APIs in all component systems (CGI-programming, terminal emulators, data feeds. …) Cross-index datasets on all possible combinations of joinable keys Allow user get to data by any means available Google-style approach, include direct links to main databases Build customized web interfaces based on user feedback Fast, no reorganization of the underlying systems Use AI/Data-Mining techniques to flag/correct input errors Exploit big opportunities for automation Auto-populate forms Aman Shaikh, Albert Greenberg, August 2005

Network Care Fundamentals Troubleshooting Maintenance Network Security Aman Shaikh, Albert Greenberg, August 2005

Elements of Network Care
Troubleshooting and Maintenance are intertwined with each other One can trigger the other; Example: diagnosis of a failing line card can lead to its replacement Example: things can go wrong during maintenance that lead to troubleshooting Maintenance and upgrades Troubleshooting Plan Detect Notify customers Localize Prepare network Diagnose Fix Perform Verify Verify Aman Shaikh, Albert Greenberg, August 2005

Proactivity and Reactivity
Target Prevent problems, rather than getting better and better at fixing problems How Robust design Automation of network management Forensics and post mortem analysis of problems Limitations Moving target! Can walk on water if its frozen Silent failures! No trap, no measurement Aman Shaikh, Albert Greenberg, August 2005

Network Troubleshooting
Workflow Detect, Localize, Diagnose, Fix, Verify Target: automate all of this! Reality: we are not all the way there yet Reality: better at automating the earlier parts of the workflow, thanks to continuously operating comprehensive monitoring tools and systems Importance of real time execution: obvious Importance of off line analysis (post mortem): critical driver for network improvement Systems and Tools Passive and Active monitoring Tools that apply in many roles across the workflow Correlation Accelerating, improving and automating the workflow remember to talk about localizing as a hard step Aman Shaikh, Albert Greenberg, August 2005

Example Detect Continuous active monitoring (PE-PE) shows loss of continuity for some PE pairs Localize Active monitoring, in this case, provides immediate localization to impacted PEs Diagnose Syslogs, MIBs, OSPF monitoring reveals CPU spikes coincident with high BGP workloads, OSPF sessions dropped, customer provisioning Diagnosis: Unsustainable BGP workload (running at higher priority than OSPF) on a certain class of PE routers Fix Short-term: Control provisioning and other configuration changes to avoid triggering the problem Permanent: Vendor fixed scheduling priorities of OSPF and BGP processes Verify Enhancements to active monitoring, specific to provisioning Aman Shaikh, Albert Greenberg, August 2005

Network Troubleshooting Toolkit: Box Level
Up Down Triggered by hard failures (link, card, router, etc) Near real-time alarms Statistical Traffic, buffers, CPU, … Degrading conditions; e.g., significant loss, no queues  degrading hardware on linecard  plan maintenance Scalable Easy when looking data source by data source Harder when looking at the huge number of data sources from diverse network elements: SNMP, syslogs, SONET alarms, … Aman Shaikh, Albert Greenberg, August 2005

Network Troubleshooting Toolkit: Network Level and External
Active measurement Path level performance information Delay and delay variation measurements Indication of customer degradation (except hard failures) Scalability problems (N Squared issues) Control Plane monitoring (BGP, OSPF, LDP) Passively forming the views of routing akin to the routers themselves Correlation Data fusion of network measurements and associating alerts Anomography – network wide anomaly detection from network element External TACACs and workflow logs – who is doing what and where on the network Alerting and tickets from other layers (Optical, VoIP) Aman Shaikh, Albert Greenberg, August 2005

Box-level: SNMP What is it? SNMP = Simple Network Management Protocol Allows NMS to query devices for information Information stored as MIB (Management Information Base) Allows devices to notify NMS about events SNMPv1, SNMPv2, SNMPv3 (work in progress) Backwards-compatible? Usage in troubleshooting Detection (Up Down and Statistical) Quite often in diagnosis Other usage Reporting, trending and statistics SNMP link utilization forms key component of traffic matrix estimation Capacity planning, evolution of network architecture Aman Shaikh, Albert Greenberg, August 2005

SNMP MIB A model of how information is stored in a device Collection of objects identified by object Ids (OID) Information is organized hierarchically Hierarchies allow grouping of information by topics E.g., interface group stores information about interface state Hierarchies allow controlled extension of the model E.g., router vendors have defined their own MIBs Accessing the MIB NMS  Device read: get, getNext NMS  Device write: set Device  NMS notification: Traps Aman Shaikh, Albert Greenberg, August 2005

Example MIB ROOT ccit(0) iso(1) joint(2) standard(0) reg-authority(1) member-body(2) indent-org(3) dod(6) internet(1) directory(1) mgmt(2) experimental(3) private(4) mib(1) enterprises(1) Vendor-specific MIBs cisco(9) system(1) snmp(11) udp(7) att(3) icmp(5) interfaces(2) ip(4) tcp(6) egp(8) transmission(10) RMON2(17) OID for ICMP: RMON(16) Aman Shaikh, Albert Greenberg, August 2005

Limitations of SNMP Inadequate SNMP MIBs are inconsistently implemented (or not at all) SNMP MIBs cover only a small portion of critical information on the health and behavior of the router Statistics hard-coded No local intelligence to: accumulate relevant information, alert NMS to pre-specified conditions, etc. Highly aggregated traffic information Aggregate link statistics Cannot drill down Protocol: simple = dumb Cannot express complex queries over MIB information in SNMPv1 “Get all or nothing” More expressibility in SNMPv3 Aman Shaikh, Albert Greenberg, August 2005

Box-level: Syslog What is it? Moral equivalent of #if (DEBUG) printf(…) in the router code Vendors print plethora of information via syslog The syslog output can be collected at a remote server Usage in Troubleshooting Detection, localization, diagnosis Valuable source of information on what equipments are doing Limitation Syslog output is not standardized No consistency across vendors or different platforms of same vendor Makes it cumbersome to write portable tools that feed off syslog Syslog is not reliable Loss of messages when router CPU is busy Aman Shaikh, Albert Greenberg, August 2005

Box-level: Telnet/CLI
What is it? Telnet/ssh into routers and issue commands for troubleshooting Ping, traceroute, show/debug,… Resetting sometimes fixes problems! “shutdown/no shutdown” can sometimes solve problems on linecard! Often used extensively in troubleshooting Usage in Troubleshooting Localization, diagnosis, fix, verify Limitations Doing things via CLI is playing with fire… Tight access control and authorization, considerable expertise required, “ask yourself” training Also need to control how many people can simultaneously telnet in Places load on router CPU Mention here that how amazed I was to see how many times people touch routers for trouble-shooting purposes while I was at the NOC. Aman Shaikh, Albert Greenberg, August 2005

Network-level: Active Measurements
PoP 1 PoP 2 Probe Probe PE PE CE CE edge to edge probes CE CE Probe may be onboard the router (SAA) or separate server Utility Alarms are driven on estimates of application impact Routing design can be assessed and adjusted for efficiency The effect of equipment/facility failures can be assessed and mitigation put into place Operations Methods are designed to minimize application impact The behavior of new applications (e.g. VoIP) can be estimated The risk for Service Level Agreements can be gauged Customers are given a view of the measurements to provide a view into backbone performance Aman Shaikh, Albert Greenberg, August 2005

Active Monitoring Design
Goal Schedule packet transmissions (Poisson, Periodic, …) so that virtually every performance affecting event longer than a few seconds will be detected Performance impacting events include Card changes on backbone routers that cause re-routes (previously not considered customer-impacting) Small but persistent drops at interfaces Major congestion events Events that cause indirect harm via excessive jitter This provides the ability of the backbone to support real-time protocols can be tracked fairly accurately Aman Shaikh, Albert Greenberg, August 2005

Views of the Information
Public/Customer View Current Round Trip (RT) Loss and mean RT delay by city-pair Monthly averages for Loss and Delay Network-wide Global Operations View RT Loss RT Delay (95th percentile, min, mean) Inter-Packet Delay Variation (IPDV) or ‘jitter’ Degraded seconds or minutes in test Operations View For analysis and investigation Numerous metrics and raw data available Aman Shaikh, Albert Greenberg, August 2005

Network-level: Control Plane Monitoring
What are Route Monitors? Allow collection and analysis of routing messages E.g., OSPF  Link State Advertisements (LSAs), BGP  routing updates Trouble-shooting usage: Detect: Real-time tracking of routing events Diagnosis: Post-mortem analysis of problems Other usage: Network maintenance Track and validate maintenance steps “What-if” Analysis Capacity planning, architectural changes, policy changes, risk analysis Understanding routing dynamics of commercial networks Convergence, stability, robustness Interaction of protocols Aman Shaikh, Albert Greenberg, August 2005

Route Monitors in Practice
Research and academic Route-views and RIPE [route-views, ripe-ris] Public archives of BGP updates Have spawned numerous research papers on BGP OSPF Monitor from AT&T Labs [shaikh-nsdi04] IPMon project at Spring Labs [spint-ipmon,pyrt] Commercial products: RouteExplorer by PacketDesign [packetdesign] OSPF, IS-IS, EIGRP, BGP RouteDynamics by IPSUM [ipsum] OSPF, IS-IS, BGP Aman Shaikh, Albert Greenberg, August 2005

Collecting Routing Data
Challenge How to collect data passively BGP Monitor Use of public-domain routing software: Zebra/Quagga Passiveness achieved through configuration on routers Route filters that block any route updates from the monitors OSPF Monitor [shaikh-nsdi04] Various modes of connecting to the network Need one connection per area Passiveness achieved through careful implementation of the collector Aman Shaikh, Albert Greenberg, August 2005

Correlation Across Data Sources
What is it? Correlate multiple data sources Simplest is to align multiple time-series Trouble-shooting usage: Detection: Dramatic reduction in false positives, and in redundant alarms Discovery of new and unexpected failure modes (e.g., IP/Optical interactions) Localization: Correlation of active and passive monitoring helps to simultaneously provide the severity and the locus of the problem Diagnosis: root cause analysis and fault localization Correlation enables automated drill-down Correlation capabilities are extremely powerful for post mortem analysis and for identification of recurring failure modes flying, previously, under the radar Sample research work: BGP and SNMP (link utilization) for anomaly detection [roughan04] Risk modeling for fault localization [kompella05] OSPF and BGP correlation for root cause analysis [teixeira04] Aman Shaikh, Albert Greenberg, August 2005

Two Sources: SNMP and BGP
Traffic volumes within a time interval Two detection algorithms Holt-winters Decomposition-based algorithm BGP Fluctuations in number of routes per exit-point Use EWMA (Exponentially Weighted Moving Average) Aman Shaikh, Albert Greenberg, August 2005

Example 1 Anomaly that triggers an alarm – major network peer failure Anomaly that does not trigger an alarm: monitor session resets Aman Shaikh, Albert Greenberg, August 2005

Example 2 No alarm: monitor data loss Alarm – again, peering related
Aman Shaikh, Albert Greenberg, August 2005

Network Maintenance A very large problem Under-explored. Research opportunities Why? Continuous drivers for software update (routers, linecards, processors) Bugs, vulnerabilities, upgrades, enhancements new features, new knobs to turns, new protocols, new services Continuous drivers for hardware updates (routers, linecards, processors) Failures, upgrades (higher speeds, new technologies), … Workflow Plan, Notify Customers, Prepare Network, Perform, Verify Very large opportunities for automation of workflow execution Systems Decision support Analysis of network and customer impact for each network update Optimizing, scheduling systems and workforce Execution Methodology and tools for minimizing impact during update execution Aman Shaikh, Albert Greenberg, August 2005

Example: Router OS Upgrade
Plan On site work force available? Customer notification required? Piggyback Opportunities? Architectural Exceptions? Special customer exceptions? … Resolve conflicts with other activities Risk/impact analysis on network and customers Notify customers if needed Leveraging the customer database Prepare the network Move traffic around by reconfiguring IGP (and BGP) Take out of production the router under maintenance E.g., move traffic off links incident on the router Perform the update Checkpoint state Minimize hit on the network, and time to upgrade Decision Support Aman Shaikh, Albert Greenberg, August 2005

Example: Router OS Upgrade (Continued)
Verify In final steps of execution, perform series of checks Examples: diff with checkpoint, check OS version after router reboot Rollback network to previous state Revert IGP and BGP (e.g., move traffic back on links) Check performance and fault monitoring Router is in production No adverse impact on network No adverse impact on customers Aman Shaikh, Albert Greenberg, August 2005

Decision Support Goal: A robust network configuration Good performance, even during failures and planned changes Limit impact of network update Maintenance: assess impact of planned outages Assessment of impact from maintenance on routers or underlying technologies (fibers, transponders, optical amplifiers, …) What if tools Compute flexible set of potential routing metric changes to minimize impact Key ingredients Data, models, and process – IP and cross-layer (optical, service) Importance and difficulty of data flow and data integrity Wide field of use beyond maintenance Risk, survivability and vulnerability analysis, network and service evolution, capacity planning Aman Shaikh, Albert Greenberg, August 2005

Decision Support Needs
Risk modeling Transport level SRLG data: Shared Risk Link Groups e.g., all IP links whose integrity depend on a common fiber conduit belong to an SRLG associated with that conduit IP Level: routers, interfaces Traffic modeling Traffic matrix: where the traffic is coming from and going to Hard problem in IP networks! Topology and Routing Analysis Via configuration management and route monitoring systems Route simulation Algorithms and analysis Impact analysis, optimization plans to minimize impact, … Aman Shaikh, Albert Greenberg, August 2005

Risk Modelling Risk management: tradeoff likelihood of failure, impact and economics Links (lasers), Fiber spans (SRLG), fibers (e.g., optical amplifiers), routers Impact analyzed through Risk Assessment Tool Probabilities model; drives requirements Integrity of a simple IP link depends on a complex set of transport facilities LA NY SF Washington IP (logical) layer Physical (fiber) layer LA NY SF Washington Common SRLG Aman Shaikh, Albert Greenberg, August 2005

Traffic Matrices: Big Picture
Router Level Demand Matrices Granularity: router or router interface Killer App: Network Maintenance Innovation: Tomo-gravity Flow Level Demand Matrices Granularity: TCP/IP headers Killer App: Traffic Analysis with Drill-down Innovation: Priority Sampling Path Matrices Killer App: Passive Performance Measurement Innovation: Trajectory Sampling Still working its way through standards and implementations Focus Here Introduce concepts, quantities we want to compute and why, key innovation – remarkably simple algorithms, Aman Shaikh, Albert Greenberg, August 2005

Requirements Use only data that is widely available, is built into the network elements, and is easy to collect on any interface on any router in a timely fashion Simple, statistically sound, scalable algorithms Frameworks that cover range of approaches, and help to explain how and why the approaches work Robust to the harsh realities of the operational environment Graceful degradation given data loss, corruption What this means for traffic matrix estimation Use link loads: SMMP MIB 2 Ubiquitously available, robust Cope gracefully with missing, late, corrupted or otherwise flawed data Aman Shaikh, Albert Greenberg, August 2005

Network Tomography Have link traffic measurements Want to know demands from source to destination A B C Aman Shaikh, Albert Greenberg, August 2005

Problem: b=Ax Only measure at links 1 route 3 link 1 router 2 route 2 link 2 route 1 3 link 3 Let me first mathematically formulate the problem … In this simple network, there are 3 nodes 123; which are connected to the green router through link 1,2,3 respectively. There are also 3 different routes. Route 1 involve link 2,3, but not link 1; route 2 involves …, route 3 involve … The link loads b1 is the sum of the traffic along route 2 and 3. Similarly, … You can write this in matrix format, where the link loads b_i and traffic matrix elements x_j are linearly related by a routing matrix. You know the left hand side, know the routing matrix, want to estimate traffic matrix. Problem: Estimate traffic matrix (x’s) from the link measurements (b’s) Aman Shaikh, Albert Greenberg, August 2005

Approach: Direct SVD Solution of b=Ax
Simplest approach: singular value decomposition Often used when system is overconstrained; when underconstrained it sucks. … It minimizes inconsistency in the least squares of the error. The problem is massively under-constrained Aman Shaikh, Albert Greenberg, August 2005

A successful approach: Tomo-gravity
Tomo-gravity = tomo-graphy + gravity modeling Reduce problem size Exploit topological equivalence Find a solution x, which satisfies the constraints, and is closest to the generalized gravity model solution (g) minimizes constraint subspace (b=Ax) (from link measurements) tomo-gravity solution (x) generalized gravity solution (g) Now I can explain how we come up with the name tomo-gravity. Tomo-gravity = tomography + gravity modeling. It tries to take the best of both approaches … Aman Shaikh, Albert Greenberg, August 2005

Foundation in Information Theory
Minimize Mutual Information I(S,D) Information gained about source (S) from destination (D) Assume no information beyond the link load constraints b=Ax Framework for tomo-gravity Gravity model = independence (between S and D) Generalized gravity model = conditional independence Explains tomo-gravity’s success with since this is the first-order approx. to Kullback-Leibler divergence from independence for I(S,D) There will be a test at the end of the tutorial ;-) Aman Shaikh, Albert Greenberg, August 2005

Tomo-gravity Works Best of tomography and gravity modeling (solid foundation in information theory) Simple, and quick: A few seconds for large IP backbone Accurate: average ~11% error Including netflow now significantly improves this! Errors become a few percent. Uses widely available SNMP data Highly robust  Can work within the limitations of SNMP data Only uses first order statistics  Interpolation very effective Limited scope for improvement Can easily incorporate additional constraints Killer App: Network Maintenance K To summarize, tomo-gravity really works! It takes the best of both tomography and gravity modeling … (end with the story on how we successfully prevented service disruption during disastrous simultaneous link failures) Aman Shaikh, Albert Greenberg, August 2005

Executing the Plan Prepare Network, Perform, Verify Back to the Router Upgrade Example – Via IGP (OSPF, IS-IS) metric changes Cost-out: assign high weight to link(s) so that traffic is drained out before bringing the link down Cost-out a link Bring the link down Perform maintenance/upgrade Bring the link up Cost-in the link Cost-out/cost-in does not mean zero impact on traffic Possibility of loops However, traffic is handled more gracefully Aman Shaikh, Albert Greenberg, August 2005

Router Cost-out Options
Option 1: Cost-out all outgoing links of a router Based on IETF RFC 3137 [rfc3137] Configuration changes only at the router in question Cisco ‘max-metric router-LSA’ command allows one to perform entire cost-out in one atomic operation Option 2: Cost-out all incoming links of a router Have to cost-out links at the neighboring routers Which option is better? Option 1 is operationally easier than option 2 Impact on traffic: not clear Aman Shaikh, Albert Greenberg, August 2005

Hitless Upgrades? Make hardware/software upgrades completely non-intrusive No impact on routing and forwarding performance of routers Other than the router being upgraded, no impact on customer performance and traffic Other operational uses: Router internals for continuous operation during upgrade also provide increased reliability and availability during failures How? Component redundancy Component Plug-n-play Protocol extensions Nirvana Active research area Aman Shaikh, Albert Greenberg, August 2005

Component Redundancy Backup route processor Duplicate state at the backup route processor Seamlessly transfer control to backup processor E.g., Avici’s NSR (non-stop routing) [avici-nsr] Bundle multiple physical links into a single IP layer link Issues: Ensure packets from a single flow are delivered in-order Fast failover required Failure of some links can overload link Bandwidth thresholds to bring a link down Avici’s Composite links [avici-composite-link] Aman Shaikh, Albert Greenberg, August 2005

Protocol Extensions for Hitless Restart
Extend routing protocols so that a router is used for forwarding even if routing process is inactive Issues: Need support from multiple routers What to do upon topology changes to avoid black-holes and loops? Example: Two proposals for OSPF Graceful restart [rfc3623] Support from neighbor routers required Abandon hitless restart upon topology change I’ll Be Back (IBB) [shaikh-infocom02] Support from entire OSPF domain required Abandon IBB only if loops and/or black-holes can actually form and only for affected destinations Cisco’s NSF with SSO [cisco-bgp-nsf] Aman Shaikh, Albert Greenberg, August 2005

Network Security Intelligence is key If you don’t understand it how can you secure it? If you don’t understand it how can you tell what’s different? Network Security and (normal) Network Management two sides of the same coin Information Needs Topology, Traffic, Routing, Configuration, Service – Customer associations Example: same network-wide data sources that feed traffic engineering, feed online threat analysis – e.g., netflow Yet, for security, perhaps more so than normal NM tasks, the details really matter DoS is Denial of Service – not necessarily Distributed Denial of Service Attack (DDoS Attack) A (difficult) task for network care is to determine whether an anomaly arises from “natural causes” or from DDoS attacks Example: SYN floods caused by web server crash (HTTP and user retries) or a router crash (BGP retries) vs. SYN floods caused by an attack Example: Spikes and swings in traffic with root causes in the optical layer – traffic not being monitored by DoS sensors suddenly becomes monitored . . . Aman Shaikh, Albert Greenberg, August 2005

End System Trends (Enterprise and Home)
Explosion of security risks in the end systems PDAs generate and hold a ton of private information Example: Paris Hilton’s sidekick PDA Appealing applications open new doors for exploits (W32/Mytob …), instant messaging, … Urgent! Click. Try this URL? Click. Install this? Click. You sure? Click. Malware installed Solutions Ways to cope: vulnerability testing, user training, desktop configuration management Microsoft Tuesdays: teams of specialists who analyze monthly advisories from major software houses on newly discovered vulnerabilities, and on cost/benefit analyses on deployment Explosion of software and devices running software Adding a lot of new code and new vulnerabilities Bad guys never had it so good More complex end system firewalls and rules may not be the solution VPNs, fancy group management, network definitions, bandwidth controls, … Witty worm: clever one packet worm that successfully exploited a firewall manufacturer’s product line, exploiting ports the firewall meant to block Complexity! Aman Shaikh, Albert Greenberg, August 2005

Enterprise Trends: Outsourcing
Outsource the wide area network MPLS VPNs run by a network service provider Outsource the servers Network firewall, hosted , e-business, VoIP infrastructure, web applications run by a network service provider Why? Complexity: Complicated to secure and expensive to manage Advisories, patches, best practices, churn Routers increasingly complex: distributed intelligence across line cards, route processors VPN technologies Greasing the skids for server outsourcing Lowering the expense for backhauling to data centers to reach outsourced servers Enterprises cope by concentrating and centralizing solutions and expertise Providers have a multiplexing advantage Amortization of knowledge: more data, more confirmation of attacks or problems, more information shared across customers More efficient engineering – less over-engineering with pooled resources Consequence: Security an active area for network service providers and networking research Securing the core, the data centers, the networked applications, and the customers Example: July 2003, specific 3 packet sequence jams Cisco linecards Industry’s processes for vulnerability notification, patch management less structured and evolved for routers than for desktops Aman Shaikh, Albert Greenberg, August 2005

Attack Traffic Trends Decline of worms and viruses that jam the network On the front page in the early 2000’s, the carpet bombers that jam networks -- Slammer, Safire, Code Red, … Relatively small, stable residual traffic persists from these Yet, the potential is still there for another carpet bombing worm Greatest potential for research on worm mitigation is for the enterprise Throttling at or near the source Not every enterprise hit by the Slammer Rise of the targeted, purposeful attacks that jam or compromise more focused targets Hackers against hackers Poor Man’s Internet gaming Attacks against specific applications, services, customers Wide set of popular toolkits/attacks available: smurf, fraggle, TCP SYN flood, connection killing, distributed reflection Identity theft Example: phishing attacks Malware installation Bots bought and sold to spammers, and other bad guys Aman Shaikh, Albert Greenberg, August 2005

Network Security Activities: Prevention
Prevention has a bigger bang for the buck Some Enterprises may think of detection and mitigation as too little too late Perimeter defense Routers: ACLs, blackhole routes Servers: firewalls Core cloaking MPLS (one IP hop) core stops attackers from knowing internal topology and routing Access control for network elements Logins/passwords via centralized authentication servers Controls on which users/systems can execute which commands Audit trails Vulnerability Analysis Testing: how routers, switches, servers hold up to a range of emulated attacks in the test lab Simulation: identification of weaknesses and better mitigation policies via network-level simulations Network’s weakest link is the network’s strongest component Rigorous and periodic security audits for all network and service elements Routers, switches, servers, … Stuff you do ahead of time to make Aman Shaikh, Albert Greenberg, August 2005

Prevention: Perimeter Control
Forwarding mechanisms Unicast reverse path forwarding controls for spoofed source IP addresses Drop at edge router if source IP is not routable Blackhole routes for specific destination IP addresses Static routes whose BGP next hop is not routable Used to drop packets directed to infrastructure, and to drop attacks on customer routes Somewhat coarse logging to support attack forensics Filtering mechanisms Data: Access Control Lists (ACLs) More precise blocking (src/dest IPs, ports) and rate limiting of packet streams Provider network ACLs: relatively simple, instantiated at edge interfaces Enterprise network ACLs: relatively complex, instantiated across the network Somewhat more precise logging to support attack forensics Downside: intensive processing and memory resources often precludes wide use Control: routing import and export policies CE import policies: control plane counterpart of ACLs – route scoping CE export policies: route scoping and good citizenship – limiting route propagation to specific groups, not propagating any instabilities Aman Shaikh, Albert Greenberg, August 2005

Prevention: Testing Complex feature interactions in routers have the potential of amplifying small DoS attacks Subtle sequence of QoS configuration commands can cause packets to be process switched (by CPU, whose cycles are needed for OSPF, BGP, …) rather than line card switched Consequences CPU and traffic correlated Small DoS attack (e.g., on a T1 interface) can bring down a large CE CPU correlated with link load! QoS config that removes process switching CPU load Time (diurnal traffic loading pattern) Aman Shaikh, Albert Greenberg, August 2005

Network Security Activities: Detection/Forensics
Network Providers strategically positioned to fight DDoS Traffic Analysis Detection – early sensing of possible attack Today, catching the high volume attacks for the most part Forensics – sustained analysis and trace-back Challenges and balancing acts in creating and maintaining relatively raw data Traffic analysis challenge: massive traffic volumes across the network edge Flow-based monitoring: scalable, comprehensive Packet-header monitoring: deep, analysis on important interfaces (or interfaces under attack) Dark space monitoring: helpful when source IP spoofing is occurring Arbor, Riverhead (Cisco), Cloudshield (underlay), Snort … Owing to the scale of the network and the traffic, all of the above is research Routing analysis Monitoring diversion of routes and traffic from intended destinations In the middle of the Internet, a BGP speaker lies about routes Detection: ISPs/enterprises set up “customer” connectivity to other ISPs to monitor the advertisement of their private address spaces (AT&T Peermon) Active research domain Again, there is a multiplexing advantage Seeing a large fraction of the Internet helps Challenges Make sure measurement infrastructure remains up during attacks Time to detect, how long to collect forensics for analysis/traceback Aman Shaikh, Albert Greenberg, August 2005

Network Security Activities: Mitigation
Blackhole routes and ACLs Stops the bad traffic and any collateral damage to the victim, not necessarily the DoS Buys time for forensics, other mitigation Scrubbing Diversion to scrubbing farm, which attempts to drop/analyze the attack traffic and send remaining on to the destination Makes most sense in a network, as a shared resource Scrubbers involve expensive deep packet inspection of traffic diverted to the scrubbing farm via routing and tunneling Challenges Whether or not to mitigate Size and duration of the attack, damage (including collateral), customer How long to mitigate Fixed time interval (one week?), until the attack disappears? How to automate High cost of mitigating false positives Adaptive defenses, closed loop incorporating false positives and associated costs [Duffield] Aman Shaikh, Albert Greenberg, August 2005

VoIP Case Study Aman Shaikh, Albert Greenberg, August 2005

Outline for VoIP Fundamentals Commercial VoIP service models VoIP network management Fundamentals VoIP signaling and transmission Generic VoIP infrastructure Commercial VoIP services Customer-oriented VoIP Upstream bandwidth determination Business-oriented VoIP SLAs, determining VoIP quality VoIP network management Challenges Trouble-shooting Security Performance implications for the underlying IP network Pluses and Minuses for VoIP Pluses VoIP is cheaper than traditional phones Additional services/features Can forward voic to an address Users can carry their phone wherever they want Minuses Low reliability Hard to achieve “five-9s reliability” (99.999%) of PSTN Poor call quality at times Dependence on power VoIP stops working during power failure Aman Shaikh, Albert Greenberg, August 2005

Voice is Big (~ as Big as Data)
U.S. long distance (rough round numbers) 4.5 Petabytes/day (petabyte = 1015 bytes) ~1 billion calls/day ~3 minutes/call ~2 x 100 kbps for encoding two 64 kbps streams per call Flash crowds (e.g., American Idol voting) Tens of millions of calls in 10 minutes (first few minutes of voting) to a handful of phone numbers By comparison, a very large, tier 1 ISP carries ~ 2-3 petabytes/day Aman Shaikh, Albert Greenberg, August 2005

PSTN is Reliable, and Society Banks on that!
That is, voice on the PSTN (Public Switched Telephony Network) is amazingly reliable Five nines (99.999% availability) engineering In the U.S., outages are reported to the FCC FCC = Federal Communication Commission Voice supports critical services 911 and GETS GETS = Government Emergency Telecommunication Service Out-of-band network configuration management for IP networks Dial into the router, rather than telnet in Lest you saw off the limb you are standing on  Aman Shaikh, Albert Greenberg, August 2005

VoIP Signaling VoIP Phone VoIP Phone Alice Bob Register Alice Register Bob Registrar Service Call Bob Signaling using SIP Resolve Bob’s location Location Server Proxy Server Proxy Server Send call to Bob’s domain VoIP Infrastructure Signaling Call setup, session management, negotiation of session parameters, dealing with advanced features Competing protocols and standards SIP [rfc3261], H.323 (ITU-T), MGCP [rfc3435], ... Aman Shaikh, Albert Greenberg, August 2005

VoIP Transmission: Voice Samples/UDP
Bob’s VoIP Phone Alice’s VoIP Phone Decoder (DA Converter) Coder (AD Converter) De-jitter buffer media packet IP Cloud Voice sample Eth RTP UDP IP IPSec? IPSec (From webopedia) Short for IP Security, a set of protocols developed by the IETF to support secure exchange of packets at the IP layer. IPsec has been deployed widely to implement Virtual Private Networks (VPNs). IPsec supports two encryption modes: Transport and Tunnel. Transport mode encrypts only the data portion (payload) of each packet, but leaves the header untouched. The more secure Tunnel mode encrypts both the header and the payload. On the receiving side, an IPSec-compliant device decrypts each packet. For IPsec to work, the sending and receiving devices must share a public key. This is accomplished through a protocol known as Internet Security Association and Key Management Protocol/Oakley (ISAKMP/Oakley), which allows the receiver to obtain a public key and authenticate the sender using digital certificates. Voice samples transmission: IP packets (RTP [rfc3250] over UDP) Coder + Decoder = Codec Perform Analog-to-digital and digital-to-analog conversion Vary in sound quality, bandwidth requirement, computational requirement… Each phone, gateway, service support several different CODECS Example CODECS: ITU G.711 (64 kbps), ITU G.729 (8 kbps) Aman Shaikh, Albert Greenberg, August 2005

Consumer VoIP Service Models
BYOA (Bring Your Own Access) model “Overlay,” “Third Party,” … To date: this model works extremely well Today’s DSL vs. Cable wars help to explain why QoS in the end systems (telephony adapters), with no access to QoS in access or core networks; accordingly no transport SLA Commercial offers: AT&T CallVantage, Vonnage, 8x8, Skype BSP (Broadband Service Provider – Comcast, SBC, …) model End-to-end QoS for on-net and on-net  PSTN flows, with potential for transport SLAs BSP capabilities to tag and differentiate their own service offers (e.g., voice, video, web) from third party services DOCSIS, PacketCable, … BSP capabilities to potentially integrate modem, telephony adapter, router, firewall and more in one residential gateway box per home U.S. Supreme Court’s “Brand X” decision [internetnews-brandx] June 27, 2005 ( Court Backs Cable in Brand X Case UPDATED: The Supreme Court ruled today that cable broadband providers do not have to share their lines with independent Internet service providers (ISPs), a victory for the cable industry, which dominates the broadband access market. In a 6-3 decision, the justices said the Federal Communications Commission (FCC) was correct in classifying broadband cable modem as a deregulated information service. Aman Shaikh, Albert Greenberg, August 2005

Consumer VoIP Service Models in Pictures
BYOA Model Home Phone VoIP Provider Cable/DSL Modem Cable/DSL Provider Internet TA PC PSTN BSP Model Home Cable/DSL /VoIP Provider Phone Cable/DSL Modem Internet TA PC PSTN TA = Telephony Adapter Aman Shaikh, Albert Greenberg, August 2005

Business VoIP Service Model
Enterprise site 1 WAN/VPN Service Provider Enterprise site 2 PSTN Service models parallel to PSTN counterparts DIY (Do it Yourself): Enterprise maintains its own PBX PBX = Private Branch eXchange Outsourced PBX : Use of IP-centrex [ip-centrex] Equipment Enterprise side VoIP phones, (potentially) PBX, VoIP infrastructure (potential) capability of using PSTN for fail-over Service Provider side VPNs with QoS capabilities, VoIP infrastructure, (potentially) IP-centrex End-to-end QoS expected and possible! PBX (Taken from Webopedia) Short for private branch exchange, a private telephone network used within an enterprise. Users of the PBX share a certain number of outside lines for making telephone calls external to the PBX. Most medium-sized and larger companies use a PBX because it's much less expensive than connecting an external telephone line to every telephone in the organization. In addition, it's easier to call someone within a PBX because the number you need to dial is typically just 3 or 4 digits. A new variation on the PBX theme is the centrex, which is a PBX with all switching occurring at a local telephone office instead of at the company's premises. CENTREX (Taken from Webopedia) Short for central office exchange service, a type of PBX service in which switching occurs at a local telephone station instead of at the company premises. Typically, the telephone company owns and manages all the communications equipment necessary to implement the PBX and then sells various services to the company. There is also a web-site related to ip-centrex: Some of the IP-centrex vendors seem to be: Lucent-Juniper (they seem to have a combined product), NetCentrex, Sylantro. Aman Shaikh, Albert Greenberg, August 2005

VoIP Network Management Challenges
Data Plane End system management for Consumer (BYOA) VoIP QoS management for Business VoIP Control Plane VoIP server infrastructure monitoring VoIP security issues Aman Shaikh, Albert Greenberg, August 2005

VoIP End System Management
Consumer expectations: surf (at roughly the same speed as before adding VoIP) and talk simultaneously Data: set MSS and QoS parameters Voice: use appropriate codec with right set of parameters Biggest issue: upstream (from home to Internet) bandwidth Too small  VoIP is infeasible! Natural, longer term solution: estimate available bandwidth and dynamically change codec in TAs Most current generation TAs cannot do either Get it wrong, and web speed may degrade – a potential dissatisfier Aman Shaikh, Albert Greenberg, August 2005

Estimating Upstream Bandwidth of Customer
ICMP ECHO packets Measurement Source customer ICMP ECHOREPLY packets Upstream background traffic Assumption: provider does not have direct access to customer Measurement source sends ICMP ECHO packets to a customer node Estimate the customer’s upstream bandwidth by measuring the arriving rate of ICMP ECHOREPLY packets from customer nodes Aman Shaikh, Albert Greenberg, August 2005

Bandwidth Estimation: A Bit More Detail
Techo Bdownstream customer Bupstream Techoreply Let Secho and Sechoreply be the packet size of ICMP ECHO and ECHOREPLY packets Assumptions: The upstream link of the customer is the bottleneck of the roundtrip path, i.e., Techo < Techoreply, o.w., Bestimated ≈ Bdownstream Most broadband clients satisfy this requirement The customer node replies to (large-size) ICMP ECHO packets Bestimated = N * Sechoreply / Techoreply ≈ Bupstream Aman Shaikh, Albert Greenberg, August 2005

Bandwidth Estimation: Potential Deal breakers
Downstream congestion could make the downstream path the bottleneck ICMP packet generation delay at the customer node can increase Techoreply Strictly speaking, Bestimated is a value between real available bandwidth and upstream capacity BSP may block ICMP… Aman Shaikh, Albert Greenberg, August 2005

QoS Management for Business VoIP
QoS is possible because of end-to-end control Approaches for classifying and marking VoIP traffic at enterprise sites Approach 1: mark traffic in VoIP phones itself Approach 2: use separate VLAN (Virtual LAN) for VoIP traffic, and mark traffic coming over this VLAN Approach 3: use an agent that looks for RTP traffic and marks the packets Provide QoS to marked traffic inside service provider network Routers provide service differentiation for different classes of traffic Example: Cisco’s MPLS class of service [cisco-mpls-cos] uses WRED and WFQ for service differentiation WRED (Weighted RED): for controlling packet loss probability WFQ (Weighted Fair Queuing): for controlling delay and bandwidth Setting these parameters is often challenging! Aman Shaikh, Albert Greenberg, August 2005

SLA Offerings for Business VoIP
SLA offerings possible because of end-to-end control and QoS Subject to Acceptable Usage Policy (AUP) Example: all bets are off if bandwidth usage is more than X% SLAs are offered in terms of voice quality Determining voice quality Voice stands out as one network application where huge investment has sunk into quality evaluation Voice is tricky, and voice is important! Psycho-acoustic measures MOS (Mean Opinion Score) 5 (excellent), 4 (good), 3 (fair), 2 (poor), 1 (bad) Via panels of people listening to voice samples Important to relate quality to measurable impairments on the path from mouth to ear ITU’s E-Model [e-model] Aman Shaikh, Albert Greenberg, August 2005

ITU’s E-Model A tool for voice transmission planning, developed and used by the world’s experts. R USER SATISFACTION USER SATISFACTION MOS MOS %GOB %POW G.107 Default Value 100 100 93 Very Satisfied Very Satisfied 4.4 98.4 0.1 90 90 4.3 4.3 97.0 0.2 Satisfied Satisfied 80 80 4.0 4.0 89.5 1.4 R is an index for the quality of a voice connection, Also known as the “R-factor” Some Users Dissatisfied Some Users Dissatisfied Some Users Dissatisfied 70 70 3.6 3.6 73.6 5.9 Many Users Dissatisfied Many Users Dissatisfied Many Users Dissatisfied Many Users Dissatisfied 60 60 3.1 3.1 50.1 17.4 Nearly All Users Dissatisfied Nearly All Users Dissatisfied Nearly All Users Dissatisfied Nearly All Users Dissatisfied 50 50 2.6 2.6 26.6 37.7 Not Recommended Not Recommended Not Recommended Not Recommended 1.0 1.0 99.8 Courtesy: Al Morton Aman Shaikh, Albert Greenberg, August 2005

The “R-Factor” R-factor is a single-integer measure of voice quality Range: [0, 100], 0: worst, 100: best PSTN range for R is 80 to 90, nominally 85, toll quality is R ≥ 80 R = 100 – Is – Id – If + A Simple, additive model Is, Id and If model impairments from the network (delay, loss, jitter) as well as the codec (loss, delay from compression and de-jitter buffer depth) A reflects lowered expectations given added convenience Example: A = 10 for cellular R can be estimated from network measurements Parameterized by coded parameters Offboard in a network probe Onboard the routers (example: Cisco SAA) Aman Shaikh, Albert Greenberg, August 2005

Impact of Loss and Delay on R-Factor
Knee at ~ 150 ms, for 1-way delay – quality degrades drastically Long propagation delays – caused by optical or routing level anomalies have large negative impact for global VoIP Need for latency sensitive routing, fast fail-over and convergence Courtesy Al Morton Aman Shaikh, Albert Greenberg, August 2005

VoIP Control Plane Management Challenges
Architectural: VoIP signaling and calls span multiple domains Technological: VoIP infrastructure software relatively immature Example: equipment does not gracefully handle overload conditions Flash crowds can become a big problem Business-oriented VoIP: conference call Customer-oriented VoIP: vote for “American Idol” Standards-related: Protocols are still evolving… Leads to inter-operability issues between vendors Multiple signaling protocols (SIP, H.323, MGCP, …) make matters worse! Overlapping functionality but each provides its own unique functionality Will community converge to a single protocol? SIP seems to be the protocol of choice going forward… Simple, service agnostic, extensible (trading off QoS and security TBD or by leveraging other protocols, at least initially), improving standards and implementation Leads to deployment of software for protocol interworking Adds more components to the VoIP infrastructure Aman Shaikh, Albert Greenberg, August 2005

VoIP Control Plane Monitoring
Multiple, Distributed Servers Access servers (AS), Call Routing Elements, PSTN Gateways, Advanced Feature Servers Class 5 features (dial-tone, DTMF, call-waiting, call forwarding, …), routing on-net or off-net (to PSTN), supporting advanced features such as “locate me”, conferencing Multiple information sources CDRs (Call Detail Records), SNMP MIBs/Traps, Server status and logs, distributed SIP sniffers/analyzers, active VoIP-specific probes Complex, distributed systems debugging CDRs provide “cause codes” for problems SIP sniffers help to localize the problem to specific servers/databases/gateways Device specific diagnosis helps to trouble-shoot problem Aman Shaikh, Albert Greenberg, August 2005

VoIP-Specific Security Issues
Today’s middle-boxes, such as firewalls and NATs, do not always work well with VoIP Firewalls: block dynamic ports used by VoIP NAT: hide the identity of the user behind it Traditional IP security measures can have adverse impact on delay, jitter and bandwidth Example: crypto-engines used for encryption may not support QoS There is a trade-off between application-level encryption versus IP-level encryption Application-level not good for wiretap requirements like CALEA [calea] IP-level (e.g., IPSec) can significantly increase bandwidth usage since VoIP packets are small VoIP services requires trust and closed user group management Otherwise, I can hang up your phone, make your message light go on, steal service, … Gaps in today’s control plane, filled via the management plane Aman Shaikh, Albert Greenberg, August 2005

Some Directions and Challenges
Aman Shaikh, Albert Greenberg, August 2005

Core Network Management
Myth of five nines? What level of reliability is really required SONET rings provide 50 msec protection – should customers really care How to design end equipment that’s more tolerant to small outages? Reliability is critical for some services: out of band control (now using the PSTN! when VoIP succeeds…); 911 Is reliability really about FRR or SONET rings? Old news, numerous solutions Yet, how do we get to robustness: understand and controls to assure a small push applied to the network will have a small impact Where are the new and impactful opportunities Edge, enterprise, higher layer interactions Reliable router? How to deal with the whirlwind of new features and interactions How to be proactive? How to explore huge multi-dimensional space in testing How to uncover the plethora of failure modes in the field Correlation, learning How to design for a high software defect rate much higher than acceptable How to simplify enterprise networks so that they are inherently less fragile and much simpler to reason about and control Aman Shaikh, Albert Greenberg, August 2005

Security Problem really arise at the end systems Servers, PDAs, software, software, software… Should solutions be focused on the end systems, or the network? To what extent can the network help protect the customer’s software infrastructure? How much and where? DDoS attacks – despite all the research, a huge amount of improvement needed VPNs – membership in many, without exploding complexity and information leaks Stepping stones, bot-armies, marketplace of malware? Network itself faces thorny security challenges Secure router designs for handling both public and private traffic Access control Aman Shaikh, Albert Greenberg, August 2005

Automation How do we create Information base: accurate, timely information -- customer-feature associations, performance/fault measurements, … Decision support: effective rules and/or decision support tools – predictable response to potential control actions Control mechanisms: effective protocol and network management mechanisms for direct implementation of desired controls How far can we push automation, coping with Multiple objectives, multiple criteria Software rot: assumptions in the software disconnecting from reality Small errors in information, decision and control having large impact Where should different elements of network management functionality be placed? control vs. management planes Lift intelligence into the management plane Rework the control plane architecture Aman Shaikh, Albert Greenberg, August 2005

Services How do you design an Internet that can support a range of new services What do these new services require? TV? R-factor for video? How to do scalable, application-level monitoring and adaptation, coping with Pollution of QoS classes Network/application interactions: design, management, fault localization, provisioning Localization: is the application or the network broken What new services or enhanced existing services can the network offer? Aman Shaikh, Albert Greenberg, August 2005

See http://www.research.att.com/~ashaikh/network-management
References See Aman Shaikh, Albert Greenberg, August 2005

Routing [rfc2328] J. Moy, “OSPF Version 2”, IETF RFC 2328 [rfc1771] Y. Rekhtar and T. Li, “A Border Gateway Protocol 4 (BGP-4)”, IETF RFC 1771 [rfc1195] R. Callon, “Use of OSI IS-IS for Routing in TCP/IP and Dual Environments”, IETF RFC 1195 [rfc2453] G. Malkin, “RIP Version 2”, IETF RFC 2453 [cisco-eigrp] “Enhanced Interior Gateway Routing Protocol (EIGRP)”, [route-views] [ripe-ris] [ipsum] [packetdesign] Aman Shaikh, Albert Greenberg, August 2005

Routing (cont’d.) [sprint-ipmon] [pyrt] [bgplay] [ripe-libbgpdump] [shaikh-nsdi04] A. Shaikh and A. Greenberg, “OSPF Monitoring: Architecture, Design and Deployment Experience”, Proc. Usenix NSDI, Mar. 2004 [caesar05-policies] M. Caesar and J. Rexford, “BGP Policies in ISP Networks”, UC Berkeley Technical Report UCB/CSD , Mar 2005 [nordstrom04] O. Nordstrom and C. Dovrolis, “Beware of BGP attacks”, in ACM SIGCOMM CCR, Apr 2004 [feamster-sigmetrics04] N. Feamster et al., “A Model of BGP Routing for Network Engineering”, in ACM SIGMETRICS, Jun 2004 [smart-routing05] NANOG 25 panel, jun 2002 Aman Shaikh, Albert Greenberg, August 2005

Routing (cont’d.) [feamster-imc04] N. Feamster et al., “BorderGuard: Detecting Cold Potatoes from Peers”, in Proc. IMC, Oct 2004 [feamster-nsdi05] N. Feamster and H. Balakrishnan, “Detecting BGP Configuration Faults with Static Analysis”, Proc. USENIX NSDI, May 2005 Aman Shaikh, Albert Greenberg, August 2005

Network Troubleshooting
[cisco-netflowv9] [sflow] [ietf-ipfix] [daytona] [smartsamp] [lumeta] [traceroute] [tcpdump] [nimi] [planetlab] [peterson02] L. Peterson et al., “A Blueprint for Introducing Disruptive Technology into the Internet”, HotNets, Oct. 2002 [bavier04] A. Bavier et al., “Operating System Support for Planetary-Scale Services, Proc. USENIX NSDI, Mar. 2004 Aman Shaikh, Albert Greenberg, August 2005

Network Troubleshooting (cont’d.)
[lakhina04] A. Lakhina, M. Crovella and C. Diot, “Diagnosing Network-wide Traffic Anomalies”, Proc. ACM SIGCOMM, Sept. 2004 [roughan04] M. Roughan et al., “Combining Routing and Traffic Data for Detection of IP Forwarding Anomalies”, Proc. ACM SIGCOMM NetTS Workshop, Aug. 2004 [kompella05] R. Kompella et al., “IP Fault Localization via Risk Modeling”, Proc. USENIX NSDI, May 2005 [teixeira04] R. Teixeira et al., “Dynamics of Hot-Potato Routing in IP Networks”, Proc. ACM SIGMETRICS, June 2004 [agarwal04] S. Agarwal et al., “Impact of BGP Dynamics on Router CPU Utilization”, Proc. PAM, April 2004 Aman Shaikh, Albert Greenberg, August 2005

Maintenance and Upgrade
[rfc3623] J. Moy et al., “Graceful OSPF Restart”, IETF RFC 3623 [shaikh-infocom02] A. Shaikh, R. Dube and A. Varma, “Avoiding Instability during Graceful Shutdown of OSPF”, Proc. IEEE Infocom, June 2002 [rfc3137] A. Retana et al., “OSPF Stub Router Advertisement”, IETF RFC 3137 [avici-nsr] [avici-composite-link] [cisco-bgp-nsf] Aman Shaikh, Albert Greenberg, August 2005

VoIP [voip-info-wiki] [goode02] B. Goode, “Voice Over Internet Protocol (VOIP)”, Proc. of the IEEE, VOL. 90, NO. 9, Sept. 2002 [mehta01] P. Mehta and S. Udani, “Overview of Voice over IP”, Tech. Report MS-CIS-01-31, University of Pennsylvania, Feb. 2001 [sinden02] R. Sinden, “Comparison of Voice over IP with Circuit Switching Techniques”, Department of Electronics and Computer Science, Southampton University, Jan. 2002s [rfc3261] J. Rosenberg et al., “SIP: Session Initiation Protocol”, IETF RFC 3261 [rfc3263] J. Rosenberg and H. Schulzrinne, “Session Initiation Protocol (SIP): Locating SIP Servers”, IETF RFC 3263 [rfc3435] F. Andreasen and B. Foster, “Media Control Gateway Protocol (MGCP) Version 1.0”, IETF RFC 3435 [rfc3250] H. Schulzrinne et al., “RTP: A Transport Protocol for Real-Time Applications”, IETF RFC 3250 Aman Shaikh, Albert Greenberg, August 2005

VoIP (cont’d.) [internetnews-brandx] “Court Backs Cable in Brand X Case”, Internetnews.com, June 27, [ip-centrex] [e-model] “The e-model, a computational model for use in transmission plannning”, ITU-T Recommendation G.107, May 2000 [cole01] R. Cole and J. Rosenbluth, “Voice over IP Performance Monitoring”, ACM SIGCOMM CCR, Volume 31, Issue 2, Apr. 2001 [rosenbluth01] J. Rosenbluth, “A framework for Setting Packet Loss objectives for VoIP”, ITU-T Study Group 12 Delayed contribution, Oct 2001 [nist05] D. Kuhn et al., “Security Considerations for Voice Over IP Systems”, NIST Special Publication , Jan. 2005 [boutremans02] C. Boutremans et al., “Impact of Link Failures on VoIP Performance”, Proc. ACM NOSSDAV, 2002 Aman Shaikh, Albert Greenberg, August 2005

VoIP (cont’d.) [cisco-saa] “Service Assurance Agent (SAA)”, [cisco-mpls-cos] “MPLS Class of Service”, [calea] “Communications Assistance for Law Enforcement Act”, Aman Shaikh, Albert Greenberg, August 2005

Operations and Management of IP Networks: What Researchers Should Know

Similar presentations

Presentation on theme: "Operations and Management of IP Networks: What Researchers Should Know"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Operations and Management of IP Networks: What Researchers Should Know

Similar presentations

Presentation on theme: "Operations and Management of IP Networks: What Researchers Should Know"— Presentation transcript:

Similar presentations

About project

Feedback