Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Slides:



Advertisements
Similar presentations
Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,
Advertisements

The Impact of Policy and Topology on Internet Routing Convergence NANOG 20 October 23, 2000 Abha Ahuja InterNap *In collaboration with.
Modeling Inter-Domain Routing Protocol Dynamics ISMA 2000 December 6, 2000 In collaboration with Abha, Ahuja, Roger Wattenhofer, Srinivasan Venkatachary,
1 End-to-End Routing Behavior in the Internet Internet Routing Instability Presented by Carlos Flores Gaurav Jain May 31st CS 6390 Advanced Computer.
Internet Routing Instability Craig Labovitz, G. Robert Malan, Farham Jahanian University of Michigan Presented By Krishnanand M Kamath.
Part IV: BGP Routing Instability. March 8, BGP routing updates  Route updates at prefix level  No activity in “steady state”  Routing messages.
Advanced Networks 1. Delayed Internet Routing Convergence 2. The Impact of Internet Policy and Topology on Delayed Routing Convergence.
CS 268: Routing Behavior in the Internet Ion Stoica February 18, 2003.
Delayed Internet Routing Convergence Craig Labovitz, Microsoft Research Abha Ahuja, University of Michigan Farnam Jahanian, University of Michigan Abhit.
Network Layer: Internet-Wide Routing & BGP Dina Katabi & Sam Madden.
CS540/TE630 Computer Network Architecture Spring 2009 Tu/Th 10:30am-Noon Sue Moon.
Routing: Exterior Gateway Protocols and Autonomous Systems Chapter 15.
1 Experimental Study of Internet Stability and Wide-Area Backbone Failure Craig Labovitz, Abha Ahuja Merit Network, Inc Presented by Changchun Zou.
Internet Routing Instability Three Papers Presented by Michael A. Smith Craig Labovitz, G. Robert Malan, Farnam Jahanian, "Internet Routing Instability."
© J. Liebeherr, All rights reserved 1 Border Gateway Protocol This lecture is largely based on a BGP tutorial by T. Griffin from AT&T Research.
Internet Routing Instability
1 Interdomain Routing Protocols. 2 Autonomous Systems An autonomous system (AS) is a region of the Internet that is administered by a single entity and.
1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.
CS Summer 2003 CS672: MPLS Architecture, Applications and Fault-Tolerance.
1 BGP Security -- Zhen Wu. 2 Schedule Tuesday –BGP Background –" Detection of Invalid Routing Announcement in the Internet" –Open Discussions Thursday.
CS 164: Global Internet Slide Set In this set... More about subnets Classless Inter Domain Routing (CIDR) Border Gateway Protocol (BGP) Areas with.
(c) Anirban Banerjee, Winter 2005, CS-240, 2/1/2005. The Impact of Internet Policy and Topology on Delayed Routing convergence C. Labovitz, A. Ahuja, R.
Internet Routing Instability Labovitz et al. Sigcomm 1997 Largely adopted from Ion Stoica’s slide at UCB.
BGP: Inter-Domain Routing Protocol Noah Treuhaft U.C. Berkeley.
Delayed Internet Routing Convergence Craig Labovitz, Abha Ahuja, Abhijit Bose, Farham Jahanian Presented By Harpal Singh Bassali.
Dynamics of Hot-Potato Routing in IP Networks Renata Teixeira (UC San Diego) with Aman Shaikh (AT&T), Tim Griffin(Intel),
E2E Routing Behavior in the Internet Vern Paxson Sigcomm 1996 Slides are adopted from Ion Stoica’s lecture at UCB.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Exterior Gateway Protocols: EGP, BGP-4, CIDR Shivkumar Kalyanaraman Rensselaer Polytechnic Institute.
Graphs and Topology Yao Zhao. Background of Graph A graph is a pair G =(V,E) –Undirected graph and directed graph –Weighted graph and unweighted graph.
Routing.
Computer Networks Layering and Routing Dina Katabi
Inter-domain Routing Outline Border Gateway Protocol.
Chapter 22 Network Layer: Delivery, Forwarding, and Routing
Network Sensitivity to Hot-Potato Disruptions Renata Teixeira (UC San Diego) with Aman Shaikh (AT&T), Tim Griffin(Intel),
Authors Renata Teixeira, Aman Shaikh and Jennifer Rexford(AT&T), Tim Griffin(Intel) Presenter : Farrukh Shahzad.
Inter-domain Routing: Today and Tomorrow Dr. Jia Wang AT&T Labs Research Florham Park, NJ 07932, USA
I-4 routing scalability Taekyoung Kwon Some slides are from Geoff Huston, Michalis Faloutsos, Paul Barford, Jim Kurose, Paul Francis, and Jennifer Rexford.
Unicast Routing Protocols  A routing protocol is a combination of rules and procedures that lets routers in the internet inform each other of changes.
Introduction to BGP.
1 Interdomain Routing (BGP) By Behzad Akbari Fall 2008 These slides are based on the slides of Ion Stoica (UCB) and Shivkumar (RPI)
CS 3700 Networks and Distributed Systems Inter Domain Routing (It’s all about the Money) Revised 8/20/15.
Lecture 4: BGP Presentations Lab information H/W update.
TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 11 Unicast Routing Protocols.
David Wetherall Professor of Computer Science & Engineering Introduction to Computer Networks Hierarchical Routing (§5.2.6)
A Firewall for Routers: Protecting Against Routing Misbehavior1 June 26, A Firewall for Routers: Protecting Against Routing Misbehavior Jia Wang.
A Measurement Study on the Impact of Routing Events on End-to-End Internet Path Performance Feng Wang 1, Zhuoqing Morley Mao 2 Jia Wang 3, Lixin Gao 1,
T. S. Eugene Ngeugeneng at cs.rice.edu Rice University1 COMP/ELEC 429/556 Introduction to Computer Networks Inter-domain routing Some slides used with.
Detection of Routing Loops and Analysis of Its Causes Sue Moon Dept. of Computer Science KAIST Joint work with Urs Hengartner, Ashwin Sridharan, Richard.
TCOM 509 – Internet Protocols (TCP/IP) Lecture 06_a Routing Protocols: RIP, OSPF, BGP Instructor: Dr. Li-Chuan Chen Date: 10/06/2003 Based in part upon.
By, Matt Guidry Yashas Shankar.  Analyze BGP beacons which are announced and withdrawn, usually within two hour intervals.  The withdraws have an effect.
An internet is a combination of networks connected by routers. When a datagram goes from a source to a destination, it will probably pass through many.
Routing in the Inernet Outcomes: –What are routing protocols used for Intra-ASs Routing in the Internet? –The Working Principle of RIP and OSPF –What is.
A Measurement Study on the Impact of Routing Events on End-to-End Internet Path Performance Feng Wang 1, Zhuoqing Morley Mao 2 Jia Wang 3, Lixin Gao 1,
Some Observations on Network Failures NANOG 15 Craig Labovitz.
1 Chapter 4: Internetworking (IP Routing) Dr. Rocky K. C. Chang 16 March 2004.
Inter-domain Routing Outline Border Gateway Protocol.
A survey of Internet routing reliability Presented by Kundan Singh IRT internal talk April 9, 2003.
1 CS716 Advanced Computer Networks By Dr. Amir Qayyum.
Jian Wu (University of Michigan)
Border Gateway Protocol
Introduction to Internet Routing
Net 323 D: Networks Protocols
Routing.
Net 323 D: Networks Protocols
COS 561: Advanced Computer Networks
COS 561: Advanced Computer Networks
COS 561: Advanced Computer Networks
COS 461: Computer Networks
BGP Instability Jennifer Rexford
Computer Networks Protocols
Presentation transcript:

Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan

Context Network Infrastructure Network Attacks S/H Failures Operational Faults Windmill Probes Netflow Statistics Protocol Scrubbers Event Aggregation Data Mining Replication schemes Active Response Capabilities Analysis Engines RoutersName Servers Critical Services Anomalous Network Events Coarse and Fine Grained Measurement Tools Countermeasures LIGHTHOUSE: Survivable Network Infrastructure Joint projects between U. Michigan & Merit Network

Motivation Increasing reliance of financial and national utility infrastructures on interconnected IP-based networks Explosive growth in both size and topological complexity of the underlying communication infrastructure Reliance on off-the-self infrastructure & shrink-wrapped code Network infrastructure is vulnerable: –inherent instability and transient oscillations –delayed convergence and long failover –coordinated denial of service attacks on network resources –hardware and software failures –operational faults and misconfigurations

Joint effort between University of Michigan and Merit Network Study Scalability and Stability of the Internet Infrastructure

Overview Measurement and Probe Software Data Dissemination Service Visualization and Data Mining Tools Tools for measurement & analysis of network perf. and stability: Current Internet Architecture: Decentralization of the Backbone Regional networks, national ISPs Trend toward private peering Increasing complexity and heterogeneity

Imminent Collapse of the Internet Collapse of the Internet Now ?

Internet Growth Explosive growth in both size and topological complexity Internet end-system growth Traffic volume & characteristics Infrastructure topological evolution

Internet End-System Growth

Growth of the Web Dramatic increase in the number of users, number of web sites, amount of content available, Web traffic. Source: IDC, 1998.

Exponential Traffic Growth Traffic Breakdown Source: Merit Network, Inc

Infrastructure Topological Evolution Between : Decentralization: from a single backbone network to a conglomeration of 100s of backbone and 1000s ISP. Loss of hierarchy and abstraction: from strict hierarchical network to increasingly a full-mesh interconnection. Significant bandwidth increase: from signle T3 (45MB) circuit and T1 (1MB) links to multiple OC48 (1.2GB) circuits and OC12 (622MB) lines between nodes.

Internet Evolution: NSFNet NSFNet Backbone Regional Campus Hello/EGP Hierarchical network with a single central backbone

Internet Evolution: Today AS1 AS2 AS3 AS4 C4 C2 C3 C1 Full-mesh interconnection of ISP backbones and customers

Impact of Instability & Failures –Increased end-to-end Loss/Latency –Increased delay in convergence & network reachability –Backbone infrastructure CPU/Memory requirements –Backbone “route flap storms” –Network management complexity

Background: Internet Architecture BGP

Background: Internet Routing Two major categories –Inter-domain (BGP between autonomous systems) –Intra-domain (OSPF, ISIS, IGRP inside an AS) BGP –Incremental: announcements and withdraws –Updates include policy (e.g. MED, ASPath) –Maintain multiple possible routes

Background: BGP Routing Protocol BGP is an incremental protocol that sends update information only upon changes in network topology or routing policy. Two forms of messages:  announcements:  New network accessible  Prefer another route to network destination  withdrawals:  Destination network is no longer accessible Routing policies vs. shortest number of hops

MCI Sprint Border Gateway Protocol (BGP) Inter-domain protocol between Autonomous Systems Routing peers exchange reachability information incrementally BGP uses TCP as the transport protocol between peer routers

Background: Internet Core Networks aggregated into CIDR (Classless Inter-Domain Routing) prefixes Prefix represents a set of destination IP addresses At Internet “core” all routers maintain paths to “default- free” routes Originally 5 major Internet Exchange Points (IXPs) In 1996, approximately 30,000 default-free routes

Roadmap Study of stability of routing in the Internet backbone –Transient oscillations, pathological redundant updates –congestion collapse and correlation to network usage –SIGCOMM’97 and INFOCOMM’99 Study of route availability and failover rates –long-term availability of Internet backbone routes –Case study of regional provider –FTCS’99 Study of convergence behavior of routing protocols –Injection of route changes into the Internet backbone –Impact of convergence delay on end-to-end path –18-month study & ongoing

Internet Exchange Points Deployed probes machines at five public exchange points Collected all routing updates at IXPs over four year period

Internet Routing Instability Results Number of BGP routing updates exchanged per day in the Internet core is orders of magnitude larger than expected. Most routing information is dominated by pathological, or redundant updates, which do not directly reflect changes in routing policy or topology. Instability and redundant updates exhibit a specific periodicity of 30 and 60 seconds. Instability and redundant updates show a surprising correlation to network usage and exhibit corresponding daily and weekly cyclic trends.

Instability Results (Continued) Instability is not dominated by a small set of autonomous systems or routes. Instability is not disproportionately dominated by prefixes of specific lengths, i.e. independent of aggregation. Discounting policy fluctuation and pathological behavior, there remains a significant level of Internet forwarding instability. Details: SIGCOMM’97 & INFOCOMM’99

Taxonomy WADiff (forwarding change) AADiff (policy or forwarding change) AADup (pathology) WWDup (pathology) WADup (failure)

Growth in Routing State Linear growth in routing table

Growth in Routing State Linear growth in routing table & autonomous systems Significant instability and pathological behavior

Initial Findings (SIGCOMM’97) Up to 60 million BGP updates/day for only 30,000 default-free routes! –On avg. 2-6 Million withdraws per day (mostly duplicates) –e.g., ISP A had 259 routes but withdrew 2.4 million routes All state changes well distributed across prefix lengths, autonomous systems Unexpected frequency components –30 second inter-arrival time between updates –Daily/weekly components

More Initial Observations Most routing updates pathological (millions!) –Some due to misconfiguration Private networks Host routes Multicast routes –Majority duplicate updates Duplicate withdraws (WWDup > 99.99%) Duplicate announcements (AADup)

BGP Updates

30 Second Frequency Components 1997

Origins of Pathological Updates (INFOCOM99) Majority stem from two router software implementation issues: –stateless BGP withdraws –non-transitive attribute filtering Frequency due to non-jittered router timers –lack of precise specification Others sources of pathologies: –BGP/IBGP misconfiguration –Still others DSU/CSU oscillation –And still others due distance-vector algorithm

After Initial Publication of Results One popular vendor validated our conjectures and released updated software in 1997 –Software rapidly deployed by ISPs –Stateful BGP reduced updates by orders of magnitude –Addition of random intervals to timers diminished frequency components

BGP Announcements and Withdraws NANOG presentationISP Geeks ReleaseMainline Release

Frequency Components

BGP Failures -- Congestion Collapse (BGP Frequency)

A Short Story Sigcomm '97 findings were puzzling: Bandwidth Utilization  Instability Hypothesis: Congestion causes underlying TCP to backoff BGP-level timers expire, causing termination

MCI Sprint Border Gateway Protocol (BGP) Interdomain protocol between Autonomous Systems Routing peers exchange reachability information incrementally BGP uses TCP as the transport protocol between peer routers

BGP Congestion Collapse Hypothesis Congestion causes underlying TCP to backoff BGP-level timers expire, causing termination Interaction between BGP and TCP leads to router congestion collapse High bandwidth utilization  BGP Instability Validated using Windmill tool (SIGCOMM98)

What about Failures? Some state changes due to policy changes & network failures Cannot distinguish between policy, intra-domain and inter- domain failures Methodology: –Measure long-term rate of failure for Internet backbone routes –Case study of regional provider

Internet Infrastructure Failures (FTCS99) Internet significantly less reliable and available than PSTN telephone network. After a network becomes unreachable, in most cases, it takes longer than 5 mins before it is reachable again. Even for transient oscillations, convergence of backbone routing states may be in the order of mins! Route failover (re-routing of traffic to a given network) occurs on average of once every three days or more. A small fraction of network paths contribute disproportionately to number of long-term outages

Definitions Route Failure: Prefix destination unavailable for 30 or more minutes Route Repair: A failed route becomes available Route Failover: A route replaced with one associated with a different path

Organizational Diameter of Internet Number of administrative domains between two networks Rapid regional commercialization of the Internet during 1997

Route Failures: How long before a network is unreachable?

Route Repairs: How long before a network is reachable again?

Failover: How long before traffic is re-routed?

Source of Failures Inside a Regional ISP Michnet Backbone Failures 11/ /98

Conventional Wisdom on Convergence Internet is highly redundant –Just reroute around in a few milliseconds Routing protocol convergence takes only a few ???? “Bad news travels fast” –Fast withdraw propagation valid goal –Announcements slower because bundled BGP has great convergence properties –Path vector solved the convergence and counting to infinity (looping) problems All my customers are multi-homed, triple-homed –Convergence -- what, me worry? Not True!

18-Month Study of Convergence Behavior Instrument the Internet –Inject routes into geographically and topologically diverse provider BGP peering sessions (Japan, Michigan, US Exchange Points, Canada, UK) –Periodically fail and change these routes (i.e. send withdraws or new attributes) –Time events using ICMP ping and NTP synchronized BGP “routeviews” monitoring machines –Wait 18 months… (50,000 routing events)

Passive & Active Measurement Infrastructure

Internet ISP4 Stub AS RouteViews Data Collection Probe ISP5 ISP6 ISP3 Upstream ISP1 Stub AS Fault Injection Server Upstream ISP2 BGP Fault BGP ICMP Echos Passive & Active Measurement Infrastructure

Terminology Tdown: A previously available route is withdrawn. This is a route failure. Tup: previously unavailable route is announced as available. This is a route repair. Tshort: A route is replaced with another route having a shorter path. This is a route failover. Tlong: A route is replaced by another route with a longer path. This is a route failover.

Avg. number of messages generated by each ISP following a routing update event Tdown and Tlong generated more messages than Tup and Tshort Significant variation among ISPs within each category of message

Withdraw Convergence (Tdown) After a BGP route is withdrawn, barring other failures, how long does it take Internet routing tables to reach steady-state?

Convergence delay after a Tdown Withdraw Convergence

Different providers exhibit different behavior 70% of withdraws from most ISPs take more than a minute For ISP in Canada, 20% withdraws took more than three minutes to converge Observed latencies of up to 10 mins for certain events No correlation between convergence latency and geography or topological (except for MichNet)

Failovers and Repairs What are the relative convergence latencies for failovers and repairs? Does bad news (withdraws) travel faster?

Failures, Failovers and Repairs Bad News Does Not Travel Fast!

Failures, Failovers and Repairs Bad News Does Not Travel Fast!

Failures, Failovers and Repairs Bad news does not travel fast… Repairs (Tup) exhibit similar convergence properties as long  short path failover Failures (Tdown) and short  long failovers also similar –Slower than Tup (e.g. a repair) –60% take longer than two minutes –Failover times degrade the greater the degree of multi- homing!

End2End Connectivity Impact of delayed convergence on E2E connectivity? After a failover, how long before my site is reachable? –Modified ICMP pings sent once a second –Source IP address block of pseudo-AS –100 randomly chosen web sites from cache logs

Impact of Convergence Delay on End-to-End Path Avg. packet loss to 100 web sites (1 min bins in the ten mins preceding and following a routing update)

What is Happening? Non-deterministic ordering of BGP update messages leads to –Transient oscillations –Each change in FIB adds delay (CPU, BGP bundling timer) –At extreme, convergence triggers BGP dampening

BGP Bad News Given best current routing practices, inter-domain BGP convergence times degrade exponentially with increase in the degree of interconnectivity for a given route … and the degree of inter-connectivity (multi-homing, transit, etc) is increasing

Internet vs. Telephone Network Packet-switched vs. circuit-switched No explicit reservation on the Internet Fault-tolerant switches in telephone networks Significantly shorter development, testing and deployment cycle in the Internet world Reliability vs. time-to-market Relative degree of operational experience Small number of telecommunication companies vs. a conglomeration of thousands of ISPs

The Next Challenge Jeopardizing the Explosive Growth of the Web is AVAILABILITY. Growing reliance on the Internet for commerce, healthcare, education,... Challenges Facing Today’s Internet are Bandwidth and Latency

Context Network Infrastructure Network Attacks S/H Failures Operational Faults Windmill Probes Netflow Statistics Protocol Scrubbers Event Aggregation Data Mining Replication schemes Active Response Capabilities Analysis Engines RoutersName Servers Critical Services Anomalous Network Events Coarse and Fine Grained Measurement Tools Countermeasures LIGHTHOUSE: Survivable Network Infrastructure Sponsors: NSF, DARPA and INTEL

Acknowledgements Michigan Students & Merit Staff: Abha Ahuja, Mukesh Agrawal, Paul Howell, Craig Labovitz, Rob Malan, Matt Smart, David Watson Sponsors: National Science Foundation, DARPA, Intel, IBM, HP