1 Experimental Study of Internet Stability and Wide-Area Backbone Failure Craig Labovitz, Abha Ahuja Merit Network, Inc. 1998. Presented by Changchun Zou.

Slides:



Advertisements
Similar presentations
The Impact of Policy and Topology on Internet Routing Convergence NANOG 20 October 23, 2000 Abha Ahuja InterNap *In collaboration with.
Advertisements

Modeling Inter-Domain Routing Protocol Dynamics ISMA 2000 December 6, 2000 In collaboration with Abha, Ahuja, Roger Wattenhofer, Srinivasan Venkatachary,
1 End-to-End Routing Behavior in the Internet Internet Routing Instability Presented by Carlos Flores Gaurav Jain May 31st CS 6390 Advanced Computer.
Internet Routing (COS 598A) Today: Interdomain Routing Convergence Jennifer Rexford Tuesdays/Thursdays.
Internet Routing Instability Craig Labovitz, G. Robert Malan, Farham Jahanian University of Michigan Presented By Krishnanand M Kamath.
Part IV: BGP Routing Instability. March 8, BGP routing updates  Route updates at prefix level  No activity in “steady state”  Routing messages.
Advanced Networks 1. Delayed Internet Routing Convergence 2. The Impact of Internet Policy and Topology on Delayed Routing Convergence.
CS 268: Routing Behavior in the Internet Ion Stoica February 18, 2003.
Delayed Internet Routing Convergence Craig Labovitz, Microsoft Research Abha Ahuja, University of Michigan Farnam Jahanian, University of Michigan Abhit.
Border Gateway Protocol Ankit Agarwal Dashang Trivedi Kirti Tiwari.
CS540/TE630 Computer Network Architecture Spring 2009 Tu/Th 10:30am-Noon Sue Moon.
Lecture 9 Overview. Hierarchical Routing scale – with 200 million destinations – can’t store all dests in routing tables! – routing table exchange would.
Internet Routing Instability Three Papers Presented by Michael A. Smith Craig Labovitz, G. Robert Malan, Farnam Jahanian, "Internet Routing Instability."
© J. Liebeherr, All rights reserved 1 Border Gateway Protocol This lecture is largely based on a BGP tutorial by T. Griffin from AT&T Research.
Internet Routing Instability
Consensus Routing: The Internet as a Distributed System John P. John, Ethan Katz-Bassett, Arvind Krishnamurthy, and Thomas Anderson Presented.
1 Interdomain Routing Protocols. 2 Autonomous Systems An autonomous system (AS) is a region of the Internet that is administered by a single entity and.
Chapter 4: Network Layer 4. 1 Introduction 4.2 Virtual circuit and datagram networks 4.3 What’s inside a router 4.4 IP: Internet Protocol –Datagram format.
Practical and Configuration issues of BGP and Policy routing Cameron Harvey Simon Fraser University.
Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.
(c) Anirban Banerjee, Winter 2005, CS-240, 2/1/2005. The Impact of Internet Policy and Topology on Delayed Routing convergence C. Labovitz, A. Ahuja, R.
Internet Routing Instability Labovitz et al. Sigcomm 1997 Largely adopted from Ion Stoica’s slide at UCB.
BGP: Inter-Domain Routing Protocol Noah Treuhaft U.C. Berkeley.
Delayed Internet Routing Convergence Craig Labovitz, Abha Ahuja, Abhijit Bose, Farham Jahanian Presented By Harpal Singh Bassali.
Dynamics of Hot-Potato Routing in IP Networks Renata Teixeira (UC San Diego) with Aman Shaikh (AT&T), Tim Griffin(Intel),
Inherently Safe Backup Routing with BGP Lixin Gao (U. Mass Amherst) Timothy Griffin (AT&T Research) Jennifer Rexford (AT&T Research)
Graphs and Topology Yao Zhao. Background of Graph A graph is a pair G =(V,E) –Undirected graph and directed graph –Weighted graph and unweighted graph.
Network Monitoring for Internet Traffic Engineering Jennifer Rexford AT&T Labs – Research Florham Park, NJ 07932
14 – Inter/Intra-AS Routing
Feb 12, 2008CS573: Network Protocols and Standards1 Border Gateway Protocol (BGP) Network Protocols and Standards Winter
Michael Over.  Which devices/links are most unreliable?  What causes failures?  How do failures impact network traffic?  How effective is network.
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
Inter-domain Routing Outline Border Gateway Protocol.
14 – Inter/Intra-AS Routing Network Layer Hierarchical Routing scale: with > 200 million destinations: can’t store all dest’s in routing tables!
Network Sensitivity to Hot-Potato Disruptions Renata Teixeira (UC San Diego) with Aman Shaikh (AT&T), Tim Griffin(Intel),
1 Meeyoung Cha, Sue Moon, Chong-Dae Park Aman Shaikh Placing Relay Nodes for Intra-Domain Path Diversity To appear in IEEE INFOCOM 2006.
Authors Renata Teixeira, Aman Shaikh and Jennifer Rexford(AT&T), Tim Griffin(Intel) Presenter : Farrukh Shahzad.
I-4 routing scalability Taekyoung Kwon Some slides are from Geoff Huston, Michalis Faloutsos, Paul Barford, Jim Kurose, Paul Francis, and Jennifer Rexford.
Information-Centric Networks04a-1 Week 4 / Paper 1 Open issues in Interdomain Routing: a survey –Marcelo Yannuzzi, Xavier Masip-Bruin, Olivier Bonaventure.
© Janice Regan, CMPT 128, CMPT 371 Data Communications and Networking BGP, Flooding, Multicast routing.
Interconnectivity Density Compare number of AS’s to average AS path length A uniform density model would predict an increasing AS Path length (“Radius”)
CS 3830 Day 29 Introduction 1-1. Announcements r Quiz 4 this Friday r Signup to demo prog4 (all group members must be present) r Written homework on chapter.
Border Gateway Protocol Presented BY Jay Purohit & Rupal Jaiswal GROUP 9.
BGP topics to be discussed in the next few weeks: –Excessive route update –Routing instability –BGP policy issues –BGP route slow convergence problem –Interaction.
A comparison of overlay routing and multihoming route control Hayoung OH
A Measurement Study on the Impact of Routing Events on End-to-End Internet Path Performance Feng Wang 1, Zhuoqing Morley Mao 2 Jia Wang 3, Lixin Gao 1,
On Understanding of Transient Interdomain Routing Failures Feng Wang, Lixin Gao, Jia Wang, and Jian Qiu Department of Electrical and Computer Engineering.
Network Layer4-1 Intra-AS Routing r Also known as Interior Gateway Protocols (IGP) r Most common Intra-AS routing protocols: m RIP: Routing Information.
1 A Framework for Measuring and Predicting the Impact of Routing Changes Ying Zhang Z. Morley Mao Jia Wang.
By, Matt Guidry Yashas Shankar.  Analyze BGP beacons which are announced and withdrawn, usually within two hour intervals.  The withdraws have an effect.
Eliminating Packet Loss Caused by BGP Convergence Nate Kushman Srikanth Kandula, Dina Katabi, and Bruce Maggs.
An internet is a combination of networks connected by routers. When a datagram goes from a source to a destination, it will probably pass through many.
© 2005 Cisco Systems, Inc. All rights reserved. BGP v3.2—6-1 Scaling Service Provider Networks Scaling IGP and BGP in Service Provider Networks.
A Measurement Study on the Impact of Routing Events on End-to-End Internet Path Performance Feng Wang 1, Zhuoqing Morley Mao 2 Jia Wang 3, Lixin Gao 1,
Some Observations on Network Failures NANOG 15 Craig Labovitz.
1 Chapter 4: Internetworking (IP Routing) Dr. Rocky K. C. Chang 16 March 2004.
Inter-domain Routing Outline Border Gateway Protocol.
Traffic-aware Inter-Domain Routing for Improved Internet Routing Stability Zhenhai Duan Florida State University 1.
A survey of Internet routing reliability Presented by Kundan Singh IRT internal talk April 9, 2003.
1 CS716 Advanced Computer Networks By Dr. Amir Qayyum.
Jian Wu (University of Michigan)
Border Gateway Protocol
COS 561: Advanced Computer Networks
Introduction to Internet Routing
BGP Overview BGP concepts and operation.
COS 561: Advanced Computer Networks
COS 561: Advanced Computer Networks
COS 561: Advanced Computer Networks
COS 461: Computer Networks
BGP Instability Jennifer Rexford
Presentation transcript:

1 Experimental Study of Internet Stability and Wide-Area Backbone Failure Craig Labovitz, Abha Ahuja Merit Network, Inc Presented by Changchun Zou

2 Outline  Introduction  Experiments methodology  Analysis of Inter-domain Path Stability  Analysis of Intra-domain Network Stability  Frequency property analysis  Conclusion

3 Introduction Earlier study reveals: ( last presentation two paper) 99% routing instability consisted of pathological update, not reflect actual network topological or policy changes.  Causes: hardware, software bugs.  Improved a lot in last several years. This paper study: “legitimate” faults that reflect actual link or network failures.

4 Experiment Methodology Inter-domain BGP data collection (01/98~ 11/98) RouteView probe : participate in remote BGP peering session. Collected 9GB complete routing tables of 3 major ISPs in US. About 55,000 route entries

5 Intra-domain routing data collection (11/97~11/98) Case study : Medium size regional network --- MichNet Backbone. Contains 33 backbone routers with several hundred customer routers. Data from:  A centralized network management station (CNMS) log data  Ping every router interfaces every 10 minutes.  Used to study frequency and duration of failures.  Network Operations Center(NOC) log data.  CNMS alerts lasting more than several minutes.  Prolonged degradation of QoS to customer sites.  Used to study network failure category.

6 Data Preprocessing Purpose: Filter out pathological routing and policy changes.  Limit dataset to only prefixes present in routing table for more than 60% (170 days) of nine month study.  Filter out 20% short-lived routes.  Provide a more conservative estimate of failure.  Apply 15 minute filter window to BGP routes, count multiple failures in a window as a single failure.  Filter out high frequency pathological BGP.  ISP said 15 minutes is the time for routing to converge.

7 Analysis of Inter-domain Path Stability BGP routing table events classes: Route Failure: loss of a previously available routing table path to a given network or a less specific prefix destination. Question: Why “less specific prefix” ? Router aggregate multiple more specific prefix into a single supernet advertisement /24  /16

8 Route Repair: A previously failed route to a network prefix is announced as reachable. Route Fail-over: A route is implicitly withdrawn and replaced by an alternative route with different next-hop or ASpath to a prefix destination. Policy Fluctuation: A route is implicitly withdrawn and replaced by an alternative route with different attributes, but the same next-hop and ASpath. ( MED, etc). Pathological Routing: Repeated withdrawn or duplicate announcement the exact same route. Last two events have been studied before, here we study the first three events in BGP experiments.

9 Inter-domain Route Availability Route availability: A path to a network prefix or a less specific prefix is presented in the provider’s routing table. Figure 4: Cumulative distribution of the route availability of 3 ISPs

10 Observation from route availability data  Less than 25%~35% of routes had availability higher than 99.99%.  10% of routes exhibited under 95% availability.  Internet is far less robust than telephony: Public Switched Telephone Network (PSTN) averaged an availability rate better than %)  the ISP1 step curve represents the 11/98 major internet failure which caused several hours loss of connectivity of internet.

11 Route Failure and Fail-over Failure: loss of previously available routing table path to a prefix or less specific prefix destination. Fail-over : change in ASpath or next-hop reachability of a route. Fig5: Cumulative distribution of mean-time to failure and mean-time to fail-over for routes from 3 ISPs.

12 Observation from route failure and fail-over  The majority routes(>50%) exhibit a mean-time to failure of 15 days.  75% routes have failed at least once in 30 days.  Majority routes fail-over within 2 days.  Only 5%~20% of routes do not fail-over within 5 days.  A slightly higher incidence of failure today than ( 2/3’s of routes persisted for days or weeks.)

13 Route Repair time & Failure Duration Route Repair: a previously failed route is announced reachable. MTTF: Mean-Time to Failure MTTR: Mean-time to Repair Fig6: Cumulative distribution of MTTR and failure duration for routes from 3 ISPs.

14 Observation in MTTR and Failure duration  40% failures are repaired in 10 minutes.  Majority(70%) routes are resolved within 1/2 hour.  Heavy-tailed distribution of MTTR: failures not repaired in 1/2 hour are serious outage requiring great effort to deal with.  Only 25%~35% outages are repaired within 1 hour.  Indication: A small number of routes failed a lot times and lasted more than one hour.  It agree with previous paper that a small fraction routes are responsible for majority of network instability.

15 Analysis of Intra-domain Network Stability Backbone router: connect to other backbone router via multiple physical path. Well equipped and maintained. Customer router: connect to regional backbone via single physical connection. Less ideal maintained.

16 Observation in MTTR and Failure duration  Majority interfaces exhibit MTTF 40 days. ( while majority inter-domain MTTF occur within 30 days)  Step discontinuities is because of a router has many interfaces.  80% of all failures are resolved within 2 hours.  Heavy-tail distribution of MTTR represent that longer than 2 hours outages are long-term and requires great effort to deal with.

17 Intra-domain Network Failure Table1: Category and number of recorded outages Internet in MichNet. (11/97~11/98) The data is taken from MichNet NOC log data.  Most outages were not related to IP backbone infrastructure.  Majority outages were from customer sites than backbone nodes.

18 Availability of each backbone router Table2: Availability of Router Interfaces during one year MichNet study( 11/97~11/98)

19 Observation of availability of backbone  Data is taken from CNMS monitor logs.  Overall up time is 99.0% for the year.  Failure logs reveal a number of persistent circuit or hardware faults repeatedly happened.  Operation staffs said: (NOC log data has no duration statistics)  Most backbone outages tend be on order of several minutes.  Customer outages persist longer on order of several hours.  Power outages and hardware failure tend to be resolved within 4 hours.  Routing problem last within 2 hours.

20 Frequency Property Analysis Frequency analysis of BGP and OSPF update messages. Fig8: BGP updates measured at Mae-East exchange point( 08/96~09/96) ; OSPF updates in MichNet using hourly aggregates.( 10/98~11/98)

21 Observation of update frequency  BGP shows significant frequencies at 7 days, and 24 hours.  Low amount instability in weekends.  Fairly stable of Internet in early morning compared with North American business hours.  Absence of intra-domain frequency pattern indicates that much of BGP instability stems from Internet congestion.  BGP is build on TCP. TCP has congestion window. Update or KeepAlive message time out.  AS Internal congestion make IBGP lost and spread out.  Some new routers provide a mechanism: BGP traffic has higher priority and KeepAlive message persist under congestion.

22 Conclusion  Internet exhibit significantly less availability and reliability than telephony network.  Major Internet backbone paths exhibit mean-time to failure of 25 days or less, mean-time to repair of 20 minutes or less. Internet backbones are rerouted( either due to failure or policy changes) on average of once every 3 days or less.  The 24 hours, 7 days cycle of BGP traffic and none cycle in OSPF suggest that BGP instability stem from congestion collapse.  A small number of Internet ASes contribute to a large number of long-term outage and backbone unavailability.

23 Conclusion(contd.)  For robustness, commercial & critical sites use multi- home, ubiquitous network medium.  Further research is needed to confirm Internet failures may stem from congestion collapse.  Research on Internet routing behavior will greatly help future rational growth of Internet.