LIFEGUARD: Practical Repair of Persistent Route Failures Ethan Katz-Bassett (USC), Colin Scott (UW/UCB), David Choffnes, Italo Cunha (UW), Valas Valancius,

Slides:



Advertisements
Similar presentations
Introduction to IP Routing Geoff Huston. Routing How do packets get from A to B in the Internet? A B Internet.
Advertisements

Internet Availability Nick Feamster Georgia Tech.
1 Isolating Wide-Area Network Faults with Baywatch Colin Scott With Professor Ethan Katz-Bassett, Dave Choffnes, Italo Cunha, Arvind Krishnamurthy, and.
CS 4700 / CS 5700 Network Fundamentals Lecture 10: Inter Domain Routing (It’s all about the Money) Revised 2/4/2014.
CSE390 – Advanced Computer Networks Lecture 6-7: Inter Domain Routing (It’s all about the Money) Based on slides from D. Choffnes Northeastern U. Revised.
Network Layer: Internet-Wide Routing & BGP Dina Katabi & Sam Madden.
Fundamentals of Computer Networks ECE 478/578 Lecture #18: Policy-Based Routing Instructor: Loukas Lazos Dept of Electrical and Computer Engineering University.
Consensus Routing: The Internet as a Distributed System John P. John, Ethan Katz-Bassett, Arvind Krishnamurthy, and Thomas Anderson Presented.
Courtesy: Nick McKeown, Stanford
1 Interdomain Routing Protocols. 2 Autonomous Systems An autonomous system (AS) is a region of the Internet that is administered by a single entity and.
© 2007 Cisco Systems, Inc. All rights reserved.ICND2 v1.0—3-1 Medium-Sized Routed Network Construction Reviewing Routing Operations.
Transient BGP Loops Do they matter, and what can be done about them? Nate Kushman MIT/Akamai Srikanth Kandula, Dina Katabi and John Wroclawski.
1 Measurement of Highly Active Prefixes in BGP Ricardo V. Oliveira, Rafit Izhak-Ratzin, Beichuan Zhang, Lixia Zhang GLOBECOM’05.
Interdomain Routing Security COS 461: Computer Networks Michael Schapira.
Mini Introduction to BGP Michalis Faloutsos. What Is BGP?  Border Gateway Protocol BGP-4  The de-facto interdomain routing protocol  BGP enables policy.
CCNA 2 v3.1 Module 6.
Interdomain Routing Establish routes between autonomous systems (ASes). Currently done with the Border Gateway Protocol (BGP). AT&T Qwest Comcast Verizon.
Inherently Safe Backup Routing with BGP Lixin Gao (U. Mass Amherst) Timothy Griffin (AT&T Research) Jennifer Rexford (AT&T Research)
1 Autonomous Systems An autonomous system is a region of the Internet that is administered by a single entity. Examples of autonomous regions are: UVA’s.
Hot Potatoes Heat Up BGP Routing Jennifer Rexford AT&T Labs—Research Joint work with Renata Teixeira, Aman Shaikh, and.
Interdomain Routing and the Border Gateway Protocol (BGP) Reading: Section COS 461: Computer Networks Spring 2011 Mike Freedman
ROUTING PROTOCOLS Rizwan Rehman. Static routing  each router manually configured with a list of destinations and the next hop to reach those destinations.
Fundamentals of Networking Discovery 2, Chapter 6 Routing.
Jennifer Rexford Fall 2010 (TTh 1:30-2:50 in COS 302) COS 561: Advanced Computer Networks Stub.
Computer Networks Layering and Routing Dina Katabi
1 Studying Black Holes on the Internet with Hubble Ethan Katz-Bassett, Harsha V. Madhyastha, John P. John, Arvind Krishnamurthy, David Wetherall, Thomas.
Try Before you Buy: SDN Emulation with (Real) Interdomain Routing 1 Brandon Schlinker ⋆, Kyriakos Zarifis*, Italo Cunha ♮, Nick Feamster †, Ethan Katz-Bassett*,
9/15/2015CS622 - MIRO Presentation1 Wen Xu and Jennifer Rexford Department of Computer Science Princeton University Chuck Short CS622 Dr. C. Edward Chow.
© Janice Regan, CMPT 128, CMPT 371 Data Communications and Networking BGP, Flooding, Multicast routing.
CS 3700 Networks and Distributed Systems Inter Domain Routing (It’s all about the Money) Revised 8/20/15.
CCNA 1 Module 10 Routing Fundamentals and Subnets.
TDTS21: Advanced Networking Lecture 7: Internet topology Based on slides from P. Gill and D. Choffnes Revised 2015 by N. Carlsson.
Vytautas Valancius, Nick Feamster, Akihiro Nakao, and Jennifer Rexford.
A Snapshot on MPLS Reliability Features Ping Pan March, 2002.
T. S. Eugene Ngeugeneng at cs.rice.edu Rice University1 COMP/ELEC 429/556 Introduction to Computer Networks Inter-domain routing Some slides used with.
By, Matt Guidry Yashas Shankar.  Analyze BGP beacons which are announced and withdrawn, usually within two hour intervals.  The withdraws have an effect.
CCNA 2 Week 6 Routing Protocols. Copyright © 2005 University of Bolton Topics Static Routing Dynamic Routing Routing Protocols Overview.
6: Routing Working at a Small to Medium Business.
Eliminating Packet Loss Caused by BGP Convergence Nate Kushman Srikanth Kandula, Dina Katabi, and Bruce Maggs.
CS 4396 Computer Networks Lab BGP. Inter-AS routing in the Internet: (BGP)
1 Version 3.1 Module 6 Routed & Routing Protocols.
CSE534- Fundamentals of Computer Networking Lecture 12-13: Internet Connectivity + IXPs (The Underbelly of the Internet) Based on slides by D. Choffnes.
© 2002, Cisco Systems, Inc. All rights reserved..
© 2005 Cisco Systems, Inc. All rights reserved. BGP v3.2—6-1 Scaling Service Provider Networks Scaling IGP and BGP in Service Provider Networks.
Securing BGP Bruce Maggs. BGP Primer AT&T /8 Sprint /16 CMU /16 bmm.pc.cs.cmu.edu Autonomous System Number Prefix.
© 2005 Cisco Systems, Inc. All rights reserved. BGP v3.2—3-1 Route Selection Using Policy Controls Using Multihomed BGP Networks.
A Snapshot on MPLS Reliability Features Ping Pan March, 2002.
Routing Protocols COSC 541 Data Commun. System & Networks Yue Dou.
Border Gateway Protocol. Intra-AS v.s. Inter-AS Intra-AS Inter-AS.
1 Internet Routing 11/11/2009. Admin. r Assignment 3 2.
Reverse Traceroute Ethan Katz-Bassett, Harsha V. Madhyastha, Vijay K. Adhikari, Colin Scott, Justine Sherry, Peter van Wesep, Arvind Krishnamurthy, Thomas.
CS 3700 Networks and Distributed Systems
CS 4700 / CS 5700 Network Fundamentals
Autonomous Systems An autonomous system is a region of the Internet that is administered by a single entity. Examples of autonomous regions are: UVA’s.
CS 3700 Networks and Distributed Systems
CS 4700 / CS 5700 Network Fundamentals
Chapter 2: Static Routing
The Need for End-to-End Evaluation of Cloud Availability
A stability-oriented approach to improving BGP convergence
Department of Computer and IT Engineering University of Kurdistan
COS 561: Advanced Computer Networks
CS 4700 / CS 5700 Network Fundamentals
Dynamic Routing and OSPF
COS 561: Advanced Computer Networks
COS 561: Advanced Computer Networks
COS 561: Advanced Computer Networks
BGP Interactions Jennifer Rexford
COS 461: Computer Networks
Transient BGP Loops Do they matter, and what can be done about them?
BGP Instability Jennifer Rexford
Presentation transcript:

LIFEGUARD: Practical Repair of Persistent Route Failures Ethan Katz-Bassett (USC), Colin Scott (UW/UCB), David Choffnes, Italo Cunha (UW), Valas Valancius, Nick Feamster (GT), Harsha Madhyastha (UCR), Tom Anderson, Arvind Krishnamurthy (UW) This work is generously funded in part by Google, Cisco and the NSF.

How common are these outages? 86% of outages are less than 5 minutes Long outages account for 90% of the downtime Portion of outages Portion of total downtime  Monitor network outages from Amazon’s EC2  2 million outages in two months LIFEGUARD: Automatic Diagnosis and Repair4

5 L IFE G UARD : Practical Repair of Persistent Route Failures Reasons for Long-Lasting Outages  Long-term outages are:  Caused by routers advertising paths that do not work  E.g., corrupted memory on line card causes black hole  E.g., bad cross-layer interactions cause failed MPLS tunnel  Repaired over slow, human timescales  Not well understood  Complicated by lack of visibility into or control over routes in other ISPs

6 6 Establishing Inter-Network Routes  Border Gateway Protocol (BGP)  Internet ’ s inter-network routing protocol  Network chooses path based on its own opaque policy ($$)  Forward your preferred path to neighbors WS ATT  WS Sprint  ATT  WS L3  ATT  WS UW  L3  ATT  WS

7 L IFE G UARD : Practical Repair of Persistent Route Failures Choose a path that avoids the problem. Self-Repair of Forward Paths

What about reverse path failures?  ~90% of paths on the Internet are asymmetric!  *Y. He, M. Faloutsos, S. Krishnamurthy, and B. Huffaker. On routing asymmetry in the Internet. In Autonomic Networks Symposium in Globecom,

9 L IFE G UARD : Practical Repair of Persistent Route Failures Ideal Self-Repair of Reverse Paths

10 L IFE G UARD : Practical Repair of Persistent Route Failures A Mechanism for Failure Avoidance  Forward path: Choose route that avoids ISP or ISP-ISP link  Reverse path: Want others to choose paths to my prefix P that avoid ISP or ISP-ISP link X  Want a BGP announcement AVOID(X,P):  Any ISP with a route to P that avoids X uses such a route  Any ISP not using X need only pass on the announcement

11 L IFE G UARD : Practical Repair of Persistent Route Failures AVOID(L3,WS) Ideal Self-Repair of Reverse Paths

BGP Doesn’t Have AVOID!  How can we approximate AVOID?  Hint: how does BGP avoid loops? 12

13 L IFE G UARD : Practical Repair of Persistent Route Failures WS ATT → WS UW → L3 → ATT → WS Sprint → Qwest → WS AISP → Qwest → WS L3 → ATT → WS Qwest → WS Practical Self-Repair of Reverse Paths

14 L IFE G UARD : Practical Repair of Persistent Route Failures WS ATT → WS UW → L3 → ATT → WS Sprint → Qwest → WS AISP → Qwest → WS ? Qwest → WS UW → Sprint → Qwest → WS → L3 → WS Sprint → Qwest → WS → L3 → WS AISP → Qwest → WS → L3 → WS ATT → WS → L3 → WS WS → L3 → WS Qwest → WS → L3 → WS AVOID(L3,WS) Practical Self-Repair of Reverse Paths L3 → ATT → WS BGP loop prevention encourages switch to working path.

That’s outage avoidance  How do we detect outages in the first place?  And how do we know who to AVOID? 15

16 L IFE G UARD : Practical Repair of Persistent Route Failures Locating Internet Failures  How it works today  Customer complains to network operator  Operator sends test traffic to confirm  If confirmed:  Who is causing the problem?  Is it affecting just me?

17 L IFE G UARD : Practical Repair of Persistent Route Failures  Historical atlas enables reasoning about changes  Traceroute yields only path from GMU to target  Reverse traceroute reveals path asymmetry How does L IFE G UARD locate a failure? Before outage: HistoricalCurrent

18 L IFE G UARD : Practical Repair of Persistent Route Failures  Forward path works Problem with ZSTTK? Ping? Fr: VP How does L IFE G UARD locate a failure? Ping! To: VP During outage: HistoricalCurrent

19 L IFE G UARD : Practical Repair of Persistent Route Failures  Forward path works How does L IFE G UARD locate a failure? NTT:Ping? Fr:GMU GMU:Ping! Fr:NTT During outage: HistoricalCurrent

20 L IFE G UARD : Practical Repair of Persistent Route Failures  Forward path works  Rostelcom is not forwarding traffic towards GMU Rostele: Ping? Fr:GMU How does L IFE G UARD locate a failure? During outage: HistoricalCurrent

21 L IFE G UARD : Practical Repair of Persistent Route Failures How L IFE G UARD Locates Failures  L IFE G UARD : 1. Maintains background historical atlas 2. Isolates direction of failure 3. Tests historical paths in failing direction to prune candidate failure locations  Once failure located, use BGP loop prevention to AVOID the problem