Presentation is loading. Please wait.

Presentation is loading. Please wait.

Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008.

Similar presentations


Presentation on theme: "Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008."— Presentation transcript:

1 Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

2 Aug. 28, 2007AsiaFI, Student Workshop2 Outline Problem Statement Analysis of Self-Healing Routing Existing Improvement Solutions Our Self-Healing Solution Conclusion and Future work

3 Aug. 28, 2007AsiaFI, Student Workshop3 Problem Statement Routing (Intra- and Inter- domain) is critical elements as Internet infrastructure How robust are they against large scale failures/attacks? Cisco routers caused major outage in Japan 2007 Earthquake in Taiwan causes undersea cable damage in 2006 We need to improve them, but how can we do?

4 Aug. 28, 2007AsiaFI, Student Workshop4 Internet Routing Not a homogeneous network A network autonomous systems (AS) Each AS under the control of an ISP. Large variation in AS sizes – typical heavy tail. Inter-AS routing Border Gateway Protocol (BGP). A path-vector algorithm. Serious scalability/recovery issues. Intra-AS routing Several algorithms; usually work fine Central control, smaller network, …

5 Aug. 28, 2007AsiaFI, Student Workshop5 Measurements – Prefix Growth Table sizes grow 2x faster than real growth One (conservative) analysis predicts 2M entries in 10 years

6 Aug. 28, 2007AsiaFI, Student Workshop6 Measurements – BGP Updates

7 Aug. 28, 2007AsiaFI, Student Workshop7 Distribution of Updates – Main Observation Most of the network is very stable Parts of the network are very unstable Everybody pays for the instability Problem is getting worse

8 Aug. 28, 2007AsiaFI, Student Workshop8 Routing Failure Causes Large area router/link damage (e.g., earthquake) Large scale failure due to buggy SW update. High BW cable cuts Router configuration errors Aggregation of large un-owned IP blocks Happens when prefixes are aggregated for efficiency Incorrect policy settings resulting in large scale delivery failures Network wide congestion (DoS attack) Malicious route advertisements via worms

9 Aug. 28, 2007AsiaFI, Student Workshop9 Outline Problem Statement Analysis of Self Healing Routing Existing Improvement Solutions Our Self-healing Solution Conclusion and Future work

10 Aug. 28, 2007AsiaFI, Student Workshop10 Existing Routing Protocols Normal process of IP-based self-healing routing Failure Detection Failure Notification Forwarding Path Re-computation Existing routing protocols … RIP: hundreds of seconds, count to infinity OSPF, tens of seconds BGP, several minutes or longer, cant converge due to policy confliction.

11 Aug. 28, 2007AsiaFI, Student Workshop11 The State Transition under Failure A simple state transition to analyze the routing convergence.

12 Aug. 28, 2007AsiaFI, Student Workshop12 The Problems of Transient Failures Routing Blackhole Traffic is silently dropped without informing the source that the data did not reach its intended recipient. Routing Loop The path to a particular destination forms a loop.

13 Aug. 28, 2007AsiaFI, Student Workshop13 Outline Problem Statement Analysis of Self Healing Routing Existing Improvement Solutions Our Self-Healing Solution Conclusion and Future work

14 Aug. 28, 2007AsiaFI, Student Workshop14 Traditional Fast Reroute Solutions Major improvement in Intra-domain routing is fast reroute solutions. SONET rings are significantly reduce this recovery time, but they are expensive. FRR with MPLS-TE, hard to deploy because it will introduce much complexity into core network. IP-FRR developed by IETF, which still has some shortcomings, e.g., LFA needs a neighbor with a shortest path not containing the failed nodes. Layer 3 Tunnel provides pre-computed path protection, which may not eliminate the routing loops introduced by tunneling.

15 Aug. 28, 2007AsiaFI, Student Workshop15 State Transition of Improved Solution State transition with protection and damping: improving availability and stability.

16 Aug. 28, 2007AsiaFI, Student Workshop16 BGP Fast Convergence Solutions Major Problem in BGP Theoretical analysis and measurement result indicate path exploration of path vector protocol prolongs routing convergence Several solution addressed this problem: RCN can eliminate all the obsolete routes and ensure that only valid alternative routes are chosen and propagated by carrying the root-cause information in the BGP updates. Ghost Flushing improves the BGP convergence by expediting the removal of outdated ghost information in the Internet. Drawbacks … Network fail-over events in GF, Transient routing problems.…

17 Aug. 28, 2007AsiaFI, Student Workshop17 Outline Problem Statement Analysis of Self Healing Routing Existing Improvement Solutions Our Self-Healing Solution Requirements of Solution Routing Protection Evaluation Metrics Conclusion and Future Work

18 Aug. 28, 2007AsiaFI, Student Workshop18 Self-healing Routing The goal of self-healing routing After a link or a node is devastated, network can restore or repair routes by itself Self-healing routing approaches Routing Restoration (Fast Routing Convergence) Attempt to find a new path on-demand to restore connectivity when a failure occurs. Routing Protection Based on the fixed and predetermined failure recovery, provide a working path set up for traffic forwarding and an alternate protection path.

19 Aug. 28, 2007AsiaFI, Student Workshop19 Requirements of Solution Simplicity The solution should be simple and not add much complexity in core networks, but MPLS needs a fundamental infrastructure. Easy Deployment and Management MPLS-related solution is not a good potential solution because it is hard to pre-compute backup path for every nodes. Efficiency Protection should not be deployed to cover 100% of network, especially when multiple failures happen. Incremental Deployment Support It is an important factor when considering and designing a novel routing protocol, because we all can not ensure that we can deploy it once.

20 Aug. 28, 2007AsiaFI, Student Workshop20 Requirements of Solution (cont.) Business model Support The designed solution should consider the business model of path protection application in production networks. In order to protect unstable network and backbone network areas, contrasts between different ISPs should be signed to guarantee routing availability in these areas. Low Cost The path protection solution should provide routes without many computation processes or additional computation power needed on routers, and provide packet delivery performance guarantee with low packet loss. The solution should covers protection under both short term or long term network failures.

21 Aug. 28, 2007AsiaFI, Student Workshop21 Principle of our solution (cont.) The key idea of routing protection is that it makes tradeoff between the additional cost introduced by tunneling and packet lost caused by failures. Fast Failure Detection simplicity, fast detection, easy implementation and no change to existing routing protocols, Bidirectional Forwarding Detection (BFD) is directly applied. Path Protection Technique Although two different types of routing protocol need be considered, intra-domain routing and inter-domain routing tunnel, there is no need for us to provide path protection techniques for different routing instances. In order to eliminate the problems introduced by L3 tunnel, we choose L2TP as protection technique.

22 Aug. 28, 2007AsiaFI, Student Workshop22 Principle of our solution (cont.) Tunnel Deactivation Tunnels should be deactivated if the short term failure recovers or route converges again after a long term failure, e.g. for the view of loop avoidance or performance. In this situation, tunnel inactivation mechanism is essential to guarantee normal data forwarding. LAC: L2tp Access Concentrator LNS: L2TP Network Server

23 Aug. 28, 2007AsiaFI, Student Workshop23 Evaluation metrics of routing system Two metrics to evaluate routing system Availability refers to the ability of routing system to work for normal packet delivery no matter whether network failures happen. Stability refers to routing dynamic of routing system no matter network failures happen. Routing paths provided by tunnel guarantee routing availability, while delayed route updates during long-term failures or eliminated route updates during short-term failures improves stability of routing systems.

24 Aug. 28, 2007AsiaFI, Student Workshop24 Outline Problems Analysis of Self Healing Routing Existing Improvement Solutions Our Self-Healing Solution Conclusion and Future Work

25 Aug. 28, 2007AsiaFI, Student Workshop25 Conclusion and Future Work A lot of interesting problems in the Internet The routing issues in Internet are being addressed actively. Many of the problems are hard – no easy solutions, have to make tradeoffs. Our solution well addresses the self-healing problems of routing. Further study and measurement of our solution Development of the prototype and Experimental analysis on CERNET2

26 Thanks Q&A liqi@csnet1.cs.tsinghua.edu.cn


Download ppt "Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008."

Similar presentations


Ads by Google