Resilient Routing Reconfiguration Ye Wang, Hao Wang, Ajay Mahimkar +, Richard Alimi, Yin Zhang +, Lili Qiu +, Yang Richard Yang Google + University.

Slides:

Advertisements

Similar presentations

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.

Advertisements

Symantec 2010 Windows 7 Migration EMEA Results. Methodology Applied Research performed survey 1,360 enterprises worldwide SMBs and enterprises Cross-industry.

Symantec 2010 Windows 7 Migration Global Results.

Adders Used to perform addition, subtraction, multiplication, and division (sometimes) Half-adder adds rightmost (least significant) bit Full-adder.

EE384y: Packet Switch Architectures

1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.

Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

Greening Backbone Networks Shutting Off Cables in Bundled Links Will Fisher, Martin Suchara, and Jennifer Rexford Princeton University.

Flexible Budgets, Variances, and Management Control: II

Distributed Systems Architectures

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.

Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.

Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.

Path Splicing Nick Feamster, Murtaza Motiwala, Megan Elmore, Santosh Vempala.

1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.

David Burdett May 11, 2004 Package Binding for WS CDL.

Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination. Introduction to the Business.

Scalable Routing In Delay Tolerant Networks

1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.

Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×

Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.

Chapter 6 File Systems 6.1 Files 6.2 Directories

1 Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

Multipath Routing for Video Delivery over Bandwidth-Limited Networks S.-H. Gary Chan Jiancong Chen Department of Computer Science Hong Kong University.

Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:

1 Outline relationship among topics secrets LP with upper bounds by Simplex method basic feasible solution (BFS) by Simplex method for bounded variables.

Robust Window-based Multi-node Technology- Independent Logic Minimization Jeff L.Cobb Kanupriya Gulati Sunil P. Khatri Texas Instruments, Inc. Dept. of.

Pole Placement.

Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.

Chapter 1: Introduction to Scaling Networks

PP Test Review Sections 6-1 to 6-6

1 Atomic Routing Games on Maximum Congestion Costas Busch Department of Computer Science Louisiana State University Collaborators: Rajgopal Kannan, LSU.

The Weighted Proportional Resource Allocation Milan Vojnović Microsoft Research Joint work with Thành Nguyen Microsoft Research Asia, Beijing, April, 2011.

Bright Futures Guidelines Priorities and Screening Tables

EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.

Chapter 10: Virtual Memory

1 Sizing the Streaming Media Cluster Solution for a Given Workload Lucy Cherkasova and Wenting Tang HPLabs.

Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.

Countering DoS Attacks with Stateless Multipath Overlays Presented by Yan Zhang.

Chapter 20 Network Layer: Internet Protocol

Chapter 6 File Systems 6.1 Files 6.2 Directories

Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.

1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.

Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.

Routing and Congestion Problems in General Networks Presented by Jun Zou CAS 744.

CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.

Adding Up In Chunks.

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.

Abhigyan, Aditya Mishra, Vikas Kumar, Arun Venkataramani University of Massachusetts Amherst 1.

Flexible Budgets and Performance Analysis

Essential Cell Biology

PSSA Preparation.

Essential Cell Biology

Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.

Immunobiology: The Immune System in Health & Disease Sixth Edition

Energy Generation in Mitochondria and Chlorplasts

3 - 1 Copyright McGraw-Hill/Irwin, 2005 Markets Demand Defined Demand Graphed Changes in Demand Supply Defined Supply Graphed Changes in Supply Equilibrium.

© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Link-State Routing Protocols Routing Protocols and Concepts – Chapter.

Scalable Rule Management for Data Centers Masoud Moshref, Minlan Yu, Abhishek Sharma, Ramesh Govindan 4/3/2013.

New Opportunities for Load Balancing in Network-Wide Intrusion Detection Systems Victor Heorhiadi, Michael K. Reiter, Vyas Sekar UNC Chapel Hill UNC Chapel.

Traffic Engineering with Forward Fault Correction (FFC)

Network Architecture for Joint Failure Recovery and Traffic Engineering Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

Shadow Configurations: A Network Management Primitive Richard Alimi, Ye Wang, Y. Richard Yang Laboratory of Networked Systems Yale University.

A General approach to MPLS Path Protection using Segments Ashish Gupta Ashish Gupta.

COPE: Traffic Engineering in Dynamic Networks Hao Wang, Haiyong Xie, Lili Qiu, Yang Richard Yang, Yin Zhang, Albert Greenberg Yale University UT Austin.

Fast recovery in IP networks using Multiple Routing Configurations Amund Kvalbein Simula Research Laboratory.

Presentation transcript:

Resilient Routing Reconfiguration Ye Wang, Hao Wang*, Ajay Mahimkar +, Richard Alimi, Yin Zhang +, Lili Qiu +, Yang Richard Yang * Google + University of Texas at Austin Yale University ACM SIGCOMM 2010, New Delhi, India ACM SIGCOMM 2010, New Delhi, India September 2, 2010

Motivation Failures are common in operational IP networks Accidents, attacks, hardware failure, misconfig, maintenance Multiple unexpected failures may overlap e.g., concurrent fiber cuts in Sprint (2006) Planned maintenance affects multiple network elements May overlap with unexpected failures (e.g. due to inaccurate SRLG) Increasingly stringent requirement on reliability VoIP, video conferencing, gaming, mission-critical apps, etc. SLA has teeth violation directly affects ISP revenue Need resiliency: network should recover quickly & smoothly from one or multiple overlapping failures

Challenge: Topology Uncertainty Number of failure scenarios quickly explodes 500-link network, 3-link failures: > 20,000,000! Difficult to optimize routing to avoid congestion under all possible failure scenarios Brute-force failure enumeration is clearly infeasible Existing methods handle only 100s of topologies Difficult to install fast routing Preconfigure 20,000,000 backup routes on routers?

Focus exclusively on reachability e.g., FRR, FCP (Failure Carrying Packets), Path Splicing May suffer from congestion and unpredictable performance congestion mostly caused by rerouting under failures [Iyer et al.] multiple network element failures have domino effect on FRR rerouting, resulting in network instability [N. So & H. Huang] Only consider a small subset of failures e.g., single-link failures [D. Applegate et al.] Insufficient for demanding SLAs Online routing re-optimization after failures Too slow cannot support fast rerouting Existing Approaches & Limitations 4

R3: Resilient Routing Reconfiguration A novel link-based routing protection scheme requiring no enumeration of failure scenarios provably congestion-free for all up-to-F link failures efficient w.r.t. router processing/storage overhead flexible in supporting diverse practical requirements

Goal: congestion-free rerouting under up-to-F link failures Input: topology G(V,E), link capacity c e, traffic demand d d ab : traffic demand from router a to router b Output: base routing r, protection routing p r ab (e): fraction of d ab carried by link e p l (e): (link-based) fast rerouting for link l Problem Formulation capacity d ab =6 ab d c r ab (ab)=1 p ab (ac)=1 p ab (cb)=1

From Topology Uncertainty to Traffic Uncertainty Instead of optimizing for original traffic demand on all possible topologies under failures R3 optimizes protection routing for a set of traffic demands on the original topology Rerouting virtual demand set captures the effect of failures on amount of rerouted traffic Protection routing on original topology can be easily reconfigured for use after failure occurs 7

Failure scenario (f) rerouted traffic (x) Rerouted traffic under all possible up-to-F-link failure scenarios (independ of r): X F = { x | 0 x l c l, Σ(x l /c l ) F } (convex combination) Rerouting Virtual Demand Set 8 4/5 0/10 4/10 2/10 rerouted traffic x l after link l fails = base load on l given r (r is congestion-free x l c l ) load/capacity ab c d Failure scenario Rerouted traffic Upper bound of rerouted traffic ac failsx ac = 4x ac 5 (c ac ) ab failsx ab = 2x ab 10 (c ab )

R3 Overview Offline precomputation Plan (r,p) together for original demand d plus rerouting virtual demand x on original topology G(V,E) to minimize congestion p may use links that will later fail Online reconfiguration Convert and use p for fast rerouting after failures 9

Compute (r,p) to minimize MLU (Max Link Utilization) for original demand d + rerouting demand x X F r carries d, p carries x X F min (r,p) MLU s.t. [1] r is a routing, p is a routing; [2] x X F, e: [ a,b V d ab r ab (e) + l E x l p l (e) ] / c e MLU Challenge: [2] has infinite number of constraints Solution: apply LP duality a polynomial # of constraints Offline Precomputation 10 Original traffic Rerouting traffic

Step 1: Fast rerouting after ac fails: Precomputed p for ac: p ac (ac)=1/3, p ac (ab)=1/3, p ac (ad)=1/3 ac fails fast reroute using p ac \ p ac (ac) equivalent to a rescaled p ac : ξ ac (ac)=0, ξ ac (ab)=1/2, ξ ac (ad)=1/2 link l fails ξ l (e)=p l (e)/(1-p l (l)) Online Reconfiguration 11 1/3 2/3 1/3 p ac (e) ab c d 0 1/2 1 ξ ac (e)

Online Reconfiguration (Cont.) Step 2: Reconfigure p after failure of ac Current p for ab: p ab (ac)=1/2, p ab (ad)=1/2 ac fails 1/2 need to be detoured using ξ ac p ab (ac)=0, p ab (ad)=3/4, p ab (ab)=1/4 link l fails l, p l (e) = p l (e)+p l (l) ξ l (e) 12 1/2 Apply detour ξ ac on every protection routing (for other links) that is using ac p ab (e) ab c d 0 3/ /4

R3 Guarantees Sufficient condition for congestion-free if (r,p) s.t. MLU 1 under d+X F no congestion under any failure involving up to F links Necessary condition under single link failure if there exists a protection routing guarantees no congestion under any single-link failure scenario (r,p) s.t. MLU 1 under d+X 1 Adding superset of rerouted traffic to original demand is not so wasteful Open problem: is R3 optimal for >1 link failures? R3 online reconfiguration is order independent of multiple failures 13

R3 Extensions Fixed base routing r can be given (e.g., as an outcome of OSPF) Trade-off between no-failure and failure protection Add penalty envelope β(1) to bound no-failure performance Trade-off between MLU and end-to-end delay Add envelope γ(1) to bound end-to-end path delay Prioritized traffic protection Associate different protection levels to traffic with different priorities Realistic failure scenarios Shared Risk Link Group, Maintenance Link Group Traffic variations Optimize (r,p) for d D + x X F 14

Evaluation Methodology Network Topology Real: Abilene, US-ISP (PoP-level) Rocketfuel: Level-3, SBC, and UUNet Synthetic:GT-ITM Traffic Demand Real:Abilene, US-ISP Synthetic:gravity model Failure Scenarios Abilene:all failure scenarios with up to 3 physical links US-ISP:maintenance events (6 months) + all 1-link, 2-link failures + ~1100 sampled 3-link failures Enumeration only needed for evaluation, not for R3 15

Evaluation Methodology (cont.) R3 vs other rerouting schemes OSPF+R3: add R3 rerouting to OSPF MPLS-ff+R3: ideal R3 (flow-based base routing) OSPF+opt : benchmark, optimal rerouting (based on OSPF) OSPF+CSPF-detour: commonly used OSPF+recon: ideal OSPF reconvergence FCP: Failure Carrying Packets [Lakshminarayanan et al.] PathSplice: Path Splicing (k=10,a=0,b=3) [Motivala et al.] Performance metrics MLU (Maximum Link Utilization) measures congestion lower better performance ratio = MLU of the algorithm / MLU of optimal routing for the changed topology (corresponding to the failure scenario) always 1, closer to 1 better 16

US-ISP Single Failure 17 R3 achieves near-optimal performance (R3); R3 out-performs other schemes significantly.

US-ISP Multiple Failures 18 R3 consistently out-performs other schemes by at least 50% all two-link failure scenariossampled three-link failure scenarios

US-ISP No Failure: Penalty Envelope 19 R3 near optimal under no failures with 10% penalty envelope

Experiments using Implementation R3 Linux software router implementation Based on Linux kernel Linux MPLS Implement flow-based fast rerouting and efficient R3 online reconfiguration by extending Linux MPLS Abilene topology emulated on Emulab 3 physical link failures (6 directed link failures) 20

R3 vs OSPF-recon: Link Utilization 21 R3 out-performs OSPF+recon by a factor of ~3

Precomputation Complexity Profile computation time (in seconds) Computer: 2.33 GHz CPU, 4 GB memory 22 Offline precomputation time < 36 minutes (operational topologies, < 17 minutes) Computation time stable with higher #failures.

Router Storage Overhead Estimate maximum router storage overhead for our implementation ILM (MPLS incoming label mapping) and NHLFE (MPLS next-hop label forwarding entry) are required to implement R3 protection routing 23 Modest router storage overhead: FIB < 300 KB, RIB < 20 MB, #ILM < 460, #NHLFE < 2,500

Conclusions RRR R3: Resilient Routing Reconfiguration Provably congestion-free guarantee under multiple failures Key idea: convert topology uncertainty to traffic uncertainty offline precomputation + online reconfiguration Flexible extensions to support practical requirements Trace-driven simulation R3 near optimal (>50% better than existing approaches) Linux implementation Feasibility and efficiency of R3 in real networks 24

Thank you! 25

26 Backup Slides 26

27 US-ISP Single Failure: Zoom In R3 achieves near-optimal performance (R3 vs opt) For each hour (in one day), compare the worst case failure scenario for each algorithm 27

Level-3 Multiple Failures Left: all two-failure scenarios; Right: sampled three-failure scenarios 28 R3 out-performs other schemes by >50%

SBC Multiple Failures Left: all two-failure scenarios Right: sampled three-failure scenarios 29 Ideal R3 outperforms OSPF+R3 in some cases

Robustness on Base Routing OSPFInvCap: link weight is inverse proportional to bandwidth OSPF: optimized link weights Left: single failure; Right: two failures 30 A better base routing can lead to better routing protection

31 Flow RTT (Denver-Los Angeles) R3 implementation achieves smooth routing protection 31

32 R3: OD pair throughput Traffic demand is carried by R3 under multiple link failure scenarios 32

33 R3: Link Utilization Bottleneck link load is controlled under 0.37 using R3 33

Offline Precomputation Solution [C2] contains infinite number of constraints due to x Consider maximum extra load on link e caused by x Σ l E x l p l (e) UB, if there exists multipliers π e (l) and λ e (LP duality): Convert [C2] to polynomial number of constraints: 34 π e (l) λ e & constraints on π e (l) and λ e

35 Traffic Priority Service priority is a practical requirement of routing protection Traffic Priority Example TPRT (real-time IP) traffic should be congestion-free under up-to-3 link failures (to achieve % reliability SLA) Private Transport (TPP) traffic should survive up-to-2 link failures General IP traffic should only survive single link failures 35

36 R3 with Traffic Priority Attribute protection level to each class of traffic TPRT protection level 3 TPP protection level 2 IP protection level 1 Traffic with protection level greater than equal to i should survive under failure scenarios covered by protection level i 36

37 R3 with Traffic Priority Algorithm Precomputation with Traffic Priority Consider protection for each protection level i Guarantee each class of traffic has no congestion under the failure scenarios covered by its protection level 37

38 R3 with Traffic Priority Simulation Routing Protection Simulation Basic R3 vs R3 with Traffic Priority Methodology US-ISP: a large tier-1 operational network topology hourly PoP-level TMs for a tier-1 ISP (1 week in 2007) extract IPFR and PNT traffic from traffic traces, subtract IPFR and PNT from total traffic and treat the remaining as general IP Protection levels: TPRT: up-to-4 link failures TPP: up-to-2 link failures IP: single link failures Failure scenarios: enumerated all single link failures 100 worst cases of 2 link failures 100 worst cases of 4 link failures 38

Traffic Protection Priority IP: up-to-1 failure protection; TPP: up-to-2 failure protection; TPRT: up-to-4 failure protection Left: single failure; Righttop: worst two failures; Rightbottom: worst four failures 39 R3 respects different traffic protection priorities

Linux Implementation MPLS-ff software router Based on Linux kernel Linux MPLS Implement flow-based routing for efficient R3 online reconfiguration Extend MPLS FWD (forward) data structure to enable per hop traffic splitting for each flow Failure detection and response Detection using Ethernet monitoring in kernel In operational networks, detection can be conservative (for SLRG, MLG) Notification using ICMP 42 flooding Requires reachability under failures Rerouting traffic by MPLS-ff label stacking Online reconfiguration of traffic splitting ratios (locally at each router) 40

41 MPLS-ff R3 uses flow-based traffic splitting (for each OD pair) e.g. p newy wash (chic hous)=0.2 means link chic hous will carry 20% traffic originally carried by newy wash when newy wash fails Current MPLS routers only support path-based traffic splitting Traffic load on equal-cost LSPs is proportional to requested bandwidth by each LSP Juniper J-, M-, T- series and Cisco 7200, 7500, series e.g. multiple paths from newy to wash can be setup to protect link newy wash 41

42 MPLS-ff Convert flow-based to path-based routing e.g., using flow decomposition algorithm -- [Zheng et al] Expensive to implement R3 online reconfiguration Need to recompute and signal LSPs after each failure Extend MPLS to support flow-based routing MPLS-ff Enable next-hop traffic splitting ratios for each flow flow: traffic originally carried by a protected link 42

43 MPLS-ff MPLS FWD data structure Extended to support multiple Next Hop Label Forwarding Entry (NHLFEs) One NHLFE specifying one neighbor One Next-hop Splitting Ratio for each NHLFE Label 200 FWD: NHLFE R2: 50% NHLFE R3: 50% NHLFE R4: 0% R1 R2 R4 R3 43

44 MPLS-ff Implement Next-hop Splitting Ratio router i, next-hop j, protected link (a,b): Packet hashing Same hash value for packets in the same TCP flow Independent hash values on different routers for any particular TCP flow Hash of (packet flow fields + router id) Hash = f(src, dst, srcPort, dstPort) FWD to j if 40<Hash<64 all packets from i: 40<Hash<64! (skewed hash value distribution) ij 44

45 Failure Response Failure Detection and Notification detection: Ethernet monitoring in Linux kernel notification: ICMP 42 flood Failure Response MPLS-ff label stacking Protection Routing Update R3 online reconfiguration requires a local copy of p on each router 45

R3 Design: Routing Model Network topology as graph G = (V,E) V: set of routers E: set of directed network links, link e=(i,j) with capacity c e Traffic matrix (TM) TM d is a set of demands: d = { d ab | a,b V } d ab : traffic demand from a to b Flow-based routing representation r = { r ab (e) | a,b V, e E } r ab (e): the fraction of traffic from a to b (d ab ) that is carried by ink e e.g. r ab (e)=0.25 means link e carry 25% traffic from a to b 46

Intuition behind R3 Plan rerouting on the single original topology Avoid enumeration of topologies (failure scenarios) Compute (r,p) to gurantee congestion-free for d + x X F on G Puzzles Add rerouted traffic before it appears X F = { x | 0 x l c l, Σ(x l /c l ) F } superset of rerouted traffic under failures Use network resource that will disappear Protection routing p uses links that will fail 47 By doing two counter-intuitive things, R3 achieves: Congestion-free under multiple link failures w/o enumeration of failure scenarios Optimal for single link failure Fast rerouting Not real traffic demand! Topology with links that will fail!

Step 1: Fast rerouting after ac fails: Precomputed p for link ac: p ac (ac)=0.4, p ac (ab)=0.3, p ac (ad)=0.3 ac fails 0.4 need to be carried by ab and ad p ac : ac 0, ab 0.5, ad 0.5 a locally rescales p ac and activates fast rerouting Efficiently implemented by extending MPLS label stacking 2/102+2/10 Online Reconfiguration 48 0/10 2/10 0+2/10 2+4/10 4/54-4/5 Fast rerouting load/capacity ab c d p ac (e) ab c d