Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Limiting the Impact of Failures on Network Performance Joint work with Supratik Bhattacharyya, and Christophe Diot High Performance Networking Group,

Similar presentations


Presentation on theme: "1 Limiting the Impact of Failures on Network Performance Joint work with Supratik Bhattacharyya, and Christophe Diot High Performance Networking Group,"— Presentation transcript:

1 1 Limiting the Impact of Failures on Network Performance Joint work with Supratik Bhattacharyya, and Christophe Diot High Performance Networking Group, 25 Feb. 2004 Yashar Ganjali Computer Systems Lab. Stanford University yganjali@stanford.edu http://www.stanford.edu/~yganjali

2 2 Motivation  The core of the Internet consists of several large networks (IP backbones).  IP backbones are carefully provisioned to guarantee low latency and jitter for packet delivery.  Failures occur on a daily basis as a result of  Physical layer malfunction,  Router hardware/software failures,  Maintenance,  Human errors, …  Failures affect the quality of service delivered to backbone customers.

3 3 Outline  Background  Sprint’s IP backbone  Data  Impact Metrics  Time-based metrics  Link-based metrics  Measurements  Reducing the impact  Identifying critical failures  Causes analysis  Reducing critical failures

4 4 Background – Sprint’s IP backbone  IP layer operates above DWDM with SONET framing.  IS-IS protocol used to route traffic inside the network.  IP-level restoration  When an IP link fails, all routers in the network independently compute a new path around the failure  No protection in the underlying optical infrastructure.

5 5 Data  IS-IS Link State PDU logs  Collected by passive listeners from Sprint’s North America backbone.  Feb. 1 st, 2003 to Jun. 30 th, 2003.  SNMP logs  Link loads recorded once in every 5 minutes.  SONET layer alarms  Corresponding to minor and major problems in the optical layer  We are only interested in two alarms:SLOS, and SLOS cleared.

6 6 Link Failures in Sprint’s IP Backbone – 9408 Failures

7 7 Inter-POP vs. Intra-POP ANA-2 ANA-3 ANA-1 ANA-4

8 8 Outline  Background  Sprint’s IP backbone  Data  Impact Metrics  Time-based metrics  Link-based metrics  Measurements  Reducing the impact  Identifying critical failures  Causes analysis  Reducing critical failures

9 9 Inter-POP Link Failures in Sprint’s IP Backbone

10 10 Two Perspectives  For a given impact metric  Time-based analysis: Measure the impact of failures on the given metric as a function of time.  Link-based analysis: Measure the impact of failures on the given metric as a function of failing links.

11 11 Time-based Impact Metrics 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load

12 12 Number of Simultaneous Failures

13 13 Number of Simultaneous Failures

14 14 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load Time-based Impact Metrics

15 15 Number of Affected O-D Pairs ACF B DE

16 16 Number of Affected O-D Pairs

17 17 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load Time-based Impact Metrics

18 18 Number of Affected BGP Prefixes

19 19 Time-based Impact Metrics 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load

20 20 Path Unavailability ACF B DE

21 21 Path Unavailability

22 22 Time-based Impact Metrics 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load

23 23 Total Rerouted Traffic

24 24 Time-based Impact Metrics 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load

25 25 Maximum Load Throughout the Network

26 26 Maximum Load Throughout the Network 96% of link failures were not followed by an immediate change in maximum load.

27 27 Time-based Impact Metrics 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load

28 28 Number of Failures per Link

29 29 Number of Affected OD Pairs per Link

30 30 Number of Affected BGP Prefixes per Link

31 31 Path Coverage ACF B DE

32 32 Path Coverage of Links

33 33 Total Rerouted Traffic on a Link

34 34 Peak Factor of a Link

35 35 Link-based Impact Metrics 1. Number of Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path coverage 5. Total rerouted traffic 6. Peak factor

36 36 Outline  Background  Sprint’s IP backbone  Data  Impact Metrics  Time-based metrics  Link-based metrics  Measurements  Reducing the impact  Identifying critical failures  Causes analysis  Reducing critical failures

37 37 Critical Failures  For each time-based metric  Removing failures occuring during 1-5% of time improves the metrics by a factor of at least 5.  For each link-based metric  Removing failures on 1-7% of links improves the metric by a factor of at least 3.

38 38 Critical Time Periods

39 39 Critical Links  Any link which has a critical failures, is called a Critical Link.  We are interested in fixing such links.

40 40 Correlation of Critical Sets

41 41 Correlation of the Critical Sets MetricSize12345678910 1) Simultaneous failures 11 -0.380.330.270.230.110.130.080.150.05 2) # of O-D pairs 9 --0.370.210.250.120.140.060.090.06 3) # of BGP prefixes 6 ---0.180.320.090.050.10.070.03 4) Path unavailability 5 ----0.410.140.110.080.120.04 5) Total rerouted traffic 6 -----0.090.110.090.08 6) # of failures 2 ------0.290.310.250.17 7) # of O-D pairs 3 -------0.290.30.18 8) # of BGP prefixes 2 --------0.130.19 9) Path coverage 6 ---------0.08 10) Total rerouted traffic 1 ---------- Overall 23% of all links are critical.

42 42 Cause Analysis  Markopoulou et al. have used IS-IS update messages for characterizing link failures into the following categories [MIB+04].  Maintenance  Unplanned Shared failures –Router-related –Optical-related –Unspecified Individual failures About 70% of all unplanned failures

43 43 Matching SLOS Alarms with IP Link Failures Time IP link failure SLOS ~ 20ms SLOS Cleared ~ 12sec 58% of all link failures are due to optical layer problems. 84% of critical failures are due to optical layer problems.

44 44 Reducing Critical Failures  Replace old optical fibers/parts.  Optical Protection.  Push the traffic away.  Also works for maximum load and peak factor.

45 45 Performance Improvement Time-based metricsLink-based Metrics Metric% improvementMetric% improvement # of failures # of affected O-D pairs # of BGP prefixes Path unavailability Total rerouted traffic 41 36 32 39 29 # of failures # of affected O-D pairs # of BGP prefixes Path coverage Total rerouted traffic 45 37 29 42 38

46 46 Reducing Link Down-time  Low-failure links:  Failure are very rare.  Damping doesn’t help.  High-failure links:  Failure rate changes very slowly.  Fixed damping is wasteful.

47 47 Adaptive Damping Input:  : time difference between the last two failures  : threshold  : constant function Adaptive_Damping begin if (  <  ) ADT :=  x  ; else ADT := 0; end; Output: ADT: Adaptive damping timer

48 48 Number – Duration Pareto Curve

49 49 Thank you!


Download ppt "1 Limiting the Impact of Failures on Network Performance Joint work with Supratik Bhattacharyya, and Christophe Diot High Performance Networking Group,"

Similar presentations


Ads by Google