Presentation is loading. Please wait.

Presentation is loading. Please wait.

A survey of Internet routing reliability Presented by Kundan Singh IRT internal talk April 9, 2003.

Similar presentations


Presentation on theme: "A survey of Internet routing reliability Presented by Kundan Singh IRT internal talk April 9, 2003."— Presentation transcript:

1 A survey of Internet routing reliability Presented by Kundan Singh IRT internal talk April 9, 2003

2 Internet routing reliability2 Agenda Routing overview Problems Route oscillations Slow convergence Scaling Configuration Effect on VoIP

3 Internet routing reliability3 Overview of Internet routing AT&T (inter-national provider) Regional provider MCI Regional provider Campus OSPF (optimize path) BGP (policy based) Autonomous systems Cable modem provider

4 Internet routing reliability4 Border gateway protocol TCP OPEN, UPDATE, KEEPALIVE, NOTIFICATION Hierarchical peering relationship Export all routes to customers only customer and local routes to peers and providers Path-vector Optimal AS path satisfying policy 12 543 67 ProviderCustomer Peer Backup d: 31247 e: 3125... d d: 1247 d: 247 d: 47 d: 247 0

5 Internet routing reliability5 Route selection Local AS preference AS path length Multi-exit discriminator (MED) Prefer external-BGP over internal-BGP Use internal routing metrics (e.g., OSPF) Use identifier as last tie breaker AS1 AS3 AS2 AS4 B1 B2 B3 B4 R1 R2 C1 C2

6 Internet routing reliability6 Route oscillation Each AS policy independent Persistent vs transient Not if distance based Solution: Static graph analysis Policy guidelines Dynamic “flap” damping 0 12

7 Internet routing reliability7 Static analysis Abstract models: Solvable? Resilience on link failure? Multiple solutions? Sometimes solvable? Does not work NP complete Relies on Internet routing registries

8 Internet routing reliability8 Policy guidelines MUST Prefer customer over peer/provider Have lowest preference for backup path “avoidance level” increases as path traverses MED must be used across all advertisements Works even on failure and consistent with current practice Limits the policy usage

9 Internet routing reliability9 Convergence in intra-domain IS-IS – millisecond convergence Detect change (hardware, keep-alive) Improved incremental SPF Link “down” immediate, “up” delayed Propagate update before calculate SPF Keep-alive before data packets Detect duplicate updates OSPF stability Sub-second keep-alive Randomization Multiple failures Loss resilience Distance vector Count to infinity

10 Internet routing reliability10 BGP convergence 0 12 R ( R, 1R, 2R) (0R, 1R, R)(0R, R, 2R)

11 Internet routing reliability11 BGP convergence 0 12 R ( -, 1R, 2R) (0R, 1R, - )(0R, -, 2R) 0->1: 01R 0->2: 01R 1->0: 10R 1->2: 10R 2->0: 20R 2->1: 20R

12 Internet routing reliability12 BGP convergence 0 12 R ( -, 1R, 2R) (01R,1R, - )( -, -, 2R) 1->0: 10R 1->2: 10R 1->0: 12R 1->2: 12R 2->0: 20R 2->1: 20R 2->0: 21R 2->1: 21R 01R

13 Internet routing reliability13 BGP convergence 0 12 R ( -, -, 2R) (01R,10R, - )( -, -, 2R) 1->0: 12R 1->2: 12R 2->0: 20R 2->1: 20R 2->0: 21R 2->1: 21R 2->0: 201R 2->1: 201R 10R 0->1: W 0->2: W

14 Internet routing reliability14 BGP convergence MinRouteAdver To announcements In 13 steps Sender side loop detection One step 0 12 R ( -, -, - ) After 48 steps

15 Internet routing reliability15 BGP convergence [2] Latency due to path exploration Fail-over latency = 30 n Where n = longest backup path length Within 3min, some oscillations up to 15 min Loss and delay during convergence “up” converges faster than “down” Verified using experiment

16 Internet routing reliability16 BGP convergence [3] Path exploration => latency More dense peering => more latency Large providers, better convergence

17 Internet routing reliability17 BGP convergence [4] Route flap damping To avoid excessive flaps, penalize updated routes Penalty decays exponentially. “suppression” and “reuse” threshold Worsens convergence Selective damping Do not penalize if path length keeps increasing Attach a preference with route

18 Internet routing reliability18 BGP convergence [5] 12R and 235R are inconsistent. Prefer directly learnt 235R Distinguish failure with policy change Order of magnitude improvement 1 20R 35 12R 235R 2R

19 Internet routing reliability19 BGP scaling Full mesh logical connection within an AS Add hierarchy

20 Internet routing reliability20 BGP scaling [2] Route reflector More popular Upgrade only RR Confederations Sub-divide AS Less updates, sessions

21 Internet routing reliability21 BGP scaling [3] May have loop If signaling path is not forwarding path RR C2 RR C1 Q P Signaling path Choose QChoose P Logical BGP session Physical link

22 Internet routing reliability22 BGP scaling [4] Persistent oscillations possible Modify to pass multiple route information within an AS

23 Internet routing reliability23 BGP stability Initial experiment (’96) 99% redundant updates <= implementation or configuration bug After bug fixes (97-98) Well distributed across AS and prefix

24 Internet routing reliability24 BGP stability [2] Inter-domain experiment (’98) 9 months, 9GB, 55000 routes, 3 ISP, 15 min filtering 25-35% routes are 99.99% available 10% of routes less that 95% available

25 Internet routing reliability25 BGP stability [3] Failure More than 50% have MTTF > 15 days, 75% failed in 30 days Most fail-over/re-route within 2-days (increased since ’94) Repair 40% route failure repaired in < 10min, 60% in 30min Small fraction of routes affect majority of instability Weekly/daily frequency => congestion possible

26 Internet routing reliability26 BGP stability [4] Backbone routers Interface MTTF 40 days 80% failures resolved in 2 hr Maintenance, power and PSTN are major cause for outages (approx 16% each) Overall uptime of 99% Popular destinations Quite robust Average duration is less than 20s => due to convergence

27 Internet routing reliability27 BGP under stress Congestion Prioritize routing control messages over data Routing table size AS count, prefix length, multi-home, NAT Effects: Number of updates; convergence Configuration, no universal filter Real routers “malloc” failure Cascading effect Prefix limiting option Graceful restart CodeRed/Nimda Quite robust Some features get activated during stress Improper rate limiting Misconfiguration: IGP instability propagated Bugs: duplicate announcements

28 Internet routing reliability28 BGP misconfiguration Failure to summarize, hijack, advertise internal prefix, or policy. 200-1200 prefix each day ¾ of new advertisement as a result 4% prefix affect connectivity Cause Initialization bug (22%), reliance on upstream filtering (14%), from IGP (32%) Bad ACL (34%), prefix based (8%) Conclusion user interface, authentication, consistency verification, transaction semantics for command

29 Internet routing reliability29 PSTN failures Switch vendors aim for 99.999% availability Network availability varies (domestic US calls > 99.9%) Study in ‘97 Overload caused 44% customer-minutes Mostly short outages Human error caused 50% outages Software only 14% No convergence problem

30 Internet routing reliability30 VoIP Tier-1 backbone (Sprint) have good delay, loss characteristics. Average scattered loss.19% (mostly single packet loss, use FEC) 99.9% probes have <33ms delay Most burst loss due to routing problem Customer sites have more problems

31 Internet routing reliability31 VoIP [2] Outages = more than 300ms loss More than 23% losses are outages Outages are similar for different networks Call abortion due to poor quality Net availability = 98%

32 Internet routing reliability32 Future work End system and higher layer protocol reliability and availability Mechanism to reduce effect of outages in VoIP Redundancy of VoIP systems during outages Convergence and scaling of TRIP, which is similar to BGP


Download ppt "A survey of Internet routing reliability Presented by Kundan Singh IRT internal talk April 9, 2003."

Similar presentations


Ads by Google