Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jian Wu (University of Michigan)

Similar presentations


Presentation on theme: "Jian Wu (University of Michigan)"— Presentation transcript:

1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network
Jian Wu (University of Michigan) Z. Morley Mao (University of Michigan) Jennifer Rexford (Princeton University) Jia Wang (AT&T Labs Research)

2 Motivation destination Failure AS4 Disruption AS2 AS3 Congestion C BR C BR B C BR A C Mitigation AS1 Backbone network is vulnerable to routing changes that occur in other domains. C BR D source

3 Goal Identify important anomalies Lost reachability
Persistent flapping Large traffic shifts Contributions: Build a tool that identifies a small number of important routing disruptions from a large volume of raw BGP updates in real time. Use the tool to characterize routing disruptions occur in operational network

4 Interdomain Routing: Border Gateway Protocol
“I can reach /24 via AS 1” “I can reach /24” AS 2 C BR AS 1 C BR C BR C BR eBGP AS 3 eBGP iBGP /24 data traffic data traffic Prefix-based: one route per prefix Path-vector: list of ASes in path Incremental: every update indicates a change Policy-based: local ranking of routes

5 Capturing Routing Changes
Large operational network (8/16/2004 – 10/ ) eBGP C BR eBGP Updates Updates eBGP C BR iBGP C BR iBGP Best routes Best routes iBGP BGP Monitor CPE iBGP iBGP C BR iBGP C BR eBGP C BR eBGP eBGP

6 Challenges Large volume of BGP updates
Millions daily, very bursty Too much for an operator to manage Not the same as root-cause analysis Identify changes and their effects Focus on actionable events rather than diagnosis Diagnose causes in/near the AS

7 Frequent Flapping Prefixes Persistent Flapping Prefixes
System Architecture E BR BGP Updates (106) Event Classification “Typed” Events Traffic Impact Prediction E BR Large Disruptions Netflow Data (101) Event Correlation Clusters Frequent Flapping Prefixes (103) (101) BGP Update Grouping Events Persistent Flapping Prefixes (101) (105) From millions of updates to a few dozen reports

8 Grouping BGP Update into Events
Challenge: A single routing change leads to multiple update messages affects routing decisions at multiple routers BGP Update Grouping E BR BGP Updates Events Solution: Group all updates for a prefix with inter-arrival < 70 seconds Flag prefixes with changes lasting > 10 minutes. Persistent Flapping Prefixes

9 Grouping Thresholds Based on our understanding of BGP and data analysis Event timeout: 70 seconds 2 * MRAI timer + 10 seconds 98% inter-arrival time < 70 seconds Convergence timeout: 10 minutes BGP usually converges within a few minutes 99.9% events > 10 minutes

10 Persistent Flapping Prefixes
Types of Persistent flapping Conservative damping parameters (78.6%) Protocol Oscillations due to MED (18.3%) Unstable interface or BGP session (3.0%) Surprising finding: 15.2% of updates were caused by persistent-flapping prefixes even though flap damping is enabled.

11 Example: Unstable eBGP Session
D Peer ISP E B E C p Customer Flap damping parameters is session-based Damping not implemented for iBGP sessions

12 e.g., Loss/Gain of Reachability
Event Classification Challenge: Major concerns in network management Changes in reachability Heavy load of routing messages on the routers Change of flow of the traffic through the network Event Classification “Typed” Events, e.g., Loss/Gain of Reachability Events Solution: classify events by severity of their impact

13 Event Category – “No Disruption”
AS1 AS2 E D E No Traffic Shift E A E B “No Disruption”: no border routers has any traffic shift. (50.3%) ISP E C

14 Event Category – “Internal Disruption”
AS1 AS2 E D E E A E B “Internal Disruption”: all of the traffic shifts are internal traffic shift. (15.6%) ISP E C Internal Traffic Shift

15 Event Category – “Single External Disruption”
AS1 AS2 E D E E A E B external Traffic Shift “Single External Disruption”: only one of the traffic shifts is external traffic shift. (20.7%) ISP E C

16 Statistics on Event Classification
Events Updates No Disruption 50.3% 48.6% Internal Disruption 15.6% 3.4% Single External Disruption 20.7% 7.9% Multiple External Disruption 7.4% 18.2% Loss/Gain of Reachability 6.0% 21.9% First 3 categories have significant variations from day to day Updates per event depends on the type of events and the number of affected routers

17 Event Correlation Challenge: A single routing change
affects multiple destination prefixes Event Correlation “Typed” Events Clusters Solution: group events of same type that occur close in time

18 EBGP Session Reset Caused most of “single external disruption” events
Check if the number of prefixes using that session as the best route changes dramatically Validation with Syslog router report (95%) Number of prefixes session recovery session failure time

19 Hot-Potato Changes Hot-Potato Changes P
Caused “internal disruption” events Validation with OSPF measurement (95%) [Teixeira et al – SIGMETRICS’ 04] P “Hot-potato routing” = route to closest egress point E A E B 9 11 10 ISP E C

20 Traffic Impact Prediction
Challenge: Routing changes have different impacts on the network which depends on the popularity of the destinations Traffic Impact Prediction Large Disruptions Clusters Netflow Data E BR E BR E BR Solution: weigh each cluster by traffic volume

21 Traffic Impact Prediction
Traffic weight Per-prefix measurement from netflow 10% prefixes accounts for 90% of traffic Traffic weight of a cluster the sum of “traffic weight” of the prefixes A small number of large clusters have large traffic weight Mostly session resets and hot-potato changes

22 Performance Evaluation
Memory Static memory: “current routes”, 600 MB Dynamic memory: “clusters”, 300 MB Speed 99% of intervals of 1 second of updates can be process within 1 second Occasional execution lag Every interval of 70 seconds of updates can be processed within 70 seconds Measurements were based on 900MHz CPU

23 Conclusion BGP troubleshooting system
fast, online fashion operators’ concerns (reachability, flapping, traffic) significant information reduction millions of update  a few dozens of large disruptions Uncovered important network behaviors Hot-Potato changes Session resets Persistent flapping prefixes


Download ppt "Jian Wu (University of Michigan)"

Similar presentations


Ads by Google