Presentation is loading. Please wait.

Presentation is loading. Please wait.

Troubleshooting Wireless Mesh Networks Victor Bahl joint work with Lili Qiu, Ananth Rao (UCB) & Lidong Zhou Microsoft Research April.

Similar presentations


Presentation on theme: "Troubleshooting Wireless Mesh Networks Victor Bahl joint work with Lili Qiu, Ananth Rao (UCB) & Lidong Zhou Microsoft Research April."— Presentation transcript:

1 Troubleshooting Wireless Mesh Networks Victor Bahl bahl@microsoft.com joint work with Lili Qiu, Ananth Rao (UCB) & Lidong Zhou Microsoft Research April 1, 2004

2 Mesh Network Management ISO’s definition of network management: –Fault Management –Configuration Management –Security Management –Performance management –Accounting “Network management is a process of controlling a complex data network so as to maximize its efficiency and productivity”

3 Goals Assist with Mesh Router configuration Reactive and Pro-active Trouble Shooting –Investigate reported performance problems Time-series analysis to detect deviation from normal behavior –Localize and Isolate trouble spots Collect and analyze traffic reports from mesh nodes –Determine possible causes for the trouble spots Interference, or hardware problems, or network congestion, or malicious nodes …. Respond to troubled spots –Re-route traffic –Rate limit –Change topology via power control & directional antenna control –Flag environmental changes & problems

4 Nomenclature Mesh Management Module (M 3 ) –Runs on every node Mesh Management Server (MMS) –Runs on gateway or designated nodes Mesh Network Management Protocol (MNMP) –Protocol (similar to SNMPv3) between M 3 and MMS

5 Focus of this talk Gathering & Distribution Data Cleaning Data Fault Isolation & Diagnosis

6 Challenges in Fault Diagnosis Characteristics of multi-hop wireless networks –Unpredictable physical medium, prone to link errors –Network topology is dynamic –Resource limitation calls for a diagnosis approach with low overhead –Vulnerable to link attacks Identifying root causes –Just knowing link statistics is insufficient –Signature Based Techniques don’t work well –Determining normal behavior is hard Handling multiple faults –Complicated interactions between faults and traffic, and among faults themselves

7 Previous Approaches to Fault Diagnosis Protocols for Network Management ANMP [singh99] Guerrilla [shen02] Detecting Routing and MAC misbehavior Watchdog & pathrater [Baker00] MACMis [Vaidya03] Fault Management in Infrastructure mode AirWave, AirDefense, UniCenter, Symbol’s WNMS, IBM’s WSA, Wibhu’s SpetraMon, …

8 Our Approach Use a network simulator as a real-time diagnostic tool

9 Fault Detection, Isolation & Diagnosis Process Collect Data Clean Data Diagnose Faults Simulate Raw Data Root Causes Measured Performance Routes Link Loads Signal Strength Inject Candidate Faults Performance Estimate Agent Module Manager Module SNMP MIBs Performance Counters WRAPI MCL NativeWiFi

10 Root Cause Analysis Module

11 Our Fault Diagnosis Framework Advantages –Flexible & customizable for a large class of networks –Captures complicated interactions within the network, between the network & environment, and among multiple faults –Extensible in its ability of detecting new faults –Facilitates what-if analysis Challenges –To accurately reproduce the behavior of the network inside a simulator –To build a fault diagnosis technique using the simulator as a diagnosis tool

12 Handling the Challenges Reproducing network behavior Identify the set of traces to collect Rule out erroneous data from the trace Drive the simulator with the cleaned traces Building fault diagnosis Use performance results from trace-driven simulation to establish the normal behavior Deviation from the normal behavior indicates a potential fault Identify root causes by efficiently search over fault space to re-produce faulty symptoms

13 Why Simulator? Flow 1 Flow 2 Flow 3 Flow 4 Flow 5 2.5 Mbps0.23 Mbps2.09 Mbps0.17 Mbps2.55 Mbps

14 Simulator Accuracy: RF Propagation RF propagation model versus measured signal strengths for IEEE 802.11a cards from different vendors

15 Simulator Accuracy Experiments –A single one-hop UDP flow –2 UDP flows within communication range –2 UDP flows within interference range –1 UDP flow with 2 hops where the src. & dest. Are within communication range –1 UDP flow with 2 hops where the src. & dest. Are within interference range but not communications range

16 Simulator Accuracy: Throughput Estimated versus actual throughput when channel conditions are good (IEEE 802.11a)

17 Simulator Accuracy: Throughput (2) Estimated matches measured throughput till the channel conditions become poor

18 Simulator Accuracy: Throughput No. of Walls Loss Rate Measured Throughput Simulated Throughput 411.0 %15.52 Mbps15.94 Mbps 57.01 %12.56 Mbps14.01 Mbps 63.42 %12.97 Mbps11.55 Mbps Estimated matches measured throughput for poor channel conditions when loss rate is incorporated

19 How Stable is the Channel? Good environmental conditions, received signal strength remains stable

20 Data Collection What should we collect? –Network Topology/Connectivity Info (Neighbor Table) –Noise level & signal strength –Traffic load to direct neighbor –Loss rate to direct neighbor (retransmission count)

21 Data Distribution Design Goal Minimize bandwidth consumption Techniques –Dynamic scoping Each node takes a local view of the network The coverage of the local view adapts to traffic patterns –Adaptive monitoring Minimize measurement overhead in normal case Change update period Push and pull –Delta compression –Multicast

22 Management Overhead 40 Kb/sec 25 Kb/sec 15 Kb/sec BW requirement does not go up much with network size Info distributed: Routing changes Traffic counters (e.g. pkts. sent & rcv.) Signal Strength Avg: 1 to 5 hops

23 Measurement Overhead on Throughput

24 Data Cleaning Data may not be pristine. Why? –Liars, malicious users –Missing data –Measurement errors Clean the Data –Detect Liars Assumption: most nodes are honest Approach: –Neighborhood Watch –Find the smallest number of lying nodes to explain inconsistency in traffic reports –Smoothing & Interpolation

25 Example: Resiliency against Liars/Lossy Links Problem Identify nodes that report incorrect information (liars) Detect lossy links Assume Nodes monitor neighboring traffic, build traffic reports and periodically share info. Most nodes provide reliable information Challenge Wireless links are error prone and unstable Approach Find the smallest number of lying nodes to explain inconsistency in traffic reports Use the consistent information to estimate link loss rates Results

26 Fault Diagnosis Algorithm 1. Initialization: diagnosed fault set F = { } 2. Forward addition while (diff(MeasuredPerf, SimulatedPerf(F)) > threshold) { Find a candiate fault that explains the mismatch between current and predicted performance the most, and add it to F } 3. Backward deletion while (diff(MeasuredPerf, SimulatedPerf(F)) > threshold) { Find a fault in F that explains the mismatch the least. Delete it from F if excluding it results in little change } 4. Report F

27 Fault Diagnosis Algorithm (Cont’d) What does it mean “The fault A explains the mismatch between current and predicted performance the most”? –Diff(MeasuredPerf, PredictedPerf(with fault A)) is smallest –Probability(MeasuredPerf|Fault A) is largest These two criterions give us two effective search algorithms

28 Performance Number of faults 468101214 Coverage110.750.70.920.86 False Positive 00000.250.29 Faults detected: - Random packet dropping - MAC misbehavior - External noise 25 node random topology

29 Performance Evaluation Measurements show that performance from trace-driven simulation matches reality We are able to diagnose random packet dropping, external noise sources, and MAC misbehavior –Diagnose over 10 simultaneous faults of multiple types in a simulated 49-node network with 80% coverage and close to 0 false positive –Implemented our approach in a small multihop IEEE 802.11a testbed, and showed it can diagnose random packet dropping

30 What-if Analysis Improvement on removing flows ActionTotal Throughput (Mbps) None1.064 Reduce Flow 8 by ½1.148 Re-route Flow 8 around grid boundary1.217 Increase power from 15 dBm to 20 dBm0.99 Increase power from 15 dBm to 25 dBm1.661

31 Mesh Visualization Module

32

33 Thanks! http://www.research.microsoft.com/sn/mesh

34 Backup

35 Detection of Intentional Packet Drops Scenario - 49 node network - Randomly pick nodes that drop packets


Download ppt "Troubleshooting Wireless Mesh Networks Victor Bahl joint work with Lili Qiu, Ananth Rao (UCB) & Lidong Zhou Microsoft Research April."

Similar presentations


Ads by Google