Debugging Intermittent Issues

Debugging Intermittent Issues
Lloyd Moore, President Seattle Robotics Society, December 2016

Agenda The problem with intermittent issues Getting a baseline
Basic statistics Making a change Is it fixed? Dealing with multi-causal issues

Problem with Intermittent Issues
Fundamentally you do not know what is causing the problem, therefore you cannot control for it You never know from one run to the next if the issue will show up From a single run you cannot say anything, so you will need multiple runs Once you have multiple runs you need statistics to identify and describe any change in behavior

Getting a Baseline A baseline is a reference point or configuration where you control any factor that might be related to the issue Since you really don't know what is causing the issue – you likely won't actually control for it, but at least you minimize the number of variables affecting future measurements Ideally the goal is to simply control as much of the system as possible

Key Point Intermittent failures are NOT really intermittent!!!
They are caused by some varying condition that you are not measuring, observing, and controlling for If the varying condition happens in a particular way – you get a failure – every time The debugging problem is to find the condition that is varying, quantify how varies then control it

Getting a Baseline What you need to control will vary based on what you are debugging “Pure software” based issues will often be easier to control than “network connected” issues or “hardware related” issues Note that “pure software” issues are also far less likely to be intermittent as there are far fewer sources of randomness Most notable pure software randomness is system loading leading to race conditions

Items to look at controlling
Version of the software you are running Physical location of the target system Other processes running on the target system Network traffic to the target system – ideally connect the system to a wired network connection on an isolated network Other processes, and response time, of remotely connected systems that the target system depends on

More items to control Data being manipulated by the system
State of the target system – does each run of the system begin in the same state, including existence and size of debugging files Depending on what is going on ANYTHING could affect the issue: lighting, time of day, people in the room, temperature, be creative May not get everything on the first pass – for really hard problems establishing a baseline becomes iterative

Can't Control It – Measure it
Many times you will not be able to control a factor that may be important – time of day for example If you cannot control a factor make every attempt to measure and record the value during your test runs Often times after hours of testing and looking at something else, a pattern will emerge in the collected data that strongly hints at the real issue

Baseline Failure Rate Part of the baseline configuration is the rate at which you experience the failure condition Once you have the test environment as stable and quantified as possible you need to run the system multiple times and quantify how often you see the failure Three basic outcomes here: The failure no longer occurs The failure now happens all the time The failure happens X of Y trials

Baseline Failure Rate If you no longer have an intermittent failure (no failure or constant failure) something you changed in setting up the baseline is likely related to the issue – this is good, you can now systemically vary the baseline setup to debug Most likely you will still get some failures, for the next set of slides we will assume that you have run the system 10 times and saw 3 failures.....

Basic Statistics 3 failures in 10 runs is: 3/10 = 0.3 = 30% failure rate Success rate is: 1.0 – 0.3 = 0.7 = 70% Note that this number is an APPROXIMATION – you can really never know the actual rates What is the chance of testing 10 times and not seeing a single failure: Chance of success on first try * chance of success on second try * chance …. 0.7 * = 0.7^10 = = 2.8%

How many times to test? So if we CAN run the test 10 times and not see a failure how many times do we really need to test? Yes there is a formula to estimate this – have never actually used this in practice as several of the variables involved are pretty hard to estimate – things like degrees of freedom Simply use a rule of thumb here – want to get the chance of not seeing a failure when there really is one below either 5% or 1%

How many times to test? If I want to get below 1% will 15 trials be enough? 0.70^15 = = 0.47% - Yep, and this approach is good enough for day to day work Those that are more math inclined could also solve the equation: 0.01 = 0.70^X , then of course make sure to take the next full integer! (Hint: Need logb ar = r logb a)

OK now what? We have a controlled test set up
We have an approximation of the failure rate We may also have some clues on what may be causing the problem Next step is to attempt a change and observe the failure rate

Is it fixed? I made a change and now I don't see the failure condition any longer – I'm done right? Hum – NO! From the previous slide: What is the chance of testing 10 times and not seeing a single failure: Chance of success on first try * chance of success on second try * chance …. 0.7 * = 0.7^10 = = 2.8%

Is it fixed? One issue with problems that display intermittently is that you can never really know for sure if they are fixed The best you can do is estimate the probability that they are fixed and decide if that is good enough for your application You can always run more trials to increase the certainty that you really fixed the problem

Is it fixed? Assume our original 30% failure rate, 70% success rate:
Trials: Formula: Chance of not seeing failure: 10 0.7^10 2.82% 15 0.7^15 0.474% 20 0.7^20 0.079% 25 0.7^25 0.013%

Multi-causal Issues I made a change and my failure rate decreased, but not to zero, now what? Very good indication that you have either affected the original problem or have actually solved one problem but another remains The key to sorting this out is to look for details on the actual failure Often times multiple issues have a similar looking failure, but are really two different things

Multi-Causal Issues Attempt to get more information on the exact nature of the remaining failure, things to look at: Stack traces Timing values, how long into the run? Variation in results – cluster plots helpful here More than one trigger for the failure Many times you will stumble upon this information while debugging the original issue Very common for intermittent issues to be multi-causal as are ignored when first seen

Multi-Causal Issues If you suspect you have more than one issue setup a separate debugging environment for each suspected issue You may also be able to make a list of the separate failure cases and failure rates, decomposing the original numbers Helpful to remember the original failure rate is simply the sum of the individual failure rates Work the issue with the highest failure rate first Keep working down the list until all are solved

Summary Intermittent issues are not really intermittent
Need to track down what unknown variable is changing and handle that variation A baseline system configuration with failure rates is key to telling if a change occured Just because you don't see the failure any more is NOT a guarantee that it is fixed Multi-causal issues can be separated and worked as individual issues once you have details on the failure

Questions?

Debugging Intermittent Issues

Similar presentations

Presentation on theme: "Debugging Intermittent Issues"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Debugging Intermittent Issues

Similar presentations

Presentation on theme: "Debugging Intermittent Issues"— Presentation transcript:

Similar presentations

About project

Feedback