Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide 1/20 Automatic Problem Localization via Multi- dimensional Metric Profiling Ignacio Laguna 1, Subrata Mitra 2, Fahad A. Arshad 2, Nawanol Theera-Ampornpunt.

Similar presentations


Presentation on theme: "Slide 1/20 Automatic Problem Localization via Multi- dimensional Metric Profiling Ignacio Laguna 1, Subrata Mitra 2, Fahad A. Arshad 2, Nawanol Theera-Ampornpunt."— Presentation transcript:

1 Slide 1/20 Automatic Problem Localization via Multi- dimensional Metric Profiling Ignacio Laguna 1, Subrata Mitra 2, Fahad A. Arshad 2, Nawanol Theera-Ampornpunt 2, Zongyang Zhu 2, Saurabh Bagchi 2, Samuel P. Midkiff 2, Mike Kistler 3, Ahmed Gheith 3 1 LLNL, 2 Purdue University, 3 IBM Research Austin

2 Slide 2/20 Debugging large scale systems is difficult Bug causes loss of millions of dollars.. Can not eliminate all the bugs ! Need for quick detection and diagnosis.

3 Slide 3/20 Observation: Bugs Change Metric Behavior Hadoop DFS file-descriptor leak in version 0.17 Correlations differ on bug manifestation Healthy Run Unhealthy Run Behavior is different Patch + } finally { + IOUtils.closeStream(reader); + IOUtils.closeSocket(dn); + dn = null; + } } catch (IOException e) { ioe = e; LOG.warn("Failed to connect to " + targetAddr + "...");

4 Slide 4/20 Diagnosis process high-level idea … Code Region Program Code Region Metric 1 Metric 2 Metric 3 Metric 100 Abnormal code blocks Manifested-in-metrics (MM) bugs Abnormal temporal pattern in one of the metrics

5 Slide 5/20 ORION: Framework for localizing origin of MM faults

6 Slide 6/20 Filter out noise. Keep the ones which show a trend Overview of the workflow ORION: Framework for localizing origin of MM faults Normal Run Failed Run Find Abnormal Metrics Find Abnormal Code Regions Find Abnormal Windows When correlation model of metrics broke Those that contributed most to the model breaking Instrumentation in code used to map metric values to code regions Select metrics

7 Slide 7/20 Measurements for various metrics Collect from different layers: hardware, OS, middleware, application Use open source monitoring tools: PAPI, /proc and other middleware and application level info Some examples: –H/W: Cache related, branching related, LD/ST counts –OS: CPU/Memory usage, Context switches, File descriptors, Disk IO, Network Packets –Middleware: Busy threads, request processing time –Application: Per-servlet stats, exception count No exhaustive list of metrics, include whatever might be relevant ORION will address curse of dimensionality, filter out noise

8 Slide 8/20 Why collect from different layers ? Faults come from: Hardware Software Network Bugs from many components: Application Libraries OS & Runtime system It is necessary to monitor metrics from all layers Array data of random nature Array.sort() for(int I; I < Array.size; I++) { if(Array[I] > 50) do_some_thing 2.4 slowdown Branch-mispredicted //Array.sort() if(Array[I] > 50)

9 Slide 9/20 Application Profiling Metrics gathered by separate process Lightweight, low interference with app Requires offline processing to line-up measurements Process 1 (App) Process 2 (Profiler) Function 1 Function 2 Function 3 Process 1 (Profiler + App) Function 1 Function 2 Function 3 Asynchronous Synchronous Instruments binary code Collects measurements at the beginning and end of classes/methods Higher overhead, but more accurate

10 Slide 10/20 Metric Selection for Accurate Diagnosis Dimensionality reduction: –Filter out redundant/noisy metrics –Reduce computational overhead for subsequent steps Heuristic based on PCA to rank metrics based on its contribution in explaining overall variance Fore more detailed analysis after PCA: Choose only the metrics for which rank changes between normal run and abnormal run –Do a light-weight filtering before a heavy-weight detailed analysis

11 Slide 11/20 Selecting Abnormal Window via Nearest-Neighbor (NN) Normal RunFaulty Run 3, 55, 47, 0.7,… 2, 54, 45, 0.8,… 3, 55, 47, 0.7,… 2, 55, 45, 0.6,… Traces …… ▪ Sample of all metrics ▪ Annotated with code region Window 1 Window 2 Window 3 Correlation Coefficient Vectors (CCV) [cc 1,2, cc 1,3,…, cc n-1,n ] x x x x xx x x x x x x x Nearest-Neighbor to find Outliers x x Outliers Normal RunFaulty Run 0.2, 0.8, 0, -0.6,…0.1, 0.6, 0, -0.5,… Repeat with different window sizes Windows of fixed size

12 Slide 12/20 Selecting Abnormal Metrics by Frequency of Occurrence Distance (CCV 1, CCV 2 ) CC 6,1 CC 5,1 CC 10,11 0.10.70.2 CC 5,2 CC 7,2 CC 3,12 0.50.050.3 CC 15,16 CC 8,20 CC 19,5 0.50.050.8 Window X Window Y Window Z Abnormal metric: 5 Example Steps Get top-k abnormal windows 1 Rank Correlation Coefficients (CC) based on contribution to the distance for each window 2 Select the most frequent metric(s) 3 Contribution to the distance CC 5,1 CC 5,2 CC 19,5

13 Slide 13/20 Pinpointing anomalous code regions Traces Normal RunFaulty Run …… Window 2 Window 3 3, 55, 47, 0.7,… 2, 55, 46, 0.7,… 2, 95, 45, 0.6,… 3, 55, 47, 0.7,… Find top-k abnormal windows Build a histogram of code regions Output top-3 most frequent code regions Only for anomalous metric Window 1

14 Slide 14/20 Example 1: File descriptor leak in Hadoop DFS 45 java classes and 358 methods were instrumented inside hadoop/dfs package Top abnormal metrics: 1.Minflt 2.Num_file_desc … Top abnormal code region: 1. /dfs/DFSClient + } finally { + IOUtils.closeStream(reader); + IOUtils.closeSocket(dn); + dn = null; + } } catch (IOException e) { ioe = e; LOG.warn("Failed to connect to " + targetAddr + "..."); RankClassAverage # file descriptors 1NamespaceInfo6.... 8DFSClient1.16

15 Slide 15/20 Example 2: Failures in distributed regression test framework (MHM) NFS connection fails intermittently –Emulate by dropping out-going NFS packets Code-Annotation –Asynchronous, manual An application from IBM for testing architecture simulators

16 Slide 16/20 MHM – debugging results in asynchronous mode Abnormal code-region is selected almost correctly –Reason for inaccuracy: code region is very small Abnormal metrics are correlated with the failure origin: NFS connection Abnormal code regions given by the tool Where the problem occurs

17 Slide 17/20 Example 3: Debugging an unknown bug StationsStat: multi-tier distributed application at Purdue Used by students to check the availability of work stations in computer labs throughout campus Periodic failure – application became unresponsive Restart – problem appears to go away temporarily Particularly challenging – there was no error free data ORION used data segments collected right after restart to build its model Anomalous metric that ORION identified: # active SQL connections SQL driver was in fact buggy. Upgrade fixed the problem

18 Slide 18/20 Overhead and performance Profiling overhead is a function of number and type metrics collected Asynchronous profiling has very little overhead but is less accurate and requires offline alignment Localization time is a function of the size of available profile log files ApplicationTraining time (sec)Localization time (sec) Hadoop31.06213.06 MHM12.4816.66 StationsStat217.431645.1

19 Slide 19/20 Conclusion We present ORION – a tool for root cause analysis for MM failures Pinpoints the metric that is highly affected by a failure and highlights corresponding code regions ORION models application behavior through pairwise correlation of multiple metrics Our case studies with different applications show the effectiveness of the tool in detecting real world bugs

20 Slide 20/20 Thank you Questions ?

21 Slide 21/20 Future directions Improve scalability Create a library for collecting various metrics as part of the tool.

22 Slide 22/20 Current debugging techniques Interactive, requires manual intervention: gdb, totalview Memory, CPU profilers can identify bottlenecks: gprof,.NET memory profiler Log analysis Model checking Some tools use a threshold corresponding to a single metric to identify bugs In advanced cases: thresholds are learned through training


Download ppt "Slide 1/20 Automatic Problem Localization via Multi- dimensional Metric Profiling Ignacio Laguna 1, Subrata Mitra 2, Fahad A. Arshad 2, Nawanol Theera-Ampornpunt."

Similar presentations


Ads by Google