Presentation is loading. Please wait.

Presentation is loading. Please wait.

Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace.

Similar presentations


Presentation on theme: "Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace."— Presentation transcript:

1 Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace

2 Troubleshooting Networked Systems Hard to develop, debug, deploy, troubleshoot No standard way to integrate debugging, monitoring, diagnostics

3 Status quo: device centric... 28 03:55:38 PM fire... 28 03:55:39 PM fire...... [04:03:23 2006] [notice] Dispatch s1... [04:03:23 2006] [notice] Dispatch s2... [04:04:18 2006] [notice] Dispatch s3... [04:07:03 2006] [notice] Dispatch s1... [04:10:55 2006] [notice] Dispatch s2... [04:03:24 2006] [notice] Dispatch s3... [04:04:47 2006] [crit] Server s3 down...... 72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal 66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro 66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga... 72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal 66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro 66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid...... Firewall Load Balancer Web 1 Web 2 Database

4 Status quo: device centric Determining paths: –Join logs on time and ad-hoc identifiers Relies on –well synchronized clocks –extensive application knowledge Requires all operations logged to guarantee complete paths

5 Examples 5 User DNS Server Proxy Web Server

6 Examples 6 User DNS Server Proxy Web Server

7 Examples 7 User DNS Server Proxy Web Server

8 Examples 8 User DNS Server Proxy Web Server

9 Approaches to Diagnosis Passively learn the relationships –Infer problems as deviations from the norm Actively Instrument the stack to learn relationships –Infer problems as deviations from the norm

10 Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula

11 Enterprise Management: Between a Rock and a Hard Place Manageability Stick with tried software, never change infrastructure Cheap Upgrades are hard, forget about innovation! Usability Keep pace with technology Expensive –IT staff in 1000s –72% of MS IT budget is staff Reliability Issues –Cost of down-time

12 Well-Managed Enterprises Still Unreliable 10% Troubled 85% Normal Fraction Of Requests 0.7% Down.1.02.04.06.08 10 100 1000 10000 Response time of a Web server (ms) 0 10% responses take up to 10x longer than normal How do we manage evolving enterprise networks?

13 Current Tools Miss the Forest for the Trees Monitor Individual Boxes or Protocols Flood admin with alerts Don’t convey the end-to-end picture SQL Backend Web Server Authentication Server DNS Client But, the primary goal of enterprise management is to diagnose user-perceived problems!

14 Instead of looking at the nitty-gritty of individual components, use an end-to-end approach that focuses on user problems Sherlock

15 Challenges for the End-to-End Approach Don’t know what user’s performance depends on

16 –Dependencies are distributed –Dependencies are non-deterministic Don’t know which dependency is causing the problem –Server CPU 70%, link dropped 10 packets, but which affected user? SQL Backend Web Server Auth. Server DNS Client E.g., Web Connection Challenges for the End-to-End Approach

17 Sherlock’s Contributions Passively infers dependencies from logs Builds a unified dependency graph incorporating network, server and application dependencies Diagnoses user problems in the enterprise Deployed in a part of the Microsoft Enterprise

18 Sherlock’s Architecture

19 Servers Clients Sherlock’s Architecture Web1 1000ms Web2 30ms File1 Timeout User Observations + = List Troubled Components Network Dependency Graph Inference Engine Sherlock works for various client-server applications

20 Video Server Data Store DNS How do you automatically learn such distributed dependencies?

21 Strawman: Instrument all applications and libraries Sherlock exploits timing info Time My Client talks to B tt My Client talks to C If talks to B, whenever talks to C  Dependent Connections  Not Practical

22 Sherlock exploits timing info Time tt B BB B B B False Dependence B C If talks to B, whenever talks to C  Dependent Connections Strawman: Instrument all applications and libraries  Not Practical

23 Sherlock exploits timing info Time If talks to B, whenever talks to C  Dependent Connections tt B B C Inter-access time Dependent iff  t << Inter-access time As long as this occurs with probability higher than chance Strawman: Instrument all applications and libraries  Not Practical

24 Sherlock’s Algorithm to Infer Dependencies  Infer dependent connections from timing Video DNS Store Dependency Graph

25 Bill’s Client Store DNS Sherlock’s Algorithm to Infer Dependencies  Infer dependent connections from timing  Infer topology from Traceroutes & configurations Video  Store Video Bill Watches Video Bill  DNS Bill  Video Works with legacy applications Adapts to changing conditions Dependency Graph Video DNS Store

26 But hard dependencies are not enough…

27 Bill’s ClientStoreDNS Video  Store Video Bill watches Video Bill  DNSBill  Video But hard dependencies are not enough…  Need Probabilities p1 p3 If Bill caches server’s IP  DNS down but Bill gets video Sherlock uses the frequency with which a dependence occurs in logs as its edge probability p2 p1=10% p2=100%

28 How do we use the dependency graph to diagnose user problems?

29 Bill’s Client Store DNS Video  Store Video Bill Watches Video Bill  DNS Bill  Video Which components caused the problem? Need to disambiguate!! Diagnosing User Problems

30 Bill’s Client Store DNS Video  Store Video Bill Watches Video Bill  DNS Bill  Video Diagnosing User Problems Which components caused the problem? Bill Sees Sales Sales Bill  Sales Paul Watches Video2 Paul  Video2 Video2  Store Video2 Use correlation to disambiguate!! Disambiguate by correlating –Across logs from same client –Across clients Prefer simpler explanations

31 Will Correlation Scale?

32 Corporate Core Will Correlation Scale? Microsoft Internal Network O(100,000) client desktops O(10,000) servers O(10,000) apps/services O(10,000) network devices Building Network Campus Core Data Center Dependency Graph is Huge

33 Can we evaluate all combinations of component failures? The number of fault combinations is exponential! Impossible to compute! Will Correlation Scale?

34 Scalable Algorithm to Correlate But how many is few? Evaluate enough to cover 99.9% of faults For MS network, at most 2 concurrent faults  99.9% accurate Only a few faults happen concurrently Exponential  Polynomial

35 But how many is few? Evaluate enough to cover 99.9% of faults For MS network, at most 2 concurrent faults  99.9% accurate Scalable Algorithm to Correlate Only a few faults happen concurrently Only few nodes change state Exponential  Polynomial

36 Re-evaluate only if an ancestor changes state Reduces the cost of evaluating a case by 30x-70x Exponential  Polynomial But how many is few? Evaluate enough to cover 99.9% of faults For MS network, at most 2 concurrent faults  99.9% accurate Only a few faults happen concurrently Only few nodes change state Scalable Algorithm to Correlate

37 Results

38 Experimental Setup Evaluated on the Microsoft enterprise network Monitored 23 clients, 40 production servers for 3 weeks –Clients are at MSR Redmond –Extra host on server’s Ethernet logs packets Busy, operational network –Main Intranet Web site and software distribution file server –Load-balancing front-ends –Many paths to the data-center

39 What Do Web Dependencies in the MS Enterprise Look Like?

40 Auth. Server What Do Web Dependencies in the MS Enterprise Look Like? Client Accesses Portal

41 Auth. Server What Do Web Dependencies in the MS Enterprise Look Like? Client Accesses Portal

42 Auth. Server Sherlock discovers complex dependencies of real apps. What Do Web Dependencies in the MS Enterprise Look Like? Client Accesses PortalClient Accesses Sales

43 What Do File-Server Dependencies Look Like? Client Accesses Software Distribution Server Auth. Server WINSDNS Backend Server 1 Backend Server 2 Backend Server 3 Backend Server 4 Proxy File Server 100% 10%6% 5% 2% 8% 5% 1%.3% Sherlock works for many client-server applications

44 Dependency Graph: 2565 nodes; 358 components that can fail Sherlock Identifies Causes of Poor Performance Component Index Time (days) 87% of problems localized to 16 components

45 Sherlock Identifies Causes of Poor Performance Inference Graph: 2565 nodes; 358 components that can fail Corroborated the three significant faults Component Index Time (days)

46 SNMP-reported utilization on a link flagged by Sherlock Problems coincide with spikes Sherlock Goes Beyond Traditional Tools Sherlock identifies the troubled link but SNMP cannot!

47

48 X-Trace X-Trace records events in a distributed execution and their causal relationship Events are grouped into tasks –Well defined starting event and all that is causally related Each event generates a report, binding it to one or more preceding events Captures full happens-before relation

49 X-Trace Output Task graph capturing task execution –Nodes: events across layers, devices –Edges: causal relations between events IP Router IP Router IP TCP 1 Start TCP 1 End IP Router IP TCP 2 Start TCP 2 End HTTP Proxy HTTP Server HTTP Client

50 Each event uniquely identified within a task: [TaskId, EventId] [TaskId, EventId] propagated along execution path For each event create and log an X-Trace report –Enough info to reconstruct the task graph Basic Mechanism IP Router IP Router IP TCP 1 Start TCP 1 End IP Router IP TCP 2 Start TCP 2 End HTTP Proxy HTTP Server HTTP Client f h b a g m n cde ijkl [T, g][T, a] X-Trace Report TaskID: T EventID: g Edge: from a, f X-Trace Report TaskID: T EventID: g Edge: from a, f

51 X-Trace Library API Handles propagation within app Threads / event-based (e.g., libasync) Akin to a logging API: –Main call is logEvent(message) Library takes care of event id creation, binding, reporting, etc Implementations in C++, Java, Ruby, Javascript

52 Task Tree X-Trace tags all network operations resulting from a particular task with the same task identifier Task tree is the set of network operations connected with an initial task Task tree could be reconstruct after collecting trace data with reports 52

53 An example of the task tree A simple HTTP request through a proxy 53

54 X-Trace Components Data –X-Trace metadata Network path –Task tree Report –Reconstruct task tree 54

55 Propagation of X-Trace Metadata The propagation of X-Trace metadata through the task tree 55

56 Propagation of X-Trace Metadata The propagation of X-Trace metadata through the task tree 56

57 The X Trace metadata FieldUsage FlagsBits that specify which of the three optional components are present TaskIDAn unique integer ID TreeInfoParentID, OpID, EdgeType DestinationSpecify the address that X-Trace report should be sent to OptionsAccommodate future extensions mechanism 57

58 X-Trace Report Architecture 58

59 X-Trace Report Architecture 59

60 X-Trace Report Architecture 60

61 X-Trace-like in Google/Bing/Yahoo Why? –Own large portion of the ecosystem –Use RPC for communication –Need to understand Time for user request Resource utilization by request

62 Discussion Report loss Non-tree request structures Partial deployment Managing report traffic Security Considerations 62

63 Sherlock V X-trace Overhead V. Accuracy Deployment issues –Invasiveness –Code modification

64 Conclusions Sherlock passively infers network-wide dependencies f rom logs and traceroutes It diagnoses faults by correlating user observations X-trace actively discovers network-wide dependencies


Download ppt "Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace."

Similar presentations


Ads by Google