Presentation is loading. Please wait.

Presentation is loading. Please wait.

Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN), Alec Wolman WIP.

Similar presentations


Presentation on theme: "Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN), Alec Wolman WIP."— Presentation transcript:

1 Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN), Alec Wolman WIP

2 2 Goals & Target Environment l Improve the ability of large internet portals to gain insight into failures l Non-goals: l masking failures l use machine learning to infer abnormal behavior

3 3 MSN Background l Messenger, www.msn.com, Hotmail, Search, many other “properties” l Large (> 100 million users) l Sources of Complexity: l multiple data-centers l large # of machines l complex internal network topology l diversity of applications and software infrastructure

4 4 The Plan l Detecting, managing, and diagnosing failures l Review MSN’s current approaches l Describe our solution at a high level

5 5 Detecting Failures l Monitor system availability with heartbeats l Monitor applications availability & quality of service using synthetic requests l Customer complaints l Telephone, email Problems: l These approaches provide limited coverage – harder to catch failures that don’t affect every request l Data on detected failures often lacks necessary detail to suggest a remedy: l which front end is flaky? l which app component caused end-user failure?

6 6 Managing Failures Definition: l Ability to prioritize failures l Detect component service degradation l Characterizing app-stability l Capacity planning l When server “x” fails, what is the impact of this failure? l Better use of ops and engineering resources l Current approach: no systematic attempt to provide this functionality

7 7 Our solution (in 2 steps) Detecting and Managing Failures l Step 1: Instrument applications to track user requests across the “service chain” l Each request is tagged with a unique id l Service chain is composed on-the-fly with help of app instrumentation l For each request: l Collect per-hop performance information l Collect per-request failure status l Centralized data collection

8 8 What kinds of failures? We can handle: l Machine failures l Network connectivity problems Most: l Misconfiguration l Application bugs But not all: l Application errors where app itself doesn’t detect that there is a problem

9 9 Diagnosing Failures l Assigning responsibility to a specific hw or sw component l Insight into internals of a component l Cross component interactions l Current approach: instrument applications l App-specific log messages l Problems l High request rates => log rollover l Perceived overhead => detailed logging enabled during testing, disabled in production

10 10 Fuse Background l FUSE (OSDI 2004): lightweight agreement on only one thing: whether or not a failure has occurred l Lack of a positive ack => failure

11 11 Step 2: Conditional Logging l Step 2: Implement “conditional logging” to significantly reduce the overhead of collecting detailed logs across different machines in the service chain l Step 1 provides ability to identify a request across all participants in the service chain, Fuse provides agreement on failure status across that chain l While fate is undecided: Detailed log messages stored in main memory l Common case overload of logging is vastly reduced l Once the fate of service chain is decided, we discard app logs for successful requests and save logs for failures l Quantity of data generated is manageable, when most requests are successful

12 12 Example Benefits: l FUSE allows monitoring of real transactions. l All transactions, or a sampled subset to control overhead. l When a request fails, FUSE provides an audit trail l How far did it get? l How long did each step take? l Any additional application specific context. l FUSE can be deployed incrementally. Server1Server3Server2Client X

13 13 Issues l Overload policy: need to handle bursts of failures without inducing more failures l How much effort to make apps FUSE enabled? l Are the right components FUSE enabled? l Identifying and filtering false positives l Tracking request flow is non-trivial with network load balancers

14 14 Status l We’ve implemented FUSE for MSN, integrated with ASP.NET rendering engine l Testing in progress l Roll-out at end of summer

15 15 Backups

16 16 FUSE is Easy to Integrate Example current code on Front End: ReceiveRequestFromClient(…) { … SendRequestToBackEnd(…); } Example code on Front End using FUSE: ReceiveRequestFromClient(…, FUSEinfo f) { // default value of f = null if ( f != null ) JoinFUSEGroup( f ); … SendRequestToBackEnd(…, f ); } Current implementation is in C#, and consists of 2400 LOC


Download ppt "Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN), Alec Wolman WIP."

Similar presentations


Ads by Google