Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN), Alec Wolman WIP.

Slides:



Advertisements
Similar presentations
Overview of local security issues in Campus Grid environments Bruce Beckles University of Cambridge Computing Service.
Advertisements

Complete Event Log Viewing, Monitoring and Management.
High throughput chain replication for read-mostly workloads
Module 12: Auditing SQL Server Environments
Test Case Management and Results Tracking System October 2008 D E L I V E R I N G Q U A L I T Y (Short Version)
Network Management Basics Network management requirements OSI Management Functional Areas –Network monitoring: performance, fault, accounting –Network.
HP Quality Center Overview.
Cisco Confidential 1 © 2010 Cisco and/or its affiliates. All rights reserved. Next Generation Monitoring in Cisco Security Cloud Leon De Jager and Nitin.
© 2013 IBM Corporation October 4, 2013 IT Analytics and Big Data IBM Solutions Paul Smith (Smitty) Service Management Architect.
Network+ Guide to Networks, Fourth Edition
Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager.
Cloud Testing – Guidelines and Approach. Agenda Understanding “The Cloud”? Why move to Cloud? Testing Philosophy Challenges Guidelines to select a Cloud.
Profiling Network Performance in Multi-tier Datacenter Applications
Modern Application Lifecycle Pla n Develop + Test Monitor + Learn Release.
The middleware that makes real time integration a reality.
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
Introduction : ‘Skoll: Distributed Continuous Quality Assurance’ Morimichi Nishigaki.
seminar on Intrusion detection system
Software Process and Product Metrics
Maintaining and Updating Windows Server 2008
Loupe /loop/ noun a magnifying glass used by jewelers to reveal flaws in gems. a logging and error management tool used by.NET teams to reveal flaws in.
The Importance Of Transactions In The World Of Analytics Doug Aoyama Director, Product Marketing.
What Can You do With BTM? Business Transaction Management touches the following disciplines:  Performance Management  Application Management  Capacity.
Event Viewer Was of getting to event viewer Go to –Start –Control Panel, –Administrative Tools –Event Viewer Go to –Start.
CONTENTS:-  What is Event Log Service ?  Types of event logs and their purpose.  How and when the Event Log is useful?  What is Event Viewer?  Briefing.
Microsoft ® Official Course Monitoring and Troubleshooting Custom SharePoint Solutions SharePoint Practice Microsoft SharePoint 2013.
Cloud Attributes Business Challenges Influence Your IT Solutions Business to IT Conversation Microsoft is Changing too Supporting System Center In House.
Network and Active Directory Performance Monitoring and Troubleshooting NETW4008 Lecture 8.
Windows Vista: Volume Activation 2.0
Chapter 17: Watching Your System BAI617. Chapter Topics Working With Event Viewer Performance Monitor Resource Monitor.
SQL Server Replication By Karthick P.K Technical Lead, Microsoft SQL Server.
Network+ Guide to Networks, Fourth Edition Chapter 1 An Introduction to Networking.
ABSTRACT Zirous Inc. is a growing company and they need a new way to track who their employees working on various different projects. To solve the issue.
Merlin Bar Graph: Problems, Solutions, Progress Status
` Tangible Interaction with the R Software Environment Using the Meuse Dataset Rachel Bradford, Landon Rogge, Dr. Brygg Ullmer, Dr. Christopher White `
User Manager Pro Suite Taking Control of Your Systems Joe Vachon Sales Engineer November 8, 2007.
Module 7: Fundamentals of Administering Windows Server 2008.
X-Road – Estonian Interoperability Platform
Service Transition & Planning Service Validation & Testing
A Virtual Honeypot Framework Author: Niels Provos Published in: CITI Report 03-1 Presenter: Tao Li.
Event Management & ITIL V3
A Web Based Workorder Management System for California Schools.
1 © 2001, Cisco Systems, Inc. All rights reserved. Cisco Info Center for Security Monitoring.
7-1 Management Information Systems for the Information Age Copyright 2004 The McGraw-Hill Companies, Inc. All rights reserved Chapter 7 IT Infrastructures.
Deploy With Confidence Minimize risks Improve business output Optimize resources.
Computing Infrastructure for Large Ecommerce Systems -- based on material written by Jacob Lindeman.
1 Implementing Monitoring and Reporting. 2 Why Should Implement Monitoring? One of the biggest complaints we hear about firewall products from almost.
Developer TECH REFRESH 15 Junho 2015 #pttechrefres h Understand your end-users and your app with Application Insights.
Hosted SharePoint. Part 3/3: Office Live as a WSS solution Speaker Name Microsoft Corporation Hosted.
Vinay Paul. CONTENTS:- What is Event Log Service ? Types of event logs and their purpose. How and when the Event Log is useful? What is Event Viewer?
NETWORKING FUNDAMENTALS. Network+ Guide to Networks, 4e2.
Creating SmartArt 1.Create a slide and select Insert > SmartArt. 2.Choose a SmartArt design and type your text. (Choose any format to start. You can change.
Dapper, a Large-Scale Distributed System Tracing Infrastructure
Network management Network management refers to the activities, methods, procedures, and tools that pertain to the operation, administration, maintenance,
1 Automating Network Diagnostics to Help End-Users Dave Thaler
Some Great Open Source Intrusion Detection Systems (IDSs)
EView/390z Management for IBM Mainframe for HPE Operations Manager i (OMi) Extending the cross-platform capabilities of Hewlett Packard Enterprise Software.
The DPIaaS Controller Prototype
TrueSight Operations Management 11.0 Architecture
Large Distributed Systems
Software Design and Architecture
AgilizTech Support Desk Overview
An Introduction to Computer Networking
Introduction to Databases Transparencies
Cloud Web Filtering Platform
Saravana Kumar CEO/Founder - Kovai Atomic Scope – Product Update.
System Center Configuration Manager Cloud Services – Cloud Distribution Point Presented By: Ginu Tausif.
Information system analysis and design
Presentation transcript:

Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN), Alec Wolman WIP

2 Goals & Target Environment l Improve the ability of large internet portals to gain insight into failures l Non-goals: l masking failures l use machine learning to infer abnormal behavior

3 MSN Background l Messenger, Hotmail, Search, many other “properties” l Large (> 100 million users) l Sources of Complexity: l multiple data-centers l large # of machines l complex internal network topology l diversity of applications and software infrastructure

4 The Plan l Detecting, managing, and diagnosing failures l Review MSN’s current approaches l Describe our solution at a high level

5 Detecting Failures l Monitor system availability with heartbeats l Monitor applications availability & quality of service using synthetic requests l Customer complaints l Telephone, Problems: l These approaches provide limited coverage – harder to catch failures that don’t affect every request l Data on detected failures often lacks necessary detail to suggest a remedy: l which front end is flaky? l which app component caused end-user failure?

6 Managing Failures Definition: l Ability to prioritize failures l Detect component service degradation l Characterizing app-stability l Capacity planning l When server “x” fails, what is the impact of this failure? l Better use of ops and engineering resources l Current approach: no systematic attempt to provide this functionality

7 Our solution (in 2 steps) Detecting and Managing Failures l Step 1: Instrument applications to track user requests across the “service chain” l Each request is tagged with a unique id l Service chain is composed on-the-fly with help of app instrumentation l For each request: l Collect per-hop performance information l Collect per-request failure status l Centralized data collection

8 What kinds of failures? We can handle: l Machine failures l Network connectivity problems Most: l Misconfiguration l Application bugs But not all: l Application errors where app itself doesn’t detect that there is a problem

9 Diagnosing Failures l Assigning responsibility to a specific hw or sw component l Insight into internals of a component l Cross component interactions l Current approach: instrument applications l App-specific log messages l Problems l High request rates => log rollover l Perceived overhead => detailed logging enabled during testing, disabled in production

10 Fuse Background l FUSE (OSDI 2004): lightweight agreement on only one thing: whether or not a failure has occurred l Lack of a positive ack => failure

11 Step 2: Conditional Logging l Step 2: Implement “conditional logging” to significantly reduce the overhead of collecting detailed logs across different machines in the service chain l Step 1 provides ability to identify a request across all participants in the service chain, Fuse provides agreement on failure status across that chain l While fate is undecided: Detailed log messages stored in main memory l Common case overload of logging is vastly reduced l Once the fate of service chain is decided, we discard app logs for successful requests and save logs for failures l Quantity of data generated is manageable, when most requests are successful

12 Example Benefits: l FUSE allows monitoring of real transactions. l All transactions, or a sampled subset to control overhead. l When a request fails, FUSE provides an audit trail l How far did it get? l How long did each step take? l Any additional application specific context. l FUSE can be deployed incrementally. Server1Server3Server2Client X

13 Issues l Overload policy: need to handle bursts of failures without inducing more failures l How much effort to make apps FUSE enabled? l Are the right components FUSE enabled? l Identifying and filtering false positives l Tracking request flow is non-trivial with network load balancers

14 Status l We’ve implemented FUSE for MSN, integrated with ASP.NET rendering engine l Testing in progress l Roll-out at end of summer

15 Backups

16 FUSE is Easy to Integrate Example current code on Front End: ReceiveRequestFromClient(…) { … SendRequestToBackEnd(…); } Example code on Front End using FUSE: ReceiveRequestFromClient(…, FUSEinfo f) { // default value of f = null if ( f != null ) JoinFUSEGroup( f ); … SendRequestToBackEnd(…, f ); } Current implementation is in C#, and consists of 2400 LOC