A Path-based Approach to Managing Failures and Evolution Mike Chen, Anthony Accardi 1, Emre Kıcıman, Jim Lloyd 2, Dave Patterson, Armando Fox, Eric Brewer.

Slides:

Advertisements

Similar presentations

Autonomous Recovery in Componentized Internet Application Candea et. al Vikram Negi.

Advertisements

Large-Scale Distributed Systems Andrew Whitaker CSE451.

1 VLDB 2006, Seoul Mapping a Moving Landscape by Mining Mountains of Logs Automated Generation of a Dependency Model for HUG’s Clinical System Mirko Steinle,

Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

Imbalanced data David Kauchak CS 451 – Fall 2013.

© Chinese University, CSE Dept. Software Engineering / Software Engineering Topic 1: Software Engineering: A Preview Your Name: ____________________.

Trace Analysis Chunxu Tang. The Mystery Machine: End-to-end performance analysis of large-scale Internet services.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

CLUE: SYSTEM TRACE ANALYTICS FOR CLOUD SERVICE PERFORMANCE DIAGNOSIS Hui Zhang 1, Junghwan Rhee 1, Nipun Arora 1, Sahan Gamage 2, Guofei Jiang 1, Kenji.

High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ

Rake: Semantics Assisted Network- based Tracing Framework Yao Zhao (Bell Labs), Yinzhi Cao, Yan Chen, Ming Zhang (MSR) and Anup Goyal (Yahoo! Inc.) Presenter:

Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,

CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

Bug Isolation via Remote Program Sampling Ben Liblit, Alex Aiken, Alice X.Zheng, Michael I.Jordan Presented by: Xia Cheng.

1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.

Karl Schnaitter and Neoklis Polyzotis (UC Santa Cruz) Serge Abiteboul (INRIA and University of Paris 11) Tova Milo (University of Tel Aviv) Automatic Index.

Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,

Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.

Leveraging User Interactions for In-Depth Testing of Web Applications Sean McAllister, Engin Kirda, and Christopher Kruegel RAID ’08 1 Seoyeon Kang November.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer

Introduction ： ‘Skoll: Distributed Continuous Quality Assurance’ Morimichi Nishigaki.

Course Instructor: Aisha Azeem

Lecture 11 Intrusion Detection (cont)

Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.

Achieving Better Reliability With Software Reliability Engineering Russel D’Souza Russel D’Souza.

Katanosh Morovat.   This concept is a formal approach for identifying the rules that encapsulate the structure, constraint, and control of the operation.

Designing For Testability. Incorporate design features that facilitate testing Include features to: –Support test automation at all levels (unit, integration,

Web-Enabled Decision Support Systems

 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.

Profile Driven Component Placement for Cluster-based Online Services Christopher Stewart (University of Rochester) Kai Shen (University of Rochester) Sandhya.

INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Bug Localization with Machine Learning Techniques Wujie Zheng

Engr. M. Fahad Khan Lecturer Software Engineering Department University Of Engineering & Technology Taxila.

Automated Problem Diagnosis for Production Systems Soila P. Kavulya Scott Daniels (AT&T), Kaustubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi (CMU),

Stamping out worms and other Internet pests Miguel Castro Microsoft Research.

Developer TECH REFRESH 15 Junho 2015 #pttechrefres h Understand your end-users and your app with Application Insights.

Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng

EEC 688/788 Secure and Dependable Computing Lecture 8 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Combining Statistical Monitoring and Predictable Recovery for Self-Management Armando Fox, Emre Kıcıman, Stanford University Dave Patterson, Mike Jordan,

“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.

CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.

© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.

A Runtime Verification Based Trace-Oriented Monitoring Framework for Cloud Systems Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Wei Dong.

Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Progress Report Armando Fox with George Candea, James Cutler, Ben Ling, Andy Huang.

© 2002 IBM Corporation IBM Research 1 Policy Transformation Techniques in Policy- based System Management Mandis Beigi, Seraphin Calo and Dinesh Verma.

A Statistical Learning Approach to Diagnosing eBay’s Site Mike Chen, Alice Zheng, Jim Lloyd, Michael Jordan, Eric Brewer

Software Quality Assurance and Testing Fazal Rehman Shamil.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,

1 Creating Situational Awareness with Data Trending and Monitoring Zhenping Li, J.P. Douglas, and Ken. Mitchell Arctic Slope Technical Services.

Gorilla: A Fast, Scalable, In-Memory Time Series Database

DOWeR Detecting Outliers in Web Service Requests Master’s Presentation of Christian Blass.

SketchVisor: Robust Network Measurement for Software Packet Processing

Experience Report: System Log Analysis for Anomaly Detection

Welcome to the Winter 2004 ROC Retreat

Embracing Failure: A Case for Recovery-Oriented Computing

Self Healing and Dynamic Construction Framework:

Applying Control Theory to Stream Processing Systems

Large Distributed Systems

Chapter 8 – Software Testing

Evaluating Transaction System Performance

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Presentation transcript:

A Path-based Approach to Managing Failures and Evolution Mike Chen, Anthony Accardi 1, Emre Kıcıman, Jim Lloyd 2, Dave Patterson, Armando Fox, Eric Brewer UC Berkeley, Tellme 1, Stanford Univ., eBay 2

NSDI 2004Slide 2 Need for Fast Recovery  Failures are common and costly –Daily partial site outages for large sites. –Downtime: $300K - $6million/hr.  Challenges: –Lots of potential sources of faults. –Multiple independent faults. –Distributed runtime behavior (e.g. load balancing)  Observation: very short outages are “free” –Cost of downtime is not linear.

NSDI 2004Slide 3 Need for Rapid Evolution  Competition drives demand for new features and bug fixes –Switching cost is low. –Single administrative domain lowers upgrade barrier.  Challenges: –Short release cycles Weekly and bi-weekly for new features at eBay and Tellme, shorter for bug fixes.Weekly and bi-weekly for new features at eBay and Tellme, shorter for bug fixes. –Distributed runtime behavior  Observation: trend towards application server frameworks –E.g. J2EE,.NET, etc.

NSDI 2004Slide 4  2 extremes of granularity  Problems: –Dispersed execution context –Local context often insufficient –“Blackbox” components Current Approaches to Understand Systems eBayeBayeBayeBay External (end to end) X = 3 Y = true “Micro” view e.g. code-level debuggers granularity

NSDI 2004Slide 5  Captures the relationship between components and their aggregate behavior –Complements both end-to-end tools and “micro” analysis tools. “Macro” Approach eBayeBayeBayeBay WebServer “Micro” view e.g. code-level debuggers “Macro” view WS WS WS App App App DB External (end to end) X = 3 Y = true “Micro” view e.g. code-level debuggers

NSDI 2004Slide 6 First Step: Path-based Analysis  Paths record runtime properties of requests –components used (name, version, etc) –timestamps  Two principles 1.Use paths as the core abstraction 2.Apply statistical analysis to a large number of paths  Focus on correctness –In addition to performance (MSR’s Magpie, HP’s WebMon and Project 5) Web A Web B App B App C DB A DB B App A path 1. Web A, t = 1 2. App A, t = App B, t = DB B, t = 56 …. request

NSDI 2004Slide 7 Architecture  Observation includes: –Component/resource names, version, … –Timestamps  Application-generic tracing –By instrumenting the application servers E.g. < 1K lines for JBoss, a J2EE app serverE.g. < 1K lines for JBoss, a J2EE app server –Request-centric Associate system events to user-visible eventsAssociate system events to user-visible events –Performance overhead 1-3% for eBay1-3% for eBay Web Tracer App Tracer Web Tracer App Tracer DB Tracer DB Tracer Aggregator Ops/QA/Dev request Storage Query interface Analysis Engines Detection Diagnosis Viz observation Path

NSDI 2004Slide 8 3 Path-based Frameworks Paths Framework SiteDescriptionPhysicalTiers # of Machines # of Requests Apps Hosted Pinpoint- Research prototype based on J2EE 2-3--Java ObsLogs Tellme Enterprise voice application network (5)Hundreds Millions per day VoiceXML SuperCAL eBayeBayeBayeBay Online auction Billions per day C++, Java  eBay Stats –1TB raw logs/day (150GB gzipped), 200Mbps peak –2K app servers, 40 SuperCAL machines

NSDI 2004Slide 9 Talk Outline  Motivation and Approach  Failure Management –Failure detection via path anomalies –Failure diagnosis using machine learning methods  Evolution Management –Application-generic dependency tracking –Detecting and diagnosing changes  Conclusions

NSDI 2004Slide 10 Failure Management  Goal: minimize impact of failures –User-visible failures => $$$ lost  78% of recovery time is spent on detection and diagnosis Feedback Impact AnalysisDetection Diagnosis Recovery Repair failure timeline78%

NSDI 2004Slide 11 Fast Recovery Challenges  Many potential causes of failures –SW bugs, hardware, configuration, network, DB, … –Multiple independent failures  Lots of data –Many small, but tolerable failures –Real-time detection/diagnosis  Root cause might not be captured in logs –Tradeoff between logging granularity and overhead  Observation: exact root cause may not be required for many recovery techniques

NSDI 2004Slide 12 Failure Detection Concepts  Path collisions –Incomplete paths interrupted by other requests.  Structural anomalies –Learn a set of “good” paths, and flag unseen paths. –Extended to use probabilistic models. App B App C DB A DB B App A Web A Web B requestsrequests

NSDI 2004Slide 13 Structural Anomalies in Path Shapes  Probabilistic Context Free Grammar (PCFG) –Represents likely calls made by each component –Learn probabilities of rules based on observed paths  Anomalous path shapes –Score a path by calculating the deviations of P(observed calls) from average.  Detected 90% of injected faults in our experiments A B C A B C Sample Paths Learned PCFG p=1 S A p=.5 A B A BC p=.5 B C B $ p=1 C $

NSDI 2004Slide 14 Failure Diagnosis Concepts  Idea: all bad paths touch the root cause –Look for path properties common to failed requests E.g. components used in all failed pathsE.g. components used in all failed paths –Extended to use probabilistic models.  Limitation: –Inter-path dependency App B App C DB A DB B App A requestsrequests App B Web A Web B

NSDI 2004Slide 15 Failure Diagnosis  Summarize each path into:  What features of requests correlate with failures (e.g. NullPointerException)? –Request type, name, pool, host, version, DB, or a combination of these? –Different causes require different recovery techniques PathTypeNamePoolHostVersionDBStatus 1URLViewFeedbackCgi FeedbackDB, UserDB, … NullPointer 2URLBidCgi PriceDBSuccess 3XML……………… Features

NSDI 2004Slide 16 Machine X Machine Y Machine Borrow Statistical Learning Techniques  Cast as feature selection problem in machine learning  Use decision trees because results are easily interpretable 1.Learn the tree from data (with failed paths) 2.The edges that lead to failed nodes are the candidates Success Null- Pointer Success Time- out Respond MyFeedback ViewFeedba ck Login Request Name Request Name TypeNameMachineStatus URL My- Feedback X Null- Pointer URLLoginXSuccess XML View- Feedback YSuccess URLRespondYTimeout ………… Diagnosis: 1) Machine X and MyFeedback 2) Machine Y and Respond FeaturesClass Label

NSDI 2004Slide 17 Diagnosis Results of Decision Trees  Recall vs precision tradeoff –Recall: % of true faults identified –Precision: 1 – false positive rate  Decision trees –C4.5 w/ adaptation A standard decision tree algorithmA standard decision tree algorithm –MinEntropy A greedy variant that finds one leaf with the most failuresA greedy variant that finds one leaf with the most failures Actual results from eBay deploymentActual results from eBay deployment –Association rules Data mining algorithm that computes the conditional probabilities for all combinations of featuresData mining algorithm that computes the conditional probabilities for all combinations of featuresperfect

NSDI 2004Slide 18 Talk Outline  Motivation and Approach  Failure Management  Evolution Management –Application-generic dependency tracking –Detecting and diagnosing expected and unexpected changes  Conclusions

NSDI 2004Slide 19 Tracking Dependency  Current approaches –Manual approaches are error-prone and slow –Static analysis captures possible system behavior vs. runtime analysis which captures the actual behavior  Paths directly captures application structure –Application-generic tracking of actual dependency Zero changes to applicationsZero changes to applications Rubis, a J2EE auction application, hosted on Pinpoint/JBoss

NSDI 2004Slide 20 Automatically Derived State Dependency  Paths associate requests with internal state –Coupling of requests through shared state Easily extended to track fine-grained (e.g. row-level) state sharingEasily extended to track fine-grained (e.g. row-level) state sharing Database Tables ProductSignonAccountBannerInventory VerifySigninRRR CartRRR/W CommitOrderRW CategoryR SearchRR ProductDetailsRR/W NewAccountRR CheckoutW Requests PetStore, a J2EE e-commerce application, hosted on Pinpoint/JBoss R – read W - write

NSDI 2004Slide 21 Detecting/Diagnosing Changes  Paths provides a flexible mechanism to profile any sub-path –Take the interval between any two observations –Drill down to identify problematic sub-paths  Statistical analysis simultaneously examines thousands of sub-paths –Use non-parametric tests (e.g. Mann-Whitney) –Thousands of sub-paths tested for every Tellme release observationobsobsobsobs path

NSDI 2004Slide 22 Detecting/Diagnosing App-level Changes  Paths enables simultaneous testing of many sub-paths –drill down to diagnose specific slow sub-paths 2 versions of a Tellme application Lower quartile Median Upper quartile Outliers Change detected in 1 sub-path in 1 application

NSDI 2004Slide 23 Detecting/Diagnosing App-level Changes  Paths enables simultaneous testing of many sub-paths –drilling down to diagnose the specific slow sub-paths 2 versions of 2 Tellme applications and 3 sub-paths Lower quartile Median Upper quartile Outliers No changes

NSDI 2004Slide 24 Detecting/Diagnosing App-level Changes  Paths enables simultaneous testing of many sub-paths –drilling down to diagnose the specific slow sub-paths 3 versions of 2 Tellme applications and 3 sub-paths App fixed

NSDI 2004Slide 25  Look for consistent deviation across applications Detecting/Diagnosing Platform Changes 2 versions of a Tellme platform Change detected in 1 sub-path in 1 application

NSDI 2004Slide 26  Look for consistent deviation across applications Detecting/Diagnosing Platform Changes 2 versions of a Tellme platform Consistent changes across all apps

NSDI 2004Slide 27  Look for consistent deviation across applications Detecting/Diagnosing Platform Changes 3 versions of a Tellme platform platform fixed

NSDI 2004Slide 28 Lessons Learned  Separate the path analysis logic from observation instrumentation –Improves maintainability and extensibility  Data is cheap –Allows the use of simple statistical algorithms  Live workload –Important to support online use of tools  Record “attempts” –Failed components/resources may not record observations properly

NSDI 2004Slide 29 Summary  Paths + statistical analysis: –Improves failure detection and diagnosis to support fast recovery. –Automates dependency tracking and change analysis to support rapid and correct evolution.  Deployed and evaluated on real systems –Pinpoint, Tellme, and eBay  Future work: –Wide-area systems and systems that span multiple administrative domains

NSDI 2004Slide 30 Thank You  Acknowledgements –Berkeley/Stanford ROC Research Group –Professor Michael Jordan and Alice Zheng –Shepherd Miguel Castro and anonymous reviewers  For more info: –Google, Yahoo, or MSN Search for Mike Chen

NSDI 2004Slide 31 Backup Slides

NSDI 2004Slide 32 Recovery Time Saving  Expected Time Saved = E (Manual Diag. + Recovery) – E (Automated & Manual Diag. + Recovery) –Use diagnosis time based on experience: Diagnosis time: Automated = 1min, Manual (perfect) = 15minDiagnosis time: Automated = 1min, Manual (perfect) = 15min Recovery time (w/ verification) = 5 minRecovery time (w/ verification) = 5 min Time Saved (min) Noise Filtering Threshold $50K to $1million saved

NSDI 2004Slide 33 Show eBay’s Complex System Diagram  Show a few path examples

NSDI 2004Slide 34 Failure Management Process  Detection  Isolation  Diagnosis  Impact Analysis  Repair  Feedback

NSDI 2004Slide 35 MinEntropy  Entropy measures the randomness of data –E.g. if failure is evenly distributed (very random), then entropy is high  Rank features by the normalized entropy –E.g. if root cause is a machine failure, then entropy will be low in the host dimension. Since all types of requests will fail on that host, the entropy in the request type dimension will be higher.  Implemented at eBay –Greedy approach searches for the leaf node with most failures –Pros: fast (<1s for 100K txns and scales linearly) –Cons: Optimized for single faultsOptimized for single faults Features may not be independent (ie. pool and host)Features may not be independent (ie. pool and host)

NSDI 2004Slide 36 MinEntropy example TxTypeErrors URL4350 SQL XSLT0 …… PoolErrorsCgi012 Cgi14002 Cgi230 Cgi38 Cgi45 …… MachineErrorsAttila1985 Lenin2002 Marcus4 Scipio0 …… URLErrorsMyEBay636 MyEBaySel ler 512 MyEBayLo gin 736 …… LabelErrorsE E29115 Alert: Build E293 causing URL error storm (not specific to any URL) in pool CGI1

NSDI 2004Slide 37 Association Rules  Data mining technique to compute item sets –e.g. Shoppers who bought this item also shopped for …  Metrics –Confidence: (# of A & B) / # of A Conditional probability of B given AConditional probability of B given A –Support: (# of A & B)/total # of txns  Generates rules for all possible sets –e.g. machine=abc, txn=login => status=NullPointerException (conf:0.1, support=0.02)  Applied to failure diagnosis –Find all rules that has failed status on the right –Pros: looks at combinations of features –Cons: generates many rules

NSDI 2004Slide 38 Adapting Association Rules  Sample output (rules containing failures): TxnType=URL Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) TxnType=URL TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)  Rank by the size of item sets if support and conf are equal –TxnName = LeaveFeedback

NSDI 2004Slide 39 Failure Management  Goal: minimize impact of failure –User-visible failures => $$$ lost  Fast failure detection and diagnosis are critical to availability –78% of recovery time is spent on detection and diagnosis Feedback Impact AnalysisDetection Diagnosis Recovery Repair failuretimeline

NSDI 2004Slide 40 PCFG Thresholding ● Set a threshold for declaring anomalies – Static threshold: any request > 99 th or 99.5 th percentile – Dynamic threshold: when proportions don't match known good.

NSDI 2004Slide 41 Failure Diagnosis Experiments  Data set –10 one-minute traces, 4 with 2 independent faults total of 14 independent faultstotal of 14 independent faults –About 1/8 of the whole site (640 potential single-faults)  Metrics –Recall: % of true faults identified = (# of identified faults) / (# of true faults) –Precision: 1 – false positive rate = (# of identified faults) / (# of predicted faults) TypeNamePoolMachineVersionDatabaseStatus HostDB Host, Host Host, DB Host, SW DB, SW

NSDI 2004Slide 42 eBay’s Site  2 physical tiers –Web server/app server + DB –Apps in both Java (WebSphere) and C++  SuperCAL (Centralized Application Logging) –API for app developer to log anything to CAL –Platform logs common path features: cookie, host, URL, DB table(s), status, etc.  Stats –1TB raw logs/day (150GB gzipped), 200Mbps peak –2K app servers, 40 SuperCAL machines How to diagnose accurately and efficiently???