Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Statistical Learning Approach to Diagnosing eBay’s Site Mike Chen, Alice Zheng, Jim Lloyd, Michael Jordan, Eric Brewer

Similar presentations


Presentation on theme: "A Statistical Learning Approach to Diagnosing eBay’s Site Mike Chen, Alice Zheng, Jim Lloyd, Michael Jordan, Eric Brewer"— Presentation transcript:

1 A Statistical Learning Approach to Diagnosing eBay’s Site Mike Chen, Alice Zheng, Jim Lloyd, Michael Jordan, Eric Brewer mikechen@cs.berkeley.edu

2 Path-based DiagnosisJan 12, 2004Slide 2 Motivation  Fast failure detection and diagnosis are critical to high availability –But, exact root cause may not be required for many recovery techniques  Many potential causes of failures –Software bugs, hardware, configuration, network, database, etc. –Manual diagnosis is slow and inconsistent  Statistical approaches are ideal –Simultaneously examining many possible causes of failures –Robust to noise

3 Path-based DiagnosisJan 12, 2004Slide 3 Challenges  Lots of (noisy) data  Near real-time detection and diagnosis  Multiple independent failures  Root cause might not be captured in logs

4 Path-based DiagnosisJan 12, 2004Slide 4 Talk Outline  Introduction  eBay’s infrastructure  3 statistical approaches  Early results

5 Path-based DiagnosisJan 12, 2004Slide 5 eBay’s Infrastructure  2 physical tiers –Web server/app server + DB –Migrating to Java (WebSphere) from C++  SuperCAL (Centralized Application Logging) –API for app developer to log anything to CAL –Runtime platform provides application-generic logging: cookie, host, URL, DB table(s), status, duration, etc. –Supports nested txns –A path can be identified via thread ID + host ID

6 Path-based DiagnosisJan 12, 2004Slide 6 SuperCAL Architecture  Stats –2K app servers, 40 SuperCAL machines –1B URLs/day –1TB raw logs/day (150GB gzipped), 200Mbps peak App Servers LB Switch detection Real-time msg bus diagnosis ……

7 Path-based DiagnosisJan 12, 2004Slide 7 Failure Analysis  Summarize each transaction into:  What features are causing requests to fail? –Txn type, txn name, pool, host, version, DB, or a combination of these? –Different causes require different recovery techniques IDTypeNamePoolHostVersionDBStatus 1URLViewFeedbackCgi01341.2.1FeedbackDB, UserDB, … NullPointer 2URLBidCgi22311.0.3PriceDBSuccess 3XML……………… FeaturesClass

8 Path-based DiagnosisJan 12, 2004Slide 8 3 Approaches  Machine learning –Decision trees –MinEntropy – eBay’s greedy variant of decision trees  Data mining –Association rules

9 Path-based DiagnosisJan 12, 2004Slide 9 Decision Trees  Classifiers developed in the statistical machine learning field  Example: go skiing tomorrow?  “learning” => inferring the decision trees rules from data Y YN New snowNo new snow Cloudy Sunny Y YN Cloudy No new snow New snow

10 Path-based DiagnosisJan 12, 2004Slide 10 Decision Trees  Feature selection –Look for features that best separates the classes –Different algorithms uses different metrics to measure “skewness” (e.g. C4.5 uses information gain)  The goal of decision tree algorithm – to split nodes until leaves are “pure” enough or until no further split is possible i.e. pure => all data points have the same class label –Use pruning heuristics to control over-fitting TxnNameFailed MyEBay636 MyEBaySeller512 MyEBayLogin736 …… MachineFailed Attila2985 Lenin20 Marcus4 Scipio5 ……

11 Path-based DiagnosisJan 12, 2004Slide 11 Decision Trees – Sample Output  Pool = icgi1 | TxnName = LeaveFeedback: failed (8,1) | TxnName = MyFeedback: failed (205,3) Pool = icgi2 | TxnName = Respond: failed (1) | TxnName = ViewFeedback: failed (3554,52) (Correct, incorrect) 820513554 icgi1icgi2 Respond LeaveFdbk MyFdbk ViewFdbk  Naïve diagnosis: 1.Pool=icgi1 and TxnName=LeaveFeedback 2.Pool=icgi1 and TxnName=MyFeedback 3.Pool=icgi2 and TxnName=Respond 4.Pool=icgi2 and TxnName=ViewFeedback

12 Path-based DiagnosisJan 12, 2004Slide 12 Feature Selection Heuristics 1.Ignore leaf nodes with no failed transactions 2.Problem: noisy leaves –keep the top N leaves, or ignore nodes with < M% failues 3.Problem: features may not be independent –drop ancestor nodes that are “subsumed” by the leaves 4.Rank by impact –sort the predicted causes by failure count 820513554 icgi1icgi2 Respond LeaveFdbk MyFdbk ViewFdbk 2053554 icgi1icgi2 Respond MyFdbk 2053554 Respond MyFdbk

13 Path-based DiagnosisJan 12, 2004Slide 13 MinEntropy  Entropy measures the randomness of data –E.g. if failure is evenly distributed (very random), then entropy is high  Rank features by the normalized entropy –Greedy approach searches for the leaf node with most failures  Always produces one and exactly one diagnosis  Deployed on the entire eBay site –Sends real-time alerts to ops –Pros: fast (<1s for 100K txns and scales linearly) –Cons: optimized for single faults

14 Path-based DiagnosisJan 12, 2004Slide 14 MinEntropy example TxnTypeErrors URL4350 SQL47 EMAIL12 XSLT0 …… PoolErrors Cgi012 Cgi14002 Cgi230 Cgi38 Cgi45 …… MachineErrors Attila1985 Lenin2002 Marcus4 Scipio0 …… TxnNameErrors MyEBay636 MyEBaySel ler 512 MyEBayLo gin 736 …… VersionErrors E2933987 E29115 Alert: Version E293 causing URL failures (not specific to any URL) in pool CGI1

15 Path-based DiagnosisJan 12, 2004Slide 15 Association Rules  Data mining technique to compute item sets –e.g. Shoppers who bought this item also shopped for …  Metrics –Confidence: (# of A & B) / # of A Conditional probability of B given A –Support: (# of A & B)/total # of txns  Generates rules for all possible sets –e.g. machine=abc, txn=login => status=NullPointer (conf:0.1, support=0.02)  Applied to failure diagnosis –Find all rules that has failed status on the right, then rank by conf –Pros: looks at combinations of features –Cons: generates many rules

16 Path-based DiagnosisJan 12, 2004Slide 16 Association Rules – Sample Output  Sample output (rules containing failures): TxnType=URL Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) TxnType=URL TxnName=LeaveFeedback ==> Status=Failed conf:(0.28) TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)  Problem: features may not be independent –e.g. all LeaveFeedback txns are of type URL –Drop rules that are subsumed by more specific rules  Diagnosis: TxnName=LeaveFeedback

17 Path-based DiagnosisJan 12, 2004Slide 17 Experimental Setup  Dataset –About 1/8 of the whole site –10 one-minute traces, 4 with 2 concurrent faults total of 14 independent faults –True faults identified through post-mortems, ops chat logs, application logs, etc.  Metrics –Precision: (# of identified faults) / (# of true faults) –Recall: (# of identified faults) / (# of predicted faults) TypeNamePoolMachineVersionDatabaseStatus 10300152607408 HostDBHost, HostHost, DBHost, SWDB, SW 241111

18 Path-based DiagnosisJan 12, 2004Slide 18 Results: DBs in Dataset  True causes for DB-related failures are captured in the dataset –Variable number of DBs used by each txn  Feature selection heuristics 1.Ignore leaf nodes with no failed transactions 2.Noise filtering –ignore nodes with < M% failues (in this case, M = 10) 3.Path trimming –drop ancestor nodes subsumed by the leaf nodes

19 Path-based DiagnosisJan 12, 2004Slide 19 Results: DBs not in Dataset  True cause not captured for DB-related failures  C4.5 suffers from unbalanced dataset –i.e. produces a single-rule that predicts every txn to be successful

20 Path-based DiagnosisJan 12, 2004Slide 20 What’s next?  ROC curves –show tradeoff between precision and recall  Transient failures –Up-sample to balance dataset or use cost matrix  Some measure of the “confidence” of the prediction  More data points –Have 20hrs of logs that have failures

21 Path-based DiagnosisJan 12, 2004Slide 21 Open Questions  How to deal with multiple symptoms? –E.g. DB outage causing multiple types of requests to fail –Treat it as multiple failures?  Failure importance (count vs. rate) –Two failures may have similar failure count –Low volume and higher failure rate vs. high volume and lower failure rate


Download ppt "A Statistical Learning Approach to Diagnosing eBay’s Site Mike Chen, Alice Zheng, Jim Lloyd, Michael Jordan, Eric Brewer"

Similar presentations


Ads by Google