Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using HTTP Access Logs To Detect Application-Level Failures In Internet Services Peter Bodík, UC Berkeley Greg Friedman, Lukas Biewald, Stanford University.

Similar presentations


Presentation on theme: "Using HTTP Access Logs To Detect Application-Level Failures In Internet Services Peter Bodík, UC Berkeley Greg Friedman, Lukas Biewald, Stanford University."— Presentation transcript:

1 Using HTTP Access Logs To Detect Application-Level Failures In Internet Services Peter Bodík, UC Berkeley Greg Friedman, Lukas Biewald, Stanford University HT Levine, Ebates.com George Candea, Stanford University

2 2 Motivation problem: –takes weeks/months to detect some failures in Internet services assumption: –users change their behavior in response to failures –e.g., can’t follow a link from /shopping_cart to /checkout goal: –quickly detect changes/anomalies in users’ access patterns –localize the cause of the change: which page is causing problems? did the page transitions change?

3 3 Outline online algorithms for anomaly detection demo of a GUI tool for real-time detection questions we have future work

4 4 Anomalies in user access patterns why this approach to failure detection? –leverages aggregate intelligence of people using the site –identifying page access patterns can help localize failures –don’t need any instrumentation types of anomalies –unexpected: signify failures/problems –expected: verify the changes/updates to the website what types of user patterns we can observe –frequencies of individual pages –page transitions –user sessions

5 5 Real-world failures from Ebates.com Ebates.com –mid-sized e-commerce site –provided 5 sets of HTTP logs (1-2 week period) –have access emails, chat logs from periods of problems each data set contains one or more failures –mostly site crash examples –problem with survey pages –broken signup page –bad DB query

6 6 Normal traffic: 11am – 3am 1 hit / 5 minutes10 hits / 5 minutes 100 hits / 5 minutes 11am3am

7 7 Anomaly: 7am – 1pm 1 hit / 2 minutes10 hits / 2 minutes 100 hits / 2 minutes 7am1pm

8 8 Online detection of anomalies assign anomaly score to the current time interval handling anomalous intervals in the past 1.use all intervals 2.less weight on the anomalous intervals 3.ignore anomalous intervals localization of problems –most anomalous pages –changes in page transitions time

9 9 Two algorithms chi-square test –count hits to top 40 pages in the past 6 hours and the past 10 minutes –compare relative frequencies using the chi-square test –more sensitive to frequent pages –compare page transitions before and during the anomaly Naive Bayes anomaly detection –assume that page frequencies are independent –model frequency of each page as a Gaussian –learn mean and variance from the past –anomaly score = 1 - Prob(current interval is normal) –more sensitive to infrequent pages

10 10 Two Anomalies 1 hit / 5 minutes10 hits / 5 minutes 100 hits / 5 minutes number of hits to the top 10 pages anomaly threshold anomaly score time (hours)

11 11 GUI tool for real-time detection why need GUI tool? –build trust of the operators –why should the operator believe the algorithm? “picture is worth a thousand words” –manual monitoring/inspection of traffic by operators –make SLT usable in real life report 1 warnings instead of anomalies every minute compare: Most anomalous pages: /landing.jsp 19.55 /landing_merchant.jsp 19.50 /mall_ctrl.jsp 3.69 /malltop.go 2.63 /mall.go 2.18 warning #3: detection time: Sun Nov 16 19:27:00 PST 2003 start: Sun Nov 16 19:24:00 PST 2003 end: Sun Nov 16 21:05:00 PST 2003 significance = 7.05

12 12 Summary of successful results october 2003 – broken signup page: –noticed the problem 7 days earlier + correctly localized! november 2003 – account page problem: –1 st warning: 16 hours earlier –2 nd warning: 1 hour earlier + correctly localized the bad page! july 2001 – landing looping problem: –warning 2 days earlier + correctly localized detected a failure they didn’t tell us about detected three other significant anomalies –feedback: “these might have been important, but we didn’t know about them. definitely useful if detected in real-time.”

13 13 Oct 2003 – broken signup page (1)

14 14 Oct 2003 – broken signup page (2)

15 15 Oct 2003 – broken signup page (3)

16 16 Oct 2003 – broken signup page (4)

17 17 Nov 2003 – account page problem (1)

18 18 Nov 2003 – account page problem (2) 9am1pm

19 19 How to evaluate? information from HT Levine: –time + root cause of major failures (site down,...) –time of minor problems (DB alarm,...) –harmless updates (code push, page update) scenario: 1.page A pushed at 3:30pm, Monday 2.anomaly on page A at 6pm, Monday 3.mostly ok for next 48 hours 4.site down at 6pm, Wednesday –would detecting the anomaly on Monday help??

20 20 What is a true/false positive? detected a major/minor problem: GREAT detected a regular site update: ??? detected a significant anomaly, BUT –Ebates knows nothing about it –no major problems at that time –??? detected anomalies almost every night –certainly a false positive

21 21 Build a simulator? site: PetStore, Rubis? failures: try failures from Ebates user simulator: based on real logs from Ebates cons: –less realistic (how to build a realistic simulator of users?) pros: –know exactly what happened in the system (measure TTD) –try many different failures –use for evaluating TCQ-based preprocessing

22 22 Localization Naive Bayes better at localization –likely reason: more sensitive to infrequent pages

23 23 Future work develop better quantitative measures for analysis GUI tool –deploy at EBates –make available as open source –could help convince other companies to provide failure data detect more subtle problems –harder to detect using current methods explore HCI aspects of our approach

24 24 Conclusions very simple algorithms can detect serious failures visualization helps understand the anomaly have almost-perfect source of failure data –complete HTTP logs –operators willing to cooperate –emails, chat logs from periods of problems still hard to evaluate


Download ppt "Using HTTP Access Logs To Detect Application-Level Failures In Internet Services Peter Bodík, UC Berkeley Greg Friedman, Lukas Biewald, Stanford University."

Similar presentations


Ads by Google