The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch.

Slides:



Advertisements
Similar presentations
A Provider-side View of Web Search Response Time
Advertisements

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
Trace Analysis Chunxu Tang. The Mystery Machine: End-to-end performance analysis of large-scale Internet services.
CLUE: SYSTEM TRACE ANALYTICS FOR CLOUD SERVICE PERFORMANCE DIAGNOSIS Hui Zhang 1, Junghwan Rhee 1, Nipun Arora 1, Sahan Gamage 2, Guofei Jiang 1, Kenji.
Social Media Mining Chapter 5 1 Chapter 5, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010.
Analyzing Case Study Evidence
Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison
Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun Minlan Yu, Michael J. Freedman, Jennifer Rexford Princeton University.
Distributed Systems Spring 2009
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of.
McGraw-Hill/Irwin ©2008 The McGraw-Hill Companies, All Rights Reserved SECTION 12.1 PROJECT MANAGEMENT.
Learning Objectives 1 Problem Definition and the Research Process Copyright © 2002 South-Western /Thomson Learning CHAPTER three.
Flow Anomaly Detection in Firewalled Networks Research Report Mike Chapple December 15, 2005.
Delayed Internet Routing Convergence Craig Labovitz, Abha Ahuja, Abhijit Bose, Farham Jahanian Presented By Harpal Singh Bassali.
1 TVA: A DoS-limiting Network Architecture Xiaowei Yang (UC Irvine) David Wetherall (Univ. of Washington) Thomas Anderson (Univ. of Washington)
Adaptive Content Delivery for Scalable Web Servers Authors: Rahul Pradhan and Mark Claypool Presented by: David Finkel Computer Science Department Worcester.
OPERATIONS MANAGEMENT for MBAs Fourth Edition
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
Failure Avoidance through Fault Prediction Based on Synthetic Transactions Mohammed Shatnawi 1, 2 Matei Ripeanu 2 1 – Microsoft Online Ads, Microsoft Corporation.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.
Chapter Three Copyright © 2004 John Wiley & Sons, Inc. Problem Definition and the Research Process Copyright © 2004 John Wiley & Sons, Inc. Chapter Three.
Game-based Analysis of Denial-of- Service Prevention Protocols Ajay Mahimkar Class Project: CS 395T.
New Challenges in Cloud Datacenter Monitoring and Management
Enabling Internet “Suspend/Resume” with Session Continuations Alex C. Snoeren MIT Laboratory for Computer Science (with Hari Balakrishnan, Frans Kaashoek,
Integrated Social and Quality of Service Trust Management of Mobile Groups in Ad Hoc Networks Ing-Ray Chen, Jia Guo, Fenye Bao, Jin-Hee Cho Communications.
Towards Modeling Legitimate and Unsolicited Traffic Using Social Network Properties 1 Towards Modeling Legitimate and Unsolicited Traffic Using.
Project Management An overview. What is a Project A temporary job to accomplish a specific task A temporary job to accomplish a specific task Attributes.
Accelerating Mobile Applications through Flip-Flop Replication
Module 8: Risk Management, Monitoring and Project Control We would like to acknowledge the support of the Project Management Institute and the International.
 Zhichun Li  The Robust and Secure Systems group at NEC Research Labs  Northwestern University  Tsinghua University 2.
Learning Objective Chapter 2 The Marketing Research Process and the Management of Marketing Research CHAPTER two The Marketing Research Process and the.
McGraw-Hill/Irwin © 2006 The McGraw-Hill Companies, Inc. All rights reserved. BUSINESS DRIVEN TECHNOLOGY Business Plug-In B10 Project Management.
UAB Dynamic Monitoring and Tuning in Multicluster Environment Genaro Costa, Anna Morajko, Paola Caymes Scutari, Tomàs Margalef and Emilio Luque Universitat.
Towards An Open Data Set for Trace-Oriented Monitoring Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Michael R. Lyu 1,2 1 National University.
BUSINESS PLUG-IN B15 Project Management.
Organizational Psychology: A Scientist-Practitioner Approach Jex, S. M., & Britt, T. W. (2014) Prepared by: Christopher J. L. Cunningham, PhD University.
Astro / Geo / Eco - Sciences Illustrative examples of success stories: Sloan digital sky survey: data portal for astronomy data, 1M+ users and nearly 1B.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
College of Computer National University of Defense Technology Jingwen Zhou, Zhenbang Chen, Haibo Mi, and Ji Wang {jwzhou, This work.
Copyright  2003 by Dr. Gallimore, Wright State University Department of Biomedical, Industrial Engineering & Human Factors Engineering Human Factors Research.
Science Process Skills Vocabulary 8/17/15. Predicting Forming an idea of an expected result. Based on inferences.
CSC480 Software Engineering Lecture 10 September 25, 2002.
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha.
D u k e S y s t e m s Asynchronous Replicated State Machines (Causal Multicast and All That) Jeff Chase Duke University.
MORTALITY CHANGE THE DETAILS ARE MESSY Year to year decline irregular Persistent, puzzling differentials Cause of death structure difficult to understand.
Information System Project Management.  Some problems that org faced with IS dev efforts include schedule delays, cost overrun, less functionality than.
SCIENCE There is a method to the madness!! SCIENTIFIC METHOD State the Problem State the Problem Gather Information Gather Information Form a Hypothesis.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Measurement in the Internet Measurement in the Internet Paul Barford University of Wisconsin - Madison Spring, 2001.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
INTEGRATED SCIENCE PROCESS SKILLS BASIC SCIENCE PROCESS SKILLS
Exposure Assessment for Health Effect Studies: Insights from Air Pollution Epidemiology Lianne Sheppard University of Washington Special thanks to Sun-Young.
03/03/051 Performance Engineering of Software and Distributed Systems Research Activities at IIT Bombay Varsha Apte March 3 rd, 2005.
Chapter 1 Database Access from Client Applications.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Network Weather Service. Introduction “NWS provides accurate forecasts of dynamically changing performance characteristics from a distributed set of metacomputing.
1 / 21 Providing Differentiated Services from an Internet Server Xiangping Chen and Prasant Mohapatra Dept. of Computer Science and Engineering Michigan.
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
Best detection scheme achieves 100% hit detection with
A Validation System for the Complex Event Processing Directives of the ATLAS Shifter Assistant Tool G. Anders (CERN), G. Avolio (CERN), A. Kazarov (PNPI),
The Tail at Scale Dean and Barroso, CACM Feb 2013 Presenter: Chien-Ying Chen (cchen140) CS538 - Advanced Computer Networks 1.
Jacob R. Lorch Microsoft Research
Measurement-based Design
DQBarge: Improving data-quality
Quantifying Scale and Pattern Lecture 7 February 15, 2005
CHAPTER two The Marketing Research Process and the Management of
PMI Central Alabama Chapter Meeting 9/18/2018
Request Behavior Variations
Architectures of distributed systems
Presentation transcript:

The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch

Internet services are complex Michael Chow2 Tasks Time Scale and heterogeneity make Internet services complex

Internet services are complex Michael Chow3 Tasks Time Scale and heterogeneity make Internet services complex

Analysis Pipeline Michael Chow4

Step 1: Identify segments Michael Chow5

Step 2: Infer causal model Michael Chow6

Step 3: Analyze individual requests Michael Chow7

Step 4: Aggregate results Michael Chow8

Challenges Previous methods derive a causal model – Instrument scheduler and communication – Build model through human knowledge Michael Chow9 Need method that works at scale with heterogeneous components

Opportunities Component-level logging is ubiquitous Handle a large number of requests Michael Chow10 Tremendous detail about a request’s execution Coverage of a large range of behaviors

The Mystery Machine 1) Infer causal model from large corpus of traces – Identify segments – Hypothesize all possible causal relationships – Reject hypotheses with observed counterexamples 2) Analysis – Critical path, slack, anomaly detection, what-if Michael Chow11

Step 1: Identify segments Michael Chow12

Define a minimal schema Michael Chow13 Segment 1Segment 2 Task Event

Define a minimal schema Michael Chow14 Segment 1Segment 2 Task Event Request identifier Machine identifier Timestamp Task Event Aggregate existing logs using minimal schema

Step 2: Infer causal model Michael Chow15

Types of causal relationships Michael Chow16 RelationshipCounterexample Happens-Before Mutual Exclusion Pipeline B B A A A A B B B B A A A A B B OR B B A A B B A A C C B’ A’ C’ B B A A C C B’ C’ A’ t1t1 t2t2 t1t1 t2t2

Producing causal model Michael Chow17 Causal Model S1N1C1 S2N2C2

Producing causal model Michael Chow18 Causal Model S1N1C1 S2N2C2 C1N1S1 C2N2S2 Trace 1 Time

Producing causal model Michael Chow19 Causal Model S1N1C1 S2N2C2 C1N1S1 C2N2S2 Trace 1 Time

Producing causal model Michael Chow20 Causal Model S1N1C1 S2N2C2 C1N1S1 C2N2S2 Time Trace 2

Producing causal model Michael Chow21 Causal Model S1N1C1 S2N2C2 C1N1S1 C2N2S2 Time Trace 2

Producing causal model Michael Chow22 Causal Model S1N1C1 S2N2C2 C1N1S1 C2N2S2 Time Trace 2

Step 3: Analyze individual requests Michael Chow23

Critical path using causal model Michael Chow24 S1N1C1 S2N2C2 C1N1S1 C2N2S2 Trace 1 Time

Critical path using causal model Michael Chow25 S1N1C1 S2N2C2 C1N1S1 C2N2S2 Trace 1

Critical path using causal model Michael Chow26 S1N1C1 S2N2C2 C1N1 S1 C2N2S2 Trace 1

Step 4: Aggregate results Michael Chow27

Inaccuracies of Naïve Aggregation Michael Chow28

Inaccuracies of Naïve Aggregation Michael Chow29

Inaccuracies of Naïve Aggregation Michael Chow30 Need a causal model to correctly understand latency

High variance in critical path Breakdown in critical path shifts drastically – Server, network, or client can dominate latency Michael Chow31 ServerNetwork Client Percent of critical path cdf

High variance in critical path Breakdown in critical path shifts drastically – Server, network, or client can dominate latency Michael Chow32 ServerNetwork Client Percent of critical path cdf 20% of requests, server contributes 10% of latency

High variance in critical path Breakdown in critical path shifts drastically – Server, network, or client can dominate latency Michael Chow33 ServerNetwork Client Percent of critical path cdf 20% of requests, server contributes 50% or more of latency

Diverse clients and networks Michael Chow34 Server Network Client Server Network Client Server Network Client

Diverse clients and networks Michael Chow35 Server Network Client Server Network Client Server Network Client

Diverse clients and networks Michael Chow36 Server Network Client Server Network Client Server Network Client

Differentiated service Michael Chow37 Deliver data when needed and reduce average response time No slack in server generation time Produce data faster Decrease end-to-end latency Slack in server generation time Produce data slower End-to-end latency stays same

Additional analysis techniques Slack analysis What-if analysis – Use natural variation in large data set Michael Chow38 C1N1 S1 C2N2S2 Time

What-if questions Does server generation time affect end-to-end latency? Can we predict which connections exhibit server slack? Michael Chow39

Server slack analysis Michael Chow40 Slack < 25msSlack > 2.5s

Server slack analysis Michael Chow41 Slack < 25msSlack > 2.5s End-to-end latency increases as server generation time increases Server generation time has little effect on end- to-end latency

Predicting server slack Predict slack at the receipt of a request Past slack is representative of future slack Michael Chow42 First slack (ms) Second slack (ms)

Predicting server slack Predict slack at the receipt of a request Past slack is representative of future slack Michael Chow43 Classifies 83% of requests correctly Type II Error 9% Type I Error 8% First slack (ms) Second slack (ms)

Conclusion Michael Chow44

Questions Michael Chow45