The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch
Internet services are complex Michael Chow2 Tasks Time Scale and heterogeneity make Internet services complex
Internet services are complex Michael Chow3 Tasks Time Scale and heterogeneity make Internet services complex
Analysis Pipeline Michael Chow4
Step 1: Identify segments Michael Chow5
Step 2: Infer causal model Michael Chow6
Step 3: Analyze individual requests Michael Chow7
Step 4: Aggregate results Michael Chow8
Challenges Previous methods derive a causal model – Instrument scheduler and communication – Build model through human knowledge Michael Chow9 Need method that works at scale with heterogeneous components
Opportunities Component-level logging is ubiquitous Handle a large number of requests Michael Chow10 Tremendous detail about a request’s execution Coverage of a large range of behaviors
The Mystery Machine 1) Infer causal model from large corpus of traces – Identify segments – Hypothesize all possible causal relationships – Reject hypotheses with observed counterexamples 2) Analysis – Critical path, slack, anomaly detection, what-if Michael Chow11
Step 1: Identify segments Michael Chow12
Define a minimal schema Michael Chow13 Segment 1Segment 2 Task Event
Define a minimal schema Michael Chow14 Segment 1Segment 2 Task Event Request identifier Machine identifier Timestamp Task Event Aggregate existing logs using minimal schema
Step 2: Infer causal model Michael Chow15
Types of causal relationships Michael Chow16 RelationshipCounterexample Happens-Before Mutual Exclusion Pipeline B B A A A A B B B B A A A A B B OR B B A A B B A A C C B’ A’ C’ B B A A C C B’ C’ A’ t1t1 t2t2 t1t1 t2t2
Producing causal model Michael Chow17 Causal Model S1N1C1 S2N2C2
Producing causal model Michael Chow18 Causal Model S1N1C1 S2N2C2 C1N1S1 C2N2S2 Trace 1 Time
Producing causal model Michael Chow19 Causal Model S1N1C1 S2N2C2 C1N1S1 C2N2S2 Trace 1 Time
Producing causal model Michael Chow20 Causal Model S1N1C1 S2N2C2 C1N1S1 C2N2S2 Time Trace 2
Producing causal model Michael Chow21 Causal Model S1N1C1 S2N2C2 C1N1S1 C2N2S2 Time Trace 2
Producing causal model Michael Chow22 Causal Model S1N1C1 S2N2C2 C1N1S1 C2N2S2 Time Trace 2
Step 3: Analyze individual requests Michael Chow23
Critical path using causal model Michael Chow24 S1N1C1 S2N2C2 C1N1S1 C2N2S2 Trace 1 Time
Critical path using causal model Michael Chow25 S1N1C1 S2N2C2 C1N1S1 C2N2S2 Trace 1
Critical path using causal model Michael Chow26 S1N1C1 S2N2C2 C1N1 S1 C2N2S2 Trace 1
Step 4: Aggregate results Michael Chow27
Inaccuracies of Naïve Aggregation Michael Chow28
Inaccuracies of Naïve Aggregation Michael Chow29
Inaccuracies of Naïve Aggregation Michael Chow30 Need a causal model to correctly understand latency
High variance in critical path Breakdown in critical path shifts drastically – Server, network, or client can dominate latency Michael Chow31 ServerNetwork Client Percent of critical path cdf
High variance in critical path Breakdown in critical path shifts drastically – Server, network, or client can dominate latency Michael Chow32 ServerNetwork Client Percent of critical path cdf 20% of requests, server contributes 10% of latency
High variance in critical path Breakdown in critical path shifts drastically – Server, network, or client can dominate latency Michael Chow33 ServerNetwork Client Percent of critical path cdf 20% of requests, server contributes 50% or more of latency
Diverse clients and networks Michael Chow34 Server Network Client Server Network Client Server Network Client
Diverse clients and networks Michael Chow35 Server Network Client Server Network Client Server Network Client
Diverse clients and networks Michael Chow36 Server Network Client Server Network Client Server Network Client
Differentiated service Michael Chow37 Deliver data when needed and reduce average response time No slack in server generation time Produce data faster Decrease end-to-end latency Slack in server generation time Produce data slower End-to-end latency stays same
Additional analysis techniques Slack analysis What-if analysis – Use natural variation in large data set Michael Chow38 C1N1 S1 C2N2S2 Time
What-if questions Does server generation time affect end-to-end latency? Can we predict which connections exhibit server slack? Michael Chow39
Server slack analysis Michael Chow40 Slack < 25msSlack > 2.5s
Server slack analysis Michael Chow41 Slack < 25msSlack > 2.5s End-to-end latency increases as server generation time increases Server generation time has little effect on end- to-end latency
Predicting server slack Predict slack at the receipt of a request Past slack is representative of future slack Michael Chow42 First slack (ms) Second slack (ms)
Predicting server slack Predict slack at the receipt of a request Past slack is representative of future slack Michael Chow43 Classifies 83% of requests correctly Type II Error 9% Type I Error 8% First slack (ms) Second slack (ms)
Conclusion Michael Chow44
Questions Michael Chow45