Hotspot Detection in a Service Oriented Architecture Pranay Anchuri, Rensselaer Polytechnic Institute, Troy, NY Roshan Sumbaly, Coursera, Mountain View, CA Sam Shah, LinkedIn, Mountain View, CA
Introduction
Largest professional network. 300M members from 200 countries. 2 new members per second.
Largest professional network. 300M members from 200 countries. 2 new members per second.
Service Oriented Architecture
What is a Hotspot Hotspot : Service responsible for suboptimal performance of a user facing functionality.
What is a Hotspot Hotspot : Service responsible for suboptimal performance of a user facing functionality. Performance measures: Latency Cost to serve Error rate
Who uses hotspot detection ? Engineering teams : Minimize latency for the user. Increase the throughput of the servers. Operations teams : Reduce the cost of serving user requests.
Goal
Data - Service Call Graphs Service call metrics logged into a central system. Call graph structure re-constructed from random trace id.
Example of Service Call Graph Read profile Content Service Context Service Content Service EntitlementsVisibility
Example of Service Call Graph Read profile Content Service Context Service Content Service EntitlementsVisibility
Example of Service Call Graph Read profile Content Service Context Service Content Service EntitlementsVisibility
Example of Service Call Graph Read profile Content Service Context Service Content Service EntitlementsVisibility
Challenges in mining hotspots
Structure of call graphs Structure of call graphs change rapidly across requests. Depends on member’s attributes. A/B testing. Changes to code base. Over 90% unique structures for most requested services.
Asynchronous service calls Calls A B, A C are Serial : C is called after B returns to A. Parallel : B and C are called at same time or in a brief time span. Parallel service calls are particularly difficult to handle. Degree of parallelism ~ 20 for some services.
Related Work Hu et. al [SIGCOMM 04, INFOCOMM 05] Tools to detect bottlenecks along network paths. Mann et. al [USENIX 11] Models to estimate latency as a function of RPC’s latencies.
Why existing methods don’t work ? Metric cannot be controlled as in bottleneck detection algorithms. Analyzing millions of small networks. Parallel service calls.
Our approach
● Given call graphs Optimize and summarize approach
● Given call graphs ● Hotspots in each call graph Optimize and summarize approach
● Given call graphs ● Hotspots in each call graph ● Ranking hotspots Optimize and summarize approach
What are the top-k hotspots in a call graph ? Hotspots in a specific call graph irrespective of other call graphs for the same type of request.
Key Idea What are the k services, if already optimized, that would have lead to maximum reduction in the latency of request ? (Specific to a particular call graph)
Quantifying impact of a service What if a service was optimized by θ ? (think after the fact)
Quantifying impact of a service What if a service was optimized by θ ? (think after the fact) Its internal computations are θ times faster. No effect on the overall latency if its parent is waiting on other service call to return.
Example [0,11] [0,3] [1,2] [1.3, 1.6] [2.1, 2.5] [4,11] [6,9] [7,8]
Example [0,11] [0,3] [1,2] [1.3, 1.6] [2.1, 2.5] [4,11] [6,9] [7,8]
Example [0,11] [0,3] [1,2] [1.3, 1.6] [2.1, 2.5] [4,11] [6,9] [7,8] 2x faster
Example [0,11] [0,3] [1,2] [1.3, 1.6] [2.1, 2.5] [4,11] [6,9] [7,8] 2x faster
Example [0,11] [0,3] [1,2] [1.3, 1.6] [2.1, 2.5] [4,11] [6,9] [7,8] 2x faster Effect of 2x speedup
Local effect of optimization
Negative example [0,11] [0,3] [1,2] [1.3, 1.6] [2.1, 2.5] [4,11] [6,9] [7,8]
Negative example [0,11] [0,3] [1,2] [1.3, 1.6] [2.1, 2.5] [4,11] [6,9] [7,8]
Negative example [0,11] [0,3] [1,2] [1.3, 1.6] [2.1, 2.5] [4,11] [6,9] [7,8]
Effect propagation ABC Optimizing C Reduces run time of C C returns to B earlier. B might return earlier. A might return earlier…
Propagation Assumption A service propagates the effects to its parent only if doing so doesn’t change the order of service calls (by parent).
Example
Example After optimization
Example
Under the propagation assumption
Relaxation Variation of the propagation assumption that allows for a service to propagate fractional effects to its parent. Leads to a greedy algorithm.
Greedy algorithm to compute top-k hotspots Given an optimization factor θ, Repeatedly select a service that has maximum impact on frontend service. Update the times after each selection. Stop after k iterations.
Ranking hotspots
Rest of the paper Similar approach applied to cost of request metric. Generalized framework for optimizing arbitrary metrics. Other ranking schemes.
Results
Dataset Request type Avg # of call graphs per day* Avg # of service call per request Avg # of subcalls per service Max # of parallel subcalls Home10.2 M Mailbox3.33 M Profile3.14 M Feed1.75 M * Scaled down by a constant factor
vs Baseline algorithm
User of the system
Impact of improvement factor
Consistency over a time period
Conclusion
Conclusions Defined hotspots in service oriented architectures. Framework to mine hotspots w.r.t various performance metrics. Experiments on real world large scale datasets.
Thanks Questions ?