Guangxiang Du*, Indranil Gupta

NEW TECHNIQUES TO CURTAIL THE TAIL LATENCY IN STREAM PROCESSING SYSTEMS
Guangxiang Du*, Indranil Gupta Department of Computer Science, University of Illinois, Urbana Champaign *Google (work done at UIUC)

Motivation For stream processing systems, latency is very critical.
Most existing work, e.g. traffic-aware scheduling, elastic scaling, etc, focuses on lowering the average latency. Some applications, like interactive web service, security-related applications, require low tail latency.

Contributions of Our work
We propose three techniques to lower tail latency in stream processing systems: Adaptive Timeout Strategy Improved Concurrency Model for worker process Latency feedback-based Load Balancing Implemented in Apache Storm Micro-benchmark & Real topologies Evaluation. We implement these three techniques on Apache Storm, one of the most popular stream processing systems. We perform evaluation on our three techniques with sets of micro-benchmarks as well as real world topologies.

System Model op2 op4 op5 op1 op3
Storm, Flink and Samza fit this system model. Topology: Directed Acyclic Graph of operators. Operators are stateless. Each operator is split into multiple tasks. Data flow through the topology in the form of discrete units, tuples. Operators are connected through shuffle-grouping streams. Shuffle-grouping: tuples arriving at an operator are spread in a random way across its constituent tasks. op3

Adaptive Timeout Strategy
Storm has a built-in mechanism to guarantee message processing. If a tuple has not been completely processed within timeout, it will get replayed. But it is fixed and specified by users. We propose to adjust the timeout adaptively to catch and replay straggler tuples promptly. Op2 Op3 Op1 (source) Op4 (sink) t2 t5 t1 t3 t6 t8 t4 t7 Each tuple emitted from source operator would have a timeout value.

Adaptive Timeout Strategy Contd
At moment 𝑡 𝑖 set the timeout value for period 𝑃 𝑖 based on statistics of tuple latency in 𝑃 𝑖−1 Intuition: continuously collects the statistics of tuple latency, and periodically adjusts the timeout value based on latency distribution of recent issued tuples. based on how long the tail has been in the last time period, decide how aggressively to set the timeout value with heuristic rules: For example, if ( 99th latency) 𝑃 𝑖−1 > 2 * ( 90th latency) 𝑃 𝑖−1 Then Timeout 𝑃 𝑖 = ( 90th latency) 𝑃 𝑖−1 Why: If the tail is very long, we set the timeout to be low aggressively. Otherwise, we set timeout conservatively to avoid unnecessary replay. (Note tail long or short is a relative concept, for instance, how long is 99% latency with respect to 90% latency)

Improved Concurrency Model For Worker Process
Op2 Op3 W1 W2 W3 Op1 (source) Op4 (sink) t2 t5 t1 t3 t6 t8 t4 t7 That was for our first technique. Now, moving to our second technique. For Storm, Flink, by default each task/executor has an independent queue to buffer incoming tuples. Why: Improved concurrency model technique can reduce queueing delay by merging input queues for those tasks. A task, whenever free, can grab next available tuple from the shared input queue.

Improved Concurrency Model For Worker Process Contd
In an M/M/c queue model, 𝜆 : queue's input rate 𝜇: server's service rate c : the number of servers for the queue 𝜌: the utilization of the queue 𝑄𝑡 𝑎𝑣𝑔 : average queueing time It shows that for a given queue utilization, increasing the number of servers for a queue will lead to lower queueing delay.

Latency-based Load Balancing
Many stream processing systems may run in heterogeneous conditions, for example: the machines (or VMs) may be heterogeneous, the task assignment may be heterogeneous (machines have different number of tasks), etc. 33.33% Op2 Op3 W1 W2 W3 Op1 (source) Op4 (sink) That was for our second technique. Now, moving to our third technique. Some tasks may be faster than other tasks within the same operator. Partitioning the incoming stream of tuples uniformly across tasks thus exacerbates the tail latency. t2 t5 33.33% t1 t3 t6 t8 t4 t7 33.33%

Latency-based Load Balancing Contd
Goal: faster tasks process more work, slower tasks process less work such that tasks have basically the same latency. 33.33% 51.33% 35.33% 34.33% W1 W2 W3 t1 t4 1st t7 The key point: each task (except sink) collects latency of its immediate downstream tasks as feedback periodically, and sort them from fastest to slowest. Quickest task form a pair with the slowest. 2nd quickest form a pair with the 2nd slowest. So on… The technique performs smooth load adjustment among task pairs to suppress load oscillation. For each pair, the algorithm balances load by shifting 1% of the upstream task's outgoing traffic from the slower task to the quicker task until the latency of two is close enough. 32.33% 23.33% 33.33% 2nd 3rd t2 t5 32.33% 25.33% 33.33% t8 2nd 3rd t3 t6

Evaluation Experimental Setup (Google Compute Engine).
We implement our techniques in Apache Storm and evaluate them. 1 VM for nimbus & zookeeper 5 VMs for worker nodes. By default, each worker node runs a worker process. VM Node Machine conguration Role 1 VM n1-standard-1 (1 vCPU, 3.75GB memory) Zookeeper & Nimbus 5 VMs n1-standard-2 (2 vCPUs, 7.5GB memory) Worker Node

Evaluation: Adaptive Timeout Strategy
4-operator “Exclamation Topology” from Storm examples. Comparison between adaptive timeout strategy with different levels of replication. Approach 99th latency (ms) 99.9th latency (ms) Cost default 29.2 76.6 adaptive timeout 24.1 66.4 2.92% 20% replication 25.5 87.8 20% 50% replication 22.1 107.7 50% 100% replication 17.9 78.1 100% Why: replication is a popular/well known approach used to cut down tail latency in networking systems and search engines. & Both replication and adaptive timeout strategy benefit from speculative execution.

Evaluation: Improved Concurrency For Worker Process
micro-topology where a spout connects to a bolt through shuffle-grouping stream. The bolt has 20 task, each worker has 4 tasks. Average queueing delay drops from 2.07 ms to ms. The 90th latency, 99th latency and 99.9th latency are improved by 3.49 ms (35.5%), 3.94 ms (24.9%) and 30.1 ms (36.2%) respectively.

Evaluation: Latency-based Load Balancing
three kinds of heterogeneous scenarios: Different Storm workers are assigned different numbers of tasks. Subset of Storm workers are competing for resources with external processes. Storm workers are deployed in a cluster of heterogeneous VMs.

Evaluation: Latency-based Load Balancing Contd
Overall effect: shifts load from slower tasks to quicker tasks gradually to achieve latency balance among tasks of the same operator. 90th latency 99th latency 99.9th latency Improvement 2.2% - 56% 21.4% % 25% %

Qualitative Conditions for the Techniques
Given a topology, vary the tasks's input queue utilization and observe its effect on the adaptive timeout strategy and the improved concurrency model. However, when do we apply which technique? To find this out, we conducted two experiments… Improved concurrency model: there is a positive correlation between tasks' input queue utilization and its improvement on tail latency. Adaptive Timeout Strategy: its improvement on tail latency weakens after a certain point of input queue utilization.

Qualitative Conditions for the Techniques Contd
vary the system workload and observe its effect on the latency-based load balancing and the adaptive timeout strategy. Latency-based load balancing: works well under high workload when the heterogeneity among different VMs is most prominent. Adaptive Timeout Strategy: achieves improvement on tail latency under moderate or low system workload.

Qualitative Conditions for the Techniques Contd
Since the scopes for each different techniques hardly overlap with each other, we recommend using each of them exclusively under their desired situations.

Real-world Topologies Evaluation
Yahoo PageLoad Topology Yahoo Processing Topology

Real-world Topologies Experimental Results
Name 90th latency 99th latency 99.9th latency Adaptive Timeout Strategy ------ 28%-40% 24%-26% Improved Concurrency Model 16%-19% 36%-42% 20%-32% Latency-based Load Balance 22%-48% 50%-57% 21%-50%

Summary Propose a three novel techniques for reducing tail latency based on a common system model of system processing systems, like Storm. Adaptive Timeout Strategy Improved Concurrency Model for Worker Process Latency-based Load Balancing Provide guidelines for when to use which technique. Achieve improvement on tail latency up to 72.9% compared to Storm default implementation. DPRG:

Guangxiang Du*, Indranil Gupta

Similar presentations

Presentation on theme: "Guangxiang Du*, Indranil Gupta"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Guangxiang Du*, Indranil Gupta

Similar presentations

Presentation on theme: "Guangxiang Du*, Indranil Gupta"— Presentation transcript:

Similar presentations

About project

Feedback