Download presentation
Presentation is loading. Please wait.
Published byBryan Hunter Modified over 9 years ago
1
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Author: Raul Castro Fernandez, Matteo Migliavacca, et al. Published conference: SIGMOD’13 Reporter: Ma Yuanwen
2
Introduction Stream – A sequence of tuples – Sensor network, stock trading system Query plan – A query is specified as a directed acyclic graph Distributed stream processing – A query is deployed on a set of nodes State management – Checkpoint, back up, restore, partition Scale out – Split the instance, when the instance is overload Fault tolerance – Recover from failures without affecting processing results
3
Outline Background – Problem Statement – System Model State Management – Query state – State Operations Scale out and fault tolerance – Fault tolerance scale out algorithm – System architecture – Bottleneck detection and scaling policy Evaluation Conclusions
4
Problem Statement Operators – Stateless operators (e.g. filter or map) and stateful operators(e.g. join or aggregation ) – Sliding window based state – Entire history based state Intra-query parallelism – Query graph and execution graph Fault tolerance – Passive standby strategy – Active standby strategy – Upstream backup strategy Report the words frequencies in the recent 1 hour about every 10 minutes
5
System model(1) nameAgesex Li Lei16male Han Mei15female Jim17male nameagesex Li Lei16male nameagesex Han Mei15female nameagesex Jim17male
6
System model(2)
7
Query state (1)
8
Query state (2)
9
State operations Operator state backup and restore – Checkpoint the state of an operator and backup the state to an upstream operator – Restore state for failure and scale out Operator state partitioning – When a stateful operator scales out, it’s processing state must be split across the new partitioned operators
10
Operator state backup and restore
12
Operator state partitioning
14
Scale out and Fault Tolerance Scale out – SPS partitions operator on-demand in response to bottleneck operators Fault Tolerance – If a node hosting an operator fails, the SPS must replace it with an operator on a new node Operator recover becomes special case of scale out, in which a failed Operator is scale out to a parallelization of 1
15
Fault-tolerant scale out algorithm
16
System architecture Query manager – Perform a mapping of query operators to nodes and maintain the execution graph Deployment manager – use the execution graph to initialize nodes, deploy operators, set up stream communication and start processing
17
Bottleneck detection and scaling policy
18
Goals and deployment of evaluation The goals of experimental evaluation are to investigate – The effectiveness of stateful operator scale out approach – The recovery time of the stateful recovery mechanism – The impact of state management approach on tuple processing latency Experiment deployment
19
Experiment data Linear road benchmark (LRB) – It models a road toll network – Queries: (1) Provide toll notifications to vehicles within 5s; (2) detect accidents within 5s; (3) answer balance account queries about paid toll amounts – The input rate for a single express-way (L=1) begins at 15 tuples/s and increase to 1700 tuples/s Wikipedia – A map/reduce-style top-k query That outputs every 30 seconds the ranking of the most visited Wikipedia language versions based on Wikipedia data traces
20
Dynamic scale out (1)
21
Dynamic scale out (2)
22
Failure recovery Word count
23
State management overhead
24
Conclusions Provide state management of stateful operators – Checkpoint, back up, restore, partition Present an integrated approach for scale out and failure recovery
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.