Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing
Zaharia, et al (2012)

Streaming systems vs batch systems
Data are processed immediately vs regularly in large batch Urgency is more important Typical applications for streaming system: Site activity statistics Spam detection Cluster monitoring Network intrusion detection Site statistics: instant statistics given for advertiser, in facebook case, 1 million event/s Spam: quickly detect and remove spam Monitor and detect problems on clusters in datacenters Quickly detect network intrusion -> high level of urgency

Record-at-a-time processing model
We’ll take a look at how current streaming model process stream data Nodes continuously receive records, process and update internal state, then send new records Replication and synchronization (keep messages in the same order) for fault tolerance Upstream backup for recovery

Record-at-a-time processing model
Faults and stragglers Consistency Compatibility with batch systems Replication needs twice the resource, while upstream backup takes more recovery time, doesn’t solve stragglers It is difficult to retrieve global state since different node receive and process data at different time Because the programming model is different, it is difficult to combine with batch systems, sometimes we need batch to further process data (join data stream with historical data)

Why not batch systems? Batch systems already solved those problem
But with higher latency (within minutes) < 2 s latency needed Typical applications for streaming system: Site activity statistics Spam detection Cluster monitoring Network intrusion detection Site statistics: instant statistics given for advertiser, in facebook case, 1 million event/s Spam: quickly detect and remove spam Monitor and detect problems on clusters in datacenters Quickly detect network intrusion -> high level of urgency

Discretized Streams (D-Streams)
Run each streaming computation as a series of deterministic batch computations on small time intervals Resilient Distributed Dataset (RDD) Lineage graph for recovery Fast performance ( ms per task) Parallel recovery Recover from stragglers using speculative execution Compatible with batch systems Borrowing from batch systems In-memory storage abstraction which uses lineage graph to tracks operations used to build it, allow replayability for lost data Also from batch systems, recover node parallelly from multiple node on cluster. Traditional stream systems cant uses speculative execution Because similarity with batch systems

D-Streams Input arrives at node 1 stored at an immutable partitioned dataset Then processed via deterministic parallel ops Output stored on another partition, which can be used on next interval Batch operation is at n second interval

Batch operation and lineage graph
Each dot is a partition of data and each oval is an RDD RDDs is processed by deterministic ops such as map, reduce, groupBy The operations are logged as lineage graph for node recovery Node can also replicate RDD for checkpoint, but that doesn’t need to happen for all data, since recovery can be parallel therefore fast Similarly if a node straggles, we can speculatively execute copies of its task to another node

D-Streams operations Inputs are sorted based on arrival
Transformation: Stateless: map, reduce, groupBy, join Stateful: window, incremental aggregation (reduceByWindow), state tracking Output operator: Write to storage system Pull from RDD Stateless: without frame of reference Stateful: with frame of reference (batch process for [t, t+5))

Timing considerations and consistency
If data arrive not in the order of the external timestamp of event Wait for a “slack time” before processing each batch Simple but may introduce latency Correct records at the application-level User must aware about it If a node straggles, it can introduce a wrong result when all nodes are polled for data aggregation Speculatively run the straggled node’s task on other node All records are atomically processed with interval Suppose that the data contains information of user click at time t, it may arrive at t+3, and other data records at t+1 but arrive at t+2 Suppose a system contains data of view count, a node straggles and didn’t complete a view count computation when it asked for it, therefore it will give previous/incomplete data

D-Stream Architecture
Implemented over Spark Master: tracks D-Stream lineage graph and schedules tasks Worker: receive data, store RDDs, Client: sends input Split computation into short, stateless tasks Better for load balancing, reacting to failures, speculative tasks Data managed by block store on each worker Master also tracks blocks Clocks synchronized by NTP Data locality considered Master not fault tolerant, but workers are Optimizations: smart block placement, async I/Os, pipelining

Fault and Straggler Recovery
Checkpoint RDDs to other nodes asynchronously Can spread recovery to many nodes: less work per node Nodes are still receiving data while performing recovery Run speculative jobs (when 1.4x longer than median) Using lineage graph, the number of history graph can be increase indefinitely, therefore RDD checkpoint is needed When recovering, other nodes can help by running the lineage graph Run speculative job when node took 1.4 times longer time than the median Graph shows performance of parallel recovery, x axis is normalized system load before failure, y axis is recovery time in minute, assuming time since last checkpoint is 1 minute

Evaluation – throughput
Grep, WordCount, TopKCount Scaling is nearly linear (up to 100 clusters) even with low latency Higher throughput than existing systems (S4, Storm) Under 1s and 2s latency bound, for 1s data are inputted every 500ms, for 2s -> 1s Comparing Spark-Streaming and Storm on 30 nodes

Evaluation – recovery Average time to process each 1s batch of data before, during, and after failure. With 10s checkpoint interval and 20 nodes Varying number of concurrent failure WC shows less tolerance to failure since it has bigger lineage graph

Evaluation – recovery Average time to process each 1s batch of data before, during, and after failure. On WordCount Varying checkpoint interval Varying checkpoint interval and number of node More checkpoint interval leads to more lineage graph operations needed to run More nodes leads to more parallelization on fault recovery

Evaluation – straggler mitigation
Processing time of intervals in normal operation, as well as in the presence of a straggler, with and without speculation enabled Straggling node is not blacklisted Since it does not blacklist slow nodes, the author argues that the performance will be better if it implemented

Evaluation – real-world application
Conviva Used for computing metrics such as unique viewers and session-level metrics On 64 quad-core EC2 nodes, it could support 3.8 million concurrent viewers Mobile millennium Used to compute intensive traffic estimation from GPS 10 times faster than batch version

Conclusion A stream processing model for large clusters called D-Stream has been implemented It batches data in small time intervals and passing it through deterministic parallel operations It solves problems of the prior works on streaming system See table -> Yet, it assumes fault-less master and traded off message order consistency for latency

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Similar presentations

Presentation on theme: "Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Similar presentations

Presentation on theme: "Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)"— Presentation transcript:

Similar presentations

About project

Feedback