Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Slides:



Advertisements
Similar presentations
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Spark Streaming Large-scale near-real-time stream processing
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Discretized Streams: Fault-Tolerant Streaming Computation at Scale
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Distributed Computations
Distributed Computations MapReduce
PNUTS: YAHOO!’S HOSTED DATA SERVING PLATFORM FENGLI ZHANG.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
1 The Google File System Reporter: You-Wei Zhang.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Spark Streaming Large-scale near-real-time stream processing
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Author: Raul Castro Fernandez, Matteo Migliavacca, et al.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Next Generation of Apache Hadoop MapReduce Owen
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
TensorFlow– A system for large-scale machine learning
Introduction to Distributed Platforms
Some slides borrowed from the authors
Scaling Apache Flink® to very large State
Akshun Gupta, Karthik Bala
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Software Engineering Introduction to Apache Hadoop Map Reduce
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
CSCI5570 Large Scale Data Processing Systems
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Introduction to Spark.
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Architectures of distributed systems Fundamental Models
Slides prepared by Samkit
Architectures of distributed systems Fundamental Models
Interpret the execution mode of SQL query in F1 Query paper
The Dataflow Model.
Introduction to MapReduce
Introduction to Spark.
Architectures of distributed systems Fundamental Models
5/7/2019 Map Reduce Map reduce.
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Streaming systems vs batch systems Data are processed immediately vs regularly in large batch Urgency is more important Typical applications for streaming system: Site activity statistics Spam detection Cluster monitoring Network intrusion detection Site statistics: instant statistics given for advertiser, in facebook case, 1 million event/s Spam: quickly detect and remove spam Monitor and detect problems on clusters in datacenters Quickly detect network intrusion -> high level of urgency

Record-at-a-time processing model We’ll take a look at how current streaming model process stream data Nodes continuously receive records, process and update internal state, then send new records Replication and synchronization (keep messages in the same order) for fault tolerance Upstream backup for recovery

Record-at-a-time processing model Faults and stragglers Consistency Compatibility with batch systems Replication needs twice the resource, while upstream backup takes more recovery time, doesn’t solve stragglers It is difficult to retrieve global state since different node receive and process data at different time Because the programming model is different, it is difficult to combine with batch systems, sometimes we need batch to further process data (join data stream with historical data)

Why not batch systems? Batch systems already solved those problem But with higher latency (within minutes) < 2 s latency needed Typical applications for streaming system: Site activity statistics Spam detection Cluster monitoring Network intrusion detection Site statistics: instant statistics given for advertiser, in facebook case, 1 million event/s Spam: quickly detect and remove spam Monitor and detect problems on clusters in datacenters Quickly detect network intrusion -> high level of urgency

Discretized Streams (D-Streams) Run each streaming computation as a series of deterministic batch computations on small time intervals Resilient Distributed Dataset (RDD) Lineage graph for recovery Fast performance (50-200 ms per task) Parallel recovery Recover from stragglers using speculative execution Compatible with batch systems Borrowing from batch systems In-memory storage abstraction which uses lineage graph to tracks operations used to build it, allow replayability for lost data Also from batch systems, recover node parallelly from multiple node on cluster. Traditional stream systems cant uses speculative execution Because similarity with batch systems

D-Streams Input arrives at node 1 stored at an immutable partitioned dataset Then processed via deterministic parallel ops Output stored on another partition, which can be used on next interval Batch operation is at n second interval

Batch operation and lineage graph Each dot is a partition of data and each oval is an RDD RDDs is processed by deterministic ops such as map, reduce, groupBy The operations are logged as lineage graph for node recovery Node can also replicate RDD for checkpoint, but that doesn’t need to happen for all data, since recovery can be parallel therefore fast Similarly if a node straggles, we can speculatively execute copies of its task to another node

D-Streams operations Inputs are sorted based on arrival Transformation: Stateless: map, reduce, groupBy, join Stateful: window, incremental aggregation (reduceByWindow), state tracking Output operator: Write to storage system Pull from RDD Stateless: without frame of reference Stateful: with frame of reference (batch process for [t, t+5))

Timing considerations and consistency If data arrive not in the order of the external timestamp of event Wait for a “slack time” before processing each batch Simple but may introduce latency Correct records at the application-level User must aware about it If a node straggles, it can introduce a wrong result when all nodes are polled for data aggregation Speculatively run the straggled node’s task on other node All records are atomically processed with interval Suppose that the data contains information of user click at time t, it may arrive at t+3, and other data records at t+1 but arrive at t+2 Suppose a system contains data of view count, a node straggles and didn’t complete a view count computation when it asked for it, therefore it will give previous/incomplete data

D-Stream Architecture Implemented over Spark Master: tracks D-Stream lineage graph and schedules tasks Worker: receive data, store RDDs, Client: sends input Split computation into short, stateless tasks Better for load balancing, reacting to failures, speculative tasks Data managed by block store on each worker Master also tracks blocks Clocks synchronized by NTP Data locality considered Master not fault tolerant, but workers are Optimizations: smart block placement, async I/Os, pipelining

Fault and Straggler Recovery Checkpoint RDDs to other nodes asynchronously Can spread recovery to many nodes: less work per node Nodes are still receiving data while performing recovery Run speculative jobs (when 1.4x longer than median) Using lineage graph, the number of history graph can be increase indefinitely, therefore RDD checkpoint is needed When recovering, other nodes can help by running the lineage graph Run speculative job when node took 1.4 times longer time than the median Graph shows performance of parallel recovery, x axis is normalized system load before failure, y axis is recovery time in minute, assuming time since last checkpoint is 1 minute

Evaluation – throughput Grep, WordCount, TopKCount Scaling is nearly linear (up to 100 clusters) even with low latency Higher throughput than existing systems (S4, Storm) Under 1s and 2s latency bound, for 1s data are inputted every 500ms, for 2s -> 1s Comparing Spark-Streaming and Storm on 30 nodes

Evaluation – recovery Average time to process each 1s batch of data before, during, and after failure. With 10s checkpoint interval and 20 nodes Varying number of concurrent failure WC shows less tolerance to failure since it has bigger lineage graph

Evaluation – recovery Average time to process each 1s batch of data before, during, and after failure. On WordCount Varying checkpoint interval Varying checkpoint interval and number of node More checkpoint interval leads to more lineage graph operations needed to run More nodes leads to more parallelization on fault recovery

Evaluation – straggler mitigation Processing time of intervals in normal operation, as well as in the presence of a straggler, with and without speculation enabled Straggling node is not blacklisted Since it does not blacklist slow nodes, the author argues that the performance will be better if it implemented

Evaluation – real-world application Conviva Used for computing metrics such as unique viewers and session-level metrics On 64 quad-core EC2 nodes, it could support 3.8 million concurrent viewers Mobile millennium Used to compute intensive traffic estimation from GPS 10 times faster than batch version

Conclusion A stream processing model for large clusters called D-Stream has been implemented It batches data in small time intervals and passing it through deterministic parallel operations It solves problems of the prior works on streaming system See table -> Yet, it assumes fault-less master and traded off message order consistency for latency