Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.

Slides:



Advertisements
Similar presentations
Spark Streaming Large-scale near-real-time stream processing
Advertisements

Spark Streaming Large-scale near-real-time stream processing
Spark Streaming Real-time big-data processing
Shark:SQL and Rich Analytics at Scale
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Spark Streaming Large-scale near-real-time stream processing
HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Discretized Streams: Fault-Tolerant Streaming Computation at Scale
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Frameworks (and Stream Processing) Aditya Akella.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Adaptive Stream Processing using Dynamic Batch Sizing Tathagata Das, Yuan Zhong, Ion Stoica, Scott Shenker.
Distributed Computations MapReduce
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Resilient Distributed Datasets (NSDI 2012) A Fault-Tolerant Abstraction for In-Memory Cluster Computing Piccolo (OSDI 2010) Building Fast, Distributed.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Spark Streaming Large-scale near-real-time stream processing
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Data Engineering How MapReduce Works
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Ignite in Sberbank: In-Memory Data Fabric for Financial Services
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Fast, Interactive, Language-Integrated Cluster Computing
Introduction to Spark Streaming for Real Time data analysis
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Curator: Self-Managing Storage for Enterprise Clusters
Some slides borrowed from the authors
Scaling Apache Flink® to very large State
Spark Presentation.
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
CSCI5570 Large Scale Data Processing Systems
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Introduction to Spark.
COS 418: Distributed Systems Lecture 1 Mike Freedman
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Apache Hadoop and Spark
Fast, Interactive, Language-Integrated Cluster Computing
Streaming data processing using Spark
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
Lecture 29: Distributed Systems
Presentation transcript:

Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1

Motivation Many important applications must process large data streams in real time Site activity statistics, cluster monitoring, spam filtering,... Would like to have… Second-scale latency Scalability to hundreds of nodes Second-scale recovery from faults and stragglers Cost-efficient Previous Streaming frameworks don’t meet these 2 goals together 2

Previous Streaming Systems Record-at-a-time processing model Fault tolerance via replication or upstream backup Replication Fast Recovery, but 2x hardware cost Upstream Backup No 2x hardware, slow to recovery Neither handles stragglers 3

Discretized Streams immutable dataset (output or state); stored in memory as Spark RDD 4 A streaming computation as a series of very small, deterministic batch jobs

Resilient Distributed Datasets(RDDs) An RDD is an immutable, in-memory, partitioned logical collection of records RDDs maintain lineage information that can be used to reconstruct lost partitions 5 lines errors HDFS errors time fields lines = spark.textFile("hdfs://...") errors=lines.filter(_.startsWith(“ERROR”)) HDFS_errors=filter(_.contains(“HDFS”))) HDFS_errors=map(_.split(‘\t’)(3))

Fault Recovery All dataset Modeled as RDDs with dependency graph Fault-tolerant without full replication Fault/straggler recovery is done in parallel on other nodes fast recovery 6

Discretized Streams Faults/Stragglers recovery without replication State between batches kept in memory Deterministic operations  fault-tolerant Second-scale performance Try to make batch size as small as possible Smaller batch size  lower end-to-end latency A rich set of operations Combined with historical datasets Interactive queries 7 Spark DStream Processing batches of X seconds live data stream processed results

Programming Interface Dstream: A stream of RDDs Transformations Map, filter, groupBy, reduceBy, sort, join… Windowing Incremental aggregation State tracking Output operator Save RDDs to outside systems(screen, external storage… ) 8

Windowing Count frequency of words received in last 5 seconds words = createNetworkStream(" ones = words.map(w => (w, 1)) freqs_5s = ones.reduceByKeyAndWindow(_ + _, Seconds(5), Seconds(1)) DStream freqs_5s Transformation Sliding Window Ops 9

Incremental aggregation Aggregation Function freqs = ones.reduceByKeyAndWindow (_ + _, Seconds(5), Seconds(1)) Invertible aggregation Function freqs = ones.reduceByKeyAndWindow (_ + _, _ - _, Seconds(5), Seconds(1)) 10

State Tracking A series of events  state changing 11 sessions = events.track( (key, ev) => 1, // initialize function (key, st, ev) => // update function ev == Exit ? null : 1, "30s") // timeout counts = sessions.count() // a stream of ints

Evaluation Throughputs Linear scalability to 100 nodes 12

Evaluation Compared to Storm Storm is 2x slower than Discretized Streams 13

Evaluation Fault Recovery Recovery is fast with at most 1 second delay 14

Evaluation Stragglers recovery Speculative execution improves response time significantly. 15

Comments Batch size Minimum latency is fixed based on batching data Burst workload Dynamic change the batch size? Driver/Master fault Periodically save master data into HDFS, probably need a manual setup Multiple masters, Zookeeper? Memory usage Higher than continuous operators with mutable state It may possible to reduce the memory usage by storing only Δ between RDDs 16

Comments No latency evaluation How does it perform compared to Storm/S4 Compute intervals need synchronization the nodes have their clocks synchronized via NTP 17