Spark Streaming Large-scale near-real-time stream processing

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

Spark Streaming Large-scale near-real-time stream processing
Spark Streaming Large-scale near-real-time stream processing
Spark Streaming Real-time big-data processing
Spark Streaming Large-scale near-real-time stream processing
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Programming Models for IoT and Streaming Data IC2E Internet of Things Panel Judy Qiu Indiana University.
Spark Lightning-Fast Cluster Computing UC BERKELEY.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
Outline | Motivation| Design | Results| Status| Future
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Big Data Infrastructure Week 12: Real-Time Data Analytics (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Microsoft Ignite /28/2017 6:07 PM
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Big thanks to everyone!.
PySpark Tutorial - Learn to use Apache Spark with Python
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
Fast, Interactive, Language-Integrated Cluster Computing
Introduction to Spark Streaming for Real Time data analysis
ITCS-3190.
Hadoop Tutorials Spark
Some slides borrowed from the authors
Applying Control Theory to Stream Processing Systems
Spark Presentation.
Berkeley Data Analytics Stack (BDAS) Overview
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Ministry of Higher Education
Introduction to Spark.
Data-Intensive Distributed Computing
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Spark and Scala.
Introduction to Spark.
CS639: Data Management for Data Science
Data science laboratory (DSLAB)
Fast, Interactive, Language-Integrated Cluster Computing
Streaming data processing using Spark
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
Big-Data Analytics with Azure HDInsight
CS639: Data Management for Data Science
Presentation transcript:

Spark Streaming Large-scale near-real-time stream processing Tathagata Das (TD) UC Berkeley UC BERKELEY

What is Spark Streaming? Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark’s batch and interactive processing Provides a simple batch-like API for implementing complex algorithm Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.

Motivation Many important applications must process large streams of live data and provide results in near-real-time Social network trends Website statistics Intrustion detection systems etc. Distributed stream processing framework is required to Scale to large clusters (100s of machines) Achieve low latency (few seconds)

Integration with Batch Processing Many environments require processing same data in live streaming as well as batch post processing Existing framework cannot do both Either do stream processing of 100s of MB/s with low latency Or do batch processing of TBs / PBs of data with high latency Extremely painful to maintain two different stacks Different programming models 2X implementation effort 2X bugs

Stateful Stream Processing Traditional streaming systems have a event-driven record-at-a-time processing model Each node has mutable state For each record, update state & send new records State is lost if node dies! Making stateful stream processing be fault-tolerant is challenging mutable state node 1 node 3 input records node 2 Traditional streaming systems have what we call a “record-at-a-time” processing model. Each node in the cluster processing a stream has a mutable state. As records arrive one at a time, the mutable state is updated, and a new generated record is pushed to downstream nodes. Now making this mutable state fault-tolerant is hard.

Existing Streaming Systems Storm Replays record if not processed by a node Processes each record at least once Double counting - may update mutable state twice! Not fault-tolerant - mutable state can be lost due to failure! Trident – Use transactions to update state Processes each record exactly once Transactions to external database for state updates reduces throughput

Spark Streaming

Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs live data stream Chop up the live data stream into batches of X seconds Spark treats each batch of data as RDDs and processes them using RDD operations Finally, the processed results of the RDD operations are returned in batches Spark Streaming batches of X seconds Spark processed results

Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs live data stream Batch sizes as low as ½ second, latency ~ 1 second Potential for combining batch processing and streaming processing in the same system Spark Streaming batches of X seconds Spark processed results

Example – Get hashtags from Twitter val tweets = ssc.twitterStream(authorization) DStream: a sequence of RDD representing a stream of data batch @ t+1 batch @ t batch @ t+2 Twitter Streaming API tweets DStream stored in memory as an RDD (immutable, distributed dataset)

Example – Get hashtags from Twitter val tweets = ssc.twitterStream(authorization) val hashTags = tweets.flatMap(status => getTags(status)) new DStream transformation: modify data in one Dstream to create another DStream batch @ t+1 batch @ t batch @ t+2 tweets DStream Twitter Streaming API flatMap flatMap flatMap … hashTags Dstream [#cat, #dog, … ] new RDDs created for every batch

Example – Get hashtags from Twitter val tweets = ssc.twitterStream(authorization) val hashTags = tweets.flatMap(status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage RDD @ t RDD @ t+1 RDD @ t+2 tweets DStream flatMap flatMap flatMap hashTags DStream save every batch saved to HDFS

Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(authorization) val hashTags = tweets.flatMap(status => getTags(status)) hashTags.foreach(rdd => { ... }) foreach: do anything with the processed data RDD @ t RDD @ t+1 RDD @ t+2 tweets DStream flatMap flatMap flatMap hashTags DStream foreach foreach foreach Write to database, update analytics UI, do whatever you want

Function object to define the transformation Java Example Scala val tweets = ssc.twitterStream(authorization) val hashTags = tweets.flatMap(status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") Java JavaDStream<Status> tweets = ssc.twitterStream(authorization) JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { }) Function object to define the transformation

Window-based Transformations val tweets = ssc.twitterStream(authorization) val hashTags = tweets.flatMap(status => getTags(status)) val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue() sliding window operation window length sliding interval window length DStream of data sliding interval

Demo

lost partitions recomputed in parallel on other workers Fault-tolerance RDDs remember the sequence of operations that created it from the original fault-tolerant input data Batches of input data are replicated in memory of multiple worker nodes to make them fault-tolerant Data lost due to worker failure, can be recomputed from input data tweets RDD input data replicated in memory flatMap hashTags RDD lost partitions recomputed in parallel on other workers

Arbitrary Stateful Computations Maintain arbitrary per-key state across time Specify function to generate new state based on previous state and new data Example: Maintain per-user mood as state, and update it with their tweets def updateMood(newTweets, lastMood) => newMood moods = tweets.updateStateByKey(updateMood _)

Arbitrary Combos of Batch and Streaming Computations Inter-mix RDD and DStream operations! Example: Join incoming tweets with a spam HDFS file to filter out bad tweets tweets.transform(tweetsRDD => { tweetsRDD.join(spamHDFSFile).filter(...) })

DStream Input Sources Out of the box we provide Kafka Flume HDFS Raw TCP sockets Etc. Simple interface to implement your own data receiver What to do when a receiver is started (create TCP connection, etc.) What to do when a received is stopped (close TCP connection, etc.)

Performance Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency Tested with 100 text streams on 100 EC2 instances

Comparison with Storm and S4 Higher throughput than Storm Spark Streaming: 670k records/second/node Storm: 115k records/second/node Apache S4: 7.5k records/second/node Streaming Spark offers similar speed while providing FT and consistency guarantees that these systems lack

Fast Fault Recovery Recovers from faults/stragglers within 1 sec

Real Applications: Mobile Millennium Project Traffic transit time estimation using online machine learning on GPS observations Markov chain Monte Carlo simulations on GPS observations Very CPU intensive, requires dozens of machines for useful computation Scales linearly with cluster size

Unifying Batch and Stream Processing Models Spark program on Twitter log file using RDDs val tweets = sc.hadoopFile("hdfs://...") val hashTags = tweets.flatMap(status => getTags(status)) hashTags.saveAsHadoopFile("hdfs://...") Spark Streaming program on Twitter stream using DStreams val tweets = ssc.twitterStream() hashTags.saveAsHadoopFiles("hdfs://...")

Vision - one stack to rule them all Explore data interactively using Spark Shell to identify problems Use same code in Spark stand- alone programs to identify problems in production logs Use similar code in Spark Streaming to identify problems in live log streams $ ./spark-shell scala> val file = sc.hadoopFile(“smallLogs”) ... scala> val filtered = file.filter(_.contains(“ERROR”)) scala> val mapped = filtered.map(...) object ProcessProductionData { def main(args: Array[String]) { val sc = new SparkContext(...) val file = sc.hadoopFile(“productionLogs”) val filtered = file.filter(_.contains(“ERROR”)) val mapped = filtered.map(...) ... } object ProcessLiveStream { def main(args: Array[String]) { val sc = new StreamingContext(...) val stream = sc.kafkaStream(...) val filtered = file.filter(_.contains(“ERROR”)) val mapped = filtered.map(...) ... }

Our Vision - one stack to rule them all Interactive Processing Batch Processing Stream Processing Spark + Shark BlinkDB Spark Streaming Tachyon A single, unified programming model for all computations Allows inter-operation between all the frameworks Save RDDs from streaming to Tachyon Run Shark SQL queries on streaming RDDs Run direct Spark batch and interactive jobs on streaming RDDs

Thank you! Conclusion Integrated with Spark as an extension Run it locally or EC2 Spark Cluster (takes 5 minute to setup) Streaming programming guide – http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html Paper – tinyurl.com/dstreams Thank you!