Spark Streaming Real-time big-data processing

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

Spark Streaming Large-scale near-real-time stream processing
Spark Streaming Large-scale near-real-time stream processing
Spark Streaming Large-scale near-real-time stream processing
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Making Fly Parviz Deyhim
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.
Spark Lightning-Fast Cluster Computing UC BERKELEY.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Spark 1.1 and Beyond Patrick Wendell.
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
Outline | Motivation| Design | Results| Status| Future
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Spark Streaming Large-scale near-real-time stream processing
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Big Data Infrastructure Week 12: Real-Time Data Analytics (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
PySpark Tutorial - Learn to use Apache Spark with Python
Architecture and design
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
Fast, Interactive, Language-Integrated Cluster Computing
Machine Learning Library for Apache Ignite
Introduction to Spark Streaming for Real Time data analysis
ITCS-3190.
Some slides borrowed from the authors
Applying Control Theory to Stream Processing Systems
Spark Presentation.
Berkeley Data Analytics Stack (BDAS) Overview
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
CSCI5570 Large Scale Data Processing Systems
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Introduction to Spark.
Data-Intensive Distributed Computing
湖南大学-信息科学与工程学院-计算机与科学系
CS110: Discussion about Spark
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Introduction to Spark.
Apache Hadoop and Spark
Fast, Interactive, Language-Integrated Cluster Computing
Streaming data processing using Spark
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
Lecture 29: Distributed Systems
Presentation transcript:

Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY

What is Spark Streaming? GraphX … Shark MLlib BlinkDB Extends Spark for doing big data stream processing Project started in early 2012, alpha released in Spring 2013 with Spark 0.7 Moving out of alpha in Spark 0.9

Why Spark Streaming? Many big-data applications need to process large data streams in realtime Website monitoring Fraud detection Ad monetization

Why Spark Streaming? Need a framework for big data stream processing that Website monitoring Scales to hundreds of nodes Achieves second-scale latencies Efficiently recover from failures Integrates with batch and interactive processing Fraud detection Ad monetization

Integration with Batch Processing Many environments require processing same data in live streaming as well as batch post- processing Existing frameworks cannot do both Either, stream processing of 100s of MB/s with low latency Or, batch processing of TBs of data with high latency Extremely painful to maintain two different stacks Different programming models Double implementation effort

Stateful Stream Processing Traditional model Mutable state is lost if node fails Making stateful stream processing fault tolerant is challenging! mutable state node 1 node 3 input records node 2 Processing pipeline of nodes Each node maintains mutable state Each input record updates the state and new records are sent out Traditional stream processing use the continuous operator model, where every node in the processing pipeline continuously run an operator with in-memory mutable state. As each input records is received, the mutable state is updated and new records are sent out to downstream nodes. The problem with this model is that the mutable state is lost if the node fails. To deal with this ,various techniques have been developed to make this state fault-tolerant. I am going to divide them into two broad classes and explain their limitations.

Existing Streaming Systems Storm Replays record if not processed by a node Processes each record at least once May update mutable state twice! Mutable state can be lost due to failure! Trident – Use transactions to update state Processes each record exactly once Per-state transaction to external database is slow

Spark Streaming

Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs Chop up the live stream into batches of X seconds Spark treats each batch of data as RDDs and processes them using RDD operations Finally, the processed results of the RDD operations are returned in batches live data stream Spark Streaming batches of X seconds Spark processed results

Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs Batch sizes as low as ½ second, latency of about 1 second Potential for combining batch processing and streaming processing in the same system live data stream Spark Streaming batches of X seconds Spark processed results

Example – Get hashtags from Twitter val tweets = ssc.twitterStream() DStream: a sequence of RDDs representing a stream of data batch @ t+1 batch @ t batch @ t+2 Twitter Streaming API tweets DStream stored in memory as an RDD (immutable, distributed)

Example – Get hashtags from Twitter val tweets = ssc.twitterStream() val hashTags = tweets.flatMap(status => getTags(status)) new DStream transformation: modify data in one DStream to create another DStream batch @ t+1 batch @ t batch @ t+2 tweets DStream flatMap flatMap flatMap … hashTags Dstream [#cat, #dog, … ] new RDDs created for every batch

Example – Get hashtags from Twitter val tweets = ssc.twitterStream() val hashTags = tweets.flatMap(status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage batch @ t batch @ t+1 batch @ t+2 tweets DStream flatMap flatMap flatMap hashTags DStream save every batch saved to HDFS

Example – Get hashtags from Twitter val tweets = ssc.twitterStream() val hashTags = tweets.flatMap(status => getTags(status)) hashTags.foreach(hashTagRDD => { ... }) foreach: do whatever you want with the processed data batch @ t batch @ t+1 batch @ t+2 tweets DStream flatMap flatMap flatMap hashTags DStream foreach foreach foreach Write to a database, update analytics UI, do whatever you want

Demo

Java Example Scala Java Function object val tweets = ssc.twitterStream() val hashTags = tweets.flatMap(status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") Java JavaDStream<Status> tweets = ssc.twitterStream() JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { }) Function object

Window-based Transformations val tweets = ssc.twitterStream() val hashTags = tweets.flatMap(status => getTags(status)) val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue() sliding window operation window length sliding interval window length DStream of data sliding interval

Arbitrary Stateful Computations Specify function to generate new state based on previous state and new data Example: Maintain per-user mood as state, and update it with their tweets def updateMood(newTweets, lastMood) => newMood moods = tweetsByUser.updateStateByKey(updateMood _)

Arbitrary Combinations of Batch and Streaming Computations Inter-mix RDD and DStream operations! Example: Join incoming tweets with a spam HDFS file to filter out bad tweets tweets.transform(tweetsRDD => { tweetsRDD.join(spamHDFSFile).filter(...) })

DStreams + RDDs = Power Online machine learning Continuously learn and update data models (updateStateByKey and transform) Combine live data streams with historical data Generate historical data models with Spark, etc. Use data models to process live data stream (transform) CEP-style processing window-based operations (reduceByWindow, etc.)

Input Sources Out of the box, we provide Kafka, HDFS, Flume, Akka Actors, Raw TCP sockets, etc. Very easy to write a receiver for your own data source Also, generate your own RDDs from Spark, etc. and push them in as a “stream”

lost partitions recomputed on other workers Fault-tolerance Batches of input data are replicated in memory for fault-tolerance Data lost due to worker failure, can be recomputed from replicated input data tweets RDD input data replicated in memory flatMap hashTags RDD lost partitions recomputed on other workers All transformations are fault- tolerant, and exactly-once transformations

Performance Can process 60M records/sec (6 GB/sec) on 100 nodes at sub-second latency

Comparison with other systems Higher throughput than Storm Spark Streaming: 670k records/sec/node Storm: 115k records/sec/node Commercial systems: 100-500k records/sec/node Streaming Spark offers similar speed while providing FT and consistency guarantees that these systems lack

Recovers from faults/stragglers within 1 sec Fast Fault Recovery Recovers from faults/stragglers within 1 sec

Mobile Millennium Project Traffic transit time estimation using online machine learning on GPS observations Markov-chain Monte Carlo simulations on GPS observations Very CPU intensive, requires dozens of machines for useful computation Scales linearly with cluster size

Advantage of an unified stack Explore data interactively to identify problems Use same code in Spark for processing large logs Use similar code in Spark Streaming for realtime processing $ ./spark-shell scala> val file = sc.hadoopFile(“smallLogs”) ... scala> val filtered = file.filter(_.contains(“ERROR”)) scala> val mapped = filtered.map(...) object ProcessProductionData { def main(args: Array[String]) { val sc = new SparkContext(...) val file = sc.hadoopFile(“productionLogs”) val filtered = file.filter(_.contains(“ERROR”)) val mapped = filtered.map(...) ... } object ProcessLiveStream { def main(args: Array[String]) { val sc = new StreamingContext(...) val stream = sc.kafkaStream(...) val filtered = stream.filter(_.contains(“ERROR”)) val mapped = filtered.map(...) ... }

Roadmap Spark 0.8.1 Spark 0.9 in Jan 2014 – out of alpha! Marked alpha, but has been quite stable Master fault tolerance – manual recovery Restart computation from a checkpoint file saved to HDFS Spark 0.9 in Jan 2014 – out of alpha! Automated master fault recovery Performance optimizations Web UI, and better monitoring capabilities

Roadmap Long term goals Community feedback is crucial! Python API MLlib for Spark Streaming Shark Streaming Community feedback is crucial! Helps us prioritize the goals Contributions are more than welcome!!

Today’s Tutorial Process Twitter data stream to find most popular hashtags over a window Requires a Twitter account Need to setup Twitter OAuth keys to access tweets All the instructions are in the tutorial Your account will be safe! No need to enter your password anywhere, only the keys Destroy the keys after the tutorial is done

Conclusion Thank you! Streaming programming guide – Research Paper – spark.incubator.apache.org/docs/latest/streaming- programming-guide.html Research Paper – tinyurl.com/dstreams Thank you!