Massive Data Processing – In-Memory Computing & Spark Stream Process.

Slides:



Advertisements
Similar presentations
Spark Streaming Large-scale near-real-time stream processing
Advertisements

Spark Streaming Large-scale near-real-time stream processing
Spark Streaming Real-time big-data processing
Spark Streaming Large-scale near-real-time stream processing
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
Comp6611 Course Lecture Big data applications Yang PENG Network and System Lab CSE, HKUST Monday, March 11, 2013 Material adapted from.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Berkeley Data Analytics Stack
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Frameworks (and Stream Processing) Aditya Akella.
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Spark Resilient Distributed Datasets:
In-Memory Cluster Computing for Iterative and Interactive Applications
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
Spark Streaming Large-scale near-real-time stream processing
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.
Data Engineering How MapReduce Works
Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015.
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Big Data Infrastructure Week 12: Real-Time Data Analytics (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.
Image taken from: slideshare
Spark: Cluster Computing with Working Sets
Berkeley Data Analytics Stack - Apache Spark
Fast, Interactive, Language-Integrated Cluster Computing
Introduction to Spark Streaming for Real Time data analysis
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Spark and Scala.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Introduction to Spark.
Fast, Interactive, Language-Integrated Cluster Computing
Streaming data processing using Spark
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

Massive Data Processing – In-Memory Computing & Spark Stream Process

Background Commodity clusters have become an important computing platform for a variety of applications »In industry: search, machine translation, ad targeting, … »In research: bioinformatics, NLP, climate simulation, … High-level cluster programming models like MapReduce power many of these apps

Motivation Current popular programming models for clusters transform data flowing from stable storage to stable storage E.g., MapReduce: Map Reduce InputOutput

Motivation Map Reduce InputOutput Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures Current popular programming models for clusters transform data flowing from stable storage to stable storage E.g., MapReduce:

Motivation Acyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data: »Iterative algorithms (many in machine learning) »Interactive data mining tools (R, Excel, Python) Spark makes working sets a first-class concept to efficiently support these apps

Spark Goal Provide distributed memory abstractions for clusters to support apps with working sets Retain the attractive properties of MapReduce: »Fault tolerance (for crashes & stragglers) »Data locality »Scalability Solution: augment data flow model with “resilient distributed datasets” (RDDs)

Programming Model Resilient distributed datasets (RDDs) »Immutable collections partitioned across cluster that can be rebuilt if a partition is lost »Created by transforming data in stable storage using data flow operators (map, filter, group-by, …) »Can be cached across parallel operations Parallel operations on RDDs »Reduce, collect, count, save, … Restricted shared variables »Accumulators, broadcast variables

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Driver cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count... tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Cached RDD Parallel operation Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

RDDs in More Detail An RDD is an immutable, partitioned, logical collection of records »Need not be materialized, but rather contains information to rebuild a dataset from stable storage Partitioning can be based on a key in each record (using hash or range partitioning) Built using bulk transformations on other RDDs Can be cached for future reuse

RDD Operations Transformations (define a new RDD) map filter sample union groupByKey reduceByKey join cache … Parallel operations (return a result to driver) reduce collect count save lookupKey …

RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: cachedMsgs = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2)).cache() HdfsRDD path: hdfs://… HdfsRDD path: hdfs://… FilteredRDD func: contains(...) FilteredRDD func: contains(...) MappedRDD func: split(…) MappedRDD func: split(…) CachedRDD

Benefits of RDD Model Consistency is easy due to immutability Inexpensive fault tolerance (log lineage rather than replicating/checkpointing data) Locality-aware scheduling of tasks on partitions Despite being restricted, model seems applicable to a broad variety of applications

RDDs vs Distributed Shared Memory ConcernRDDsDistr. Shared Mem. ReadsFine-grained WritesBulk transformationsFine-grained ConsistencyTrivial (immutable)Up to app / runtime Fault recoveryFine-grained and low- overhead using lineage Requires checkpoints and program rollback Straggler mitigation Possible using speculative execution Difficult Work placementAutomatic based on data locality Up to app (but runtime aims for transparency)

Example: Logistic Regression Goal: find best line separating two sets of points + – – – – – – – – – + target – random initial line

Logistic Regression Code val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)

Example: MapReduce MapReduce data flow can be expressed using RDD transformations res = data.flatMap(rec => myMapFunc(rec)).groupByKey().map((key, vals) => myReduceFunc(key, vals)) Or with combiners: res = data.flatMap(rec => myMapFunc(rec)).reduceByKey(myCombiner).map((key, val) => myReduceFunc(key, val))

Word Count in Spark val lines = spark.textFile(“hdfs://...”) val counts = lines.flatMap(_.split(“\\s”)).reduceByKey(_ + _) counts.save(“hdfs://...”)

Example: Pregel Graph processing framework from Google that implements Bulk Synchronous Parallel model Vertices in the graph have state At each superstep, each node can update its state and send messages to nodes in future step Good fit for PageRank, shortest paths, …

Pregel Data Flow Input graph Vertex state 1 Messages 1 Superstep 1 Vertex state 2 Messages 2 Superstep 2... Group by vertex ID

PageRank in Pregel Input graph Vertex ranks 1 Contributions 1 Superstep 1 (add contribs) Vertex ranks 2 Contributions 2 Superstep 2 (add contribs)... Group & add by vertex

Pregel in Spark Separate RDDs for immutable graph state and for vertex states and messages at each iteration Use groupByKey to perform each step Cache the resulting vertex and message RDDs Optimization: co-partition input graph and vertex state RDDs to reduce communication

Other Spark Applications Twitter spam classification (Justin Ma) EM alg. for traffic prediction (Mobile Millennium) K-means clustering Alternating Least Squares matrix factorization In-memory OLAP aggregation on Hive data SQL on Spark (future work)

Overview Spark runs on the Mesos cluster manager [NSDI 11], letting it share resources with Hadoop & other apps Can read from any Hadoop input source (e.g. HDFS) Spark Hadoop MPI Mesos Node … ~6000 lines of Scala code thanks to building on Mesos

Language Integration Scala closures are Serializable Java objects »Serialize on driver, load & run on workers Not quite enough »Nested closures may reference entire outer scope »May pull in non-Serializable variables not used inside »Solution: bytecode analysis + reflection Shared variables implemented using custom serialized form (e.g. broadcast variable contains pointer to BitTorrent tracker)

Interactive Spark Modified Scala interpreter to allow Spark to be used interactively from the command line Required two changes: »Modified wrapper code generation so that each “line” typed has references to objects for its dependencies »Place generated classes in distributed filesystem Enables in-memory exploration of big data

RDD Internal API Set of partitions Preferred locations for each partition Optional partitioning scheme (hash or range) Storage strategy (lazy or cached) Parent RDDs (forming a lineage DAG)

Massive Data Processing – Spark Stream Process

What is Spark Streaming? Framework for large scale stream processing »Scales to 100s of nodes »Can achieve second scale latencies »Integrates with Spark’s batch and interactive processing »Provides a simple batch-like API for implementing complex algorithm »Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.

Motivation Many important applications must process large streams of live data and provide results in near-real-time »Social network trends »Website statistics »Intrustion detection systems »etc. Require large clusters to handle workloads Require latencies of few seconds

Need for a framework … … for building such complex stream processing applications But what are the requirements from such a framework?

Requirements Scalable to large clusters Second-scale latencies Simple programming model

Spark Streaming 32

Motivation Many important applications must process large streams of live data and provide results in near-real-time »Social network trends »Website statistics »Intrusion detection systems etc. »Require large clusters to handle workloads »Require latencies of few seconds

What is Spark Streaming? Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark’s batch and interactive processing Provides a simple batch-like API Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.

What is Spark streaming? Spark streaming is an extension of the core Spark API that enables scalable, high-throughput stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets

How Spark streaming works Spark Streaming receives live input data streams and divides the data into batches. Each batch is processed by the Spark engine to generate the final stream of results.

Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs

Example – Get hashtags from Twitter val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage flatMap save t+1 t+2 tweets DStream hashTags DStream Every batch saved to HDFS, Write to database, update analytics UI, do whatever appropriate

Example ( Scala and Java) Scala val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...”) Java JavaDStream tweets = ssc.twitterStream() JavaDstream hashTags = tweets.flatMap(new Function { }) hashTags.saveAsHadoopFiles("hdfs://...")

Combinations of Batch and Streaming Computations »Combine batch RDD with streaming DStream  Example: Join incoming tweets with a spam HDFS file to filter out bad tweets tweets.transform(tweetsRDD => { tweetsRDD.join(spamHDFSFile).filter(...) })

Similar components Higher throughput than Storm Spark Streaming: 670k records/second/node Storm: 115k records/second/node Apache S4: 7.5k records/second/node

Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs 42 Spark Streaming batches of X seconds live data stream processed results  Chop up the live stream into batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations  Finally, the processed results of the RDD operations are returned in batches

Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs 43 Spark Streaming batches of X seconds live data stream processed results  Batch sizes as low as ½ second, latency ~ 1 second  Potential for combining batch processing and streaming processing in the same system

Fault-tolerance RDDs are remember the sequence of operations that created it from the original fault-tolerant input data Batches of input data are replicated in memory of multiple worker nodes, therefore fault-tolerant Data lost due to worker failure, can be recomputed from input data input data replicated in memory flatMap lost partitions recomputed on other workers tweets RDD hashTags RDD

Key concepts DStream – sequence of RDDs representing a stream of data »Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets Transformations – modify data from on DStream to another »Standard RDD operations – map, countByValue, reduce, join, … »Stateful operations – window, countByValueAndWindow, … Output Operations – send data to external entity »saveAsHadoopFiles – saves to HDFS »foreach – do anything with each batch of results