Berkley Data Analysis Stack (BDAS)

Berkley Data Analysis Stack (BDAS)
Mesos, Spark, Shark, Spark Streaming

Current Data Analysis Open Stack
Application Storage Data Processing Infrastructure Characteristics: Batch Processing on on-disk data.. Not very efficient with “Interactive” and “Streaming” computations.

Berkeley Data Analytics Stack (BDAS)
New apps: AMP-Genomics, Carat, … Application Application in-memory processing trade between time, quality, and cost Data Processing Data Processing Data Management Storage Efficient data sharing across frameworks … in two fundamental aspects… At the application layer we build new, real applications such a AMP-Genomics a genomics pipeline, and Carat, an application I’m going to talk about soon. Building real applications allows us to drive the features and design of the lower layers. Infrastructure Resource Management Share infrastructure across frameworks (multi-programming for datacenters)

BDAS Components

Mesos A platform for sharing commodity clusters between diverse computing frameworks. B. Hindman et. al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, tech report, UCB, 2010

Mesos “Resource Offers” to publish available resources
Has to deal with “framework” specific constraints (without knowing the specific constraints). e.g. data locality. Allows the framework scheduler to “reject offer” if constraints are not met. B. Hindman et. al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, tech report, UCB, 2010

Mesos Other Issues: Resource Allocation Strategies: Pluggable
fair sharing plugin implemented Revocation Isolation: “Existing OS isolation techniques: Linux Containers”. Fault Tolerance: Master: stand by master nodes and zoo keeper Slaves: Reports task/slave failures to the framework, the latter handles. Framework scheduler failure: Replicate B. Hindman et. al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, tech report, UCB, 2010

Spark Current popular programming models for clusters transform data flowing from stable storage to stable storage Spark: In-Memory Cluster Computing for Iterative and Interactive Applications. Assumption Same working data set is used across iterations or for a number of interactive queries. Commodity Cluster Local Data partitions fit in Memory Some Slides taken from presentation:

Spark Acyclic data flows – powerful abstractions
But not efficient for Iterative/interactive applications that repeatedly use the same “working data set”. Solution: augment data flow model with in-memory “resilient distributed datasets” (RDDs)

RDDs An RDD is an immutable, partitioned, logical collection of records Need not be materialized, but rather contains information to rebuild a dataset from stable storage (lazy-loading and lineage) can be rebuilt if a partition is lost (Transform once read many) Partitioning can be based on a key in each record (using hash or range partitioning) Created by transforming data in stable storage using data flow operators (map, filter, group-by, …) Can be cached for future reuse

Generality of RDDs Claim: Spark’s combination of data flow with RDDs unifies many proposed cluster programming models General data flow models: MapReduce, Dryad, SQL Specialized models for stateful apps: Pregel (BSP), HaLoop (iterative MR), Continuous Bulk Processing Instead of specialized APIs for one type of app, give user first-class control of distributed datasets Key idea:

Programming Model Transformations (define a new RDD)
map filter sample union groupByKey reduceByKey join cache … Parallel operations (return a result to driver) reduce collect count save lookupKey …

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD Cache 1 Transformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Worker Driver results tasks Block 1 Cached RDD Parallel operation cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count Cache 2 . . . Cache 3 Block 2 Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3

RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) .cache() HdfsRDD path: hdfs://… FilteredRDD func: contains(...) MappedRDD func: split(…) CachedRDD

Benefits of RDD Model Consistency is easy due to immutability
Inexpensive fault tolerance (log lineage rather than replicating/checkpointing data) Locality-aware scheduling of tasks on partitions Despite being restricted, model seems applicable to a broad variety of applications

Example: Logistic Regression
Goal: find best line separating two sets of points random initial line + + + + + + – + + – Note that dataset is reused on each gradient computation – – + + – – – – – – target

Logistic Regression Code
val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)

Logistic Regression Performance
127 s / iteration first iteration 174 s further iterations 6 s This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)

Page Rank: Scala Implementation
val links = // RDD of (url, neighbors) pairs var ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues( * _) } ranks.saveAsTextFile(...)

Spark Summary Fast, expressive cluster computing system compatible with Apache Hadoop Works with any Hadoop-supported storage system (HDFS, S3, Avro, …) Improves efficiency through: In-memory computing primitives General computation graphs Improves usability through: Rich APIs in Java, Scala, Python Interactive shell Up to 100× faster Put high-level distributed collection stuff here? Often 2-10× less code

Spark Streaming Framework for large scale stream processing
Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark’s batch and interactive processing Provides a simple batch-like API for implementing complex algorithm Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.

Requirements Integrated with batch & interactive processing
Scalable to large clusters Second-scale latencies Simple programming model Integrated with batch & interactive processing

Stateful Stream Processing
Traditional streaming systems have a event-driven record-at-a-time processing model Each node has mutable state For each record, update state & send new records State is lost if node dies! Making stateful stream processing be fault-tolerant is challenging mutable state node 1 node 3 input records node 2 Traditional streaming systems have what we call a “record-at-a-time” processing model. Each node in the cluster processing a stream has a mutable state. As records arrive one at a time, the mutable state is updated, and a new generated record is pushed to downstream nodes. Now making this mutable state fault-tolerant is hard.

Discretized Stream Processing
Run a streaming computation as a series of very small, deterministic batch jobs live data stream Spark Streaming Chop up the live stream into batches of X seconds Spark treats each batch of data as RDDs and processes them using RDD operations Finally, the processed results of the RDD operations are returned in batches batches of X seconds Spark processed results

Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) DStream: a sequence of RDD representing a stream of data Twitter Streaming API t+1 t t+2 tweets DStream stored in memory as an RDD (immutable, distributed)

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) new DStream transformation: modify data in one Dstream to create another DStream t+1 t t+2 tweets DStream flatMap flatMap flatMap … hashTags Dstream [#cat, #dog, … ] new RDDs created for every batch

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage t t+1 t+2 tweets DStream flatMap flatMap flatMap hashTags DStream save every batch saved to HDFS

lost partitions recomputed on other workers
Fault-tolerance RDDs remember the sequence of operations that created it from the original fault-tolerant input data Batches of input data are replicated in memory of multiple worker nodes, therefore fault-tolerant Data lost due to worker failure, can be recomputed from input data tweets RDD input data replicated in memory flatMap hashTags RDD lost partitions recomputed on other workers

Key concepts DStream – sequence of RDDs representing a stream of data
Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets Transformations – modify data from on DStream to another Standard RDD operations – map, countByValue, reduce, join, … Stateful operations – window, countByValueAndWindow, … Output Operations – send data to external entity saveAsHadoopFiles – saves to HDFS foreach – do anything with each batch of results

Comparison with Storm and S4
Higher throughput than Storm Spark Streaming: 670k records/second/node Storm: 115k records/second/node Apache S4: 7.5k records/second/node Streaming Spark offers similar speed while providing FT and consistency guarantees that these systems lack

Berkley Data Analysis Stack (BDAS)

Similar presentations

Presentation on theme: "Berkley Data Analysis Stack (BDAS)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Berkley Data Analysis Stack (BDAS)

Similar presentations

Presentation on theme: "Berkley Data Analysis Stack (BDAS)"— Presentation transcript:

Similar presentations

About project

Feedback