In-Memory Cluster Computing for Iterative and Interactive Applications

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
Comp6611 Course Lecture Big data applications Yang PENG Network and System Lab CSE, HKUST Monday, March 11, 2013 Material adapted from.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Berkeley Data Analytics Stack
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Frameworks (and Stream Processing) Aditya Akella.
Overview of Spark project Presented by Yin Zhu Materials from Hadoop in Practice by A.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
In-Memory Cluster Computing for Iterative and Interactive Applications
Distributed Computations
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Spark Resilient Distributed Datasets:
In-Memory Cluster Computing for Iterative and Interactive Applications
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Storage in Big Data Systems
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Resilient Distributed Datasets (NSDI 2012) A Fault-Tolerant Abstraction for In-Memory Cluster Computing Piccolo (OSDI 2010) Building Fast, Distributed.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.
Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015.
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
John Kubiatowicz Electrical Engineering and Computer Sciences
REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.
Spark: Cluster Computing with Working Sets
Berkeley Data Analytics Stack - Apache Spark
Presented by Peifeng Yu
Fast, Interactive, Language-Integrated Cluster Computing
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Pregelix: Think Like a Vertex, Scale Like Spandex
Spark and Scala.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Introduction to Spark.
5/7/2019 Map Reduce Map reduce.
Fast, Interactive, Language-Integrated Cluster Computing
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

In-Memory Cluster Computing for Iterative and Interactive Applications Spark In-Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Justin Ma, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley

Background Commodity clusters have become an important computing platform for a variety of applications In industry: search, machine translation, ad targeting, … In research: bioinformatics, NLP, climate simulation, … High-level cluster programming models like MapReduce power many of these apps Theme of this work: provide similarly powerful abstractions for a broader class of applications Explain that trends in data collection rates vs processor and IO speeds are driving this

Motivation Current popular programming models for clusters transform data flowing from stable storage to stable storage E.g., MapReduce: Map Reduce Input Output Also applies to Dryad, SQL, etc Benefits: easy to do fault tolerance and

Motivation Current popular programming models for clusters transform data flowing from stable storage to stable storage E.g., MapReduce: Map Reduce Input Output Also applies to Dryad, SQL, etc Benefits: easy to do fault tolerance and Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures

Motivation Acyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data: Iterative algorithms (many in machine learning) Interactive data mining tools (R, Excel, Python) Spark makes working sets a first-class concept to efficiently support these apps

Spark Goal Provide distributed memory abstractions for clusters to support apps with working sets Retain the attractive properties of MapReduce: Fault tolerance (for crashes & stragglers) Data locality Scalability RDDs = first-class way to manipulate and persist intermediate datasets Solution: augment data flow model with “resilient distributed datasets” (RDDs)

Generality of RDDs We conjecture that Spark’s combination of data flow with RDDs unifies many proposed cluster programming models General data flow models: MapReduce, Dryad, SQL Specialized models for stateful apps: Pregel (BSP), HaLoop (iterative MR), Continuous Bulk Processing Instead of specialized APIs for one type of app, give user first-class control of distrib. datasets Key idea:

Outline Spark programming model Example applications Implementation Demo Future work

Programming Model Resilient distributed datasets (RDDs) Immutable collections partitioned across cluster that can be rebuilt if a partition is lost Created by transforming data in stable storage using data flow operators (map, filter, group-by, …) Can be cached across parallel operations Parallel operations on RDDs Reduce, collect, count, save, … Restricted shared variables Accumulators, broadcast variables You write a single program  similar to DryadLINQ Distributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across ops Variables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimization Mention cached vars useful for some workloads that won’t be shown here Mention it’s all designed to be easy to distribute in a fault-tolerant fashion

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD Cache 1 Transformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Worker Driver results tasks Block 1 Cached RDD Parallel operation cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count Cache 2 . . . Cache 3 Block 2 Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3

RDDs in More Detail An RDD is an immutable, partitioned, logical collection of records Need not be materialized, but rather contains information to rebuild a dataset from stable storage Partitioning can be based on a key in each record (using hash or range partitioning) Built using bulk transformations on other RDDs Can be cached for future reuse

RDD Operations Transformations (define a new RDD) map filter sample union groupByKey reduceByKey join cache … Parallel operations (return a result to driver) reduce collect count save lookupKey …

RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) .cache() HdfsRDD path: hdfs://… FilteredRDD func: contains(...) MappedRDD func: split(…) CachedRDD

Benefits of RDD Model Consistency is easy due to immutability Inexpensive fault tolerance (log lineage rather than replicating/checkpointing data) Locality-aware scheduling of tasks on partitions Despite being restricted, model seems applicable to a broad variety of applications

RDDs vs Distributed Shared Memory Concern RDDs Distr. Shared Mem. Reads Fine-grained Writes Bulk transformations Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-overhead using lineage Requires checkpoints and program rollback Straggler mitigation Possible using speculative execution Difficult Work placement Automatic based on data locality Up to app (but runtime aims for transparency)

Related Work DryadLINQ Relational databases Piccolo Language-integrated API with SQL-like operations on lazy datasets Cannot have a dataset persist across queries Relational databases Lineage/provenance, logical logging, materialized views Piccolo Parallel programs with shared distributed tables; similar to distributed shared memory Iterative MapReduce (Twister and HaLoop) Cannot define multiple distributed datasets, run different map/reduce pairs on them, or query data interactively RAMCloud Allows random read/write to all cells, requiring logging much like distributed shared memory systems

Outline Spark programming model Example applications Implementation Demo Future work

Example: Logistic Regression Goal: find best line separating two sets of points random initial line + + + + + + – + + – Note that dataset is reused on each gradient computation – – + + – – – – – – target

Logistic Regression Code val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)

Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)

Example: MapReduce MapReduce data flow can be expressed using RDD transformations res = data.flatMap(rec => myMapFunc(rec)) .groupByKey() .map((key, vals) => myReduceFunc(key, vals)) Or with combiners: res = data.flatMap(rec => myMapFunc(rec)) .reduceByKey(myCombiner) .map((key, val) => myReduceFunc(key, val))

Word Count in Spark val lines = spark.textFile(“hdfs://...”) val counts = lines.flatMap(_.split(“\\s”)) .reduceByKey(_ + _) counts.save(“hdfs://...”)

Example: Pregel Graph processing framework from Google that implements Bulk Synchronous Parallel model Vertices in the graph have state At each superstep, each node can update its state and send messages to nodes in future step Good fit for PageRank, shortest paths, …

Pregel Data Flow . . . Input graph Vertex state 1 Messages 1 Group by vertex ID Superstep 1 Vertex state 2 Messages 2 Group by vertex ID Superstep 2 . . .

PageRank in Pregel . . . Input graph Vertex ranks 1 Contributions 1 Group & add by vertex Superstep 1 (add contribs) Vertex ranks 2 Contributions 2 Group & add by vertex Superstep 2 (add contribs) . . .

Pregel in Spark Separate RDDs for immutable graph state and for vertex states and messages at each iteration Use groupByKey to perform each step Cache the resulting vertex and message RDDs Optimization: co-partition input graph and vertex state RDDs to reduce communication

Other Spark Applications Twitter spam classification (Justin Ma) EM alg. for traffic prediction (Mobile Millennium) K-means clustering Alternating Least Squares matrix factorization In-memory OLAP aggregation on Hive data SQL on Spark (future work)

Outline Spark programming model Example applications Implementation Demo Future work

Overview Spark runs on the Mesos cluster manager [NSDI 11], letting it share resources with Hadoop & other apps Can read from any Hadoop input source (e.g. HDFS) Spark Hadoop MPI Mesos Node … Mention it’s designed to be fault-tolerant ~6000 lines of Scala code thanks to building on Mesos

Language Integration Scala closures are Serializable Java objects Serialize on driver, load & run on workers Not quite enough Nested closures may reference entire outer scope May pull in non-Serializable variables not used inside Solution: bytecode analysis + reflection Shared variables implemented using custom serialized form (e.g. broadcast variable contains pointer to BitTorrent tracker)

Interactive Spark Modified Scala interpreter to allow Spark to be used interactively from the command line Required two changes: Modified wrapper code generation so that each “line” typed has references to objects for its dependencies Place generated classes in distributed filesystem Enables in-memory exploration of big data

Outline Spark programming model Example applications Implementation Demo Future work

Outline Spark programming model Example applications Implementation Demo Future work

Future Work Further extend RDD capabilities Control over storage layout (e.g. column-oriented) Additional caching options (e.g. on disk, replicated) Leverage lineage for debugging Replay any task, rebuild any intermediate RDD Adaptive checkpointing of RDDs Higher-level analytics tools built on top of Spark

Conclusion By making distributed datasets a first-class primitive, Spark provides a simple, efficient programming model for stateful data analytics RDDs provide: Lineage info for fault recovery and debugging Adjustable in-memory caching Locality-aware parallel operations We plan to make Spark the basis of a suite of batch and interactive data analysis tools

RDD Internal API Set of partitions Preferred locations for each partition Optional partitioning scheme (hash or range) Storage strategy (lazy or cached) Parent RDDs (forming a lineage DAG)