Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,

Slides:



Advertisements
Similar presentations
Introduction to Apache HIVE
Advertisements

Spark Streaming Large-scale near-real-time stream processing
Spark Streaming Large-scale near-real-time stream processing
Machine Learning on Spark
Shark:SQL and Rich Analytics at Scale
Shark Hive SQL on Spark Michael Armbrust.
The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
Spark Streaming Large-scale near-real-time stream processing
Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Spark Lightning-Fast Cluster Computing UC BERKELEY.
Spark in the Hadoop Ecosystem Eric Baldeschwieler (a.k.a. Eric14)
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
Outline | Motivation| Design | Results| Status| Future
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Storage in Big Data Systems
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica Good morning everyone. My name is Haoyuan,
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Spark Streaming Large-scale near-real-time stream processing
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.
Data Engineering How MapReduce Works
Matthew Winter and Ned Shawa
Operating Systems and The Cloud, Part II: Search => Cluster Apps => Scalable Machine Learning David E. Culler CS162 – Operating Systems and Systems Programming.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
Spark Debugger Ankur Dave, Matei Zaharia, Murphy McCauley, Scott Shenker, Ion Stoica UC BERKELEY.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.
Spark: Cluster Computing with Working Sets
Fast, Interactive, Language-Integrated Cluster Computing
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
ITCS-3190.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Data Platform and Analytics Foundational Training
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
Sajitha Naduvil-vadukootu
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Overview of big data tools
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Charles Tappert Seidenberg School of CSIS, Pace University
Fast, Interactive, Language-Integrated Cluster Computing
Presentation transcript:

Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma, Murphy McCauley, Scott Shenker, Ion Stoica, Reynold Xin UC Berkeley spark-project.org Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

What is Spark? Not a modified version of Hadoop Separate, fast, MapReduce-like engine »In-memory data storage for very fast iterative queries »General execution graphs and powerful optimizations »Up to 40x faster than Hadoop Compatible with Hadoop’s storage APIs »Can read/write to any Hadoop-supported system, including HDFS, HBase, SequenceFiles, etc

What is Shark? Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries (HiveQL, UDFs, etc) Similar speedups of up to 40x

Project History Spark project started in 2009, open sourced 2010 Shark started summer 2011, alpha April 2012 In use at Berkeley, Princeton, Klout, Foursquare, Conviva, Quantifind, Yahoo! Research & others 200+ member meetup, 500+ watchers on GitHub

This Talk Spark programming model User applications Shark overview Demo Next major addition: Streaming Spark

Why a New Programming Model? MapReduce greatly simplified big data analysis But as soon as it got popular, users wanted more: »More complex, multi-stage applications (e.g. iterative graph algorithms and machine learning) »More interactive ad-hoc queries Both multi-stage and interactive apps require faster data sharing across parallel jobs

Data Sharing in MapReduce iter. 1 iter Input HDFS read HDFS write HDFS read HDFS write Input query 1 query 2 query 3 result 1 result 2 result 3... HDFS read Slow due to replication, serialization, and disk IO

iter. 1 iter Input Data Sharing in Spark Distributed memory Input query 1 query 2 query 3... one-time processing × faster than network and disk

Spark Programming Model Key idea: resilient distributed datasets (RDDs) »Distributed collections of objects that can be cached in memory across cluster nodes »Manipulated through various parallel operators »Automatically rebuilt on failure Interface »Clean language-integrated API in Scala »Can be used interactively from Scala console

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Driver cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count... tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data)

Fault Tolerance RDDs track the series of transformations used to build them (their lineage) to recompute lost data E.g: messages = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2)) HadoopRDD path = hdfs://… HadoopRDD path = hdfs://… FilteredRDD func = _.contains(...) FilteredRDD func = _.contains(...) MappedRDD func = _.split(…) MappedRDD func = _.split(…)

Example: Logistic Regression val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w) Initial parameter vector Repeated MapReduce steps to do gradient descent Load data in memory once

Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s

Supported Operators map filter groupBy sort join leftOuterJoin rightOuterJoin reduce count reduceByKey groupByKey first union cross sample cogroup take partitionBy pipe save...

Other Engine Features General graphs of operators (e.g. map-reduce-reduce) Hash-based reduces (faster than Hadoop’s sort) Controlled data partitioning to lower communication

Spark Users

User Applications In-memory analytics & anomaly detection (Conviva) Interactive queries on data streams (Quantifind) Exploratory log analysis (Foursquare) Traffic estimation w/ GPS data (Mobile Millennium) Twitter spam classification (Monarch)...

Conviva GeoReport Group aggregations on many keys w/ same filter 40× gain over Hive from avoiding repeated reading, deserialization and filtering Time (hours)

Mobile Millennium Project Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edutraffic.berkeley.edu Iterative EM algorithm scaling to 160 nodes Estimate city traffic from crowdsourced GPS data

Shark: Hive on Spark

Motivation Hive is great, but Hadoop’s execution engine makes even the smallest queries take minutes Scala is good for programmers, but many data users only know SQL Can we extend Hive to run on Spark?

Hive Architecture Meta store HDFS Client Driver SQL Parser Query Optimizer Physical Plan Execution CLIJDBC MapReduce

Shark Architecture Meta store HDFS Client Driver SQL Parser Physical Plan Execution CLIJDBC Spark Cache Mgr. Query Optimizer [Engle et al, SIGMOD 2012]

Efficient In-Memory Storage Simply caching Hive records as Java objects is inefficient due to high per-object overhead Instead, Shark employs column-oriented storage using arrays of primitive types 1 Column Storage 23 johnmikesally Row Storage 1john4.1 2mike3.5 3sally6.4

Efficient In-Memory Storage Simply caching Hive records as Java objects is inefficient due to high per-object overhead Instead, Shark employs column-oriented storage using arrays of primitive types 1 Column Storage 23 johnmikesally Row Storage 1john4.1 2mike3.5 3sally6.4 Benefit: similarly compact size to serialized data, but >5x faster to access

Using Shark CREATE TABLE mydata_cached AS SELECT … Run standard HiveQL on it, including UDFs »A few esoteric features are not yet supported Can also call from Scala to mix with Spark Early alpha release at shark.cs.berkeley.edushark.cs.berkeley.edu

Benchmark Query 1 SELECT * FROM grep WHERE field LIKE ‘%XYZ%’;

Benchmark Query 2 SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL WHERE V.visitDate BETWEEN ‘ ’ AND ‘ ’ GROUP BY V.sourceIP ORDER BY earnings DESC LIMIT 1;

Demo

What’s Next? Recall that Spark’s model was motivated by two emerging uses (interactive and multi-stage apps) Another emerging use case that needs fast data sharing is stream processing »Track and update state in memory as events arrive »Large-scale reporting, click analysis, spam filtering, etc

Streaming Spark Extends Spark to perform streaming computations Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs Intermix seamlessly with batch and ad-hoc queries tweetStream.flatMap(_.toLower.split).map(word => (word, 1)).reduceByWindow(“5s”, _ + _) T=1 T=2 … mapreduceByWindow [Zaharia et al, HotCloud 2012]

Streaming Spark Extends Spark to perform streaming computations Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs Intermix seamlessly with batch and ad-hoc queries tweetStream.flatMap(_.toLower.split).map(word => (word, 1)).reduceByWindow(5, _ + _) T=1 T=2 … mapreduceByWindow [Zaharia et al, HotCloud 2012] Result: can process 42 million records/second (4 GB/s) on 100 nodes at sub-second latency

Streaming Spark Extends Spark to perform streaming computations Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs Intermix seamlessly with batch and ad-hoc queries tweetStream.flatMap(_.toLower.split).map(word => (word, 1)).reduceByWindow(5, _ + _) T=1 T=2 … mapreduceByWindow [Zaharia et al, HotCloud 2012] Alpha coming this summer

Conclusion Spark and Shark speed up your interactive and complex analytics on Hadoop data Download and docs: »Easy to run locally, on EC2, or on Mesos and soon YARN User meetup: meetup.com/spark-usersmeetup.com/spark-users Training camp at Berkeley in August!

Behavior with Not Enough RAM

Software Stack Local mode Spark Bagel (Pregel on Spark) Bagel (Pregel on Spark) Shark (Hive on Spark) … Streaming Spark EC2 Apache Mesos YARN