Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Frameworks (and Stream Processing) Aditya Akella.
Overview of Spark project Presented by Yin Zhu Materials from Hadoop in Practice by A.
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
In-Memory Cluster Computing for Iterative and Interactive Applications
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Spark Resilient Distributed Datasets:
In-Memory Cluster Computing for Iterative and Interactive Applications
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
Outline | Motivation| Design | Results| Status| Future
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
Storage in Big Data Systems
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica Good morning everyone. My name is Haoyuan,
Resilient Distributed Datasets (NSDI 2012) A Fault-Tolerant Abstraction for In-Memory Cluster Computing Piccolo (OSDI 2010) Building Fast, Distributed.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
Spark Debugger Ankur Dave, Matei Zaharia, Murphy McCauley, Scott Shenker, Ion Stoica UC BERKELEY.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.
Spark: Cluster Computing with Working Sets
Presented by Peifeng Yu
Spark Programming By J. H. Wang May 9, 2017.
Fast, Interactive, Language-Integrated Cluster Computing
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
Sajitha Naduvil-vadukootu
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Fast, Interactive, Language-Integrated Cluster Computing
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica University of California, Berkeley Presented by Qi Gao, adapted from Matei’s NSDI’12 presentation EECS 582 – W16 1

Overview A new programming model (RDD) parallel/distributed computing in-memory sharing fault-tolerance An implementation of RDD: Spark 2

Motivation MapReduce greatly simplifies “big data” analysis on large, unreliable clusters Simple interface: map and reduce Hides the details of parallelism, data partition, fault-tolerance, load-balancing... Problems cannot support complex applications efficiently cannot support interactive applications efficiently Root cause Inefficient data sharing 3 In MapReduce, the only way to share data across jobs is stable storage -> slow!

Motivation 4

Goal: In-Memory Data Sharing 5

Challenges x faster than network/disk, but how to achieve fault- tolerance efficiently? ●Data replication? ●Log fine-grained updates to mutable states? ●Network bandwidth is scarce resource ●Disk I/O is slow ●Costly for data-intensive apps

Observation 7 Coarse-grained operation: In many distributed computing, same operation is applied to multiple data items in parallel

RDD Abstraction Restricted form of distributed shared memory immutable, partitioned collection of records can only be built through coarse-grained deterministic transformations (map, filter, join...) 8 Efficient fault-tolerance using lineage Log coarse-grained operations instead of fine-grained data updates An RDD has enough information about how it’s derived from other dataset Recompute lost partitions on failure

Fault-tolerance 9

Spark Implements Resilient Distributed Datasets (RDDs) 10 Operations on RDDs Transformations: defines new dataset based on previous ones Actions: starts a job to execute on cluster Well-designed interface to represent RDDs Makes it very easy to implement transformations Most Spark transformation implementation < 20 LoC

Simple Yet Powerful 11 WordCount Implementation: Hadoop vs. Spark Pregel: iterative graph processing, 200 LoC using Spark HaLoop: iterative MapReduce, 200 LoC using Spark

Worker Master Spark Example: Log Mining Load error messages from a log into memory and run interactive queries 12 lines = spark.textFile("hdfs://...") base RDD errors = lines.filter(_.startsWith("ERROR")) errors.persist() transformation errors.filter(_.contains("Foo")).count() action! errors.filter(_.contains("Bar")).count() Result: full-text search on 1TB data in 5-7sec vs. 170sec with on-disk data!

Evaluation iterations on 100GB data using machines

Evaluation iterations on 54GB data with approximately 4M articles 2.4 x 7.4 x

Evaluation 15 running time for 10 iterations of k-means on 75 nodes, each iteration contains 400 tasks on 100GB data

Conclusion 16 RDDs offer a simple yet efficient programming model for a broad range of distributed applications RDDs provides outstanding performance and efficient fault-tolerance

Thank you! 17

Backup 18

Evaluation 19 Performance of logistic regression using 100GB data on 25 machines

Example: PageRank 20

21 Example: PageRank A good example to show the advantage of in-memory sharing vs. MapReduce’s stable storage sharing

Data Layout Optimization 22

Another interesting evaluation 23