Data Engineering How MapReduce Works

Slides:



Advertisements
Similar presentations
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Introduction to Spark Internals
Spark Lightning-Fast Cluster Computing UC BERKELEY.
Spark Performance Patrick Wendell Databricks.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
Comp6611 Course Lecture Big data applications Yang PENG Network and System Lab CSE, HKUST Monday, March 11, 2013 Material adapted from.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Lightning Fast Big Data Analytics using Apache Spark Lightning Fast Big Data Analytics using Apache Spark www. bigdatainnovation.org.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under.
Storage in Big Data Systems
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
MapReduce.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
PySpark Tutorial - Learn to use Apache Spark with Python
Python Spark Intro for Data Science
CC Procesamiento Masivo de Datos Otoño Lecture 8: Apache Spark (Core)
Running Apache Spark on HPC clusters
Big Data is a Big Deal!.
Concept & Examples of pyspark
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Hadoop Tutorials Spark
Hadoop MapReduce Framework
Spark Presentation.
Data Platform and Analytics Foundational Training
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
Distributed Systems CS
COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Spark and Scala.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Distributed Systems CS
Introduction to Spark.
CS639: Data Management for Data Science
Apache Hadoop and Spark
Fast, Interactive, Language-Integrated Cluster Computing
COS 518: Distributed Systems Lecture 11 Mike Freedman
Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

Data Engineering How MapReduce Works Shivnath Babu

Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job

Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job

Lifecycle of a MapReduce Job Time Input Splits Reduce Wave 1 Reduce Wave 2 Map Wave 1 Map Wave 2 Industry wide it is recognized that to manage the complexity of today’s systems, we need to make systems self-managing. IBM’s autonomic computing, Microsoft’s DSI, and Intel’s proactive computing are some of the major efforts in this direction. 4

Job Submission

Initialization

Scheduling

Execution

Map Task

Reduce Tasks

Hadoop Distributed File-System (HDFS)

Lifecycle of a MapReduce Job Time Input Splits Reduce Wave 1 Reduce Wave 2 Map Wave 1 Map Wave 2 Industry wide it is recognized that to manage the complexity of today’s systems, we need to make systems self-managing. IBM’s autonomic computing, Microsoft’s DSI, and Intel’s proactive computing are some of the major efforts in this direction. How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined? 12

Map Wave 1 Wave 3 Wave 2 Reduce Shuffle What if #reduces increased to 9? Map Wave 1 Wave 3 Wave 2 Reduce Industry wide it is recognized that to manage the complexity of today’s systems, we need to make systems self-managing. IBM’s autonomic computing, Microsoft’s DSI, and Intel’s proactive computing are some of the major efforts in this direction. 13

Spark Programming Model sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map Driver Program Cluster Manager SparkContext Worker Node Executer Cache Task Writes User (Developer) Datanode HDFS … Taken from http://www.slideshare.net/fabiofumarola1/11-from-hadoop-to-spark-12

Spark Programming Model sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map Driver Program Immutable Data structure In-memory (explicitly) Fault Tolerant Parallel Data Structure Controlled partitioning to optimize data placement Can be manipulated using a rich set of operators RDD (Resilient Distributed Dataset) Writes User (Developer) Resilient Distributed Datasets or RDD are the distributed memory abstractions that lets programmer perform in-memory parallel computations on large clusters. And that too in a highly fault tolerant manner. This is the main concept around which the whole Spark framework revolves around. Currently 2 types of RDDs: Parallelized collections: Created by calling parallelize method on an existing Scala collection. Developer can specify the number of slices to cut the dataset into. Ideally 2-3 slices per CPU. Hadoop Datasets: These distributed datasets are created from any file stored on HDFS or other storage systems supported by Hadoop (S3, Hbase etc). These are created using SparkContext’s textFile method. Default number of slices in this case is 1 slice per file block.

RDD Programming Interface: Programmer can perform 3 types of operations Transformations Create a new dataset from an existing one. Lazy in nature. They are executed only when some action is performed. Example : Map(func) Filter(func) Distinct() Actions Returns to the driver program a value or exports data to a storage system after performing a computation. Example: Count() Reduce(funct) Collect Take() Persistence For caching datasets in-memory for future operations. Option to store on disk or RAM or mixed (Storage Level). Example: Persist() Cache() Transformations: Like map – takes an RDD as an input, passes & process each element to a function, and return a new transformed RDD as an output. By default, each transformed RDD is recomputed each time you run an action on it. Unless you specify the RDD to be cached in memory. Spark will try to keep the elements around the cluster for faster access. RDD can be persisted on discs as well. Caching is the Key tool for iterative algorithms. Using persist, one can specify the Storage Level for persisting an RDD. Cache is just a short hand for default storage level. Which is MEMORY_ONLY. MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISK_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2 etc Same as the levels above, but replicate each partition on two cluster nodes. Which Storage level is best: Few things to consider: Try to keep in-memory as much as possible Try not to spill to disc unless your computed datasets are memory expensive Use replication only if you want fault tolerance

How Spark works RDD: Parallel collection with partitions User application can create RDDs, transform them, and run actions. This results in a DAG (Directed Acyclic Graph) of operators. DAG is compiled into stages Each stage is executed as a collection of Tasks (one Task for each Partition).

Summary of Components Task : The fundamental unit of execution in Spark Stage: Collection of Tasks that run in parallel DAG : Logical Graph of RDD operations RDD : Parallel dataset of objects with partitions

Example sc.textFile(“/wiki/pagecounts”) RDD[String] textFile

Example sc.textFile(“/wiki/pagecounts”) RDD[String] .map(line => line.split(“\t”)) RDD[String] RDD[List[String]] textFile map

Example sc.textFile(“/wiki/pagecounts”) RDD[String] .map(line => line.split(“\t”)) .map(R => (R[0], int(R[1]))) RDD[String] RDD[List[String]] RDD[(String, Int)] textFile map map

Example sc.textFile(“/wiki/pagecounts”) RDD[String] .map(line => line.split(“\t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_) RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] reduceByKey textFile map map

Example sc.textFile(“/wiki/pagecounts”) RDD[String] .map(line => line.split(“\t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_, 3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] Array[(String, Int)] collect reduceByKey

Execution Plan collect reduceByKey map textFile map Stage 1 Stage 2 collect reduceByKey textFile map map Stages are sequences of RDDs, that don’t have a Shuffle in between

Execution Plan collect reduceByKey textFile map Stage 1 Stage 2 Read HDFS split Apply both the maps Start partial reduce Write shuffle data Read shuffle data Final reduce Send result to driver program Stage 1 Stage 2

Stage Execution Create a task for each Partition in the input RDD Serialize the Task Schedule and ship Tasks to Slaves And all this happens internally (you don’t have to do anything)

Spark Executor (Slave) Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Core 1 Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Core 2 Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Core 3