MapReduce Based on: MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat. Mining of Massive Datasets. Jure Leskovec,

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Lecture 2 – MapReduce: Theory and Implementation CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.
Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Spring 2007 Except as otherwise noted, the content of.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.
HADOOP ADMIN: Session -2
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Acknowledgements: Some slides form Google University (licensed under the Creative Commons Attribution 2.5 License) others from Jure Leskovik.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
1 MapReduce: Theory and Implementation CSE 490h – Intro to Distributed Computing, Modified by George Lee Except as otherwise noted, the content of this.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
PPCC Spring Map Reduce1 MapReduce Prof. Chris Carothers Computer Science Department
Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
MapReduce Theory, Implementation and Algorithms Hongfei Yan School of EECS, Peking University 7/2/2009 Refer to.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Map Reduce. Functional Programming Review r Functional operations do not modify data structures: They always create new ones r Original data still exists.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Large-scale file systems and Map-Reduce
Adapted from: Google & UWash’s Creative Common MR Deck
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
Distributed System Gang Wu Spring,2018.
Overview of big data tools
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

MapReduce Based on: MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat. Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

contents Why do we need distributed computing for big data ? What is mapReduce? Functional programming review. MapReduce concept. First example – word counting. Fail tolerance. Optimizations. More examples. Complexity. Real world example.

Why do we need distributed computing for big data ? Single computer – has not enough: RAM HD capacity, IOPS. network bandwidth . CPU.

What is MapReduce MapReduce is a software framework  introduced by Google to support distributes computing on large data sets on clusters of computers. There are many other MapReduce framework made to work on different environments (Hadoop is the leading open source implementation). Why not other framwork (like MPI)?

Functional programming review Functional operations do not modify data structures : they always create new ones. original data still exists in unmodified form. No side-affect (reading input from user, networking etc’) Data flows are explicit in program design. Order of operations does not matter: Fun foo (I :int list) = sum(I)+ mul(I) + length(I) Functions can be passed as arguments זרימת מידע מפורשת – אין לי רפרנסים או פינטרים שאני לא יודע מי נגעכ מתי ובאיזה מידע

Map map f a [] = f(a) map f (a:as) = list(f(a), map(f, as)) Creates a new list applying f to each element of the input list; returns output in order. map f a [] = f(a) map f (a:as) = list(f(a), map(f, as)) Example: upper(x) : char->char Input : lst = [ a,b,c] Operation : Map upper lst ; Output : [A , B , C] Google's video slides - Cluster Computing and MapReduce

Fold fun foldl f z [] = z | foldl f z (x::xs) = foldl f (f(z,x)) xs; Moves across a list, applying f to each element plus an accumulator. F returns the next accumulator value, which is combined with the next element of the list. fun foldl f z [] = z | foldl f z (x::xs) = foldl f (f(z,x)) xs; דגש – הפלט יכול להיות רשימה Example: We wish to write sum function Receiving int list; return sum; fun sum(lst) = foldl(fn (x,a)=>x+a) 0 lst Google's video slides - Cluster Computing and MapReduce

"Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

Google’s map Map (in_key , in_value) -> (out_key , intermediate_value) list Example: Map(“play.txt”,”to be or not to be”) Will Emit: (“to”,1), (“be”,1), (“or”,1), (“not”,1), (“to”,1), (be”,1) "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

Google’s reduce Reduce (out_key,intermediate_value list) -> (key, out_value) list Example: reduce(“to”,[1,1,1]) Will Emit: [(“to”,3)] "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

Partition and combine functions A simple hash function - hash(key) mod R. the key may be different like hash(hostname(url)) mod R (“to”,1), (“be”,1), (“to”,1), (be”,1) hash(key) mod 2 (“to”,1), (“be”,1), (“or”,1), (“not”,1), (“to”,1), (be”,1) (“or”,1), (“not”,1) combine: Similar to reduce function , applied over local worker (more details will fallow) (“to”,1), (“be”,1), (“or”,1), (“not”,1), (“to”,1), (be”,1) (“to”,2), (“be”,2), (“or”,1), (“not”,1)

The MapReduce concept MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

The MapReduce concept Spite the work to pieces Start running code on workers אחד מהעובדים הוא מסטר MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

The MapReduce concept "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

The MapReduce concept Assign reducers Assign mappers MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

The MapReduce concept Mappers read input MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

The MapReduce concept Workers finishes : writes the output of map into R regions by the partitioning function Registers the results at the master. MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

The MapReduce concept "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

The MapReduce concept Reducers read the input, sort it and start reducing. העתקה יכולה להתחיל לפני שכולם סיימו, בתחלת הרדוס יכולה לתחיל רק כשכל המאפרים סיימו. כל רדוסר לוקח את הפרטישאן שלו MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

The MapReduce concept "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

The MapReduce concept Reducer store it’s output on GFS , and Inform the master. MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

The MapReduce concept "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

First example – word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

First example – word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

First example – word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

First example – word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

First example – word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

First example – word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

First example – word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

Example 2 – reverse a list of links mapper input: ( url, web page content) Mapper function: reduce function :

Example 2 – reverse a list of links mapper input: ( url, web page content) ( themarker.com, <HEAD> … href="ynet.com”..) ( calcalist.com, <HEAD> … href="ynet.com”..) mapper function: (url,web page content) -> (target,source) list [(ynet.com, themarker.com)] [(ynet.com, calcalist.com)] reduce function : (target,source) list -> (target, source list) (ynet.com,[themarker.com, calcalist.com])

Example 3 – distributed grep Given a word and a list of text file, will return the files and lines that the word appears in. Mapper input: ( docId, docContent) Mapper function: Reduce function :

Example 3 – distributed grep mapper input: ( docId, docContent) mapper function: ( docId, docContent) -> (docId, line that match pattern) reduce function : Identity function

Example 4 – BFS Given N, will return the nodes in the graph. Each node will include the distance from N . Mapper input: (nodeId,N) // N.distance – distance from source node // N.AdjacencyList Mapper function: Reduce function :

Example 4 – BFS גרף סופי, מריצים עד שאין צמתים עם מרחק אינסוף, בלכ איטרציה צריך תוכנה חיצונית שתבדוק את התנאי ותגלל הלאה במידת הצורך , hadoop מספקת כזה ממשק. "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

Example 4 – BFS "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer

Example 5 – Matrix Multiplication MN= P . iter 1: Mapper input: Mapper function: => Reduce function: (𝑗, 𝑀,𝑖, 𝑚 𝑖𝑗 , 𝑁,𝑘, 𝑛 𝑗𝑘 ,… ⇒[( 𝑖,𝑘 , 𝑚 𝑖𝑗 𝑛 𝑗𝑘 ),…..]

Example 5 – Matrix-Vector Multiplication iter 2: Mapper input: ( 𝑖,𝑘 , 𝑚 𝑖𝑗 𝑛 𝑗𝑘 ) Mapper function: Identity function Reduce function: ( 𝑖,𝑘 , 𝑚 𝑖𝑗 𝑛 𝑗𝑘 )⇒( 𝑖,𝑘 , 𝑗 𝑚 𝑖𝑗 𝑛 𝑗𝑘 )

Fail tolerance during the mapReduce task, the master ping all workers. MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

Fail tolerance 1) in progress map or reduce task is restarted on another machine. 2) Completed map task is restarted on another machine. 3) Completed reduce task is not restarted since it’s result stored on GFS. 4) If a few mappers fail on the same input – the input is marked as non-valid. הכוח של תכנות פונקציונלי – אפשר להפעיל מישהו אחר על המטלה הזאת וזה בסדר MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

Fail tolerance Master – a single point of failure MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

Optimizations The master tries to allocate mapper which is the closest to the machine that stores the input file. Combine is used to reduce network bandwidths consumption. e.g, better transmitting ‘(“pig”,3)’ then ‘(“pig”,1) , (“pig”,1), (“pig”,1)’. Some mappers may be lagging behind , the master allocate a backup worker near the end of the mappers operation.

Optimizations Some mappers may be lagging behind , the master allocate a backup worker near the end of the mappers operation. אין סייד אפקטס – אין שום בעיה להריץ פונקציות במקביל MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

BW over time בגרף האמצעי זנב ארוך, בגרך הימני 200 מתוך 1760 מתו MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

Google mapReduce usage MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

Complexity Theory for MapReduce we wish to: Shrink the wall-clock time Execute each reducer in main memory We will look into two parameters in the algorithm: reducer size(q): This parameter is the upper bound on the number of values that are allowed to appear in the list associated with a single key. replication rate(r):the number of key-value pairs produced by all the Map tasks on all the inputs, divided by the number of inputs. ככל שהרדוסר סייז קטן יותר – בטוח יכנס ל RAM והעיבוד יהיה קצר יותר. בנוסף יהיה יותר מקביליות.

Complexity - Example Similarity Joins: given a large set of elements X and a similarity measure s(x, y) that tells how similar two elements x and y of set X are. 1M images, 1MB each.

Complexity - Example mapper input: ( i, Pi) mapper function: reduce function : ((i,j),[Pi,Pj]) -> ((i,j),s(Pi,Pj)) reducer size(q):2 replication rate(r): for each input the mapper create 1M output – 1M*1M/1M= 1M. Total network: 1M * 1M * 1MB = 1 exabyte. On one gigabit network it should take 300 years. kf

Complexity- Example(fixed) mapper input: ( i, Pi) We will partision the input into g groups(G). mapper function: ( i, Pi) -> { 𝑢,𝑣 ,𝑃𝑖 | 𝑣,𝑢∈𝐺,𝑖∈𝑢} reduce function : 𝑢,𝑣 ,𝑃𝑖,𝑃𝑗,…. -> ( 𝑖,𝑗 , s(Pi,Pj)). reducer size(q): 2M/g replication rate(r): g. Total network: g * 1M * 1MB = g Terabyte. G כפול חצי יום

Real world example A graphical model is a probabilistic model for which a graph denotes the conditional dependence structure between random variables. 𝑃( 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑛 ) פסי אי זה הפריור על המשתנה i.

Real world example Distributed Message Passing for Large Scale Graphical Models, Alexander Schwing

Real world example Iteration 1: Input entry for a single map task will be as followed:

Real world example Iteration 1:

Real world example Iteration 1: mapper output

Real world example

Real world example Iteration 2:

Real world example

Hands on

conclusions The good: 1) simple. 2) proven. 3) many implementations for different platforms and languages. The bad: 1) performance improvements enabled by common database is prevented. 2) map reduce algorithms is not always easy to design. 3) not all algorithms can be converted to work efficiently on mapreduce.

References MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat. "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer “Pro Hadoop” By Jason Venner series of Google's video - Cluster Computing and MapReduce http://code.google.com/edu/submissions/mapreduce- minilecture/listing.html

Partial Implementations list The Google MapReduce framework is implemented in C++ with interfaces in Python and Java. The Hadoop project is a free open source Java MapReduce implementation. Twister is an open source Java MapReduce implementation that supports iterative MapReduce computations efficiently. Greenplum is a commercial MapReduce implementation, with support for Python, Perl, SQL and other languages. Aster Data Systems nCluster In-Database MapReduce supports Java, C, C++, Perl, and Python algorithms integrated into ANSI SQL. GridGain is a free open source Java MapReduce implementation. Phoenix is a shared-memory implementation of MapReduce implemented in C. FileMap is an open version of the framework that operates on files using existing file-processing tools rather than tuples. MapReduce has also been implemented for the Cell Broadband Engine, also in C. Mars:MapReduce has been implemented on NVIDIA GPUs (Graphics Processors) using CUDA. Qt Concurrent is a simplified version of the framework, implemented in C++, used for distributing a task between multiple processor cores. CouchDB uses a MapReduce framework for defining views over distributed documents and is implemented in Erlang. Skynet is an open source Ruby implementation of Google’s MapReduce framework Disco is an open source MapReduce implementation by Nokia. Its core is written in Erlang and jobs are normally written in Python. Misco is an open source MapReduce designed for mobile devices and is implemented in Python. Qizmt is an open source MapReduce framework from MySpace written in C#. The open-source Hive framework from Facebook (which provides an SQL-like language over files, layered on the open-source Hadoop MapReduce engine.) The Holumbus Framework: Distributed computing with MapReduce in Haskell Holumbus-MapReduce BashReduce: MapReduce written as a Bash script written by Erik Frey of Last.fm MapReduce for Go Meguro - a Javascript MapReduce framework MongoDB is a scalable, high-performance, open source, schema-free, document-oriented database. Written in C++ that features MapReduce Parallel::MapReduce is a CPAN module providing experimental MapReduce functionality for Perl. MapReduce on volunteer computing Secure MapReduce MapReduce with MPI implementation http://sites.google.com/site/cloudcomputingsystem/research/programming-model