Intro to Map-Reduce Feb 4, 2015. map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce With a heavy debt to: Google Map Reduce OSDI 2004 slides code.google.com.
Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
MapReduce Simplified Data Processing on Large Clusters
MapReduce.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Chapter 2 Data Models Database Systems: Design, Implementation, and Management, Eleventh Edition, Coronel & Morris.
CS246 TA Session: Hadoop Tutorial Peyman kazemian 1/11/2011.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 3: Mapreduce and Hadoop All slides © IG.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Introduction to Google MapReduce WING Group Meeting 13 Oct 2006 Hendra Setiawan.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Big Data Analytics with R and Hadoop
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) Sep 3, 2013 Lecture 3 Cloud Computing - 2  2013,
MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) Sep 4, 2012 Lecture 3 Cloud Computing -
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
HAMS Technologies 1
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MapReduce Costin Raiciu Advanced Topics in Distributed Systems, 2012.
Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Map Reduce. Functional Programming Review r Functional operations do not modify data structures: They always create new ones r Original data still exists.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 4: Mapreduce and Hadoop
Introduction to Google MapReduce
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 4th – Map/Reduce.
Distributed System Gang Wu Spring,2018.
Lecture 16 (Intro to MapReduce and Hadoop)
Google Map Reduce OSDI 2004 slides
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Intro to Map-Reduce Feb 4, 2015

map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems… Feb 4, 2015CS512 | Spring 2015 [Image] 2

Why? Process and analyze huge volumes of data. Facebook posts and photos, twitter streams, … Facebook 220 million new photos per week 25TB of storage per week This was in 2009! Feb 4, 2015CS512 | Spring 2015 Needle in a haystack, Facebook, 3

Easy… Computational resources are cheap. Amazon EC2, Microsoft Azure, … Feb 4, 2015CS512 | Spring 2015 [Image] 4

Not really! Code running on one CPU is simple, two is a headache, four is a nightmare, … You get the picture. Feb 4, 2015CS512 | Spring 20155

Multithreaded Programming Feb 4, 2015CS512 | Spring 2015  [Image] TheoryPractice 6

Pipe Dream  Forget multiple machines.  Write code imagining only one thread.  Someone else takes care of running the code on thousands of machines. Free the programmer from the unnecessary details. Feb 4, 2015CS512 | Spring 20157

Map-Reduce map-reduce programming model to the rescue Feb 4, 2015CS512 | Spring 20158

Long, long ago… Lisp, 1958 A programming language that introduced several innovative ideas Feb 4, 2015CS512 | Spring 2015 Recursive Functions of Symbolic Expression and Their Computation by Machine, Part I John McCarthy, MIT, April 1960 [Image] 9

LISP Introduced map and reduce. Feb 4, 2015CS512 | Spring

Map map(m f, [a 1, a 2, …a n ]) -> [b 1, b 2, …, b n ] Accepts two arguments: a function and a list of values. Generates output by repeatedly applying the function on the list of values. Feb 4, 2015CS512 | Spring

Reduce reduce(r f, [b 1, b 2, …b n ]) -> c Accepts two arguments: a function and a list of values. Generates output by reducing the list of input values using the function. Feb 4, 2015CS512 | Spring

Simple composition reduce(r f, map(m f, [a 1, a 2, …a n ])) -> c Map’s output is a list of values, which reduce can accept as one of its argument. Feb 4, 2015CS512 | Spring

Simple composition reduce(sum, map(square, [1, 2, …, 10])) -> …+10 2 Sum of squares Ignore map/reduce; observe the functions Feb 4, 2015CS512 | Spring

Analogy Map  Break large problem into small pieces  Code m f to solve one piece  Run map to apply m f on the small pieces and generate nuggets of solutions Feb 4, 2015CS512 | Spring

Analogy Reduce  Code r f to combine the nuggets  Run reduce to apply r f on the nuggets to output the complete solution Feb 4, 2015CS512 | Spring

Example — Word Count Input: 1TB file wc = {‘hello’: 0} for line in file { for word in line { wc[word]++ } print(wc) Feb 4, 2015CS512 | Spring 2015 You don’t want to write a multi-threaded version! At least, not the night before the deadline. 17

Example — Word Count 1TB file split into 100,000 pieces (chunks) With each piece Split each line into words For each unique word, maintain a counter Output the word-count mappings Merge word-count mappings together Feb 4, 2015CS512 | Spring

Example — Word Count 1TB file split into 100,000 pieces (chunks) With each piece Split each line into words For each unique word, maintain a counter Output the word-count mappings Merge word-count mappings together Feb 4, 2015CS512 | Spring

Word Count: Split and count ‘There are three types of lies – lies, damn lies, and statistics’ Feb 4, 2015CS512 | Spring

Word Count: Split and count ‘There are three types of lies – lies, damn lies, and statistics’ {‘There’ : 1, ‘are’ : 1, ‘lies’ : 3, …} Feb 4, 2015CS512 | Spring

Word Count: Merge {‘There’ : 1, ‘lies’ : 1, …} { ‘lies’ : 2, …} { ‘There’ : 1, ‘lies’ : 3, …} {‘are’ : 1} {‘types’ : 1} { ‘are’ : 1, ‘types’ : 1, …} Feb 4, 2015CS512 | Spring

Word Count: Merge {‘There’ : 1, ‘lies’ : 1, …} { ‘lies’ : 2, …} { ‘There’ : 1, ‘lies’ : 3, …} {‘are’ : 1} {‘types’ : 1} { ‘are’ : 1, ‘types’ : 1, …} Feb 4, 2015CS512 | Spring

A slightly different map-reduce Map Copies a function on a number of machines and applies each copy on different pieces of the input Feb 4, 2015CS512 | Spring

A slightly different map-reduce Reduce Combine the map outputs from different machines into a final solution Feb 4, 2015CS512 | Spring

Map-reduce reintroduced… Google created the awareness Hadoop made it into a sensation Hadoop is an open-source map-reduce implementation based on Google’s paper. Feb 4, 2015CS512 | Spring 2015 MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI'04: Sixth Symposium on Operating System Design and Implementation. December,

Hadoop: Simplified Feb 4, 2015CS512 | Spring

Example Feb 4, 2015CS512 | Spring

Mapper Feb 4, 2015CS512 | Spring An instance of the map function

Reducer Feb 4, 2015CS512 | Spring An instance of the reduce function

Job Feb 4, 2015CS512 | Spring *User’s implementation of map and reduce functions.

Input Splits? Feb 4, 2015CS512 | Spring How is input data split into equal parts and provided to mappers?

Input Splits Feb 4, 2015CS512 | Spring Distributed Filesystem viz., HDFS, splits large files into smaller chunks. Chunks are distributed and replicated over many machines.

Trackers JobTracker  Manage cluster resources  Schedule all user jobs TaskTracker  Run map or reduce task on a machine Feb 4, 2015CS512 | Spring

Trackers There is one TaskTracker for each machine. JobTracker monitors and controls the TaskTrackers. Feb 4, 2015CS512 | Spring

Hadoop: Job Feb 4, 2015CS512 | Spring

Parallel Mappers  Mappers run in parallel.  Simple & straightforward — operate on a set of chunks; inputs to mappers are disjoint Feb 4, 2015CS512 | Spring

Data Locality  How to schedule mappers to machines?  As close to data as possible: data-local, rack- local Feb 4, 2015CS512 | Spring

Map Output  Mappers output to local disk  expensive to write to DFS; if maps fail, output (and its copies) are useless Feb 4, 2015CS512 | Spring

Fault Tolerance  Mapper Failures  Easiest to handle! Re-run failed mappers.  Assuming idempotence Feb 4, 2015CS512 | Spring

Fault Tolerance  Reducer Failures  Re-run failed reducers  Complicated to re-run reducers! Feb 4, 2015CS512 | Spring

Fault Tolerance  Speculative executions  Do we have to wait to see a failure?  Check for stragglers Feb 4, 2015CS512 | Spring

Hadoop: Overview Feb 4, 2015CS512 | Spring

Input to reducers #reducers (n) known a priori. #partitions equals #reducers. Hash on the keys of mapper outputs. partition = hash(key) mod n Load balancing by randomization. Feb 4, 2015CS512 | Spring

Wait before reducing… Cannot start reducers before mappers complete. Synchronization barrier between map and reduce phases. Is it necessary? Feb 4, 2015CS512 | Spring

Embarrassing Parallelism Feb 4, 2015CS512 | Spring

Not a panacea! If your workload exhibits embarrassing parallelism, Hadoop might be the ideal framework. If not, look for other parallel programming paradigms. Feb 4, 2015CS512 | Spring

Word Count using Hadoop Feb 4, 2015CS512 | Spring

Mapper public static class MyMapper implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { // Split the given line (in value) into words and emit for each word the tuple String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } Feb 4, 2015CS512 | Spring

Reducer public static class MyReducer implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } Feb 4, 2015CS512 | Spring

Next steps? CPS 516: Data-Intensive Computing Systems Shivnath Babu There are lots of folks in our department working on cool projects involving distributed systems! Feb 4, 2015CS512 | Spring