Mapreduce and Hadoop Introduce Mapreduce and Hadoop

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Simplified Data Processing on Large Clusters
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Big Data Analytics with R and Hadoop
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Chapter 10 Data Analytics for IoT
Hadoop MapReduce Framework
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Software Engineering Introduction to Apache Hadoop Map Reduce
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Basics.
Hadoop Technopoints.
Lecture 16 (Intro to MapReduce and Hadoop)
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Mapreduce and Hadoop Introduce Mapreduce and Hadoop Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113. (Also, an earlier conference paper in OSDI 2004). Hadoop materials: http://www.cs.colorado.edu/~kena/classes/5448/s11/presentations/hadoop.pdf https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Job+Configuration

Introduction of mapreduce Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113. (Also, an earlier conference paper in OSDI 2004). Designed to process a large amount of data in Google applications Crawled documents, web request logs, graph structure of web documents, most frequent queries in a given day, etc Computation is conceptually simple. Input data is huge. Deal with input complexity, support parallel execution.

Mapreduce programming paradigm Computation Input: a set of key/value pairs Outputs: a set of key/value pairs Computation consists of two user-defined functions: map and reduce Map: takes an input pair and produces a set of intermediate key/value pairs Mapreduce library groups all intermediate values for the same intermediate key to the reduce function Reduce: takes an intermediate key and a set of values, merge the set of values to a smaller set of values for the output.

Mapreduce example: wordcount The mapreduce runtime library takes care of most of the detailed. Beside the map and reduce function, users also specify the mapreduce specification object (names of the Input /output files, etc).

Types in the map and reduce function Map (k1, v1)  list (k2, v2) Reduce(k2, list(v2))  list (v2) Input key (k1) and value (v1) are independent (from different domains) Intermediate key (k2) and value(v2) are the same type as the output key and value.

More examples Distributed grep Count of URL access Frequency Map: emit a line if it matches the pattern Reduce: copy the intemediate data to output Count of URL access Frequency Input URL access log Map: emit <url, 1> Reduce: add the value for the same url, and emit <url, total count> Reverse web-link graph Map: output <target, source> where target is found in a source page Reduce: concatenate all sources and emit <target, list(source)>

Mapreduce execution overview

Fault tolerance The master keeps tracks of everything (the status of workers, the locations and size of intermediate files, etc) Worker failure: very high chance Master re-starts a worker on another node if it failed (not responding to ping). Master failure: not likely Can do checkpoint.

Hadoop – an implmentation of mapreduce Two components Mapreduce: enables application to work with thousands of nodes Hadoop distributed file system (HDFS): allows reads and writes of data in parallel to and from multiple tasks. Combined: a reliable shared storage and analysis system. Hadoop mapreduce programs are mostly written in java (Hadoop streaming API supports other languages).

HDFS A given file is broken down into blocks (default block size: 64MB), blocks are replicated across clusters (default number of copies: 3) Name Node: maintains mapping of file names to blocks and data node slaves Data Node: stores and serves blocks of data.

HDFS data replication A given file is broken down into blocks (default block size: 64MB), blocks are replicated across clusters (default number of copies: 3) Namenode uses transaction log to maintain persistent file system metadata. Namenode uses heartbeat to make sure the datanodes are alive, automatic re-replicate and redistribution as needed

HDFS robustness Design goal: reliable data in the presence of failures. Data disk failure and network failure Heartbeat and re-replication Corrupted data Typical method for data integrity: checksum Metadata disk failure Namenode is a single point of failure – dealing with failure in a single node is relatively easy. E.g. use transactions.

Hadoop architecture Master/slave architecture

Hadoop mapreduce One jobtracker per master One tasktracker per slave Responsible for scheduling tasks on the slaves Monitor slave progress Re-execute failed slaves One tasktracker per slave Execute tasks directed by the master

Hadoop mapreduce One jobtracker per master One tasktracker per slave Responsible for scheduling tasks on the slaves Monitor slave progress Re-execute failed slaves One tasktracker per slave Execute tasks directed by the master

Hadoop mapreduce Two steps map and reduce Map step Reduce step Master node takes large problem input and and slices in into smaller sub-problems; distributes the sub-problems to worker nodes Worker node may do the same (multi-level tree structure) Worker processes smaller problems and returns the results Reduce step Master node takes the results of the map step and combines them ito get the answer to the orignal problem

Hadoop mapreduce data flow Input reader: divides input into splits that are assigned a map function Map function: maps file data to intermediate <key, value> pairs Partition function: finds the reducer (from key and the number of reducers). Compare function: when input for the reduce function and pull from the map intermediate output, the data are sorted based on this function Reduce function: as defined Output writer: write file output.

An implementation of the google’s mapreduce execution

A code example (wordcount) Two input files: File1: “hello world hello moon” File2: “goodbye word goodnight moon” Three operations Map Combine reduce

Output for each step

Hadoop worldcount main run method

Hadoop worldcount main run method

Job details Interface for a user to describe a mapreduce job to the hadoop framework for execution. User specifies Mapper Combiner Partitioner Reducer Inputformat Outputformat Etc. Jobs can be monitored Users can chain multiple mapreduce jobs together to accomplish more complex tasks.

Map class One line at a time, tokenize the line, emit <<word>, 1> to context

mapper Implementing classes extend Mapper and overrid map() Main Mapper engine: Mapper.run() setup() map() cleanup() Output data is emitted from mapper via the Context object. Hadoop MapReduce framework spawns one map task for each logical unit of input work for a map task.

Reduce class The framework group all pairs with the same key and passes them To the reduce function (sorted using compare function).

Reducer Sorts and partition Mapper outputs Reduce engine Receives a context containing job’s configuration Reducer.run() Setup() Reduce() per key associated with reduce task Cleanup() Reducer.reduce() Called once per key Passed in an Iterable values associated with the key Output to Context.write()

Task execution and environment Tasktracker executes Mapper/Reducer task as a child process in a separate jvm Child task inherits the environment of the parent TaskTracker User can specify environmental variables controlling memory, parallel computation settings, segment size, etc.

running a hadoop job? Specify the job configuration Specify input/output locations Supply map and reduce function Job client then submits the job (jar/executables) and configuration to the jobtracker ‘$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output‘ A complete example can be found at https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Job+Configuration

Communication in mapreduce applications No communication in map or reduce Dense communication with large message size between the map and reduce stage Need high throughput network for server to server traffic. Network workload is differernt from HPC workload or traditional Internet application workload.

Mapreduce limitations Good at dealing with unstructured data: map and reduce functions are user defined and can deal with data of any kind. For structured data, mapreduce may not be as efficient as traditional database Good for applications that generate smaller final data from large source data Does not quite work for applications that require fine grain communications – this includes almost all HPC types of applications. No communication within map or reduce

Summary MapReduce is simple and elegant programming paradigm that fits certain type of applications very well. Its main strength is in fault tolerant in large scale distributed platforms, and dealing with large unstructured data. Mapreduce applications create a very different traffic workload than both HPC and Internet applications. Hadoop Mapreduce is an open source framework that supports mapreduce programming paradigm.