Introduction to MapReduce ECE7610. The Age of Big-Data  Big-data age  Facebook collects 500 terabytes a day(2011)  Google collects 20000PB a day (2011)

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
MapReduce. 2 (2012) Average Searches Per Day: 5,134,000,000 (2012) Average Searches Per Day: 5,134,000,000.
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce and Hadoop Distributed File System
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
CSCI-2950u :: Data-Intensive Scalable Computing Rodrigo Fonseca (rfonseca)
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Take a Close Look at MapReduce Xuanhua Shi. Acknowledgement  Most of the slides are from Dr. Bing Chen,
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
B. RAMAMURTHY MapReduce and Hadoop Distributed File System 10/6/ Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY)
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture V: 2014/04/07.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Hadoop Aakash Kag What Why How 1.
Introduction to MapReduce and Hadoop
Introduction to HDFS: Hadoop Distributed File System
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 4th – Map/Reduce.
CS 345A Data Mining MapReduce This presentation has been altered.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Introduction to MapReduce ECE7610

The Age of Big-Data  Big-data age  Facebook collects 500 terabytes a day(2011)  Google collects 20000PB a day (2011)  Data is an important asset to any organization  Finance company; insurance company; internet company  We need new  Algorithms/data structures/programming model 2

What to do ? (Word Count)  Consider a large data collection and count the occurrences of the different words 3 Data collection web2 weed1 green2 sun1 moon1 land1 part1 {web, weed, green, sun, moon, land, part, web, green,…}

What to do ?(Word Count) 4 Data collection web2 weed1 green2 sun1 moon1 land1 part1 Multi-thread Lock on shared data

What to do?(Word Count) 5 Data collection  Single machine cannot serve all the data: you need a distributed special (file) system  Large number of commodity hardware disks: say, 1000 disks 1TB each  Critical aspects: fault tolerance + replication + load balancing, monitoring  Exploit parallelism afforded by splitting parsing and counting  Provision and locate computing at data locations

What to do? (Word Count) 6 KEYwebweedgreensunmoonlandpartwebgreen……. VALUE web2 weed1 green2 sun1 moon1 land1 part1 Data collection Separate counters Separate data Data collection Data collection Data collection Data collection

It is not easy to parallel…. 7 Fundamental issues Scheduling, data distribution, synchronization, inter- process communication, robustness, fault tolerance, … Different programming models Message Passing Shared Memory Architectural issues Flynn’s taxonomy (SIMD, MIMD, etc.), network topology, bisection bandwidth, cache coherence, … Common problems Livelock, deadlock, data starvation, priority inversion, …dining philosophers, sleeping barbers, cigarette smokers, … Different programming constructs Mutexes, conditional variables, barriers, … masters/slaves, producers/consumers, work queues,. … Actually, Programmer’s Nightmare….

MapReduce: Automate for you  Important distributed parallel programming paradigm for large-scale applications.  Becomes one of the core technologies powering big IT companies, like Google, IBM, Yahoo and Facebook.  The framework runs on a cluster of machines and automatically partitions jobs into number of small tasks and processes them in parallel.  Features: fairness, task data locality, fault-tolerance. 8

MapReduce 9 MAP: Input data  pair Data Collection: split1 web1 weed1 green1 sun1 moon1 land1 part1 web1 green1 …1 KEYVALUE Split the data to Supply multiple processors Data Collection: split 2 Data Collection: split n Map …… Map …

MapReduce 10 Reduce MAP: Input data  pair REDUCE: pair  Data Collection: split1 Split the data to Supply multiple processors Data Collection: split 2 Data Collection: split n Map …… Map …

C. Wayne State11 Count Large scale data splits Parse- hash Map Reducers (say, Count)

MapReduce 12

How to store the data ? 13 Compute Nodes What’s the problem here?

Distributed File System  Don’t move data to workers… Move workers to the data!  Store data on the local disks for nodes in the cluster  Start up the workers on the node that has the data local  Why?  Not enough RAM to hold all the data in memory  Network is the bottleneck, disk throughput is good  A distributed file system is the answer  GFS (Google File System)  HDFS for Hadoop 14

GFS/HDFS Design  Commodity hardware over “exotic” hardware  High component failure rates  Files stored as chunks  Fixed size (64MB)  Reliability through replication  Each chunk replicated across 3+ chunkservers  Single master to coordinate access, keep metadata  Simple centralized management  No data caching  Little benefit due to large data sets, streaming reads  Simplify the API  Push some of the issues onto the client 15

GFS/HDFS 16

MapReduce Data Locality  Master scheduling policy  Asks HDFS for locations of replicas of input file blocks  Map tasks typically split into 64MB (== GFS block size)  Locality levels: node locality/rack locality/off-rack  Map tasks scheduled as close to its input data as possible  Effect  Thousands of machines read input at local disk speed. Without this, rack switches limit read rate and network bandwidth becomes the bottleneck. 17

MapReduce Fault-tolerance  Reactive way  Worker failure Heartbeat, Workers are periodically pinged by master –NO response = failed worker If the processor of a worker fails, the tasks of that worker are reassigned to another worker.  Master failure Master writes periodic checkpoints Another master can be started from the last checkpointed state If eventually the master dies, the job will be aborted 18

MapReduce Fault-tolerance  Proactive way (Speculative Execution)  The problem of “ stragglers ” (slow workers) Other jobs consuming resources on machine Bad disks with soft errors transfer data very slowly Weird things: processor caches disabled (!!)  When computation almost done, reschedule in- progress tasks  Whenever either the primary or the backup executions finishes, mark it as completed 19

MapReduce Scheduling  Fair Sharing  conducts fair scheduling using greedy method to maintain data locality  Delay  uses delay scheduling algorithm to achieve good data locality by slightly compromising fairness restriction  LATE(Longest Approximate Time to End)  improves MapReduce applications' performance in heterogenous environment, like virtualized environment, through accurate speculative execution  Capacity  introduced by Yahoo, supports multiple queues for shared users and guarantees each queue a fraction of the capacity of the cluster 20

MapReduce Cloud Service Providing MapReduce frameworks as a service in clouds becomes an attractive usage model for enterprises. A MapReduce cloud service allows users to cost-effectively access a large amount of computing resources with creating own cluster. Users are able to adjust the scale of MapReduce clusters in response to the change of the resource demand of applications. 21

Amazon Elastic MR You 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 5. Move data out of HDFS 6. Scp data from cluster 0. Allocate Hadoop cluster EC2 Your Hadoop Cluster 7. Clean up!

New Challenges  Interference between co-hosted VMs  Slow down the job times  Locality preserving policy no long effective  Lose more than 20% locality (depends)  Need specifically designed scheduler for virtual MapReduce cluster  Interference-aware  Locality-aware 23

MapReduce Programming  Hadoop implementation of MR in Java (version 1.0.4)  WordCount example: hadoop /src/examples/org/apache/hadoop/examples/WordCount.java 24

MapReduce Programming 25

Map  Implement your own map class extending the Mapper class 26

Reduce  Implement your own reducer class extending the reducer class 27

Main() 28

Demo 29