1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.

Slides:



Advertisements
Similar presentations
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
These are slides with a history. I found them on the web... They are apparently based on Dan Weld’s class at U. Washington, (who in turn based his slides.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Map Reduce Allan Jefferson Armando Gonçalves Rocir Leite Filipe??
Distributed Computations MapReduce
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Map Reduce Architecture
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore INTRODUCTION TO HADOOP.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce M/R slides adapted from those of Jeff Dean’s.
All About Nutch Michael J. Cafarella CSE 454 April 14, 2005.
Intro to Web Search Michael J. Cafarella December 5, 2007.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Google Cluster Computing Faculty Training Workshop Module III: Nutch This presentation © Michael Cafarella Redistributed under the Creative Commons Attribution.
Hadoop implementation of MapReduce computational model Ján Vaňo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Data Parallel and Graph Parallel Systems for Large-scale Data Processing Presenter: Kun Li.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Dr Zahoor Tanoli COMSATS.  Certainly not suitable to process huge volumes of scalable data  Creates too much of a bottleneck.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Cloud Computing.
Hadoop Aakash Kag What Why How 1.
MapReduce: Simplified Data Processing on Large Clusters
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Introduction to MapReduce and Hadoop
Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.
Ministry of Higher Education
The Basics of Apache Hadoop
Hadoop Basics.
Lecture 16 (Intro to MapReduce and Hadoop)
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am Yi Zhou Dec 2, 2015 Wednesday9:25-9:50am Moayad Almohaishi Dec 4, 2015 Friday9:00-9:25am Project Presentation Schedule

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce – Job Processing Dr. Xiao Qin Auburn University

Review: Map-Reduce Framework 3

Grep – Input consists of (url+offset, single line) – map(key=url+offset, val=line): If contents matches regexp, emit (line, “1”) – reduce(key=line, values=uniq_counts): Don’t do anything; just emit line 4

MapReduce at Google A C++ library linked into user programs Status of Implementation (OSDI’ 04) – 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory – Limited bisection bandwidth – Storage is on local IDE disks – GFS: distributed file system manages data (SOSP'03) – Job scheduling system: jobs made up of tasks, Scheduler assigns tasks to machines 5

Execution Overview* How is this distributed? – Partition input key/value pairs into chunks, run map() tasks in parallel – After all map()s are complete, consolidate all emitted values for each unique emitted key – Now partition space of output map keys, and run reduce() in parallel If map() or reduce() fails, re-execute! 6 * Adapted from Google slides

Job Processing JobTracker TaskTracker 0 TaskTracker 1TaskTracker 2 TaskTracker 3TaskTracker 4TaskTracker 5 1.Client submits “grep” job, indicating code and input files 2.JobTracker breaks input file into k chunks, (in this case 6). Assigns work to ttrackers. 3.After map(), tasktrackers exchange map-output to build reduce() keyspace 4.JobTracker breaks reduce() keyspace into m chunks (in this case 6). Assigns work. 5.reduce() output may go to NDFS “grep” 7

Execution 8

Parallel Execution 9

Fine granularity tasks: map tasks >> machines – Minimizes time for fault recovery – Can pipeline shuffling with map execution – Better dynamic load balancing Often use 200,000 map & 5000 reduce tasks Running on 2000 machines Task Granularity and Pipelining 10 Why map 1 and 3 have different execution time?

11

12

13

14

15

16

17

18

19

20

21

Big Data: Solution “Googled” MapReduce! – Divide and Conquer. – Google File System (GFS) to store data. Apache – Framework for running applications on large clusters of commodity hardware. – Storage: HDFS. – Processing: MapReduce 22

Hadoop is – Economical – Easy to use – Portable – Reliable. Infrastructure needed, are in Data centers. Facebook’s Hadoop cluster has 30PB storage. Yahoo!, Amazon & Google all have Hadoop Data centers Hadoop in Data centers 23

Hadoop Architecture Distributed Storage (HDFS) Distributed Processing (Map Reduce) 24

Master-Slave-Client Architecture NameNode DataNode Meta-Data management Store Data Client File I/O operations manage JobTracker TaskTracker Task scheduling Execute Job Client Job submission assign 25

HDFS Data is organized into files and directories. Files are divided into uniform sized blocks and distributed across cluster nodes. Blocks are replicated to handle hardware failure Filesystem keeps checksums of data for corruption detection and recovery. HDFS exposes block placement so that computation can be migrated to data. 26

NameNode DataNode 1 DataNode 2 DataNode 3 Meta-Data management Store Data Client report Block B Block C Block A HDFS Write Write 27

HDFS Read NameNode DataNode 1 DataNode 2 DataNode 3 Meta-Data management Store Data Client report Read 28

29