Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
Overview of MapReduce and Hadoop
CMU SCS : Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos.
Center-of-Gravity Reduce Task Scheduling to Lower MapReduce Network Traffic Mohammad Hammoud, M. Suhail Rehman, and Majd F. Sakr 1.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Distributed Systems CS
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VI: 2014/04/14.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MapReduce How to painlessly process terabytes of data.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Map reduce Cs 595 Lecture 11.
Hadoop Aakash Kag What Why How 1.
MapReduce Types, Formats and Features
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
Lecture 16 (Intro to MapReduce and Hadoop)
Distributed Systems CS
MAPREDUCE TYPES, FORMATS AND FEATURES
CS639: Data Management for Data Science
CS639: Data Management for Data Science
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud

Hadoop MapReduce  MapReduce is one of the most successful realizations of large- scale “data-parallel” distributed analytics engines  Hadoop is an open source implementation of MapReduce  Hadoop MapReduce uses Hadoop Distributed File System (HDFS) as a distributed storage layer  HDFS is an open source implementation of GFS

GFS Data Distribution Policy  The Google File System (GFS) is a scalable DFS for data- intensive applications  GFS divides large files into multiple pieces called chunks or blocks (by default 64MB) and stores them on different data servers  This design is referred to as block-based design  Each GFS chunk has a unique 64-bit identifier and is stored as a file in the lower-layer local file system on the data server  GFS distributes chunks across cluster data servers using a random distribution policy

GFS Random Distribution Policy Server 0 (Writer) Blk 0 Blk 1 Blk 2 Blk 3 Blk 4 Blk 5 Blk 6 Server 1 Blk 0 Blk 2 Blk 3 Blk 5 0M 64M 128M 192M 256M 320M 384M Server 2Server 3 Blk 1 Blk 2 Blk 4 Blk 6 Blk 0 Blk 1 Blk 4 Blk 5 Blk 6 Large File Blk 0 Blk 1 Blk 2 Blk 3 Blk 4 Blk 5 Blk 6

GFS Architecture  GFS adopts a master-slave architecture GFS client Master Chunk Server Linux File System Chunk Server Linux File System Chunk Server Linux File System File name Contact address Chunk Id, range Chunk data

The Problem Scope  Hadoop MapReduce is used for powerful and efficient analytics over Big Data  The power of MapReduce lies in its ability to scale to 100s and even 1000s of machines  What amount of work can MapReduce handle?  Big Data in the order of 100s of GBs, TBs or PBs  It is unlikely that datasets of such sizes can fit on a single machine  Hence, a storage layer like HDFS is required! 6

Hadoop MapReduce: A System’s View  Hadoop MapReduce incorporates two phases, Map and Reduce phases, which encompass multiple Map and Reduce tasks Map Task Reduce Task Partition To HDFS Dataset HDFS HDFS BLK Map Phase Shuffle Stage Merge Stage Reduce Stage Reduce Phase 7 Partition Split 0 Split 1 Split 2 Split 3 Partition

Data Structure: Keys and Values  The MapReduce programmer has to specify only two “sequential” functions, the Map and the Reduce functions  These functions will be translated “automatically” into multiple Map and Reduce tasks  In MapReduce, data elements are always structured as key-value (i.e., (K, V)) pairs  In particular, the Map and Reduce functions receive and emit (K, V) pairs (K, V) Pairs Map Function (K’, V’) Pairs Reduce Function (K’’, V’’) Pairs Input SplitsIntermediate OutputsFinal Outputs

WordCount: An Application View 9 Mohammad is delivering a lecture at CMUQ CMUQ is a member of QF A Text File Mohammad is delivering a lecture at CMUQ CMUQ is a member of QF A Chunk of File A Map Function Parse & Count Key1Value1 0Mohammad is 20delivering a 18lecture at CMUQ Key2Value2 Mohammad1 is1 delivering1 a1 lecture1 at1 CMUQ1 Key2Value2 Mohammad1 is2 delivering1 a2 lecture1 at1 CMUQ2 member1 of1 QF1 Iterate & Sum Parse & Count Key1Value1 0CMUQ is a 17member of QF Key2Value2 CMUQ1 is1 a1 member1 of1 QF1 A Map Function A Reduce Function