Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

MapReduce Simplified Data Processing on Large Clusters
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Distributed Computations MapReduce
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MAP REDUCE BASICS CHAPTER 2 Basics Divide and conquer – Partition large problem into smaller subproblems – Worker work on subproblems in parallel Threads.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Introduction to Search Engines Technology CS Technion, Winter 2013 Amit Gross Some slides are courtesy of: Edward Bortnikov & Ronny Lempel, Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
MAP REDUCE BASICS CHAPTER 2. Basics Divide and conquer – Partition large problem into smaller subproblems – Worker work on subproblems in parallel Threads.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce using Hadoop Jan Krüger … in 30 minutes...
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Hadoop Aakash Kag What Why How 1.
Software Systems Development
Hadoop MapReduce Framework
Introduction to MapReduce and Hadoop
Central Florida Business Intelligence User Group
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
The Basics of Apache Hadoop
COS 418: Distributed Systems Lecture 1 Mike Freedman
CS6604 Digital Libraries IDEAL Webpages Presented by
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Hadoop Basics.
Cse 344 May 4th – Map/Reduce.
Introduction to Apache
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Distributed System Gang Wu Spring,2018.
Charles Tappert Seidenberg School of CSIS, Pace University
Introduction to MapReduce
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
COS 518: Distributed Systems Lecture 11 Mike Freedman
Presentation transcript:

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html index.html

 Namenode responsibilities: 1. Namespace management: file name, locations, access privileges etc. 2. Coordinating client operations: Directs clients to datanodes, garbage collection etc. 3. Maintaining the overall health of the system: replication factor, replica balancing etc. 4. Namenode does not take part in any computation

 A MapReduce job use individual files as a basic unit for splitting input data.  Workloads are batch-oriented, dominated by long streaming reads and large sequential writes.  Applications are aware of the distributed file system.  File system can be implemented in an environment of cooperative users.  See figure 2.6 and understand  Operations: (mapper, reducer) {combiner} [partitioner, shuffle and sort] : these operations have specific meaning in the MR context. You must understand it fully before using them.  Finally study the job configuration: items you can specify declaratively and how to specify these attributes.

 Module 4 in yahoo tutorial  Read every line of: Functional programming section  Understand the mapper, reducer and most importantly the driver method (job config)  Module 5: Read the details about partitioner  Metrics  Monitoring: web monitoring possible

 Figure 2.1 map and fold  Map is a “transformation” function that can be carried out in parallel: can work on the elements of list in parallel  Fold is an “aggregation” function that has restrictions on data locality: requires elements of the list to be brought together before the operation  For operations that are associative and commutative, significant performance can be achieved by local aggregation and sorting.  User specifies the map&reduce operations and the execution framework coordinates the execution of the programs and data movement.

 imposes structure to data ◦ Example 1: ◦ Example 2:  map: (k1, v1) → [(k2, v2)]  reduce: (k2, [v2]) → [(k3, v3)]  Map generates intermediate values, and they are implicitly operated using “group by” operator and are in order within a given reducer.  Each reducer output is written into a external file.  Reduce method is called once for each key value in the data space to be processed by reduce.  Mapper with identity reducer is essentially a sorter.  Typical Mapreduce processes data in distributed file system and writes back to the same file system.

 Data Storage: output from MR could go into a sparse multi-dimensional table called BigTable in Google’s system.  The Apache open source version is HBASE.  HABSE is a column based table.  Rows, column families each with many columns.  Data is stored normalized in a relational schema.  Data in Hbase is not normalized by choice and by design.  Column families are stored together and storage methods optimized for this.

 Very interesting since there are many tasks to manage.  Transparent, policy-driven, predictable multi- user scheduling  Speculative scheduling: Due to the barrier between M and R, the map is only as fast as the slowest Map; managing stragglers  But how to handle skew in the data: better local aggregation

 Data/operation co-location  Synchronization: copying into reduce as the map is going on; existence of barrier between map and reduce  Error and fault-tolerance: hardware as well as software

 Partitioners: Partitioners divide the intermediate key space and assign the parts to the reducers.  Combiners are optimization means by which local aggregation can be done before sort and shuffle.  Thus a complete MR job consists of mapper, reducer, combiner, partitioner and job configuration; rest is taken care of by the execution framework.