Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Slides:

Advertisements

Similar presentations

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Advertisements

HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.

J OIN ALGORITHMS USING MAPREDUCE Haiping Wang

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

Clydesdale: Structured Data Processing on MapReduce Jackie.

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.

MapReduce ： Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang

Hadoop: The Definitive Guide Chap. 8 MapReduce Features

HADOOP ADMIN: Session -2

Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Ch 4. The Evolution of Analytic Scalability

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.

Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2

Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.

MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.

Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

MapReduce Types, Formats and Features

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Ministry of Higher Education

Chapter 15 QUERY EXECUTION.

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

February 26th – Map/Reduce

Cse 344 May 4th – Map/Reduce.

Ch 4. The Evolution of Analytic Scalability

Data processing with Hadoop

Charles Tappert Seidenberg School of CSIS, Pace University

MAPREDUCE TYPES, FORMATS AND FEATURES

CS639: Data Management for Data Science

5/7/2019 Map Reduce Map reduce.

MapReduce: Simplified Data Processing on Large Clusters

Map Reduce, Types, Formats and Features

Presentation transcript:

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA) SIGMOD 2007 (Industrial) Presented by Kisung Kim

Contents  Introduction  Map-Reduce  Map-Reduce-Merge  Applications to Relational Data Processing  Case Study  Conclusion

Introduction  New challenges of data processing –A vast amount of data collected from the entire WWW  Solutions of search engine companies –Customized parallel data processing systems –Use large clusters of shared-nothing commodity nodes –Ex) Google’s GFS, BigTable, MapReduce Ask.com’s Neptune Microsoft’s Dryad Yahoo!’s Hadoop

Introduction  Properties of data-intensive systems –Simple  Adopt only a selected subset of database principles –Sufficiently generic and effective –Parallel data processing system deployed on large clusters of shared-nothing commodity nodes –Refactoring of data processing into two primitives:  Map function  Reduce function  Map-Reduce allow users not to worry about the nuisance details of: –Coordinating parallel sub-tasks –Maintaining distributed file storage\  This abstraction can greatly increase user productivity

Introduction  Map-Reduce framework is best at handling homogeneous datasets –Ex) Joining multiple heterogeneous datasets does not quite fit into the Map-Reduce framework  Extending Map-Reduce to process heterogeneous datasets simultaneously –Processing data relationships is ubiquitous –Join-enabled Map-Reduce system can provide a highly parallel yet cost effective alternative –Include relational algebra in the subset of the database principles  Relational operators can be modeled using various combinations of the three primitives: Map, Reduce, and Merge

Map-Reduce  Input dataset is stored in GFS  Mapper –Read splits of the input dataset –Apply map function to the input records –Produce intermediate key/value sets –Partition the intermediate sets into # of reducers sets  Reducer –Read their part of intermediate sets from mappers –Apply reduce function to the values of a same key –Output final results split mapper reducer map: (k1, v1)  [(k2, v2)] reduce: (k2, [v2])  [v3] map: (k1, v1)  [(k2, v2)] reduce: (k2, [v2])  [v3] Signatures of Map, Reduce Function Input Intermediate Sets Final Results

Join using Map-Reduce:  Use homogenization procedure –Apply one map/reduce task on each dataset –Insert a data-source tag into every value –Extract a key attribute common for all heterogeneous datasets –Transformed datasets now have two common attributes  Key and data-source  Problems –Take lots of extra disk space and incur excessive map-reduce communications –Limited only to queries that can be rendered as equi-joins

Join using Map-Reduce: Homogenization KeyValue 101, “Value1” 851, “Value2” 3201, “Value3” KeyValue 102, “Value4” 542, “Value5” 3202, “Value6” map reduce map reduce map reduce Dataset 1 Dataset 2 Collect records with same key

Map-Reduce-Merge  Signatures –α, β, γ represent dataset lineages –Reduce function produces a key/value list instead of just values –Merge function reads data from both lineages  These three primitives can be used to implement the parallel version of several join algorithm map: (k1, v1)  [(k2, v2)] reduce: (k2, [v2])  [v3] map: (k1, v1)  [(k2, v2)] reduce: (k2, [v2])  [v3] Map-Reduce

Merge Modules  Merge function –Process two pairs of key/values  Processor function –Process data from one source only –Users can define two processor functions  Partition selector –Determine from which reducers this merger retrieves its input data based on the merger number  Configurable iterator –A merger has two logical iterators –Control their relative movement against each others

Merge Modules Partition Selector Processor Iterator Merge Reducer Output Reducers for 1 st DatasetReducers for 2 nd Dataset Reducer Output

Applications to Relational Data Processing  Map-Reduce-Merge can be used to implement primitive and some derived relational operators –Projection –Aggregation –Generalized selection –Joins –Set union –Set intersection –Set difference –Cartesian product –Rename  Map-Reduce-Merge is relationally complete, while being load- balanced, scalable and parallel

Example: Hash Join split mapper reducer merger split mapper reducer Use a hash partitioner Read from every mapper for one designated partition Read from two sets of reducer outputs that share the same hashing buckets One is used as a build set and the other probe

Case Study: TPC-H Query 2  Involves 5 tables, 1 nested query, 1 aggregate and group by clause, and 1 order by

Case Study: TPC-H Query 2  Map-Reduce-Merge workflow 13 passes of Map-Reduce-Merge 10 mappers, 10 reducers, and 4 mergers 6 passes of Map-Reduce-Merge 5 mappers, 4 reduce-merge-mappers, 1 reduce-mapper and 1 reducer Combining phases

Conclusion  Map-Reduce-Merge programming model –Retain Map-Reduce’s many great features –Add relational algebra to the list of database principles it upholds –Contains several configurable components that enable many data- processing patterns  Next step –Develop an SQL-like interface and an optimizer to simplify the process of developing a Map-Reduce-Merge workflow –This work can readily reuse well-studied RDBMS techniques