MapReduce VS Parallel DBMSs

Slides:



Advertisements
Similar presentations
Chapter 10: Designing Databases
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Gamma DBMS (Part 2): Failure Management Query Processing Shahram Ghandeharizadeh Computer Science Department University of Southern California.
HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
1 Chapter 5 : Query Processing and Optimization Group 4: Nipun Garg, Surabhi Mithal
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
Spark: Cluster Computing with Working Sets
Parallel Database Systems The Future Of High Performance Database Systems David Dewitt and Jim Gray 1992 Presented By – Ajith Karimpana.
HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.
Clydesdale: Structured Data Processing on MapReduce Jackie.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#28: Modern Database Systems.
Cloud Computing Other Mapreduce issues Keke Chen.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
1 A Comparison of Approaches to Large-Scale Data Analysis Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, Stonebraker, SIGMOD’09 Shimin Chen Big data reading.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013.
李智宇、 林威宏、 施閔耀. + Outline Introduction Architecture of Hadoop HDFS MapReduce Comparison Why Hadoop Conclusion
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Components of Database Management System
TM 7-1 Copyright © 1999 Addison Wesley Longman, Inc. Physical Database Design.
HadoopDB project An Architetural hybrid of MapReduce and DBMS Technologies for Analytical Workloads Anssi Salohalla.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
MapReduce M/R slides adapted from those of Jeff Dean’s.
An Introduction to HDInsight June 27 th,
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
MapReduce and Parallel DMBSs: Friends or Foes? Michael Stonebraker, Daniel Abadi, David J. Dewitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
BIG DATA/ Hadoop Interview Questions.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Big Data Analytics on Large Scale Shared Storage System
Hadoop Aakash Kag What Why How 1.
Hadoop MapReduce Framework
Database Performance Tuning and Query Optimization
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
A Comparison of Approaches to Large-Scale Data Analysis
Cse 344 May 2nd – Map/reduce.
Cse 344 May 4th – Map/Reduce.
Physical Database Design
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
Ch 4. The Evolution of Analytic Scalability
Overview of big data tools
Chapter 11 Database Performance Tuning and Query Optimization
Presentation transcript:

MapReduce VS Parallel DBMSs Presenter: Ran Ding

Guideline 1. Introduction 2. Where the MR wins 3. DBMS “sweet spot” tests 4. Why the Parallel DBMS wins 5. Conclusion

Introduction-----MR The MapReduce (MR) paradigm has been hailed as a revolutionary new platform for large-scale, massively parallel data access. Like Hadoop

Introduction----Parallel DBMS Parallel DBMS appeared at mid-1980. the Teradata and Gamma projects pioneered a new architectural paradigm based on a cluster of commodity computers.

Introduction---Horizontal partitioning Distributing the rows of a relational table across the nodes of the cluster so they can process in parallel.

Introduction---DBMS One benefit is system automatically manages the various alternative partitioning strategies for the tables involved in the query. Like hash, range, and round-robin…..

Introduction-- Mapping parallel DBMS onto MapReduce It is not easy!!!!!! UDF(user defined field) helps. Like GROUP BY in SQL.

Where the MR wins 1. ETL and “read once” data sets 2. Complex analytics 3. Semi-structured data 4. Quick-and-dirty analyses 5. Limited-budget operations

ETL and “read once” data sets Extract-transform-load system MR system can be considered a general- purpose parallel ETL system. DBMSs may perform the ETL

Complex analytics Cannot be structured as single SQL aggregate queries MR is a good candidate

Semi-structured data MR systems are good at processing the data is prepared for loading into a back-end system DBMS requires wide tables with many attributes Plus, MR-style systems are easily store and process

Quick-and-dirty analyses DBMS need the programmer write the schema then load MR just copy!

Limited-budget operations MR is basically open source for free Parallel DBMS: huge cost

DBMS “Sweet Spot” Test

Why the Parallel DBMS wins 1. Repetitive record parsing 2. Compression 3. Pipelining 4. Scheduling 5. Column-oriented storage

Repetitive record parsing Parsing task requires each Map and Reduce task repeatedly parse and convert string fields into the appropriate type Records are parsed by DBMSs when the data is initially loaded.

Compression It is hard to say…….. Commercial DBMSs may use carefully tuned compression algorithms

Pipelining In parallel DBMS, data is streamed from producer to consumer the intermediate data is never written to disk In MR system, it writes the result to local data structure, and consumers read from it

Scheduling In a parallel DBMS, every node knows what it should do MR system is scheduled on processing nodes one storage block at a time.

Column-oriented storage Vertica Reads only the attributes necessary for solving the user query DBMS-X and Hadoop are both row stores

What should MR learn from Parallel DBMS MR advocates should learn from parallel DBMS the technologies and techniques for efficient query parallel execution.

Conclusion MR systems are powerful tools for ETL-style applications and for complex analytics. If the application is query-intensive, whether semi structured or rigidly structured, then a DBMS is probably the better choice

Thank you~~ Questions?