HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Distributed databases
Spark: Cluster Computing with Working Sets
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid1, Kamil BajdaPawlikowski1, Daniel Abadi1, Avi.
HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.
Clydesdale: Structured Data Processing on MapReduce Jackie.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Hive: A data warehouse on Hadoop
Overview Distributed vs. decentralized Why distributed databases
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
Cloud Computing Other Mapreduce issues Keke Chen.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
PMIT-6102 Advanced Database Systems
Hyracks: A new partitioned-parallel platform for data-intensive computation Vinayak Borkar UC Irvine (joint work with M. Carey, R. Grover, N. Onose, and.
Database Architecture Introduction to Databases. The Nature of Data Un-structured Semi-structured Structured.
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Key-Value stores simple data model that maps keys to a list of values Easy to achieve Performance Fault tolerance Heterogeneity Availability due to its.
Cloud Computing Other High-level parallel processing languages Keke Chen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
HadoopDB project An Architetural hybrid of MapReduce and DBMS Technologies for Analytical Workloads Anssi Salohalla.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
Hive Facebook 2009.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Intro – Part 2 Introduction to Database Management: Ch 1 & 2.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
Virtualization and Databases Ashraf Aboulnaga University of Waterloo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
History & Motivations –RDBMS History & Motivations (cont’d) … … Concurrent Access Handling Failures Shared Data User.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
BIG DATA/ Hadoop Interview Questions.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Image taken from: slideshare
Hadoop Aakash Kag What Why How 1.
Hadoop.
Curator: Self-Managing Storage for Enterprise Clusters
Hadoop MapReduce Framework
Spark Presentation.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Cse 344 May 4th – Map/Reduce.
Hadoop Technopoints.
Overview of big data tools
Database System Architectures
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
Pig Hive HBase Zookeeper
Presentation transcript:

HadoopDB Inneke Ponet

 Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components

 More and more data needs to be stored and processed.  People want to do more and more complex calculations on their collected data.  Analytical databases on high-end machines are moving towards cheaper lower-end machines.  The analytical database market is 27% of the database software market and is growing at a rate of 10,3% annually. Introduction

Parallel databases:  good performance,  good efficiency. MapReduce-based systems:  superior scalability,  good fault tolerance,  good flexibility to handle unstructered data. Technologies for data analysis

 Support for standard relational tables and SQL.  Implements techniques for a better performance:  Indexing, compression, materialized views, result caching, I/O sharing.  Data is partitioned (shared-nothing architecture)  transparent to the end-user. Parallel databases

The DBMS of the most analytical databases are deployed on a shared-noting architecture:  A collection of machines that  are independent,  are possible virtual,  have their own local disk and local main memory,  are connected by a high-speed network.  Scalability of machines.  Analysis tasks are easy to parallellize. Shared-nothing architecture

A technology from Google:  processes (un)structured data that is distributed on many nodes in a shared-nothing cluster;  works at enormous scale. Map and Reduce:  parallel without communicating;  Map-repartition-Reduce cycles. MapReduce

No detailed query execution plan in advance  at runtime:  adjust to node failures and slow nodes  (re)assigning tasks to faster nodes. Checkpoints the output to local disk  minimizing of the work in case of a failure. MapReduce: advantages

Hybrid database:  a combination of :  traditional DBMS,  MapReduce-technology. Developed by Yale University students: Azza Abouzeid and Kamil BajDa-Pawlikowski  It is free and open source. HadoopDB

A.Performance B.Fault tolerance C.Heterogeneous environment D.Flexible query interface E.Scalability Desired properties

 Primary characteristic to distinguish.  MapReduce: first modeling and loading data before processing  slower performance than parallel databases.  Cost saving: faster software product cheaper than a hardware upgrade or buying additional hardware. A. Performance  Parallel databases  MapReduce

 Succesfully commit transactions.  Make progress on a workload.  Heterogeneity and scalibility  more faults BUT MapReduce  good fault tolerance:  reassigning tasks;  sub-tasks minimize the effect of faults.  Parallel databases: assumption failures are rare more testing => slower performance. B. Fault tolerance  Parallel databases  MapReduce

 Nodes don’t always run on  identical hardware,  an identical virtual machine.  Different performance.  Parallel databases: not tested on more than 100 nodes. C. Heterogeneous environment  Parallel databases  MapReduce

 Easy to make queries:  SQL and non-SQL interface languages,  Use of tools.  Robust mechanisme for writing UDFs.  Parallel databases: SQL, ODBC and UDFs.  MapReduce-based systems: it is possible (Hive), but not always (Hadoop). D. Flexible query interface  Parallel databases  /  MapReduce

Traditional DBMS:  only scalable to 100 nodes.  MapReduce-based systems:  designed to scale to thousands of nodes in a shared- nothing architecture. E. Scalability ReasonsAssumption FailuresFailures are rare. HetrogeneityHomogeneous array of machines. Not testedThere are no applications with more than a few dozen nodes.  Parallel databases  MapReduce

Parallel databasesMapReduce Performance  Fault tolerance  Heterogeneous environment  Flexible query interface  // Scalability  Desired properties

 Communication: Hadoop  Database: PostgreSQL  Translation: Hive Layers of HadoopDB

 Communication layer of HadoopDB.  Hadoop framework  two layers:  Hadoop Distributed File System (HDFS),  MapReduce framework.  Cost: free/open source  MapReduce. Hadoop

 Relational DBMS.  (Possible) database layer of HadoopDB.  Cost: free/open source. PostgreSQL

 Translation layer.  Processing of a SQL query:  Query  Abstract Syntax Tree.  MetaStore: schema of the table(s).  Logical query plan: DAG of relational operators.  Optimized plan.  Physical executable plan: MapReduce job(s).  XML plan: DAG  serialized.  Hive Driver executes a Hadoop job. Hive

 Database Connector:  Interface between independent database systems;  Extends the InputFormat class (of Hadoop);  Connect to any JDBC-compliant database.  Catalog:  Meta-information about the databases:  connection parameters,  metadata.  XML file in HDFS  accessed by:  Master node,  Worker/Slave nodes. HadoopDB components

 Data loader:  Global hasher:  Custom MapReduce job  files in HDFS;  Repartioning data upon loading.  Local hasher:  Copies partition from HDFS to local file system;  Partitions the file in smaller sized chunks. HadoopDB Components (2)

 SQL to MapReduce:  Parallel database front-end to process SQL queries.  HiveQL ↓ Transform MapReduce jobs:  Connect to tables stored in HDFS;  Consists of DAGs of relational operators that operate as iterators.  Assumption  no collection of tables:  Operations on multiple tables  Reduce function.  NOT in HadoopDB: a join operation can be pushed to the databse layer. HadoopDB Components (3)

 SQL/SMS planner:  Modifies Hive:  Updates the MetaStore  Two passes over the physical plan: 1.Determine the partition keys for the Reduce Sink Operators. 2.Operators are:  converted in SQL querie(s);  pushed into the database layer.  Only filter, select and aggregation operators. HadoopDB Components (4)

HadoopDB Components (5)

Questions?