Presentation is loading. Please wait.

Presentation is loading. Please wait.

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

Similar presentations

Presentation on theme: "HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components."— Presentation transcript:

1 HadoopDB Inneke Ponet

2  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components

3  More and more data needs to be stored and processed.  People want to do more and more complex calculations on their collected data.  Analytical databases on high-end machines are moving towards cheaper lower-end machines.  The analytical database market is 27% of the database software market and is growing at a rate of 10,3% annually. Introduction

4 Parallel databases:  good performance,  good efficiency. MapReduce-based systems:  superior scalability,  good fault tolerance,  good flexibility to handle unstructered data. Technologies for data analysis

5  Support for standard relational tables and SQL.  Implements techniques for a better performance:  Indexing, compression, materialized views, result caching, I/O sharing.  Data is partitioned (shared-nothing architecture)  transparent to the end-user. Parallel databases

6 The DBMS of the most analytical databases are deployed on a shared-noting architecture:  A collection of machines that  are independent,  are possible virtual,  have their own local disk and local main memory,  are connected by a high-speed network.  Scalability of machines.  Analysis tasks are easy to parallellize. Shared-nothing architecture

7 A technology from Google:  processes (un)structured data that is distributed on many nodes in a shared-nothing cluster;  works at enormous scale. Map and Reduce:  parallel without communicating;  Map-repartition-Reduce cycles. MapReduce

8 No detailed query execution plan in advance  at runtime:  adjust to node failures and slow nodes  (re)assigning tasks to faster nodes. Checkpoints the output to local disk  minimizing of the work in case of a failure. MapReduce: advantages

9 Hybrid database:  a combination of :  traditional DBMS,  MapReduce-technology. Developed by Yale University students: Azza Abouzeid and Kamil BajDa-Pawlikowski  It is free and open source. HadoopDB

10 A.Performance B.Fault tolerance C.Heterogeneous environment D.Flexible query interface E.Scalability Desired properties

11  Primary characteristic to distinguish.  MapReduce: first modeling and loading data before processing  slower performance than parallel databases.  Cost saving: faster software product cheaper than a hardware upgrade or buying additional hardware. A. Performance  Parallel databases  MapReduce

12  Succesfully commit transactions.  Make progress on a workload.  Heterogeneity and scalibility  more faults BUT MapReduce  good fault tolerance:  reassigning tasks;  sub-tasks minimize the effect of faults.  Parallel databases: assumption failures are rare more testing => slower performance. B. Fault tolerance  Parallel databases  MapReduce

13  Nodes don’t always run on  identical hardware,  an identical virtual machine.  Different performance.  Parallel databases: not tested on more than 100 nodes. C. Heterogeneous environment  Parallel databases  MapReduce

14  Easy to make queries:  SQL and non-SQL interface languages,  Use of tools.  Robust mechanisme for writing UDFs.  Parallel databases: SQL, ODBC and UDFs.  MapReduce-based systems: it is possible (Hive), but not always (Hadoop). D. Flexible query interface  Parallel databases  /  MapReduce

15 Traditional DBMS:  only scalable to 100 nodes.  MapReduce-based systems:  designed to scale to thousands of nodes in a shared- nothing architecture. E. Scalability ReasonsAssumption FailuresFailures are rare. HetrogeneityHomogeneous array of machines. Not testedThere are no applications with more than a few dozen nodes.  Parallel databases  MapReduce

16 Parallel databasesMapReduce Performance  Fault tolerance  Heterogeneous environment  Flexible query interface  // Scalability  Desired properties

17  Communication: Hadoop  Database: PostgreSQL  Translation: Hive Layers of HadoopDB

18  Communication layer of HadoopDB.  Hadoop framework  two layers:  Hadoop Distributed File System (HDFS),  MapReduce framework.  Cost: free/open source  MapReduce. Hadoop

19  Relational DBMS.  (Possible) database layer of HadoopDB.  Cost: free/open source. PostgreSQL

20  Translation layer.  Processing of a SQL query:  Query  Abstract Syntax Tree.  MetaStore: schema of the table(s).  Logical query plan: DAG of relational operators.  Optimized plan.  Physical executable plan: MapReduce job(s).  XML plan: DAG  serialized.  Hive Driver executes a Hadoop job. Hive

21  Database Connector:  Interface between independent database systems;  Extends the InputFormat class (of Hadoop);  Connect to any JDBC-compliant database.  Catalog:  Meta-information about the databases:  connection parameters,  metadata.  XML file in HDFS  accessed by:  Master node,  Worker/Slave nodes. HadoopDB components

22  Data loader:  Global hasher:  Custom MapReduce job  files in HDFS;  Repartioning data upon loading.  Local hasher:  Copies partition from HDFS to local file system;  Partitions the file in smaller sized chunks. HadoopDB Components (2)

23  SQL to MapReduce:  Parallel database front-end to process SQL queries.  HiveQL ↓ Transform MapReduce jobs:  Connect to tables stored in HDFS;  Consists of DAGs of relational operators that operate as iterators.  Assumption  no collection of tables:  Operations on multiple tables  Reduce function.  NOT in HadoopDB: a join operation can be pushed to the databse layer. HadoopDB Components (3)

24  SQL/SMS planner:  Modifies Hive:  Updates the MetaStore  Two passes over the physical plan: 1.Determine the partition keys for the Reduce Sink Operators. 2.Operators are:  converted in SQL querie(s);  pushed into the database layer.  Only filter, select and aggregation operators. HadoopDB Components (4)

25 HadoopDB Components (5)

26 Questions?

Download ppt "HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components."

Similar presentations

Ads by Google