Download presentation
Presentation is loading. Please wait.
Published byMacy Sadlier Modified over 9 years ago
1
HadoopDB Inneke Ponet
2
Introduction Technologies for data analysis HadoopDB Desired properties Layers of HadoopDB HadoopDB Components
3
More and more data needs to be stored and processed. People want to do more and more complex calculations on their collected data. Analytical databases on high-end machines are moving towards cheaper lower-end machines. The analytical database market is 27% of the database software market and is growing at a rate of 10,3% annually. Introduction
4
Parallel databases: good performance, good efficiency. MapReduce-based systems: superior scalability, good fault tolerance, good flexibility to handle unstructered data. Technologies for data analysis
5
Support for standard relational tables and SQL. Implements techniques for a better performance: Indexing, compression, materialized views, result caching, I/O sharing. Data is partitioned (shared-nothing architecture) transparent to the end-user. Parallel databases
6
The DBMS of the most analytical databases are deployed on a shared-noting architecture: A collection of machines that are independent, are possible virtual, have their own local disk and local main memory, are connected by a high-speed network. Scalability of machines. Analysis tasks are easy to parallellize. Shared-nothing architecture
7
A technology from Google: processes (un)structured data that is distributed on many nodes in a shared-nothing cluster; works at enormous scale. Map and Reduce: parallel without communicating; Map-repartition-Reduce cycles. MapReduce
8
No detailed query execution plan in advance at runtime: adjust to node failures and slow nodes (re)assigning tasks to faster nodes. Checkpoints the output to local disk minimizing of the work in case of a failure. MapReduce: advantages
9
Hybrid database: a combination of : traditional DBMS, MapReduce-technology. Developed by Yale University students: Azza Abouzeid and Kamil BajDa-Pawlikowski It is free and open source. HadoopDB
10
A.Performance B.Fault tolerance C.Heterogeneous environment D.Flexible query interface E.Scalability Desired properties
11
Primary characteristic to distinguish. MapReduce: first modeling and loading data before processing slower performance than parallel databases. Cost saving: faster software product cheaper than a hardware upgrade or buying additional hardware. A. Performance Parallel databases MapReduce
12
Succesfully commit transactions. Make progress on a workload. Heterogeneity and scalibility more faults BUT MapReduce good fault tolerance: reassigning tasks; sub-tasks minimize the effect of faults. Parallel databases: assumption failures are rare more testing => slower performance. B. Fault tolerance Parallel databases MapReduce
13
Nodes don’t always run on identical hardware, an identical virtual machine. Different performance. Parallel databases: not tested on more than 100 nodes. C. Heterogeneous environment Parallel databases MapReduce
14
Easy to make queries: SQL and non-SQL interface languages, Use of tools. Robust mechanisme for writing UDFs. Parallel databases: SQL, ODBC and UDFs. MapReduce-based systems: it is possible (Hive), but not always (Hadoop). D. Flexible query interface Parallel databases / MapReduce
15
Traditional DBMS: only scalable to 100 nodes. MapReduce-based systems: designed to scale to thousands of nodes in a shared- nothing architecture. E. Scalability ReasonsAssumption FailuresFailures are rare. HetrogeneityHomogeneous array of machines. Not testedThere are no applications with more than a few dozen nodes. Parallel databases MapReduce
16
Parallel databasesMapReduce Performance Fault tolerance Heterogeneous environment Flexible query interface // Scalability Desired properties
17
Communication: Hadoop Database: PostgreSQL Translation: Hive Layers of HadoopDB
18
Communication layer of HadoopDB. Hadoop framework two layers: Hadoop Distributed File System (HDFS), MapReduce framework. Cost: free/open source MapReduce. Hadoop
19
Relational DBMS. (Possible) database layer of HadoopDB. Cost: free/open source. PostgreSQL
20
Translation layer. Processing of a SQL query: Query Abstract Syntax Tree. MetaStore: schema of the table(s). Logical query plan: DAG of relational operators. Optimized plan. Physical executable plan: MapReduce job(s). XML plan: DAG serialized. Hive Driver executes a Hadoop job. Hive
21
Database Connector: Interface between independent database systems; Extends the InputFormat class (of Hadoop); Connect to any JDBC-compliant database. Catalog: Meta-information about the databases: connection parameters, metadata. XML file in HDFS accessed by: Master node, Worker/Slave nodes. HadoopDB components
22
Data loader: Global hasher: Custom MapReduce job files in HDFS; Repartioning data upon loading. Local hasher: Copies partition from HDFS to local file system; Partitions the file in smaller sized chunks. HadoopDB Components (2)
23
SQL to MapReduce: Parallel database front-end to process SQL queries. HiveQL ↓ Transform MapReduce jobs: Connect to tables stored in HDFS; Consists of DAGs of relational operators that operate as iterators. Assumption no collection of tables: Operations on multiple tables Reduce function. NOT in HadoopDB: a join operation can be pushed to the databse layer. HadoopDB Components (3)
24
SQL/SMS planner: Modifies Hive: Updates the MetaStore Two passes over the physical plan: 1.Determine the partition keys for the Reduce Sink Operators. 2.Operators are: converted in SQL querie(s); pushed into the database layer. Only filter, select and aggregation operators. HadoopDB Components (4)
25
HadoopDB Components (5)
26
Questions?
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.