Institut für Scientific Computing – Universität WienP.Brezany Parallel Databases Univ.-Prof. Dr. Peter Brezany Institut für Scientific Computing Universität.

Slides:



Advertisements
Similar presentations
Unit 1:Parallel Databases
Advertisements

Distributed DBMS© M. T. Özsu & P. Valduriez Ch.6/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Parallel Databases These slides are a modified version of the slides of the book “Database System Concepts” (Chapter 18), 5th Ed., McGraw-Hill, by Silberschatz,
Parallel Databases By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
The State of the Art in Distributed Query Processing by Donald Kossmann Presented by Chris Gianfrancesco.
04/25/2005Yan Huang - CSCI5330 Database Implementation – Parallel Database Parallel Databases.
Parallel Database Systems
Parallel Database Systems The Future Of High Performance Database Systems David Dewitt and Jim Gray 1992 Presented By – Ajith Karimpana.
Parallel Database Systems
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Institut für Scientific Computing – Universität WienP.Brezany Fragmentation Univ.-Prof. Dr. Peter Brezany Institut für Scientific Computing Universität.
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
Institut für Scientific Computing – Universität WienP.Brezany Optimization of Distributed Queries Univ.-Prof. Dr. Peter Brezany Institut für Scientific.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Chapter 3 Parallel Search 3.1Search Queries 3.2Data Partitioning 3.3Search Algorithms 3.4Summary 3.5Bibliographical Notes 3.6Exercises.
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
TDD: Topics in Distributed Databases
Fall 2008Parallel Databases1. Fall 2008Parallel Databases2 Ideal Parallel Systems Two key properties:  Linear Speedup: Twice as much hardware can perform.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
Distributed DBMS© 1998 M. Tamer Özsu & Patrick Valduriez Page 13.1 Outline Introduction Background Distributed DBMS Architecture Distributed Database Design.
Chapter 5 Parallel Join 5.1Join Operations 5.2Serial Join Algorithms 5.3Parallel Join Algorithms 5.4Cost Models 5.5Parallel Join Optimization 5.6Summary.
Query Processing Presented by Aung S. Win.
PMIT-6102 Advanced Database Systems
1 Distributed and Parallel Databases. 2 Distributed Databases Distributed Systems goal: –to offer local DB autonomy at geographically distributed locations.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
DDBMS Distributed Database Management Systems Fragmentation
Databases Illuminated
Distributed Database. Introduction A major motivation behind the development of database systems is the desire to integrate the operational data of an.
PMIT-6101 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
16.7 Completing the Physical- Query-Plan By Aniket Mulye CS257 Prof: Dr. T. Y. Lin.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty
©Silberschatz, Korth and Sudarshan18.1Database System Concepts - 6 th Edition Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism.
Lecture 14- Parallel Databases Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
©Silberschatz, Korth and Sudarshan20.1Database System Concepts 3 rd Edition Chapter 20: Parallel Databases Introduction I/O Parallelism Interquery Parallelism.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Unit - 4 Introduction to the Other Databases.  Introduction :-  Today single CPU based architecture is not capable enough for the modern database.
CS 540 Database Management Systems
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
Parallel Databases.
Database Management System
Chapter 18: Parallel Databases
Efficient Join Query Evaluation in a Parallel Database System
Interquery Parallelism
Database Performance Tuning and Query Optimization
Chapter 15 QUERY EXECUTION.
April 30th – Scheduling / parallel
Akshay Tomar Prateek Singh Lohchubh
Outline Introduction Background Distributed DBMS Architecture
Indexing and Hashing Basic Concepts Ordered Indices
Chapter 20: Parallel Databases
Parallel DBMS Chapter 22, Part A
Chapter 21: Parallel Databases
Parallel DBMS Chapter 22, Sections 22.1–22.6
Chapter 18: Parallel Databases
Chapter 11 Database Performance Tuning and Query Optimization
Database System Architectures
The Gamma Database Machine Project
Chapter 21: Parallel Databases
Parallel DBMS DBMS Textbook Chapter 22
Presentation transcript:

Institut für Scientific Computing – Universität WienP.Brezany Parallel Databases Univ.-Prof. Dr. Peter Brezany Institut für Scientific Computing Universität Wien Tel Sprechstunde: Di, LV-Portal:

Institut für Scientific Computing – Universität WienP.Brezany 2 Introduction Distributed DB technology can be naturally extended to implement parallel database systems, i.e., databases systems on parallel computers. Parallel database systems exploit the parallelism in data management, in order to deliver high-performance and high-availability database servers. Parallel processing exploits multiprocessor computers to run application programs by using several processors cooperatively, in order to improve performance. Parallel DB systems combine DB management a nd parallel processing to increase performance and availability. Performance was also the objective of the database machines in the 70s and 80s. Contributions –Increasing the I/O bandwidth through parallelism: If we store a DB of size D on a single disk with throughput T, the system throuhput is bounded by T. If we partition the DB across n disks, each with capacity D/n and throughput T‘ (hopefully equivalent to T), we get an ideal throughput on n  T‘ which can be better consumed by multiple processsors (ideally n). –Therefore the next objective was to develop software-oriented solutions in order to exploit multiprocessor hardware.

Institut für Scientific Computing – Universität WienP.Brezany 3 Introduction (cont.) The objectives of parallel DB systems can be achieved by extending distributed DB technology, for example. by partitioning the DB across multiple (small) disks so that much inter- and intra-query parallelism can be obtained. This can lead to significants improvements in both response time and throughput (number of transactions per second). Examples of parallel DB products: Teradata, implementations of DB2/PE (Parallel Edition), INFORMIX and ORACLE on parallel computers. A parallel DB system acts as a server for multiple applications servers in the now common client/server organization in computer networks.

Institut für Scientific Computing – Universität WienP.Brezany 4 General Architecture of a Parallel DB Systems

Institut für Scientific Computing – Universität WienP.Brezany 5 General Architecture of a Parallel DB Systems (cont.) 1.Session Manager. It provides support for client interactions with the server.  It performs the connections and disconnections betweeen the client processes and the two other subsystems. Therefore, it initiates and closes user sessions (which may be multiple transactions). 2.Request Manager. It receives client requests related to query compilation and execution. It can access the DB directory which holds all meta-information about data and programs. The directory itself should be managed as a DB in the server. It activates the various compilation phases, triggers query execution and returns the results as well as error codes to the client applications. It may trigger the recovery procedure in case of transaction failure. To speed–up query execution, it may optimize and parallelize the query at compile-time. 3.Data Manager. It provides all the low-level functions needed to run compiled queries in parallel, i.e., database operator execution, parallel transaction support, cache management, etc.

Institut für Scientific Computing – Universität WienP.Brezany 6 A Step Towards Parallelization

Institut für Scientific Computing – Universität WienP.Brezany 7 A Step Towards Parallelization (cont.)

Institut für Scientific Computing – Universität WienP.Brezany 8 Parallel DBMS Techniques – Data Placement Data placement in parallel DB exhibits similarities with data fragmentation in distributed DB. (Fragmentation can be used to increase parallelism). In the parallel DB terminology, the terms partitioning and partition are used instead of fragmentation and fragment, respectively. Load balancing is much more difficult to achieve in the presence of a large number of nodes. Data placement must be done to maximize system performance, which can be measured by combining the total amount of work done by the system and the response time of individual queries.

Institut für Scientific Computing – Universität WienP.Brezany 9

Institut für Scientific Computing – Universität WienP.Brezany 10 Different Partitioning Schemes (cont.) Round-robin partitioning: With n partitions, the ith tuple in insertion order is assigned to partition (i mod n). This strategy enables the sequential access to a relation can be done in parallel. However, the direct access to individual tuples, based on a predicate, requires accessing the entire relation. Round robin partitioning is excellent if all applications want to access the relation by sequentially scanning all of it on each query. The problem with round-robin partitioning is that applications frequently want to associatively access tuples, meaning that the application wants to find all the tuples having a particular attribute value. The SQL query looking for the Smith’s in the phone book is an example of an associative search. Hash partitioning applies a hash function to some attribute which yields the partition number. This strategy allows exact-match queries on the selection attribute to be processed by exactly one node and all other queries to be processed by all the nodes in parallel. Hash partitioning is ideally suited for applications that want only sequential and associative access to the data. Tuples are placed by applying a hashing function to an attribute of each tuple. The function specifies the placement of the tuple on a particular disk. Associative access to the tuples with a specific attribute value can be directed to a single disk, avoiding the overhead of starting queries on multiple disks. Hash partitioning mechanisms are provided by Arbre, Bubba, Gamma, and Teradata.

Institut für Scientific Computing – Universität WienP.Brezany 11 Different Partitioning Schemes (cont.) Range partitioning distributes tuples based on the value intervals (ranges) of some attribute. In addition to supporting exact-match queries as with hashing, it is well-suited for range queries. For instance, a query with a predicate „A between A1 and A2“ may be processed by the only node(s) containing tuples whose A value is in [A1,A2]. However, range partitioning can result in high variation in partition size.

Institut für Scientific Computing – Universität WienP.Brezany 12 Query Parallelism Inter-query parallelism enables the parallel execution of multiple queries generated by concurrent transaction, in order to increase the transactional throughput. Within a query (intra-query parallelism), inter-operator and intra-operator parallelism are used to decrease response time. Inter-operator parallelism is obtained by executing in parallel several operators of the query tree on several processsors. Intra-operator parallelism – the same operator is executed by many processors, each one working on a subset of the data.

Institut für Scientific Computing – Universität WienP.Brezany 13 Intra-operator Parallelism It is based on the decomposition of one operator in a set of independent sub-operators, called operator instances. Each operator instance will then process one relation partition also called bucket. Example: Consider a simple select query.

Institut für Scientific Computing – Universität WienP.Brezany 14 Inter-Operator Parallelism With pipeline parallelism, several operators with a producer-consumer link are executed in parallel. For example, the select operator in the figure below will be executed in parallel with the subsequent join operator.  The advantage, the intermediate result is not materialized,  saving memory and disk accesses; e.g., only S and R may fit in memory. Independent parallelism is achieved when there is no dependency between the operators executed in parallel. Example: two select operators in the figure below. Join Select SR Intermediate Results

Institut für Scientific Computing – Universität WienP.Brezany 15 Parallel Data Processing It should exploit intra-operator parallelism. The focus of research is mainly on parallel algorithms for the select and join operators. The parallel processing of select in a partitioned data placement context is identical to that in a fragmented distributed DB. Depending on the select predicate, the operator may be executed at a single node (in the case of an exact match predicate) or in the case of arbitrary complex predicates at all the nodes over which the relation is partitioned. The parallel processing of join is significantly more involved that of select. The parallel nested loop (PNL) algorithm – a basic parallel join algorithm  next slide Remark: for i from 1 to n do in parallel action A indicates that the action A is to be executed by n nodes in parallel

Institut für Scientific Computing – Universität WienP.Brezany 16

Institut für Scientific Computing – Universität WienP.Brezany 17 PNL - Example The figure below shows the application of PNL with m = n = 2.

Institut für Scientific Computing – Universität WienP.Brezany 18 Parallel Query Optimization Similarities with distributed query processing. It should take advantage of both intra-operator parallelism (e.g., using PNL) and inter-operator parallelism (some of the techniques devisd for distr. DB systems). Parallel query optimization refers to the process of producing an execution plan for a given query that minimizes an objective cost function. Components: –Search space: the set of alternative execution plans to represent the input query – they are semantic equivalent. –Cost model: predicts the cost of a given execution plan. –Search strategy: explores the search space and selects the best plan. It defines which plans are examined and in which order.

Institut für Scientific Computing – Universität WienP.Brezany 19 Load Balancing The response time of a set of parallel operators is that of the longest one. Load balancing problems can appear with intra- operator parallelism (variation in partition size), namely data skew, and inter-operator parallelism (variation in the complexity operators). Inter-operator load balancing It is necessary to choose, for each operator, how many and what processors to assign for its execution. Suppose we have a perfect cost model which allows evaluating the sequential execution time of each operator.  We then need to find a way to assign processors to operators in order to obtain the best load balancing.

Institut für Scientific Computing – Universität WienP.Brezany 20 Intra-Operator Load Balancing Effects of skewed data distribution on a parallel execution: –Attribute value skew (AVS) is skew inherent in the dataset (e.g., there are more citizens in Vienna than in St. Pölten). –Tuple placement skew (TPS) is the skew induced when the data is initially partitioned (on disk) (e.g., with range partitioning). –Selectivity skew (SS) is induced when there is variation in the selectivity of select predicates on each node. –Redistribution skew (RS) occurs in the redistribution step between two operands. It is similar to TPS. –Join product skew (JPS) occurs because the join selectivity may vary between nodes. Next figure illustrates the above classification on a query applied to relations R and S which are poorly partitioned.  Such poor partitioning stems from either the data (AVS) or the partitioning function (TPS). The boxes are proportional to the size of the corresponding partitions.

Institut für Scientific Computing – Universität WienP.Brezany 21 Intra-Operator Load Balancing (cont.)

Institut für Scientific Computing – Universität WienP.Brezany 22 Intra-Operator Load Balancing (cont.) The processing times of the two instances of scan1 and scan2 are not equal. The case of the join operator is worse. –The number of tuples received is different from one instance to another because of poor redistribution of the partitions of R (RS) or variable selectivity according to the partition of R processesed (SS). –The uneven size of S partitions (AVS/TPS) yields different processing times for tuples send by the scan operator and the result size is different from one partition to the other due to join selectivity (JPS). It seems difficult to propose a solution based on estimates on the relations involved. Such strategy would require statistics like histogram on join attribute, fragmentation attribute, attributes involved in predicates – potentially on all attributes of all relations. Furthermore, a cost model would be necessary to evaluate distribution on intermediate results. A more reasonable strategy is to use a dynamic approach,i.e., redistribute the load dynamically in order to balance the execution.