Parallel Database Systems

Slides:



Advertisements
Similar presentations
Parallel Databases These slides are a modified version of the slides of the book “Database System Concepts” (Chapter 18), 5th Ed., McGraw-Hill, by Silberschatz,
Advertisements

Parallel Databases By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
04/25/2005Yan Huang - CSCI5330 Database Implementation – Parallel Database Parallel Databases.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 21: Parallel Databases.
Parallel Database Systems The Future Of High Performance Database Systems David Dewitt and Jim Gray 1992 Presented By – Ajith Karimpana.
Parallel Database Systems
Data Warehousing 1 Lecture-25 Need for Speed: Parallelism Methodologies Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Parallel DBMS Slides adapted from textbook; from Joe Hellerstein; and from Jim Gray, Microsoft Research. DBMS Textbook Chapter 22.
José Alferes Versão modificada de Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan Chapters 20: Database System Architectures.
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
Parallel DBMS Chapter 22, Part A
CMSC724: Database Management Systems Instructor: Amol Deshpande
Institut für Scientific Computing – Universität WienP.Brezany Parallel Databases Univ.-Prof. Dr. Peter Brezany Institut für Scientific Computing Universität.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
1 Parallel DBMS Slides by Joe Hellerstein, UCB, with some material from Jim Gray, Microsoft Research. See also:
TDD: Topics in Distributed Databases
Fall 2008Parallel Databases1. Fall 2008Parallel Databases2 Ideal Parallel Systems Two key properties:  Linear Speedup: Twice as much hardware can perform.
Parallel Algorithms for Relational Operations Class ID: 21 Name: Shujia Zhang.
Database System Concepts, 5th Ed. Chapter 20: Database System Architectures.
Chapter 5 Parallel Join 5.1Join Operations 5.2Serial Join Algorithms 5.3Parallel Join Algorithms 5.4Cost Models 5.5Parallel Join Optimization 5.6Summary.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Sorting and Query Processing Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 29, 2005.
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
Parallel & Distributed databases Agenda –The problem domain of design parallel & distributed databases (chp 18-20) –The data allocation problem –The data.
Shilpa Seth.  Centralized System Centralized System  Client Server System Client Server System  Parallel System Parallel System.
PMIT-6102 Advanced Database Systems
Parallel DBMS Instructor : Marina Gavrilova
17.1Database System Concepts - 6 th Edition Chapter 17: Database System Architectures Centralized and Client-Server Systems Server System Architectures.
Chapter 20: Database System Architectures Chapter 20: Database System Architectures Centralized and Client-Server Systems Server System Architectures.
Database Administration Database Architecture. 2 Outlines Centralized Architectures Client-Server Architectures Parallel Systems Distributed Systems Network.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Chapter 21: Parallel Databases Chapter 21: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation.
Chapter 18 Database System Architectures Debbie Hui CS 157B.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 18: Parallel.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts 3 rd Edition Module 18: Database System Architectures Centralized Systems Client--Server.
Mapping the Data Warehouse to a Multiprocessor Architecture
©Silberschatz, Korth and Sudarshan18.1Database System Concepts - 6 th Edition Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism.
Lecture 14- Parallel Databases Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan Chapter 21: Parallel Databases.
©Silberschatz, Korth and Sudarshan20.1Database System Concepts 3 rd Edition Chapter 20: Parallel Databases Introduction I/O Parallelism Interquery Parallelism.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Parallel tree search: An algorithmic approach for multi- field packet classification Authors: Derek Pao and Cutson Liu. Publisher: Computer communications.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Unit - 4 Introduction to the Other Databases.  Introduction :-  Today single CPU based architecture is not capable enough for the modern database.
CS 540 Database Management Systems
CS 440 Database Management Systems Parallel DB & Map/Reduce Some slides due to Kevin Chang 1.
Implementation of Database Systems, Jarek Gryz 1 Parallel DBMS Chapter 22, Part A.
©Silberschatz, Korth and Sudarshan16.1Database System Concepts 3 rd Edition Database System Architectures Centralized Systems Client--Server Systems Parallel.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
Chapter 20: Database System Architectures
Parallel Databases.
Interquery Parallelism
April 30th – Scheduling / parallel
Cse 344 May 2nd – Map/reduce.
Chapter 17: Database System Architectures
Akshay Tomar Prateek Singh Lohchubh
CSE8380 Parallel and Distributed Processing Presentation
Parallel DBMS Chapter 22, Part A
Parallel DBMS Chapter 22, Sections 22.1–22.6
Database System Architectures
The Gamma Database Machine Project
Parallel DBMS DBMS Textbook Chapter 22
Presentation transcript:

Parallel Database Systems Parallel Database Systems The Future of High Performance Database Systems Authors: David DeWitt and Jim Gray Presenter: Jennifer Tom Based on Presentation by Ajith Karimpana

Outline Why Parallel Databases? Parallel Database Implementation Parallel Database Systems Future Directions and Research Problems

Why Parallel Databases? Relational Data Model – Relational queries are ideal candidates for parallelization Multiprocessor systems using inexpensive microprocessors provide more power and scalability than expensive mainframe counterparts Why do the relational queries allow for parallelization? Operations can be executed in parallel on parallel streams of data Operations can be composed into dataflow graphs

Properties of Ideal Parallelism Linear Speedup – An N-times large system yields a speedup of N Linear Scaleup – An N-times larger system performs an N-times larger job in the same elapsed time as the original system Why are linear speedup and linear scaleup indicators of good parallelism? Linear Speedup: Performance improvements grows linearly as the number of processors increases (good scalability) Linear Scaleup: A system has the same performance when a job and the number of processors are both increased N times (maintain performance with larger jobs by increasing the number of processors)

Types of Scaleup Transactional – N-times as many clients submit N-times as many requests against an N-times larger database. Batch – N-times larger problem requires an N-times larger computer. Should we look at both transactional scaleup and batch scaleup for all situations when assessing performance? Transactional scaleup: Banking,

Barriers to Linear Speedup and Linear Scaleup Startup: The time needed to start a parallel operation dominates the total execution time Interference: A slowdown occurs as new processes are added when accessing shared resources Skew: A slowdown occurs when the variance exceeds the mean in job service times Give example situations where you might come across interference and skew barriers?

Shared-Nothing Machines Shared-memory – All processors have equal access to a global memory and all disks Shared-disk – Each processor has its own private memory, but has equal access to all disks Shared-nothing – Each processor has its own private memory and disk(s) Why do shared-nothing systems work well for database systems? Shared-memory: Interference. The interconnect must have the bandwidth of the sum of the processors and disks. Need a large private cache, but this degrades processor performance. As parallelism increases, interference on shared resources limits performance Shared-disk: Ok for read-only db with no concurrent sharing, but not effective for concurrent read and writes to a shared db. Shared-nothing: Reduces interference and it scales well. Good for shared database applications.

Parallel Dataflow Approach Relational data model operators take relations as input and produce relations as output Allows operators to be composed into dataflow graphs (operations can be executed in parallel) How would you compose a set of operations for an SQL query to be done in parallel?

Pipelined vs. Partitioned Parallelism Pipelined: The computation of one operator is done at the same time as another as data is streamed into the system. Partitioned: The same set of operations is performed on tuples partitioned across the disks. The set of operations are done in parallel on partitions of the data. Can you give brief examples of each? Which parallelism is generally better for databases? Partitioned Parallelism: Better opportunities for speedup and scaleup. Uses divide and conquer to achieve increased performance.

Data Partitioning Round-robin – For n processors, the ith tuple is assigned to the “i mod n” disk. Hash partitioning – Tuples are assigned to the disks using a hashing function applied to select fields of the tuple. Range partitioning – Clusters tuples with similar attribute values on the same disk. What operations/situations are ideal (or not ideal) for the above partitioning types? What issues to database performance may arise due to partitioning? Round-Robin Ideal for applications that wish to read entire relation sequentially for each query. Not ideal for point and range queries, since each of the n disks must be searched. Hash Ideal for point queries based on the partitioning attribute. Ideal for sequential scans of the entire relation. Not ideal for point queries on non-partitioning attributes. Not ideal for range queries on the partitioning attribute. Range Ideal for point and range queries on the partitioning attribute.

Skew Problems Data skew – The number of tuples vary widely across the partitions. Execution skew – Due to data skew, more time is spent to execute an operation on partitions that contain larger numbers of tuples. How could these skew problems be minimized? Hashing and round-robin. Range partitioning: pick non-uniformly-distributed partitioning criteria.

Parallel Execution of Relational Operators Rather than write new parallel operators, use parallel streams of data to execute existing relational operators in parallel Relational operators have (1) a set of input ports and (2) an output port Parallel dataflow – partitioning and merging of data streams into the operator ports Are there situations in which parallel dataflow does not work well? Sort-merge join and data skew

Simple Relational Dataflow Graph

Parallelization of Hash-Join Sort-merge join does not work well if there is data skew Hash-join is more resistant to data skew To join 2 relations, we divide the join into a set of smaller joins The union of these smaller joins should result in the join of the 2 relations How might other relational algorithms/operations (like sorting) be parallelized to improve query performance?

Parallel Database Systems Teradata – Shared-nothing systems that use commodity hardware. Near linear speedup and scaleup on relational queries. Tandem NonStop SQL – Shared-nothing architecture. Near linear speedup and scaleup for large relational queries. Gamma – Shared-nothing system. Provides hybrid-range partitioning. Near linear speed and scaleup measured. The Super Database Computer – Has special-purpose components. However, is shared-nothing architecture with dataflow. Bubba – Shared-nothing system, but uses shared-memory for message passing. Uses FAD rather than SQL. Uses single-level storage Should we focus on the use of commodity hardware to build parallel database systems?

Database Machines and Grosch’s Law Grosch’s Law – The performance of computer systems increase as the square of their cost. The more expensive systems have better cost to performance ratios. Less expensive shared-nothing systems like Gamma, Tandem, and Teradata achieve near-linear performance (compared to mainframes) Suggests that Grosch’s Law no longer applies

Future Directions and Research Problems Concurrent execution of simple and complex queries Problem with locks and large queries Priority scheduling (priority inversion scheduling) Parallel Query Optimization – Optimizers do not take into account parallel algorithms. Data skew creates poor query plan cost estimates. Application Program Parallelization – Database applications are not parallelized. No effective way to automate the parallelization of applications.

Future Directions and Research Problems Physical Database Design – Tools are needed to determine indexing and partitioning schemes given a database and its workload. On-line Data Reorganization – Availability of data for concurrent reads and writes as database utilities create indices, reorganize data, etc. This process must be (1) online, (2) incremental, (3) parallel, and (4) recoverable. Which of these issues may be extremely difficult to resolve?