Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.

Slides:



Advertisements
Similar presentations
1 Bell & Gray 4/15 / 95 Parallel Database Systems A SNAP Application Gordon Bell 450 Old Oak Court Los Altos, CA Jim Gray 310.
Advertisements

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
6.830 Lecture 9 10/1/2014 Join Algorithms. Database Internals Outline Front End Admission Control Connection Management (sql) Parser (parse tree) Rewriter.
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
Parallel Databases By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
04/25/2005Yan Huang - CSCI5330 Database Implementation – Parallel Database Parallel Databases.
Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke1 Parallel Database Systems Taken/tweaked from the Wisconsin DB book slides.
Parallel Database Systems
Parallel Database Systems The Future Of High Performance Database Systems David Dewitt and Jim Gray 1992 Presented By – Ajith Karimpana.
Data Warehousing 1 Lecture-25 Need for Speed: Parallelism Methodologies Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Parallel DBMS Slides adapted from textbook; from Joe Hellerstein; and from Jim Gray, Microsoft Research. DBMS Textbook Chapter 22.
1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.
Parallel DBMS Chapter 22, Part A
Chapter 4 Parallel Sort and GroupBy 4.1Sorting, Duplicate Removal and Aggregate 4.2Serial External Sorting Method 4.3Algorithms for Parallel External Sort.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
1 Parallel DBMS Slides by Joe Hellerstein, UCB, with some material from Jim Gray, Microsoft Research. See also:
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
TDD: Topics in Distributed Databases
Chapter 5 Parallel Join 5.1Join Operations 5.2Serial Join Algorithms 5.3Parallel Join Algorithms 5.4Cost Models 5.5Parallel Join Optimization 5.6Summary.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Lecture 11: DMBS Internals
Parallel DBMS Instructor : Marina Gavrilova
CS Transaction Processing Lecture 18 Parallelism.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Lecture 24 Query Execution Monday, November 28, 2005.
1 Parallel Database Systems Analyzing LOTS of Data Jim Gray Microsoft Research.
Query Processing CS 405G Introduction to Database Systems.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Lecture 3 - Query Processing (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Unit - 4 Introduction to the Other Databases.  Introduction :-  Today single CPU based architecture is not capable enough for the modern database.
CS 540 Database Management Systems
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
CS 440 Database Management Systems Parallel DB & Map/Reduce Some slides due to Kevin Chang 1.
Implementation of Database Systems, Jarek Gryz 1 Parallel DBMS Chapter 22, Part A.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
CS 440 Database Management Systems
CS 540 Database Management Systems
CS 440 Database Management Systems
Large-scale file systems and Map-Reduce
Parallel Databases.
Lecture 16: Data Storage Wednesday, November 6, 2006.
Database Performance Tuning and Query Optimization
Evaluation of Relational Operations
April 30th – Scheduling / parallel
Lecture 17: Distributed Transactions
Cse 344 May 2nd – Map/reduce.
Chapter 17: Database System Architectures
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Akshay Tomar Prateek Singh Lohchubh
Selected Topics: External Sorting, Join Algorithms, …
Parallel DBMS Chapter 22, Part A
Parallel DBMS Chapter 22, Sections 22.1–22.6
Chapter 11 Database Performance Tuning and Query Optimization
Database System Architectures
The Gamma Database Machine Project
Lecture 20: Query Execution
Parallel DBMS DBMS Textbook Chapter 22
Presentation transcript:

Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai

Outline l Why Parallelism: –technology push –application pull –Parallel database architectures l Parallel Database Techniques –partitioned data –partitioned and pipelined execution –parallel relational operators

Main Message l Technology trends give –many processors and storage units –inexpensively l To analyze large quantities of data –sequential (regular) access patterns are 100x faster –parallelism is 1000x faster (trades time for money) –Relational systems show many parallel algorithms.

Database Store ALL Data Types The New World: Billions of objects Big objects (1MB) Objects have behavior (methods) The Old World: Millions of objects 100-byte objects Mike Won David NY Berk Austin People NameAddress Mike Won David NY Berk Austin Paperless office Library of congress online All information online entertainment publishing business Information Network, Knowledge Navigator, Information at your fingertips Name AddressPapersPicture Voice People

5 1k SPECint CPU 500 GB Disc 2 GB RAM Implications of Hardware Trends Large Disc Farms will be inexpensive (10k$/TB) Large RAM databases will be inexpensive (1K$/GB) Processors will be inexpensive So building block will be a processor with large RAM lots of Disc lots of network bandwidth

Why Parallel Access To Data? At 10 MB/s 1.2 days to scan 1,000 x parallel 1.5 minute SCAN. Parallelism: divide a big problem into many smaller ones to be solved in parallel. Bandwidth

Implication of Hardware Trends: Clusters Future Servers are CLUSTERS of processors, discs CPU 50 GB Disc 5 GB RAM Thesis Many Little will Win over Few Big

Parallel Database Architectures

Parallelism: Performance is the Goal Goal is to get 'good' performance. Law 1: parallel system should be faster than serial system Law 2: parallel system should give near-linear scaleup or near-linear speedup or both.

Speed-Up and Scale-Up l Speedup: a fixed-sized problem executing on a small system is given to a system which is N-times larger. –Measured by: speedup = small system elapsed time large system elapsed time –Speedup is linear if equation equals N. l Scaleup: increase the size of both the problem and the system –N-times larger system used to perform N-times larger job –Measured by: scaleup = small system small problem elapsed time big system big problem elapsed time –Scale up is linear if equation equals 1.

Kinds of Parallel Execution Pipeline Partition outputs split N ways inputs merge M ways Any Sequential Program Any Sequential Program Sequential Any Sequential Program Any Sequential Program

The Drawbacks of Parallelism Startup: Creating processes Opening files Optimization Interference: Device (cpu, disc, bus) logical (lock, hotspot, server, log,...) Communication: send data among nodes Skew:If tasks get very small, variance > service time

Parallelism: Speedup & Scaleup 100GB Speedup: Same Job, More Hardware Less time Scaleup: Bigger Job, More Hardware Same time 100GB 1 TB 100GB 1 TB Server 1 k clients10 k clients Transaction Scaleup: more clients/servers Same response time

14 Database Systems “Hide” Parallelism Automate system management via tools data placement data organization (indexing) periodic tasks (dump / recover / reorganize) Automatic fault tolerance duplex & failover transactions Automatic parallelism among transactions (locking) within a transaction (parallel execution)

Automatic Data Partitioning Split a SQL table to subset of nodes & disks Partition within set: Range Hash Round Robin Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning Good for equi-joins, range queries group-by Good for equi-joinsGood to spread load

Index Partitioning Hash indices partition by hash B-tree indices partition as a forest of trees. One tree per range Primary index clusters data  A..C D..F G...M N...R S..Z

Partitioned Execution Spreads computation and IO among processors Partitioned data gives NATURAL parallelism

N x M way Parallelism N inputs, M outputs, no bottlenecks. Partitioned Data Partitioned and Pipelined Data Flows

Blocking Operators = Short Pipelines An operator is blocking, if it does not produce any output, until it has consumed all its input Examples: Sort, Aggregates, Hash-Join (reads all of one operand) Blocking operators kill pipeline parallelism Make partition parallelism all the more important. Sort Runs Scan Sort Runs Tape File SQL Table Process Merge Runs Table Insert Index Insert SQL Table Index 1 Index 2 Index 3 Database Load Template has three blocked phases

Parallel Aggregates For aggregate function, need a decomposition strategy: count(S) =  count(s(i)), ditto for sum() avg(S) = (  sum(s(i))) /  count(s(i)) and so on... For groups, sub-aggregate groups close to the source drop sub-aggregates into a hash river.

Sub-sorts generate runs Merge runs Range or Hash Partition River River is range or hash partitioned Scan or other source Parallel Sort M inputs N outputs Disk and merge not needed if sort fits in Memory Sort is benchmark from hell for shared nothing machines net traffic = disk bandwidth, no data filtering at the source

Hash Join: Combining Two Tables Hash smaller table into N buckets (hope N=1) If N=1 read larger table, hash to smaller Else, hash outer to disk then bucket-by-bucket hash join. Purely sequential data behavior Always beats sort-merge and nested unless data is clustered. Good for equi, outer, exclusion join Lots of papers, Hash reduces skew Right Table Left Table Hash Buckets

Q&A l Thank you! 23