Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.

Similar presentations


Presentation on theme: "Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai."— Presentation transcript:

1 Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai

2 Outline l Why Parallelism: –technology push –application pull –Parallel database architectures l Parallel Database Techniques –partitioned data –partitioned and pipelined execution –parallel relational operators

3 Main Message l Technology trends give –many processors and storage units –inexpensively l To analyze large quantities of data –sequential (regular) access patterns are 100x faster –parallelism is 1000x faster (trades time for money) –Relational systems show many parallel algorithms.

4 Database Store ALL Data Types The New World: Billions of objects Big objects (1MB) Objects have behavior (methods) The Old World: Millions of objects 100-byte objects Mike Won David NY Berk Austin People NameAddress Mike Won David NY Berk Austin Paperless office Library of congress online All information online entertainment publishing business Information Network, Knowledge Navigator, Information at your fingertips Name AddressPapersPicture Voice People

5 5 1k SPECint CPU 500 GB Disc 2 GB RAM Implications of Hardware Trends Large Disc Farms will be inexpensive (10k$/TB) Large RAM databases will be inexpensive (1K$/GB) Processors will be inexpensive So building block will be a processor with large RAM lots of Disc lots of network bandwidth

6 Why Parallel Access To Data? At 10 MB/s 1.2 days to scan 1,000 x parallel 1.5 minute SCAN. Parallelism: divide a big problem into many smaller ones to be solved in parallel. Bandwidth

7 Implication of Hardware Trends: Clusters Future Servers are CLUSTERS of processors, discs CPU 50 GB Disc 5 GB RAM Thesis Many Little will Win over Few Big

8 Parallel Database Architectures

9 Parallelism: Performance is the Goal Goal is to get 'good' performance. Law 1: parallel system should be faster than serial system Law 2: parallel system should give near-linear scaleup or near-linear speedup or both.

10 Speed-Up and Scale-Up l Speedup: a fixed-sized problem executing on a small system is given to a system which is N-times larger. –Measured by: speedup = small system elapsed time large system elapsed time –Speedup is linear if equation equals N. l Scaleup: increase the size of both the problem and the system –N-times larger system used to perform N-times larger job –Measured by: scaleup = small system small problem elapsed time big system big problem elapsed time –Scale up is linear if equation equals 1.

11 Kinds of Parallel Execution Pipeline Partition outputs split N ways inputs merge M ways Any Sequential Program Any Sequential Program Sequential Any Sequential Program Any Sequential Program

12 The Drawbacks of Parallelism Startup: Creating processes Opening files Optimization Interference: Device (cpu, disc, bus) logical (lock, hotspot, server, log,...) Communication: send data among nodes Skew:If tasks get very small, variance > service time

13 Parallelism: Speedup & Scaleup 100GB Speedup: Same Job, More Hardware Less time Scaleup: Bigger Job, More Hardware Same time 100GB 1 TB 100GB 1 TB Server 1 k clients10 k clients Transaction Scaleup: more clients/servers Same response time

14 14 Database Systems “Hide” Parallelism Automate system management via tools data placement data organization (indexing) periodic tasks (dump / recover / reorganize) Automatic fault tolerance duplex & failover transactions Automatic parallelism among transactions (locking) within a transaction (parallel execution)

15 Automatic Data Partitioning Split a SQL table to subset of nodes & disks Partition within set: Range Hash Round Robin Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning Good for equi-joins, range queries group-by Good for equi-joinsGood to spread load

16 Index Partitioning Hash indices partition by hash B-tree indices partition as a forest of trees. One tree per range Primary index clusters data 0...910..1920..2930..3940.. A..C D..F G...M N...R S..Z

17 Partitioned Execution Spreads computation and IO among processors Partitioned data gives NATURAL parallelism

18 N x M way Parallelism N inputs, M outputs, no bottlenecks. Partitioned Data Partitioned and Pipelined Data Flows

19 Blocking Operators = Short Pipelines An operator is blocking, if it does not produce any output, until it has consumed all its input Examples: Sort, Aggregates, Hash-Join (reads all of one operand) Blocking operators kill pipeline parallelism Make partition parallelism all the more important. Sort Runs Scan Sort Runs Tape File SQL Table Process Merge Runs Table Insert Index Insert SQL Table Index 1 Index 2 Index 3 Database Load Template has three blocked phases

20 Parallel Aggregates For aggregate function, need a decomposition strategy: count(S) =  count(s(i)), ditto for sum() avg(S) = (  sum(s(i))) /  count(s(i)) and so on... For groups, sub-aggregate groups close to the source drop sub-aggregates into a hash river.

21 Sub-sorts generate runs Merge runs Range or Hash Partition River River is range or hash partitioned Scan or other source Parallel Sort M inputs N outputs Disk and merge not needed if sort fits in Memory Sort is benchmark from hell for shared nothing machines net traffic = disk bandwidth, no data filtering at the source

22 Hash Join: Combining Two Tables Hash smaller table into N buckets (hope N=1) If N=1 read larger table, hash to smaller Else, hash outer to disk then bucket-by-bucket hash join. Purely sequential data behavior Always beats sort-merge and nested unless data is clustered. Good for equi, outer, exclusion join Lots of papers, Hash reduces skew Right Table Left Table Hash Buckets

23 Q&A l Thank you! 23


Download ppt "Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai."

Similar presentations


Ads by Google