Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gamma DBMS Part 1: Physical Database Design Shahram Ghandeharizadeh Computer Science Department University of Southern California.

Similar presentations


Presentation on theme: "Gamma DBMS Part 1: Physical Database Design Shahram Ghandeharizadeh Computer Science Department University of Southern California."— Presentation transcript:

1 Gamma DBMS Part 1: Physical Database Design Shahram Ghandeharizadeh Computer Science Department University of Southern California

2 Outline Alternative architectures: Alternative architectures:  Shared-disk versus Shared-Nothing Declustering techniques. Declustering techniques.

3 Shared-Disk Architecture Emerged in 1980s: Emerged in 1980s:  Many clients share storage and data: data remains available when a client fails. Network Data

4 Shared-Disk Architecture Advantages: Advantages:  Many clients share storage and data.  Redundancy is implemented in one place protecting all clients from disk failure. Network

5 Shared-Disk Architecture Advantages: Advantages:  Many clients share storage and data.  Redundancy is implemented in one place protecting all clients from disk failure.  Centralized backup: The administrator does not care/know how many clients are on the network sharing storage. Network

6 Shared-Disk Architecture Advantages: Advantages:  Many clients share storage and data.  Redundancy is implemented in one place protecting all clients from disk failure.  Centralized backup: The administrator does not care/know how many clients are on the network sharing storage. Network High Availability Data Backup Data Sharing

7 Network failures What about network failures? What about network failures?  Two host bus adapters per server,  Each server connected to a different switch.

8 Shared-Disk Architecture Storage Area Network (SAN): Storage Area Network (SAN):  Block level access,  Write to storage is immediate,  Specialized hardware including switches, host bus adapters, disk chassis, battery backed caches, etc.  Expensive  Supports transaction processing systems. Network Attached Storage (NAS): Network Attached Storage (NAS):  File level access,  Write to storage might be delayed,  Generic hardware,  In-expensive,  Not appropriate for transaction processing systems.

9 Concepts and Terminology Virtualization: Virtualization:  Available storage is represented as one HUGE disk drive, e.g., a SAN with a thousand 1.5 TB disk provides 1 Petabyte of storage,  Available storage is partitioned into Logical Unit Numbers (LUNs),  A LUN is presented to one or more servers,  A LUN appears as a disk drive to a server.  SAN places blocks across physical disks intelligently to balance load. What to do when a PC fails? What to do when a PC fails?

10 Shared-Nothing Each node (blade) consisted of one processor, memory, and a disk drive. Each node (blade) consisted of one processor, memory, and a disk drive. Network CPU 1 CPU N ….

11 Shared-Nothing Each node (blade) may consist of one or several processors, memory, and one or several disk drives. Each node (blade) may consist of one or several processors, memory, and one or several disk drives. Network …. CPU 1 CPU 2 CPU n DRAM 1 DRAM 2 DRAM D … … CPU 1 CPU 2 CPU n DRAM 1 DRAM 2 DRAM D … … Node 1 Node M

12 Shared-Nothing Network CPU 1 CPU nM …. Partition resources to construct logical nodes. With an 8 CPU PC, construct eight logical nodes each with a CPU, fraction of memory, and one disk drive. Partition resources to construct logical nodes. With an 8 CPU PC, construct eight logical nodes each with a CPU, fraction of memory, and one disk drive.

13 Data Declustering Data is partitioned across the nodes (why?): Data is partitioned across the nodes (why?):  Random/round-robin,  Hash partitioning,  Range partitioning. Each piece of a table is termed a fragment. Each piece of a table is termed a fragment. Single attribute declustering strategies Single attribute declustering strategies Two multi-attribute declustering strategies: Two multi-attribute declustering strategies: 1. Multi-Attribute GrId deClustering (MAGIC) 2. Bubba’s Extended Range Declustering (BERD)

14 Horizontal Declustering Physical View Bob2010K Shideh1835K Ted5060K Kevin62120K Angela55140K Mike4590K Logical View nameage salary Emp

15 Horizontal Declustering No partitioning attribute: Random and Round-robin. No partitioning attribute: Random and Round-robin. Single attribute declustering strategies: Single attribute declustering strategies:  Hash,  Range. Note: the database administrator must choose one attribute as the partitioning attribute.

16 Hash Declustering Bob2010K Shideh1835K Ted5060K Kevin62120K Angela55140K Mike4590K Physical View nameage salary salary % 3 Ted5060KKevin62120K nameage salaryBob2010KMike4590K nameage salaryShideh1835KAngela55140K nameage salary Emp salary is the partitioning attribute.

17 Hash Declustering Selections with equality predicates referencing the partitioning attribute are directed to a single node: Selections with equality predicates referencing the partitioning attribute are directed to a single node:  Retrieve Emp where salary = 60K Equality predicates referencing a non- partitioning attribute and range predicates are directed to all nodes: Equality predicates referencing a non- partitioning attribute and range predicates are directed to all nodes:  Retrieve Emp where age = 20  Retrieve Emp where salary < 20K SELECT * FROM Emp WHERE salary=60K SELECT * FROM Emp WHERE salary<20K

18 Range Declustering Bob2010K Shideh1835K Ted5060K Kevin62120K Angela55140K Mike4590K Physical View nameage salaryBob2010KShideh1835K nameage salaryTed5060KMike4590K nameage salaryKevin62120KAngela55140K nameage salary 0-50K51K-100K 101K- ∞ Emp salary is the partitioning attribute.

19 Range Declustering Equality and range predicates referencing the partitioning attribute are directed to a subset of nodes: Equality and range predicates referencing the partitioning attribute are directed to a subset of nodes:  Retrieve Emp where salary = 60K  Retrieve Emp where salary < 20K Predicates referencing a non-partitioning attribute are directed to all nodes. Predicates referencing a non-partitioning attribute are directed to all nodes. In our example, both queries are directed to one node.

20 An iPSC/2 Intel Hypercube Year is 1988! Year is 1988! 32 Processor Hypercube 32 Processor Hypercube Each node consists of: Each node consists of:  80386 processor (12 MHz)  2 MB DRAM  333 MB disk  A hypercube inter- connect supporting parallel transmission of messages among nodes.

21 Software Architecture Each node stores its fragment on its local disk drive. Each node stores its fragment on its local disk drive. Each node may build a B+-tree (clustered/non-clustered) and hash index on its fragment of a relation. Each node may build a B+-tree (clustered/non-clustered) and hash index on its fragment of a relation. Each node has its own concurrency control and crash recovery mechanism. Each node has its own concurrency control and crash recovery mechanism.

22 Software Architecture

23

24

25

26 Processes executing on one node shared memory – identical to today’s threads! Processes executing on one node shared memory – identical to today’s threads! At initialization time, a node would start a fixed number of threads (processes). At initialization time, a node would start a fixed number of threads (processes). All threads listen on a well defined socket, waiting for the Scheduler to dispatch work to them. All threads listen on a well defined socket, waiting for the Scheduler to dispatch work to them. A message contains the identity that the operator should assume: A message contains the identity that the operator should assume:  A “switch” statement would enable a thread to become a select, project, hash-join build, hash-join probe, etc…  The message specifies the role of the thread.

27 A Comparison of Range & Hash Closed simulation model: Closed simulation model:  A client generates a range selection predicate: X < age < Y.  The age attribute value is unique with values ranging from 0 to 999,999 (1 million rows).  A client does not generate a new request until its pending request is processed by Gamma and returned.  The system is multi-programmed by increasing the number of clients in the system.  A multi-programming level of 8 means there are 8 clients generating requests to the system (independent of one another). … 32 Node Gamma

28 A Comparison of Range & Hash Closed simulation model: Closed simulation model:  A client generates a range selection predicate: X < age < Y.  The age attribute value is unique with values ranging from 0 to 999,999 (1 million rows).  A client does not generate a new request until its pending request is processed by Gamma and returned.  A 0.01% selection predicate retrieves 100 rows.  With a clustered B+-tree index, the 100 rows are grouped together in a few disk pages. … 32 Node Gamma

29 A Comparison of Range & Hash Closed simulation model: Closed simulation model:  A client generates a range selection predicate: X < age < Y.  The age attribute value is unique with values ranging from 0 to 999,999 (1 million rows).  A client does not generate a new request until its pending request is processed by Gamma and returned.  A 0.01% selection predicate retrieves 100 rows.  With a clustered B+-tree index, the 100 rows are grouped together in a few disk pages.  With range partitioning, the predicate is processed by one node.  With hash partitioning, the predicate is processed by all 32 nodes with the scheduler coordinating the execution of each predicate on a node, and gathering of the results from every node. … 32 Node Gamma 0-31,249 31,250 – 62,499 968-750 – 1,000,000

30 Declustering Techniques: Tradeoffs Range selection predicate using a clustered B + -tree, 0.01% selectivity (10 records) Range selection predicate using a clustered B + -tree, 0.01% selectivity (10 records) Range Hash/Random/Round-robin Multiprogramming Level Throughput (Queries/Second)

31 A Comparison of Range & Hash Closed simulation model: Closed simulation model:  A client generates a range selection predicate: X < age < Y.  The age attribute value is unique with values ranging from 0 to 999,999 (1 million rows).  A client does not generate a new request until its pending request is processed by Gamma and returned.  A 1% selection predicate retrieves 10,000 rows.  With a clustered B+-tree index, the 10,000 rows are grouped together. … 32 Node Gamma 0-31,249 31,250 – 62,499 968-750 – 1,000,000

32 A Comparison of Range & Hash Closed simulation model: Closed simulation model:  A client generates a range selection predicate: X < age < Y.  The age attribute value is unique with values ranging from 0 to 999,999 (1 million rows).  A client does not generate a new request until its pending request is processed by Gamma and returned.  A 1% selection predicate retrieves 10,000 rows.  With a clustered B+-tree index, the 10,000 rows are grouped together.  With Range partitioning, the predicate is processed using one or two nodes.  With Hash partitioning, the predicate is processed by all the nodes with the scheduler coordinating the execution of the predicate. … 0-31,249 31,250 – 62,499 968-750 – 1,000,000

33 Tradeoffs (Cont…) Range selection predicate using a clustered B + -tree, 1% selectivity (1000 records) Range selection predicate using a clustered B + -tree, 1% selectivity (1000 records) Range Hash/Random/Round-robin Multiprogramming Level Throughput (Queries/Second)

34 Why Range Performs Poorly? Note: Range performed poorly because the query (1% selection) imposed a high workload onto a node! Note: Range performed poorly because the query (1% selection) imposed a high workload onto a node!  For a query with minimal (0.01% selection) workload requirement, Range is ideal! Two reasons: Two reasons:  Random generation of selection predicates does NOT mean uniform distribution of workload across nodes.  The number of ranges is the same as the number of nodes causing the tail-end servers to observe a lower load.

35 3 R1R2R3 R1R3R2 R1R3 R2R3R1 R3R1R2 R3R2R1 {R1, R2, R3} {R1, R3}R2 {R1, R3}R2 {R1, R3}R2 {R1, R3}R2 {R1, R3}R2 6 Ideal cases {R1, R3}R2 {R2, R3}R1 {R2, R3}R1 {R2, R3}R1 {R2, R3}R1 {R2, R3}R1 {R2, R3}R1 {R2, R1}R3 {R2, R1}R3 {R2, R1}R3 {R2, R1}R3 {R2, R1}R3 {R2, R1}R3 21 27 ways to assign 3 requests to the 3 nodes! Only 6 result in a uniform distribution of requests.

36 Tradeoffs (Cont…) Simple range partitioning may lead to load imbalance for queries with high selectivity: Simple range partitioning may lead to load imbalance for queries with high selectivity:  Low performance: increased response time and low system throughput. Consider a table that maintains the grade of students for different exams, range partitioned on the grade. Consider a table that maintains the grade of students for different exams, range partitioned on the grade. 0-1920-3940-59 60-7980-100

37 Tradeoffs (Cont…) Assume a range predicate overlaps 3 partitions, e.g., Assume a range predicate overlaps 3 partitions, e.g.,  0 < grade < 45  45 < grade < 90 0-1920-3940-59 60-7980-100 0-1920-3940-59 60-7980-100

38 Tradeoffs (Cont…) Higher response time because 2 nodes sit idle while 3 nodes process the query (assuming overhead of parallelism is negligible). Higher response time because 2 nodes sit idle while 3 nodes process the query (assuming overhead of parallelism is negligible). 0-1920-3940-59 60-7980-100 45 < grade < 90

39 Tradeoffs (Cont…) Lower throughput because node 3 becomes a bottleneck. Lower throughput because node 3 becomes a bottleneck.  Assuming even distribution of access to ranges, when node 3 is utilized 100%, nodes 2 and 4 have a 66% utilization, while nodes 1 and 5 are utilized 33%. 0-1920-3940-59 60-7980-100

40 Hybrid Range Partitioning [VLDB’90] To minimize the impact of load imbalance, construct more ranges than nodes, e.g., 10 ranges for a 5 node system. To minimize the impact of load imbalance, construct more ranges than nodes, e.g., 10 ranges for a 5 node system.  Predicates such as “0 < grade < 45” are now directed to all nodes.  Assuming even distribution of access to ranges where workload consists of predicates utilizing 3 sequential ranges, when node 3 become 100% utilized, nodes 2 and 4 are now utilized 83%, while nodes 1 and 5 are utilized 66%. 0-1051-60 11-2061-70 21-3071-80 31-4081-9041-5091-100

41 Multi-Attribute Declustering [SIGMOD’92] Queries with minimal resource requirements should be directed to a few processors. Why? Queries with minimal resource requirements should be directed to a few processors. Why?  Overhead of parallelism 1. Impacts query response time adversely, 2. Wastes system resources, reducing throughput.  OLTP has come a long way:  Heaviest transaction in TPC-C reads approximately 400 records.  Assuming no disk accesses, a low-end PC processes this transaction < 1 ms.  Transactions should be single sited! Range Round-robin

42 Multi-Attribute Declustering (E.g.) Recall the Emp(name, age, salary) table. Recall the Emp(name, age, salary) table. Workload consists of two queries, each with a 50% frequency of occurrence: Workload consists of two queries, each with a 50% frequency of occurrence:  Query A, range query referencing the age attribute. On average, retrieves 5 tuples.  Retrieve Emp where age > 21 and age 21 and age < 22.  Query B, range query referencing the salary attribute. On average, retrieves 10 tuples.  Retrieve Emp where salary > 50K and salary 50K and salary < 50.5K  Access methods:  A non-clustered B + -tree index on age  A clustered B + -tree index on salary Ideally, both queries should be directed to one node. Ideally, both queries should be directed to one node.

43 Multi-Attribute Declustering (E.g. Cont...) Range decluster Emp using age as the partitioning attribute. Range decluster Emp using age as the partitioning attribute. Assuming a system configured with nine nodes, the number of employed nodes is: Assuming a system configured with nine nodes, the number of employed nodes is: RangeIdeal A 50% * 1 B 50% * 9 50% * 1 Average51

44 MAGIC Construct a multi-attribute grid directory on the Emp table Construct a multi-attribute grid directory on the Emp table Each dimension corresponds to a partitioning attribute. Each dimension corresponds to a partitioning attribute. Each cell represents a fragment of the relation. Each cell represents a fragment of the relation. 114477 114477 225588 225588 336600 336600 Salary Age 0-2021-2526-3031-3536-4041-70 10-20 21-25 26-30 31-35 36-40 41-60

45 MAGIC (Low Correlation) Low correlation between salary and age attribute values: Low correlation between salary and age attribute values: 114477 114477 225588 225588 336600 336600........MAGICRangeIdealA 50% * 3 50% * 1 B 50% * 3 50% * 9 50% * 1 Avg351................................................

46 MAGIC (High Correlation) High correlation between salary and age attribute values: High correlation between salary and age attribute values: 114477 114477 225588 225588 336600 336600........MAGICRangeIdealA 50% * 1 B 50% * 9 50% * 1 Avg151................................................

47 BERD Range partition Emp using the salary attribute. Range partition Emp using the salary attribute. For the age attribute, construct an auxiliary relation containing: For the age attribute, construct an auxiliary relation containing: 1. The age attribute value of each record 2. Node containing that record Range partition the auxiliary relation using the age attribute value. Range partition the auxiliary relation using the age attribute value.

48 BERD Bob2010K Shideh1835K Ted5060K Kevin62120K Angela55140K Mike4590K Physical View nameage salaryBob2010KShideh1835K nameage salaryTed5060KMike4590K nameage salaryKevin62120KAngela55140K nameage salary 0-50K51K-100K 101K- ∞ Emp salary is the primary partitioning attribute.

49 BERD, Auxiliary relation 200 180 501 451 622 552 age NodeBob2010KShideh1835K nameage salaryTed5060KMike4590K nameage salaryKevin62120KAngela55140K nameage salary 0-50K51K-100K 101K- ∞ Auxiliary relation

50 BERD, Auxiliary relation 200 180 501 451 622 552 age Node200180 age node 0-2021-52 53- ∞ Auxiliary relation Range partition auxiliary relation using the age attribute. 501451 age node622552 age node

51 BERD, Auxiliary relation 200 180 age node Aux.age0-20 Aux.age21-52 Aux.age 53- ∞ 501451 age node622552 age nodeTed5060KMike4590K nameage salary Salary51K-100KKevin62120KAngela55140K nameage salary Salary 101K- ∞ Bob2010KShideh1835K nameage salary Salary0-50K

52 BERD (Cont…) High correlation between age and salary attribute values: High correlation between age and salary attribute values: BERDRangeIdeal A 50% * 1 B 50% * 9 50% * 1 Avg151

53 BERD (Cont…) Low correlation between age and salary attribute values: Low correlation between age and salary attribute values: BERDRangeIdeal A 50% * 1 B 50% * 9 50% * 1 Avg551 Is it possible to avoid lookup in the auxiliary table?

54 Experimental environment Verified simulation model of the Gamma database machine Verified simulation model of the Gamma database machine A 32 processor system A 32 processor system Database consists of a 100,000 tuple table based on the Wisconsin Benchmark. Database consists of a 100,000 tuple table based on the Wisconsin Benchmark.

55 Experimental Design Correlation between partitioning attribute values Workload characteristics (A,B) Multiprogramming level Low High Low, Low Low, Moderate Moderate, Low Moderate, Moderate

56 Low-Low Query Mix (Low Correlation) Multiprogramming Level Throughput (Queries/Second)

57 Low-Low Query Mix (High Correlation) Multiprogramming Level Throughput (Queries/Second)

58 Low-Moderate Mix (Low Correlation) Multiprogramming Level Throughput (Queries/Second)

59 Low-Moderate Mix (High Correlation) Multiprogramming Level Throughput (Queries/Second)

60 Moderate-Moderate Mix (Low Correlation) Multiprogramming Level Throughput (Queries/Second)

61 Moderate-Moderate Mix (High Correlation) Multiprogramming Level Throughput (Queries/Second)

62 Advantages of MAGIC Provides a superior performance when compared to BERD and Range Provides a superior performance when compared to BERD and Range Constructs the grid directory using the workload of the relation. Changes the shape of the grid directory in order to compensate for the different frequencies of access to the partitioning attributes. Constructs the grid directory using the workload of the relation. Changes the shape of the grid directory in order to compensate for the different frequencies of access to the partitioning attributes. Minimizes the overhead of parallelism. Minimizes the overhead of parallelism. Supports partial declustering of a relation in large systems. Supports partial declustering of a relation in large systems.

63 Summary Given the fast speed of CPUs, each query/transaction should be processed by one node ideally. Given the fast speed of CPUs, each query/transaction should be processed by one node ideally.

64 Parallelism versus Efficient Servers Even if all queries and transactions become single-sited, parallelism is no substitute for smart algorithms that make a single server efficient. Even if all queries and transactions become single-sited, parallelism is no substitute for smart algorithms that make a single server efficient. Why? Why?

65 Why? Assume a single server that can process one request per second. Assume a single server that can process one request per second. Two choices: Two choices: 1. Extend it with Flash and obtain a throughput of 3 requests per second. 2. Buy two additional servers and partition the data across the 3 servers. Given 3 simultaneous requests issued to each alternative: Given 3 simultaneous requests issued to each alternative:  The single processor system will process 3 requests per second.  The 3 node system may not provide a throughput of 3 requests per second.

66 3 R1R2R3 R1R3R2 R1R3 R2R3R1 R3R1R2 R3R2R1 {R1, R2, R3} {R1, R3}R2 {R1, R3}R2 {R1, R3}R2 {R1, R3}R2 {R1, R3}R2 6 Ideal cases {R1, R3}R2 {R2, R3}R1 {R2, R3}R1 {R2, R3}R1 {R2, R3}R1 {R2, R3}R1 {R2, R3}R1 {R2, R1}R3 {R2, R1}R3 {R2, R1}R3 {R2, R1}R3 {R2, R1}R3 {R2, R1}R3 21 27 ways to assign 3 requests to the 3 nodes!

67 Brain Teaser Given N servers and M requests, Given N servers and M requests,  compute the probability of:  M/N requests per node.  Number of ways M requests may map onto N servers and the probability of each scenario.

68 Brain Teaser Given N servers and M requests, Given N servers and M requests,  compute the probability of:  M/N requests per node.  Number of ways M requests may map onto N servers and the probability of each scenario.  Reward for correct answer:


Download ppt "Gamma DBMS Part 1: Physical Database Design Shahram Ghandeharizadeh Computer Science Department University of Southern California."

Similar presentations


Ads by Google