Schema Tuning. Outline Database design: Normalization –Problem of redundancy –Why? Functional dependency –How to solve? Decomposition –Objective of the.

Schema Tuning

Outline Database design: Normalization –Problem of redundancy –Why? Functional dependency –How to solve? Decomposition –Objective of the decomposition? Normalized Normal form Schema refinement –Vertical partitioning and Antipartitioning –Denormalization –Horizontal decomposition

Design Steps The design steps: 1.Real-World 2.ER model 3.Relational Schema 4. Better relational Schema 5. Relational DBMS Guideline for Step (3) to step (4) –“design theory” for relations –Or “normalization theory”. Why normalization theory –Automatic mappings from ER to relations may not produce the best relational design possible. –Database designers may go directly from (1) to (3), in which case, the relational design can be really bad.

Database Schema A relation schema is a relation name and a set of attributes R(a int, b varchar[20]); A relation instance for R is a set of records over the attributes in the schema for R.

The Evils of Redundancy Redundancy is the root of several problems associated with relational schemas: –redundant storage –insert/delete/update anomalies Consider relation obtained from Hourly_Emps: –Hourly_Emps (ssn, name, lot, rating, hrly_wages, hrs_worked) Notation: We will denote this relation schema by listing the attributes: SNLRWH –This is really the set of attributes Hourly_Emps {S,N,L,R,W,H}.

Example Problems due to R  W : –Update anomaly: Can we change W in just the 1st tuple of SNLRWH? –Insertion anomaly: Suppose W is not allowed null. What if we want to insert an employee and don’t know the hourly wage for his rating? –Deletion anomaly: If we delete all employees with rating 5, we lose the information about the wage for rating 5!

Functional Dependencies (FDs) A functional dependency X  Y holds over relation R if, for every allowable instance r of R: –i.e., given two tuples in r, if the X values agree, then the Y values must also agree. (X and Y are sets of attributes.) K is a key for relation R if: 1. K determines all attributes of R. 2. For no proper subset of K is (1) true. If K satisfies only (1), then K is a superkey.

Example Consider relation Hourly_Emps: –Hourly_Emps (ssn, name, lot, rating, hrly_wages, hrs_worked) v FD is a key: –ssn is the key u S  SNLRWH v FDs give more detail than the mere assertion of a key. –rating determines hrly_wages u R  W

Decomposition of a Relation Scheme A decomposition of R(A1... An) consists of replacing R by two or more relations such that: –Each new relation scheme contains a subset of the attributes of R Implies that there exists no attributes that do not appear in R –Every attribute of R appears as an attribute of one of the new relations. E.g., –SNLRWH can be decomposed into SNLRH and RW.

Problems with Decompositions Three potential problems to consider: ¶ Performance: Some queries become more expensive. e.g., How much did sailor Joe earn? (salary = W*H) · Lossless Join Decompositions: Given instances of the decomposed relations, we may not be able to reconstruct the corresponding instance of the original relation! Fortunately, not in the SNLRWH example. ¸ Dependency Preserving Decomposition: Checking some dependencies may require joining the instances of the decomposed relations. Fortunately, not in the SNLRWH example. Tradeoff: Must consider these issues vs. redundancy.

Lossless Join Decompositions Decomposition of R into X and Y is lossless-join w.r.t. a set of FDs F if, for every instance r that satisfies F: –  X (r)   Y (r)=r It is always true that r  X (r)   Y (r) –In general, the other direction does not hold! If it does, the decomposition is lossless-join. Definition extended to decomposition into 3 or more relations in a straightforward way.

Dependency Preserving Decomposition Consider CSJDPQV, C is key, JP →C and SD → P. – BCNF decomposition: CSJDQV and SDP –Problem: Checking JP → C requires a join! Dependency preserving decomposition (Intuitive): – If R is decomposed into X, Y and Z, and we enforce the FDs that hold on X, on Y and on Z, then all FDs that were given to hold on R must also hold. –No join is necessary to enforce original FD

Normalization A relation is normalized if every interesting functional dependency X  A involving attributes in R has the property that X is a key of R. –BCNF –Kent ： “Each attribute must describe [an entity or relationship identified by] the key, the whole key, and nothing but the key.“ No partial dependency No transitive denepdency BCNF Partial dependency Transitive dependency

Example –Schema1 (not normalized): OnOrder1(supplier_id, part_id, quantity, supplier_address) –Schema 2 (Normalized): OnOrder2(supplier_id, part_id, quantity); Supplier(supplier_id, supplier_address);

Example #1 Suppose that a bank associates each customer with his or her home branch. Each branch is in a specific legal jurisdiction( 权限 ). –Is the relation R(customer, branch, jurisdiction) normalized?

Example #1 What are the functional dependencies? –customer  branch –branch  jurisdiction –customer  jurisdiction –Customer is the key, but a functional dependency exists where customer is not involved. –R is not normalized.

Example #2 Suppose that a doctor can work in several hospitals and receives a salary from each one. –Is R(doctor, hospital, salary) normalized?

Example #2 What are the functional dependencies? –doctor, hospital  salary The key is doctor, hospital The relation is normalized.

Example #3 Same relation R(doctor, hospital, salary) and we add the attribute primary_home_address. Each doctor has a primary home address and several doctors can have the same primary home address. Is R(doctor, hospital, salary, primary_home_address) normalized?

Example #3 What are the functional dependencies? –doctor, hospital  salary –doctor  primary_home_address –doctor, hospital  primary_home_address The key is “doctor, hospital”, however doctor (a subset) determines one attribute, hence not normalized A normalized decomposition would be: –R1(doctor, hospital, salary) –R2(doctor, primary_home_address)

Normal Forms Returning to the issue of schema refinement, the first question to ask is whether any refinement is needed! If a relation is in a certain normal form (BCNF, 3NF etc.), it is known that certain kinds of problems are avoided/minimized. This can be used to help us decide whether decomposing the relation will help.

Normal forms Universe of relations 1 NF 2NF 3NF BCNF 4NF 5NF

Practical Schema Design Identify entities in the application (e.g., doctors, hospitals, suppliers). Each entity has attributes (an hospital has an address, a juridiction, …). There are two constraints on attributes: –An attribute cannot have attribute of its own. –The entity associated with an attribute must functionally determine that attribute.

Practical Schema Design Each entity becomes a relation To those relations, add relations that reflect relationships between entities. –Worksin (doctor_ID, hospital_ID) Identify the functional dependencies among all attributes and check that the schema is normalized: –If functional dependency AB  C holds, then ABC should be part of the same relation.

Vertical Partitioning Three attributes: account_ID, balance, address. Functional dependencies: –account_ID  balance –account_ID  address Two normalized schema design: –(account_ID, balance, address) or –(account_ID, balance) –(account_ID, address) Which design is better?

Vertical Partitioning Which design is better depends on the query pattern: –The application that sends a monthly statement is the principal user of the address of the owner of an account –The balance is updated or examined several times a day. The second schema might be better because the relation (account_ID, balance) can be made smaller: –More account_ID, balance pairs fit in memory, thus increasing the hit ratio –A scan performs better because there are fewer pages.

Tuning Normalization Rule of thumb: –A single normalized relation XYZ is better than two normalized relations XY and XZ if X, Y and Z are frequently accessed together –Why? No costly join operations. –The two-relation design is better iff: Users access tend to partition between the two sets Y and Z most of the time Attributes Y or Z have large values (one-third the page size or larger) –Too many accesses across pages

Vertical Partitioning and Scan R (X,Y,Z) –X is an integer –YZ are large strings Scan Query Vertical partitioning exhibits poor performance when XYZ are accessed together. Vertical partitioning provides a speedup if only two of the attributes (XY) are accessed.

Vertical Partitioning and Point Queries R (X,Y,Z) –X is an integer –YZ are large strings A mix of point queries access either XYZ or XY. Vertical partitioning gives a performance advantage if the proportion of queries accessing only XY is greater than 20%. The join is not expensive compared to a simple look-up.

Vertical Antipartitioning Brokers’ bond-buying decisions are based on the price trends of those bonds. The database holds the closing price for the last 3000 trading days, however the 10 most recent trading days are especially important. –(bond_id, issue_date, maturity, …) (bond_id, date, price) Vs. –(bond_id, issue_date, maturity, today_price, …10dayago_price) (bond_id, date, price)

Tuning Denormalization Denormalizing means violating normalization for the sake of performance: –Denormalization speeds up performance when attributes from different normalized relations are often accessed together –Denormalization hurts performance for relations that are often updated.

Denormalizing -- data Settings: lineitem ( L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT ); region( R_REGIONKEY, R_NAME, R_COMMENT ); nation( N_NATIONKEY, N_NAME, N_REGIONKEY, N_COMMENT,); supplier( S_SUPPKEY, S_NAME, S_ADDRESS, S_NATIONKEY, S_PHONE, S_ACCTBAL, S_COMMENT); –600000 rows in lineitem, 25 nations, 5 regions, 500 suppliers

Denormalizing -- transactions L_REGIONNAME lineitemdenormalized ( L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT, L_REGIONNAME); –600000 rows in lineitemdenormalized –Cold Buffer –Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives (10000RPM), Windows 2000.

Queries on Normalized vs. Denormalized Schemas Queries: select L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT, R_NAME from LINEITEM, REGION, SUPPLIER, NATION where L_SUPPKEY = S_SUPPKEY and S_NATIONKEY = N_NATIONKEY and N_REGIONKEY = R_REGIONKEY and R_NAME = 'EUROPE'; select L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT, L_REGIONNAME from LINEITEMDENORMALIZED where L_REGIONNAME = 'EUROPE';

Denormalization TPC-H schema Query: find all lineitems whose supplier is in Europe. With a normalized schema this query is a 4-way join. If we denormalize lineitem and introduce the name of the region for each lineitem we obtain a 30% throughput improvement.

Horizontal Decompositions Relation decomposition: Relation is replaced by a collection of relations that are projections. Horizontal Decomposition ： Replace relation by a collection of relations that are selections. –Each new relation has same schema as the original, but a subset of the rows. –Collectively, new relations contain all rows of the original. Typically, the new relations are disjoint.

Horizontal Decompositions (Contd.) Suppose that contracts with value > 10000 are subject to different rules. This means that Contracts with val>10000 will often be queried One way to deal with this is to build a clustered B+ tree index on the val field of Contracts. A second approach is to replace contracts by two new relations: LargeContracts and SmallContracts, with the same attributes Performs like index on such queries, but no index overhead. Can build clustered indexes on other attributes, in addition!

Masking Conceptual Schema Changes The replacement of Contracts by LargeContracts and SmallContracts can be masked by the view. However, queries with the condition val>10000 must be asked wrt LargeContracts for efficient execution: so users concerned with performance have to be aware of the change CREATE VIEW Contracts(cid, sid, jid, did, pid, qty, val) AS SELECT * FROM LargeContracts UNION SELECT * FROM SmallContracts

Schema Tuning. Outline Database design: Normalization –Problem of redundancy –Why? Functional dependency –How to solve? Decomposition –Objective of the.

Similar presentations

Presentation on theme: "Schema Tuning. Outline Database design: Normalization –Problem of redundancy –Why? Functional dependency –How to solve? Decomposition –Objective of the."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Schema Tuning. Outline Database design: Normalization –Problem of redundancy –Why? Functional dependency –How to solve? Decomposition –Objective of the.

Similar presentations

Presentation on theme: "Schema Tuning. Outline Database design: Normalization –Problem of redundancy –Why? Functional dependency –How to solve? Decomposition –Objective of the."— Presentation transcript:

Similar presentations

About project

Feedback