Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,

Similar presentations


Presentation on theme: "1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,"— Presentation transcript:

1 1 Database Theory and Methodology

2 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design, other than our intuition. Bad database design –Redundant Information –Update anomalies –Wasting storage with null values –Generation of invalid data during joins Good database design –Easy to explain its semantics –No redundancy (or at least reduced redundancy) –No information loss during joins We need formal concepts to define “goodness” and “badness” of relational schemas.

3 3 The Evils of Redundancy Redundancy is at the root of several problems associated with relational schemas: – Redundant storage – Insertion anomalies – Deletion anomalies – Update anomalies

4 4 Example: Bad Design FLIGHTS flt# date airline plane# DL242 10/23/00 Delta k-yo-33297 DL242 10/24/00 Delta t-up-73356 DL242 10/25/00 Delta o-ge-98722 AA121 10/24/00 American p-rw-84663 AA121 10/25/00 American q-yg-98237 AA411 10/22/00 American h-fe-65748 redundancy: airline name repeated for the same flight inconsistency: when airline name for a flight changes, it must be changed in many places

5 5 Bad Database Design insertion anomalies: how do we represent that SK912 is flown by Scandinavian without there being a date and a plane assigned? deletion anomalies: cancelling AA411 on 10/22/00 makes us lose that it is flown by American. update anomalies: if DL242 is flown by Sabena, we must change it everywhere. FLIGHTS flt# date airline plane# DL242 10/23/00 Delta k-yo-33297 DL242 10/24/00 Delta t-up-73356 DL242 10/25/00 Delta o-ge-98722 AA121 10/24/00 American p-rw-84663 AA121 10/25/00 American q-yg-98237 AA411 10/22/00 American h-fe-65748

6 6 Bad Database Design - decomposition FLIGHTS flt# date airline plane# DL242 10/23/00 Delta k-yo-33297 DL242 10/24/00 Delta t-up-73356 DL242 10/25/00 Delta o-ge-98722 AA121 10/24/00 American p-rw-84663 AA121 10/25/00 American q-yg-98237 AA411 10/22/00 American h-fe-65748 FLIGHTS-AIRLINE flt# airline DL242 Delta AA121 American AA411 American DATE-AIRLINE-PLANE date airline plane# 10/23/00 Delta k-yo-33297 10/24/00 Delta t-up-73356 10/25/00 Delta o-ge-98722 10/24/00 American p-rw-84663 10/25/00 American q-yg-98237 10/22/00 American h-fe-65748

7 7 Bad Database Design - information loss FLIGHTS flt# date airline plane# DL242 10/23/00 Delta k-yo-33297 DL242 10/24/00 Delta t-up-73356 DL242 10/25/00 Delta o-ge-98722 AA121 10/24/00 American p-rw-84663 AA121 10/25/00 American q-yg-98237 AA211 10/22/00 American h-fe-65748 AA411 10/24/00 American p-rw-84663 AA411 10/25/00 American q-yg-98237 AA411 10/22/00 American h-fe-65748 DATE-AIRLINE-PLANE date airline plane# 10/23/00 Delta k-yo-33297 10/24/00 Delta t-up-73356 10/25/00 Delta o-ge-98722 10/24/00 American p-rw-84663 10/25/00 American q-yg-98237 10/22/00 American h-fe-65748 FLIGHTS-AIRLINE flt# airline DL242 Delta AA121 American AA411 American information loss: we polluted the database with false facts; we can’t find the true facts.

8 8 Good Database Design no redundancy of FACT (!) no inconsistency no insertion, deletion or update anomalies no information loss FLIGHTS-DATE-PLANE flt# date plane# DL242 10/23/00 k-yo-33297 DL242 10/24/00 t-up-73356 DL242 10/25/00 o-ge-98722 AA121 10/24/00 p-rw-84663 AA121 10/25/00 q-yg-98237 AA411 10/22/00 h-fe-65748 FLIGHTS-AIRLINE flt# airline DL242 Delta AA121 American AA411 American

9 9 Goal — Devise a Theory for the Following Decide whether a particular relation R is in “good” form. In the case that a relation R is not in “good” form, decompose it into a set of relations {R 1, R 2,..., R n } such that –each relation is in good form –the decomposition is a lossless-join decomposition Our theory is based on: –functional dependencies –multivalued dependencies

10 10 Functional Dependencies (FDs) A functional dependency is a constraint between two sets of attributes from the database. Definition: A functional dependency, denoted by X  Y, holds over relation R if, for every allowable instance r of R: – t1  r, t2  r, t1.X = t2.X implies t1.Y =t2.Y – i.e., given two tuples in r, if the X values agree, then the Y values must also agree. (X and Y are sets of attributes.) An FD is a statement about all allowable relations. – Must be identified based on semantics of application. – Given some allowable instance r1 of R, we can check if it violates some FD f, but we cannot tell if f holds over R! If X  Y in R, this does not say whether or not Y  X.

11 11 Functional Dependencies and Keys A key constraint is a special case of an FD. Note the following: –If X is a candidate key, X  Y for any subset of attributes Y of R. –X  Y does not require that the set X be minimal; the additional minimality condition must be met for X to be a key. –If X  Y holds where Y is the set of all attributes in R, and there is some subset V of X s.t. V  Y, then X is a superkey.

12 12 Example: Constraints on Entity Set Consider relation obtained from Hourly_Emps: – Hourly_Emps (ssn, name, lot, rating, hrly_wages, hrs_worked) Notation: We will denote this relation schema by listing the attributes: SNLRWH – This is really the set of attributes {S,N,L,R,W,H}. – Sometimes, we will refer to all attributes of a relation by using the relation name. (e.g., Hourly_Emps for SNLRWH) Some FDs on Hourly_Emps: – ssn is the key: S  SNLRWH – rating determines hrly_wages: R  W

13 13 Example (Contd.) Problems due to R  W : – Update anomaly: Can we change W in just the 1st tuple of SNLRWH? – Insertion anomaly: What if we want to insert an employee and don’t know the hourly wage for his rating? – Deletion anomaly: If we delete all employees with rating 5, we lose the information about the wage for rating 5! Will two smaller tables be better? Wages Hourly_Emps2

14 14 Closure of a set of FDs Given some FDs, we can usually infer additional FDs: – ssn  did, did  lot implies ssn  lot – mgr_ssn  mgr_phone, dept_no  mgr_ssn implies dept_no  mgr_phone An FD f is implied by a set of FDs F if f holds whenever all FDs in F hold. Formally, the set of all dependencies that include F as well as all dependencies that can be inferred from F is called the closure of F; it is denoted by F +.

15 15 Inference Rules for FDs Armstrong’s Axioms (X, Y, Z are sets of attributes): – Reflexivity: If Y  X then X  Y. – Augmentation: If X  Y, then XZ  YZ for any Z. – Transitivity: If X  Y and Y  Z, then X  Z. These are sound and complete inference rules for FDs! – By sound we mean that they generate only FDs in F + when applied to a set F of FDs. – By complete we mean that repeated application of these rules will generate all FDs in the closure of F +.

16 16 Derived inference rules Union: if X  Y and X  Z, then X  YZ. Decomposition: if X  YZ, then X  Y and X  Z. Pseudotransitivity: if X  Y and WY  Z, then WX  Z. These additional rules are not essential; their soundness can be proved using Armstrong’s Axioms. Exercise: Prove rules Union, Decomposition and Pseudotransitivity using A.A.

17 17 Example Consider the following schema: Contracts(cid,sid,jid,did,pid,qty,value) – C is the key: C  CSJDPQV – Project purchases each part using single contract: JP  C – Dept purchases at most one part from a supplier: SD  P JP  C, C  CSJDPQV imply JP  CSJDPQV SD  P implies SDJ  JP SDJ  JP, JP  CSJDPQV imply SDJ  CSJDPQV –We can infer several additional FDs that are in the closure by using augmentation or decomposition. e.g. C  CSJDPQV implies C  C, C  S,C  J, C  D, C  P, C  Q, C  V. –We have also a number of trivial FDs from reflexivity rule.

18 18 Reasoning About FDs Computing the closure of a set of FDs can be expensive. (Size of closure is exponential in # attrs!) Typically, we just want to check if a given FD X  Y is in the closure of a set of FDs F. An efficient check: 1. Compute attribute closure of X (denoted X + ) wrt F: Set of all attributes A such that X  A is in F +. There is a linear time algorithm to compute this. 2. Check if Y is in X +. Does F = {A  B, B  C, CD  E } imply A  E? – i.e, is A  E in the closure F + ? (Equivalently, is E in X + )?

19 19 Attribute Closure X + = X; repeat { oldX + = X + ; for each functional dependency Y  Z in F do if Y  X + then X + = X +  Z; until X + = oldX +

20 20 Example Given the following set of FDs: F= {Ssn  Ename, Pnumber  {Pname, Plocation}, {Ssn,Pnumber}  Hours } The attribute closures: {Ssn} + = {Ssn, Ename} {Pnumber} + = {Pnumber, Pname,Plocation} {Ssn,Pnumber} + = {Ssn, Pnumber, Ename, Pname, Plocation, Hours}

21 21 Example R = (A, B, C, G, H, I) F = {A  B A  C CG  H CG  I B  H} (AG) + 1. result = AG. 2. result = ABCG(A  C and A  B) 3. result = ABCGH(CG  H and CG  AGBC) 4.result = ABCGHI(CG  I and CG  AGBCH) Is AG a candidate key? 1. Is AG a super key? Does AG  R? == Is (AG) +  R 2. Is any subset of AG a superkey? Does A  R? == Is (A) +  R Does G  R? == Is (G) +  R

22 22 Finding a Key for a Relation Algorithm: Finding a Key K for R, given a set of FDs 1.Set K = R. 2.For each attribute A in K { Compute (K-A) + wrt F; If (K-A) + contains all the attributes in R, then set K = K – {A} };

23 23 Example Consider the following schema and the set of FDs: R = (Ssn, Pnumber, Ename, Pname, Plocation, Hours) F = {Ssn  Ename, Pnumber  {Pname, Plocation}, {Ssn,Pnumber}  Hours} The Key is {Ssn,Pnumber}, since {Ssn,Pnumber} + = {Ssn, Pnumber, Ename, Pname, Plocation, Hours}

24 24 Equivalence of two sets of FDs Definition: A set of FDs F covers another set of FDs E if every FD in E is also in F +. Definition: F and E are equivalent if F + = E + (i.e. F covers E and E covers F). We can determine whether F covers E by calculating X + with respect to F for each FD X  Y in E, and then checking whether this X + includes the attributes in Y. If this is the case for every FD in E, then F covers E. F E F+F+

25 25 Minimal Cover Sets of functional dependencies may have redundant dependencies that can be inferred from the others –Eg: A  C is redundant in: {A  B, B  C, A  C} –Parts of a functional dependency may be redundant E.g. on RHS: {A  B, B  C, A  CD} can be simplified to {A  B, B  C, A  D} E.g. on LHS: {A  B, B  C, AC  D} can be simplified to {A  B, B  C, A  D} Intuitively, a canonical cover of F is a “minimal” set of functional dependencies equivalent to F, having no redundant dependencies or redundant parts of dependencies

26 26 Minimal Sets of FDs A set of FDs F is minimal if: Every dependency in F has a single attribute for its right-hand side. We can’t replace any dependency X  A in F with a dependency Y  A where Y  X and still have a set of dependencies equivalent with F. –This ensures that there are no redundancies by having redundant attributes on the left-hand side of a dependency. We can’t remove any dependency from F and still have a set of dependencies equivalent with F. –This ensures that there are no redundancies by having a dependency that can be inferred from the remaining FDs in F.

27 27 Minimal Cover of a set of FDs A minimal cover of a set of FDs E is a minimal set of dependencies F that is equivalent to E. There can be several minimal covers for a set of FDs. Algorithm: Finding Minimal Cover F for a set of Fds E. 1. Set F = E. 2. Replace each Fd X  {A1,..., An} in F by the n Fds X  A1,..., X  An. 3. For each Fd X  A in F For each attribute B  X If {{F - {X  A}}  {(X - {B})  A}}  F Then replace X  A with (X - {B})  A in F. 4. For each remaining Fd X  A in F If {F - {X  A}}  F, then remove X  A from F.

28 28 Examples E = { A  B, ABCD  E, EF  GH, ACDF  EG } has the following minimal cover: F = {A  B, ACD  E, EF  G, EF  H } E = {Ssn  Ename, Pnumber  {Pname, Plocation}, {Ssn,Pnumber}  Hours}} F= {Ssn  Ename, Pnumber  Pname, Pnumber  Plocation, {Ssn,Pnumber}  Hours}

29 29 Normal Forms Returning to the issue of schema refinement, the first question to ask is whether any refinement is needed! If a relation is in a certain normal form (BCNF, 3NF etc.), it is known that certain kinds of problems are avoided/minimized. This can be used to help us decide whether decomposing the relation will help. Role of FDs in detecting redundancy: – Consider a relation R with 3 attributes, ABC. No FDs hold: There is no redundancy here. Given A  B: Several tuples could have the same A value, and if so, they’ll all have the same B value!

30 30 Normalization Normalization of data is a process of analyzing the given relational schemas based on their FDs and primary keys to achieve the desirable properties of –minimizing redundancy –minimizing update, insert and delete anomalies. The process of normalization through decomposition must also confirm two additional properties: 1.Lossless join property (nonadditive join property) 2.Dependency preservation property

31 31 NF 2 1NF 2NF 3NF BCNF Overview of NFs

32 32 Normal Forms NF 2 : non-first normal form. 1NF: R is in 1NF iff all domain values are atomic. 2NF: R is in 2. NF iff R is in 1NF and every non-key attribute is fully dependent on the key. 3NF: R is in 3NF iff R is 2NF and every nonkey attribute is non- transitively dependent on the key BCNF: R is in BCNF iff every determinant is a candidate key –Determinant: an attribute on which some other attribute(s) is fully functionally dependent.

33 33 1NF: Atomic Values FLT-SCHEDULE flt# weekday airline dtime from atime to DL242 MO WE FR DELTA 10:40 ATL 12:30 BOS SK912 SA SU SAS 12:00 CPH 15:30 JFK AA242 MO FR AA 08:00 CHI 10:10 ATL Attributes must be defined over domains with atomic values FLT-SCHEDULE flt# weekday airline dtime from atime to DL242 MO DELTA 10:40 ATL 12:30 BOS SK912 SA SAS 12:00 CPH 15:30 JFK AA242 MO AA 08:00 CHI 10:10 ATL DL242 WE DELTA 10:40 ATL 12:30 BOS DL242 FR DELTA 10:40 ATL 12:30 BOS SK912 SU SAS 12:00 CPH 15:30 JFK AA242 FR AA 08:00 CHI 10:10 ATL

34 34 2NF: Full Functional Dependency An FD X  Y is a full functional dependency if any attribute A  X, X – {A} does not functionally determine Y. An FD X  Y is a partial functional dependency if some attribute A  X, (X – {A})  Y. E.g. {ssn, proj_num}  hours is a full FD. {ssn, proj_num}  ename is a partial FD. Definition: A relation schema R is in 2NF if R is in 1NF and every nonprime attribute A is fully functionally dependent on the primary key.

35 35 Third Normal Form (3NF) Relation R with FDs F is in 3NF if, for all X  A in F + – A  X (called a trivial FD), or – X contains a key for R, or – A is part of some key for R. Minimality of a key is crucial in third condition above! If R is in 3NF, obviously in 2NF. If R is in 3NF, some redundancy is still possible. It is a compromise, used when BCNF not achievable (e.g., no ``good’’ decomposition, or performance considerations). – Lossless-join, dependency-preserving decomposition of R into a collection of 3NF relations always possible.

36 36 What Does 3NF Achieve? If 3NF is violated by X  A, one of the following holds: – X is a subset of some key K We store (X, A) pairs redundantly. – X is not a proper subset of any key. There is a chain of FDs K  X  A, which means that we cannot associate an X value with a K value unless we also associate an A value with an X value. But: even if relation is in 3NF, these problems could arise. – e.g., Reserves SBDC, S  C, C  S is in 3NF, but for each reservation of sailor S, same (S, C) pair is stored.

37 37 Boyce-Codd Normal Form (BCNF) Reln R with FDs F is in BCNF if, for all X  A in F + – A X (called a trivial FD), or – X contains a key for R. In other words, R is in BCNF if the only non-trivial FDs that hold over R are key constraints. – No dependency in R that can be predicted using FDs alone. – If we are shown two tuples that agree upon the X value, we cannot infer the A value in one tuple from the A value in the other. – If example relation is in BCNF, then 2 tuples must be identical (since X is a key).


Download ppt "1 Database Theory and Methodology. 2 The Good and the Bad So far we have not developed any measure of “goodness” to measure the quality of the design,"

Similar presentations


Ads by Google