Presentation is loading. Please wait.

Presentation is loading. Please wait.

Database Design Normal forms & Normalization Compiled by S. Z.

Similar presentations


Presentation on theme: "Database Design Normal forms & Normalization Compiled by S. Z."— Presentation transcript:

1 Database Design Normal forms & Normalization Compiled by S. Z.

2 Learning Objectives Data Integrity Data loss vs data lossless
Problems with bad database design: Incomplete (easy to fix, collect all necessary attributes) redundancy Attribute decomposition Identify key attributes and functional dependencies among attributes Normal forms: 1NF, 2NF, 3NF, BCNF, 4NF What normalization is? How normal forms can be transformed from lower normal forms to higher normal forms What role it plays in the database design process? How normalization and ER modeling are used concurrently to produce a good database design? How some situations require denormalization to generate information efficiently?

3 Table design is more challenging than manipulating tables.

4 Data Integrity Data integrity refers to maintaining and assuring the accuracy and consistency of data over its entire life-cycle, and is a critical aspect to the design, implementation and usage of database system which stores, processes, or retrieves data. Data integrity is the opposite of  Incomplete data, data corruption, which is a form of data loss. Data Redundancy

5 Problems with unnormalized data
Contains redundancy Contains multi-values, non atomic field. Does not have a primary key identified Cause data anomaly Etc.

6 Example of Unnormalized Data

7 The idea is simple! This is a table design problem.

8 Issues of bad database design
Another Example: Company that manages building projects Charges its clients by billing hours spent on each contract Hourly billing rate is dependent on employee’s position Periodically, report is generated that contains information displayed in Table 5.1

9 Issues of bad database design

10 Issues of bad database design

11 Issues of bad database design
Structure of data set in Figure 5.1 does not handle data very well The table structure appears to work; report generated with ease Unfortunately, report may yield different results depending on what data anomaly has occurred

12 How to measure goodness of database design?
In the relational model, methods exist for quantifying how efficient a database is. These classifications are called normal forms (or NF).

13 Normal Form Edgar F. Codd originally established three normal forms: 1NF, 2NF and 3NF. The normal forms are progressive, so to achieve Second Normal Form, the tables must already be in First Normal Form. 2NF is better than 1NF; 3NF is better than 2NF

14 More advanced normal forms
There are now others (BCNF, 4NF, 5NF) that are generally accepted, but 3NF is widely considered to be sufficient for most applications. Is out the scope of 242 course. Will be studied in 342 course. Most tables when reaching 3NF are also in BCNF (Boyce-Codd Normal Form). Highest level of normalization is not always most desirable. Also tradeoffs of various factors need to be considered. For example, tables are sometimes denormalized to yield less I/O which increases processing speed

15 When do you need to check normal forms
Tables mapped from ER model usually meets 3NF, but there is no guarantee. Therefore once you have a database design, with tables mapped from the constructs of ER diagram, you need to check each table for up to 3NF. If a table meets 3NF, usually the table is in the good shape. Otherwise, the table usually needs to be further decomposed.

16 What keys are important?
Candidate keys vs primary keys

17 Dependency and partial dependency
What is dependency? If you look at two attributes (in a table), there are two kinds of relationship. Independent from each other, for example age and state in student table. One depends on the other, or in order words, one decide the other. For example, ssn and age. Age depends on ssn. If you know one’s ssn, you know his/her age. This shows the dependency of age on ssn, which usually is key. Partial dependency (in case when the primary key consists of multiple fields.) Fields within the table are dependent only on part of the primary key

18 Dependency vs. determinant
A determinant is the reversed concept of dependency. A determinant is any attribute (simple or composite) on which some other attribute is fully functionally dependent. Guide to Oracle 10g

19 First normal form (1NF) Applies to every relation
Primary key field identified No multi-valued attributes, no composite attributes, i.e. each attribute is atomic, one value for each attribute. Applies to every relation

20 Another Example Table 2 Title Author1 Author2 ISBN Subject Pages
Publisher Database System Concepts Abraham Silberschatz Henry F. Korth MySQL, Computers 1168 McGraw-Hill Operating System Concepts Computers 944

21 Similar problems with Table 2
This table is not very efficient with storage. This design does not protect data integrity. Third, this table does not scale well.

22 First Normal Form In our Table 2, we have two violations of First Normal Form: First, we have more than one author field, Second, our subject field contains more than one piece of information. With more than one value in a single field, it would be very difficult to search for all books on a given subject.

23 First Normal Table Table 3 Title Author ISBN Subject Pages Publisher
Database System Concepts Abraham Silberschatz MySQL 1168 McGraw-Hill Henry F. Korth Computers Operating System Concepts 944

24 continue We now have two rows for a single book. Additionally, we would be violating the Second Normal Form… A better solution to our problem would be to separate the data into separate tables- an Author table and a Subject table to store our information, removing that information from the Book table:

25 Subject Table Author Table Book Table Subject_ID Subject 1 MySQL 2
Computers Author Table Author_ID Last Name First Name 1 Silberschatz Abraham 2 Korth Henry Book Table ISBN Title Pages Publisher Database System Concepts 1168 McGraw-Hill Operating System Concepts 944

26 Each table has a primary key, used for joining tables together when querying the data. A primary key value must be unique with in the table (no two books can have the same ISBN number), and a primary key is also an index, which speeds up data retrieval based on the primary key. Now to define relationships between the tables

27 Relationships Book_Author Table Book_Subject Table ISBN Author_ID
1 2 ISBN Subject_ID 1 2

28 Second normal form (2NF) Normalization (continued)
In 1NF No partial dependencies

29 Normalization (continued)
Basic procedure for identifying partial dependency: Look at each field that is not part of the composite primary key Make certain you are required to have both parts of the composite field to determine the value of the data element

30 Second Normal Form (2NF)
As the First Normal Form deals with redundancy of data across a horizontal row, Second Normal Form (or 2NF) deals with redundancy of data in vertical columns. The Book Table will be used for the 2NF example

31 2NF Table Publisher Table Book Table Publisher_ID Publisher Name 1
McGraw-Hill Book Table ISBN Title Pages Publisher_ID Database System Concepts 1168 1 Operating System Concepts 944

32 2NF Here we have a one-to-many relationship between the book table and the publisher. A book has only one publisher, and a publisher will publish many books. When we have a one-to-many relationship, we place a foreign key in the Book Table, pointing to the primary key of the Publisher Table. The other requirement for Second Normal Form is that you cannot have any data in a table with a composite key that does not relate to all portions of the composite key.

33 Bad Example studenttable
(student ID, course ID, course Name, grade) In this case, course name depends on courseID only, so called partial dependency. Thus violate 2NF.

34 Third normal form (3NF) Normalization (continued)
In 2NF No transitive dependencies, i.e. the non – primary key attributes should be mutually independent Table is in 3NF when it is in 2NF and there are no transitive dependencies Transitive dependency Field is dependent on another field within the table that is not the primary key field

35 Third Normal Form Third normal form (3NF) requires that there are no functional dependencies of non-key attributes on something other than a candidate key. A table is in 3NF if all of the non-primary key attributes are mutually independent There should not be transitive dependencies

36 Bad Example Studenttable2 (sid, student, state, state governor)
In this case, state depends on sid, while state governor depends on state, through which, depends on sid, so exist a transitive dependency from state governor to sid via state Thus violate 3NF

37 For most business database design purposes, 3NF is as high as we need to go in normalization process
3NF does not deal satisfactorily with the case of a relation with overlapping candidate keys, i.e. multiple composite candidate keys with at least one attribute in common.

38 The Boyce-Codd Normal Form (BCNF)
BCNF requires that the table is 3NF and only determinants are the candidate keys Every determinant in table is a candidate key Has same characteristics as primary key, but for some reason, not chosen to be primary key When table contains only one candidate key, the 3NF and the BCNF are equivalent BCNF can be violated only when table contains more than one candidate key

39 The Boyce-Codd Normal Form (BCNF) (continued)
Most designers consider the BCNF as special case of 3NF, therefore, BCNF sometimes is called 3.5NF. Table can be in 3NF and fails to meet BCNF No partial dependencies, nor does it contain transitive dependencies A nonkey attribute is the determinant of a key attribute

40 The Boyce-Codd Normal Form (BCNF) (continued)

41 The Boyce-Codd Normal Form (BCNF) (continued)

42 The Boyce-Codd Normal Form (BCNF) (continued)

43 If a relational schema is in BCNF then all redundancy based on functional dependency has been removed, although other types of redundancy may still exist.

44 Fourth normal form (4NF)
Table in 3NF may contain multivalued dependencies that produce either numerous null values or redundant data It may be necessary to convert 3NF table to fourth normal form (4NF) by Splitting table to remove multivalued dependencies

45 Fourth Normal Form (4NF)
A relation is in 4NF if it is already in 3NF and has no multi-valued dependencies. Table is in fourth normal form (4NF) when both of the following are true: It is in 3NF Has no multiple sets of independent (will be discussed in later slides) multivalued dependencies (no multiple of multiple, or at most one multiple dependencies). i.e., a record type can contain at most one multi-valued facts about an entity. 4NF is largely academic if tables conform to following two rules: All attributes must be dependent on primary key, but independent of each other No row contains two or more multivalued facts about an entity

46 Multiple Multi-valued dependency
Multiple multi-valued dependencies exist when There are at least three attributes A, B, and C in a relation and For each value of A there is a well-defined set of values for B, and a well-defined set of values for C, but the set of values of B is independent of set C. Every possible combination of the two multi-valued attributes have to be stored in the database thus leading to redundancy and consequent anomalies.

47 Course ID Instructor Textbook CSCI242 Zhang Intro to MYSQL Allison MYSQL Oracle This is ok, by designating all three fields combined to serve as primary key of the table. However, this contain multiple (two in this case) multi-value sets (instructors and textbook, with respect to course ID, respectively).

48 By splitting the above relation into two relations and placing the multi-valued attributes in each table by themselves, we can convert the above to 4NF Course-INST(course-ID, Instructor) Course-TEXT(course-ID, Textbook)

49 Other problems caused by violating fourth normal form are similar in spirit to those mentioned earlier for violations of second or third normal form. They take different variations depending on the chosen maintenance policy: If there are repetitions, then updates have to be done in multiple records, and they could become inconsistent. Insertion of a new skill may involve looking for a record with a blank skill, or inserting a new record with a possibly blank language, or inserting multiple records pairing the new skill with some or all of the languages. Deletion of a skill may involve blanking out the skill field in one or more records (perhaps with a check that this doesn't leave two records with the same language and a blank skill), or deleting one or more records, coupled with a check that the last mention of some language hasn't also been deleted. Fourth normal form minimizes such update problems.

50 Independence  We mentioned independent multi-valued facts earlier, and we now illustrate what we mean in terms of the example. The two many-to-many relationships, employee:skill and employee:language, are "independent" in that there is no direct connection between skills and languages. There is only an indirect connection because they belong to some common employee. That is, it does not matter which skill is paired with which language in a record; the pairing does not convey any information. That's precisely why all the maintenance policies mentioned earlier can be allowed. In contrast, suppose that an employee could only exercise certain skills in certain languages. Perhaps Smith can cook French cuisine only, but can type in French, German, and Greek. Then the pairings of skills and languages becomes meaningful, and there is no longer an ambiguity of maintenance policies. In the present case, only the following form is correct: | EMPLOYEE | SKILL | LANGUAGE | | | | Smith | cook | French | | Smith | type | French | | Smith | type | German | | Smith | type | Greek | Thus the employee:skill and employee:language relationships are no longer independent. These records do not violate fourth normal form. When there is an interdependence among the relationships, then it is acceptable to represent them in a single record.

51 4.1.2 Multivalued Dependencies
For readers interested in pursuing the technical background of fourth normal form a bit further, we mention that fourth normal form is defined in terms of multivalued dependencies, which correspond to our independent multi-valued facts. Multivalued dependencies, in turn, are defined essentially as relationships which accept the "cross-product" maintenance policy mentioned above. That is, for our example, every one of an employee's skills must appear paired with every one of his languages. It may or may not be obvious to the reader that this is equivalent to our notion of independence: since every possible pairing must be present, there is no "information" in the pairings. Such pairings convey information only if some of them can be absent, that is, only if it is possible that some employee cannot perform some skill in some language. If all pairings are always present, then the relationships are really independent. We should also point out that multivalued dependencies and fourth normal form apply as well to relationships involving more than two fields. For example, suppose we extend the earlier example to include projects, in the following sense: An employee uses certain skills on certain projects. An employee uses certain languages on certain projects. If there is no direct connection between the skills and languages that an employee uses on a project, then we could treat this as two independent many-to-many relationships of the form EP:S and EP:L, where "EP" represents a combination of an employee with a project. A record including employee, project, skill, and language would violate fourth normal form. Two records, containing fields E,P,S and E,P,L, respectively, would satisfy fourth normal form.

52 Fourth Normal Form (4NF) (continued)

53 Fourth Normal Form (4NF) (continued)

54 Fifth normal form (5NF) Fifth normal form (5NF), also known as project-join normal form (PJ/NF) is a level of database normalization designed to reduce redundancy in relational databases recording multi-valued facts by isolating semantically related multiple relationships. A relation is said to be in the 5NF if and only if every non-trivial join dependency in it is implied by the candidate keys.

55 A join dependency *{A, B, … Z} on R is implied by the candidate key(s) of R if and only if each of A, B, …, Z is a superkey for R.

56 Normal Forms 1NF 2NF 3NF BCNF 4NF 5NF

57 UNAVOIDABLE REDUNDANCIES
Normalization certainly doesn't remove all redundancies. Certain redundancies seem to be unavoidable, particularly when several multivalued facts are dependent rather than independent.

58 INTER-RECORD REDUNDANCY
The normal forms discussed here deal only with redundancies occurring within a single record type. Fifth normal form is considered to be the "ultimate" normal form with respect to such redundancies. Other redundancies can occur across multiple record types. For the example concerning employees, departments, and locations, the following records are in third normal form in spite of the obvious redundancy: | EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION | | EMPLOYEE | LOCATION | In fact, two copies of the same record type would constitute the ultimate in this kind of undetected redundancy. Beyond the scope of this course. .

59 7 CONCLUSION While we have tried to present the normal forms in a simple and understandable way, we are by no means suggesting that the data design process is correspondingly simple. The design process involves many complexities which are quite beyond the scope of this paper. In the first place, an initial set of data elements and records has to be developed, as candidates for normalization. Then the factors affecting normalization have to be assessed: Single-valued vs. multi-valued facts. Dependency on the entire key. Independent vs. dependent facts. The presence of mutual constraints. The presence of non-unique or non-singular representations. And, finally, the desirability of normalization has to be assessed, in terms of its performance impact on retrieval applications. 7 CONCLUSION While we have tried to present the normal forms in a simple and understandable way, we are by no means suggesting that the data design process is correspondingly simple. The design process involves many complexities which are quite beyond the scope of this paper. In the first place, an initial set of data elements and records has to be developed, as candidates for normalization. Then the factors affecting normalization have to be assessed: Single-valued vs. multi-valued facts. Dependency on the entire key. Independent vs. dependent facts. The presence of mutual constraints. The presence of non-unique or non-singular representations. And, finally, the desirability of normalization has to be assessed, in terms of its performance impact on retrieval applications.

60 To judge a person is healthy is probably easier than
to make a person become healthy, to say the least in some cases!! Now we know how to verify whether or not a database design is conforming to normal forms. Still we want to know how to design databases to meet expectations of the the normal forms? There are algorithms for converting a given “bad” database design to increasingly better design.

61 Database Normalization
Database normalization is the process of removing redundant data from your tables in to improve storage efficiency, data integrity, and scalability, through removing un-normalized relationship between the attributes in the same table. Normalization generally involves splitting existing tables into multiple ones, which may need to be re-joined or linked each time when any query involving the multiple tables is issued. If you want to normalize data, normalize at the higher level first, i.e., normalize the table, the meta data of data.

62 History Edgar F. Codd first proposed the process of normalization and what came to be known as the 1st normal form in his paper A Relational Model of Data for Large Shared Data Banks. Codd stated: “There is, in fact, a very simple elimination procedure which we shall call normalization. Through decomposition nonsimple domains are replaced by ‘domains whose elements are atomic (nondecomposable) values.’”

63 Database Tables and Normalization
Step-by-step process used to determine which data elements should be stored in which tables Purpose evaluate and correct table structures minimize data data redundancy without losing information Reduces data anomalies Multiple levels of normalization Works through a series of stages called normal forms: First normal form (1NF) Second normal form (2NF) Third normal form (3NF)

64 The Normalization Process
Each table represents a single subject No data item will be unnecessarily stored in more than one table All attributes in a table are dependent on the primary key

65 The Normalization Process (continued)

66 Conversion to First Normal Form
Repeating group Derives its name from the fact that a group of multiple entries of same type can exist for any single key attribute occurrence Relational table must not contain repeating groups Normalizing table structure will reduce data redundancies Normalization is three-step procedure

67 Conversion to First Normal Form (continued)
Step 1: Eliminate the Repeating Groups Present data in tabular format, where each cell has single value and there are no repeating groups Eliminate repeating groups, eliminate nulls by making sure that each repeating group attribute contains an appropriate data value

68 Conversion to First Normal Form (continued)

69 Conversion to First Normal Form (continued)
Step 2: Identify the Primary Key Primary key must uniquely identify attribute value New key must be composed

70 Conversion to First Normal Form (continued)
Step 3: Identify All Dependencies Dependencies can be depicted with help of a diagram Dependency diagram: Depicts all dependencies found within given table structure Helpful in getting bird’s-eye view of all relationships among table’s attributes Makes it less likely that will overlook an important dependency

71 Conversion to First Normal Form (continued)

72 Conversion to First Normal Form (continued)
First normal form describes tabular format in which: All key attributes are defined There are no repeating groups in the table All attributes are dependent on primary key All relational tables satisfy 1NF requirements Some tables contain partial dependencies Dependencies based on only part of the primary key Sometimes used for performance reasons, but should be used with caution Still subject to data redundancies

73 Conversion to Second Normal Form
Relational database design can be improved by converting the database into second normal form (2NF) Two steps

74 Conversion to Second Normal Form (continued)
Step 1: Write Each Key Component on a Separate Line Write each key component on separate line, then write original (composite) key on last line Each component will become key in new table

75 Conversion to Second Normal Form (continued)
Step 2: Assign Corresponding Dependent Attributes Determine those attributes that are dependent on other attributes At this point, most anomalies have been eliminated

76 Conversion to Second Normal Form (continued)

77 Conversion to Second Normal Form (continued)
Table is in second normal form (2NF) when: It is in 1NF and It includes no partial dependencies: No attribute is dependent on only portion of primary key

78 Conversion to Third Normal Form
Data anomalies created are easily eliminated by completing three steps Step 1: Identify Each New Determinant For every transitive dependency, write its determinant as PK for new table Determinant Any attribute whose value determines other values within a row

79 Conversion to Third Normal Form (continued)
Step 2: Identify the Dependent Attributes Identify attributes dependent on each determinant identified in Step 1 and identify dependency Name table to reflect its contents and function

80 Conversion to Third Normal Form (continued)
Step 3: Remove the Dependent Attributes from Transitive Dependencies Eliminate all dependent attributes in transitive relationship(s) from each of the tables that have such a transitive relationship Draw new dependency diagram to show all tables defined in Steps 1–3 Check new tables as well as tables modified in Step 3 to make sure that each table has determinant and that no table contains inappropriate dependencies

81 Conversion to Third Normal Form (continued)

82 Conversion to Third Normal Form (continued)
A table is in third normal form (3NF) when both of the following are true: It is in 2NF It contains no transitive dependencies

83 Improving the Design Table structures are cleaned up to eliminate troublesome initial partial and transitive dependencies Normalization cannot, by itself, be relied on to make good designs It is valuable because its use helps eliminate data redundancies

84 Improving the Design (continued)
Issues to address in order to produce a good normalized set of tables: Evaluate PK Assignments Evaluate Naming Conventions Refine Attribute Atomicity Identify New Attributes Identify New Relationships Refine Primary Keys as Required for Data Granularity Maintain Historical Accuracy Evaluate Using Derived Attributes

85 Improving the Design (continued)

86 Improving the Design (continued)

87 Surrogate Key Considerations
When primary key is considered to be unsuitable, designers use surrogate keys Data entries in Table 5.3 are inappropriate because they duplicate existing records Yet there has been no violation of either entity integrity or referential integrity

88 Surrogate Key Considerations (continued)

89 Normalization and Database Design
Normalization should be part of design process Make sure that proposed entities meet required normal form before table structures are created Many real-world databases have been improperly designed or burdened with anomalies if improperly modified during course of time You may be asked to redesign and modify existing databases

90 Normalization and Database Design (continued)
ER diagram Provides big picture, or macro view, of an organization’s data requirements and operations Created through an iterative process Identifying relevant entities, their attributes and their relationship Use results to identify additional entities and attributes

91 Normalization and Database Design (continued)
Normalization procedures Focus on characteristics of specific entities Represents micro view of entities within ER diagram Difficult to separate normalization process from ER modeling process Two techniques should be used concurrently

92 Normalization and Database Design (continued)

93 Normalization and Database Design (continued)

94 Normalization and Database Design (continued)

95 Normalization and Database Design (continued)

96 Normalization and Database Design (continued)

97 Denormalization Creation of normalized relations is important database design goal Processing requirements should also be a goal If tables decomposed to conform to normalization requirements: Number of database tables expands

98 Denormalization (continued)
Joining the larger number of tables takes additional input/output (I/O) operations and processing logic, thereby reducing system speed Conflicts between design efficiency, information requirements, and processing speed are often resolved through compromises that may include denormalization

99 Denormalization (continued)
Unnormalized tables in production database tend to suffer from these defects: Data updates are less efficient because programs that read and update tables must deal with larger tables Indexing is more cumbersome Unnormalized tables yield no simple strategies for creating virtual tables known as views

100 Denormalization (continued)
Use denormalization cautiously Understand why—under some circumstances—unnormalized tables are better choice

101 Summary Normalization is technique used to design tables in which data redundancies are minimized First three normal forms (1NF, 2NF, and 3NF) are most commonly encountered Table is in 1NF when all key attributes are defined and when all remaining attributes are dependent on primary key

102 Summary (continued) Table is in 2NF when it is in 1NF and contains no partial dependencies Table is in 3NF when it is in 2NF and contains no transitive dependencies Table that is not in 3NF may be split into new tables until all of the tables meet 3NF requirements Normalization is important part—but only part—of design process

103 Summary (continued)

104 Summary (continued)

105 Summary (continued)

106 References


Download ppt "Database Design Normal forms & Normalization Compiled by S. Z."

Similar presentations


Ads by Google