Data Modeling and Database Design

Data Modeling and Database Design
Data storage is a critical component of most information systems – Some people consider it to be the critical component. This chapter covers the design and construction of both logical and physical databases. Logical part is often called data modeling; although in the textbook (chapter 7) the data design is part of the design phase, more often the logical data modeling can take place in the analysis phases. In this class, however, we discuss both logical and physical database design in this week due to time constraints and because of this textbook structure.

Overview of Database Design
Capture Users’ “Views” of data Forms, Screens, Reports Data Modeling (entity-based ERD) Add keys, attributes, relationships (key-based, and fully-described ERD) Consolidate data models – One big ERD Determine Database Schema (database design) First step is to develop a data model for the system being replaced by capturing the views of the users. The views include the forms, screens, or reports that have meaningful data for the particular users. Next, a new conceptual data model is built that includes all the requirements of the new system – add, delete, modify the entities in the “as-is” data views. Many users have different views from each other, yet, they share some data entities. Consolidate them into one big ER-diagram by taking minimum common multiples – so to speak – join all the entities, and delete redundancies. In the design stage, the conceptual data model is translated into a physical design by specifying data characteristics, such as data type, length of the fields, null or not null, default values, referential integrities, etc.

Data Modeling Data modeling is a technique for organizing and documenting a system’s DATA. Data modeling is sometimes called database modeling because a data model is usually implemented as a database. Couple of weeks back, you were introduced to activities that called for drawing process models. Process models and data models are important parts of system modeling in structured analysis and design. System models play an important role in systems development. Many experts consider data modeling to be the most important of the modeling techniques. Why is data modeling considered crucial? Data is viewed as a resource to be shared by as many processes as possible. As a result, data must be organized in a way that is flexible and adaptable to unanticipated business requirements – and that is the purpose of data modeling. Data structures and properties are reasonably permanent – certainly a great deal more stable than the processes that use the data. Often the data model of a current system is nearly identical to that of the desired system. Data models are much smaller than process and object models and can be constructed more rapidly. The process of constructing data models helps analysts and users quickly reach consensus on business terminology and rules.

06.5.18 Figure 6-1 An Entity Relationship Data Model
This diagram in the figure above makes the following business assertions: We need to store data about CUSTOMERs, ORDERs, and PRODUCTs. A CUSTOMER places zero, one, or more ORDERs. An ORDER is placed by exactly one CUSTOMER. An ORDER sold one or more PRODUCTs. Thus, an ORDER must contain at least one PRODUCT. A PRODUCT may have been sold as zero, one, or more.

System Concepts Systems thinking is the application of formal systems theory and concepts to systems problem solving. An ERD depicts data in terms of the entities and relationships described by the data. Most people can learn the techniques of systems analysis. But if they understand the underlying concepts, they can adapt the techniques to ever-changing problems and conditions. They can also improve upon the techniques, recognize the advantages of new techniques, and see opportunities to integrate different techniques. Therein lies your true opportunity for competitive advantage and security in today’s business world. If you understand theory, concepts, and techniques, you will be able to do more than just solve textbook exercises – you will be able to solve real world problems and command the premium salaries paid to today’s best problem solvers. There are several notations for ERDs. Most are named after their inventor (e.g., Chen, Martin, Bachman, Merise) or after a published standard (e.g., IDEF1X). These data modeling ‘languages’ generally support the same fundamental concepts and constructs. We have adopted “Crowsfoot” notation because of its widespread use and Visio support.

Entities An entity is something about which we want to store data.
An entity is a class of persons, places, objects, events, or concepts about which we need to capture and store data. An entity instance is a single occurrence of an entity. Consider a school system. A school system includes data that describes things such as STUDENTs, TEACHERs, COURSEs, CLASSROOMs. For any of these things, it is not difficult to imagine some of the data that describes any given instance of the thing. For example, the data that describes a particular STUDENT might include name, address, phone number, date of birth, gender, race, major, and grade point average, to but a few data items. Examples of entities include: Persons: AGENCY, CONTRACTOR, CUSTOMER, DEPARTMENT, DIVISION, EMPLOYEE, INSTRUCTOR OFFICE, STUDENT, SUPPLIER. Notice that a person entity can represent either individuals, groups, or organizations. Places: SALES REGION, BUILDING, ROOM, BRANCH OFFICE, CAMPUS. Objects: BOOK ,MACHINE, PART, PRODUCT, RAW MATERIAL, SOFTWARE LICENSE, SOFTWARE PACKAGE, TOOL, VEHICLE MODEL, VEHICLE. An object entity can represent actual objects (such as SOFTWARE LICENSE), or specifications for a type of object (such as SOFTWARE PACKAGE). Events: APPLICATION, AWARD, CANCELLATION, CLASS, FLIGHT, INVOICE, ORDER, REGISTRATION, RENEWAL, REQUISITION, RESERVATION, SALE, TRIP. Concepts: ACCOUNT, BLOCK OF TIME, BOND, COURSE, FUND, QUALIFICATION, STOCK. The entity STUDENT may have multiple instances: Mary, Joe, Mark, Susan, Deborah, and so forth. In data modeling, we do not concern ourselves with individual STUDENTs because we recognize that each STUDENT is described by similar pieces of data.

Attributes An attribute is a descriptive property or characteristic of an entity. Synonyms include element, property, and field As noted at the beginning of this section, each instance of the entity STUDENT might be described by the following attributes: NAME, ADDRESS, PHONE NUMBER, DATE OF BIRTH, GENDER, RACE, MAJOR, GRADE POINT AVERAGE, and others. Some attributes can be logically grouped into super-attributes called compound attributes. A compound attribute is one that actually consists of more primitive attributes. Synonyms in different data modeling languages are numerous: concatenated attribute, composite attribute, and data structure. A STUDENT’s NAME is actually a composite attribute that consists of LAST NAME, FIRST NAME, and MIDDLE INITIAL.

Attributes - Identification
Every entity must have an identifier or key. An key is an attribute, or a group of attributes, which assumes a unique value for each entity instance. It is sometimes called an identifier. Sometimes more than one attribute is required to uniquely identify an instance of an entity. A group of attributes that uniquely identifies an instance of an entity is called a concatenated key. Synonyms include composite key and compound key. For example, each instance of the entity STUDENT might be uniquely identified by the key STUDENT ID NUMBER. No two STUDENTs can have the same STUDENT NUMBER. For example, each TAPE entity instance in a video store might be uniquely identified by the concatenation of TITLE NUMBER plus COPY NUMBER. TITLE NUMBER by itself would be inadequate because we may own many copies of a single title. COPY NUMBER by itself would also be inadequate since we would have a copy #1 for every single title we own. We need both pieces of data to identify a specific tape (e.g., copy #7 of Da Vinci Code).

Relationships A relationship is a natural business association that exists between one or more entities. The relationship may represent an event that links the entities, or merely a logical affinity that exists between the entities. A connecting line between two entities on an ERD represents a relationship. A verb phrase describes the relationship. All relationships are implicitly bidirectional, meaning that they can interpreted in both directions. Consider, for example the entities ORDER and PRODUCT. We can make the following business assertions that link ORDERs and PRODUCTs: a current ORDER CONTAINS one or more PRODUCT a PRODUCT IS CONTAINED BY zero, one, or more ORDERs The underlined verb phrases define business relationships that exist between the two entities. Because an ORDER can contain many PRODUCTs, and a PRODUCT can be contained in many ORDERs, this is often called a many-to-many relationship.

06.5.18 Figure 6-2 A Relationship (Many-to-Many)
The figure above also shows the complexity or degree of each relationship. For example, in the above business assertions, we must also answer the following questions: Must there exist an instance of ORDER for each instance of PRODUCT? No! Must there exist an instance of PRODUCT for each instance of ORDER? Yes! How many instances of PRODUCT can exist for each instance of ORDER? Many! How many instances of ORDER can exist for each instance of PRODUCT? Many!

Cardinality Each relationship on an ERD also depicts the complexity or degree of each relationship, and this is called cardinality. Cardinality defines the minimum and maximum number of occurrences of one entity for a single occurrence of the related entity. Because all relationships are bi-directional, cardinality must be defined in both directions for every relationship.

06.5.18 Figure 6-3 Cardinality Notations
Note that in Visio, the optional (0) is represented as white circle, not black. Also, if it is minimum one and maximum one (exactly one), there are two vertical lines across the relationship (not just one line). Conceptually, cardinality tells us the following rules about the data we want to store: When we insert an ORDER instance in the database, we must link (associate) that ORDER to at least one instance of PRODUCT. In business terms, “an order cannot be filed without ordering a product.” An ORDER can contain more than one PRODUCT, and we must be able to store data that indicates all PRODUCT for a given ORDER. We must insert a PRODUCT before we can link (associate) ORDERs to that PRODUCT. That is why a PRODUCT can have zero ORDERs – no order have yet to be ordered to that PRODUCT. Once a PRODUCT has been inserted into the database, we can link (associate) many ORDERs with that PRODUCT.

Foreign Keys A relationship implies that instances of one entity are related to instances of another entity. To be able to identify those instances for any given entity, the primary key of one entity must be migrated into the other entity as a foreign key. A foreign key is a primary key of one entity that is contributed to (duplicated in) another entity for the purpose of identifying instances of a relationship. A foreign key (always in a child entity) always matches the primary key (in a parent entity). For example, consider a relationship between the entities CUSTOMER and ORDER. An ORDER is placed by exactly one CUSTOMER. For an ORDER, which CUSTOMER places it? A CUSTOMER places one or more ORDERs. For a CUSTOMER, which PRODUCTs are contained in that ORDER?

06.5.18 Figure 6-4 How to Show Foreign Keys
In the figure above, we demonstrate the concept of foreign keys with our simple data model. In this case, CUSTOMER is called the parent entity and ORDER is the child entity. The primary key is always contributed by the parent to the child as a foreign key. Thus, an instance of ORDER now has an foreign key CustomerID whose value points to the correct instance of CUSTOMER that places that order. (Foreign keys are never contributed from child-to-parent.) When you have a relationship that you cannot differentiate between parent and child it is called a non-specific relationship. A non-specific relationship (or many-to-many relationship) is one in which many instances of one entity are associated with many instances of another entity. Such relationships are suitable only for preliminary data models, and should be resolved as quickly as possible. All non-specific relationships can be resolved into a pair of one-to-many relationships by inserting an associative entity between the two original entities.

(a) Figure 6-5 Resolving Nonspecific Relationships with an Associative Entity In figure (a) above, we see that a PRODUCT is being placed by zero, one, or more ORDERs. At the same time, we see that an ORDER contains one or more PRODUCT. The maximum cardinality on both sides is ‘many’. So, which is the parent and which is the child? You can’t tell! This is called a non-specific relationship. All non-specific relationships can be resolved into a pair of one-to-many relationships. As illustrated in figure (b) above, each entity becomes a parent. A new, associative entity is introduced as the child of each parent. In the figure above, each instance of ORDERED PRODUCT represents a one ORDER’s placement of one PRODUCT. If an order is placing two products, that order will have two instances of the entity ORDERED_PRODUCT. (b)

Data Modeling During Systems Analysis
A key-based data model will be drawn first. A fully attributed data model will be constructed along with the process of analysis and design. Each attribute is defined in the repository with data types, domains, and defaults. The completed data model represents all of the business requirements for a system’s database.

How to Construct Data Models
1st Step - Entity Discovery The first task is to discover those fundamental entities in the system. There are several techniques that may be used to identify entities. During interviews or JAD sessions with system owners and users, pay attention to key words in their discussion. During interviews or JAD sessions, specifically ask the system owners and users to identify things about which they would like to capture, store, and produce information. Study existing forms and files. Some CASE tools can reverse engineer existing files and databases into physical data models. There are several techniques that may be used to identify entities. For example, during an interview with an individual discussing PlayItAgain’s business environment and activities, a user may state that “We have to keep track of all our customers and the many clubs in which they are enrolled.’’ Notice that the key words in this statement are CUSTOMERs and CLUBs. Both are entities! Another technique for identifying entities is to study existing forms and files. Some forms identify event entities. Examples include ORDERs, REQUISITIONs, PAYMENTs, DEPOSITs, and so forth. But most of these same forms also contain data that describe other entities. Consider a registration form used in your school’s course registration system. A REGISTRATION is itself an event entity. But the average registration form also contains data that describe other entities, such as STUDENT (a person), COURSEs (which are concepts), INSTRUCTORs (other persons), ADVISOR (yet another person), DIVISIONs (another concept), and so forth. These same entities could also be discovered by studying the computerized registration system’s computer files databases, or outputs. Some CASE tools can reverse-engineer existing files and databases into physical data models. The analyst must usually clean up the resulting model by physical names, codes, and comments with their logical, business-friendly equivalents.

1st Step - Entity Discovery An entity has multiple instances Entities should be named with nouns that describe the person, event, place, or intangible thing about which we want to store data. Define each entity in business terms. Names may include appropriate adjectives or clauses to better describe the entity—for instance, an externally generated CUSTOMER ORDER must be distinguished from an internally generated PURCHASE ORDER. Try not to abbreviate or use acronyms. Names should be singular so as to distinguish the logical concept of the entity from the actual instances of the entity. Don’t define the entity in technical terms, and don’t define it as ‘data about …’. Your entity names and definitions should establish an initial glossary of business terminology that will serve both you and future analysts and users for years to come.

2nd Step - The Context Data Model The second task in data modeling is to construct the context data model. The context data model includes the fundamental or independent entities that were previously discovered. An independent entity is one which exists regardless of the existence of any other entity. Its primary key contain no attributes that would make it dependent on the existence of another entity. Independent entities are almost always the first entities discovered in your conversations with the users. Relationships should be named with verb phrases that, when combined with the entity names, form simple business sentences or assertions. If only one-way naming is used, always name the relationship from parent-to-child. Some CASE tools, such as Visible Analyst let you name the relationships in both directions.

06.5.18 Figure 6-6 The PlayItAgain Context Data Model
The ERD communicates the following: A MEMBER belongs to one or more CATEGORY’s. A CATEGORY enrolls zero, one, or more MEMBERs. Again, the club may be new. Each month or quarter, a CATEGORY sponsors zero, one, or more PROMOTIONs. Why zero? Again, a club may be just starting, and not yet offering promotions. A PROMOTION is sponsored by exactly one CATEGORY. Each PROMOTION features exactly one PRODUCT. A PRODUCT is featured in zero, one, or more PROMOTIONs. For example, a CD that appeals to both country/western and light rock audiences might be featured in the promotion for both. Since products greatly outnumber promotions, most products are never featured in a promotion. A PROMOTION generates many MEMBER ORDERs. These are dated orders to which a member must reply by the specified date – else the order is filled (sounds familiar? Or not anymore?). The promotion always generates more than one order; in fact, it generates one order per member. A MEMBER ORDER is generated for zero or one PROMOTION. Why zero? In the desired system, a member can initiate their own order. It is permissible for more than one relationship to exist between the same two entities if the separate relationships communicate different business events or associations. Thus, A MEMBER responds to zero, one, or more MEMBER ORDERs. This relationship supports the promotion-generated orders. A MEMBER places zero, one, or more MEMBER ORDERs. This relationship supports member-initiated orders. In both cases, a MEMBER ORDER is placed by (is responded to by) exactly one MEMBER. A MEMBER ORDER sells one or more PRODUCTs. A PRODUCT is sold on zero, one, or more MEMBER ORDERs.

3rd Step - The Key-Based Data Model The third task is to identify the keys of each entity. If you cannot define keys for an entity, it may be that the entity doesn’t really exist—that is, multiple occurrences of the so-called entity do not exist. The following guidelines are suggested for keys: The value of a key should not change over the lifetime of each entity instance. NAME would be a poor key since a person’s last name could change by marriage or divorce. The value of a key cannot be null. Controls must be installed to ensure that the value of a key is valid. Some experts suggest that you avoid intelligent keys because the key may change over the lifetime of the entity instance. An intelligent key is a business code whose structure communicates data about an entity instance (such as its classification, size, or other properties). A code is a group of characters and/or digits that identifies and describes something in the business system. Other experts suggest that you use intelligent keys because business codes can return value to the organization, and they can be quickly processed by humans without the assistance of a computer. Consider inventing a surrogate key instead to substitute for large concatenated keys of independent entities. This suggestion is not practical for associative entities since each part of the concatenated key is a foreign key that must precisely match its parent entity’s primary key.

06.5.18 Figure 6-7 The PlayItAgain Key-Based Data Model
The figure above is the key-based data model for the PlayItAgain project. We have eliminated all non-specific relationships by resolving them into associative entities and one-to-many relationships (as described earlier). We call your attention to the following noteworthy items: Many entities have a simple, single-attribute primary key (Second row). Notice how the primary keys for PROMOTION were constructed. It has a concatenated key. Part of that key is inherited from the parent entity CATEGORY and PRODUCT. You can tell that because CategoryName and ProductID arealso foreign keys (FK). When one entity contributes its key to another entity across a relationship, the relationship is said to be identifying – because it helps to identify the child entity.. We resolved the non-specific relationships, e.g., between ORDER and PRODUCT by introducing the associative entity PRODUCT_ORDERED. Each associative entity instance represents one product on one order. The parent entities contributed their own primary keys to comprise the associative entity’s concatenated key. Also notice that each attribute in that concatenated key is a foreign key that points back to the correct parent instance. All relationships contribute foreign keys from parent-to-child. You just learned that if the contributed foreign key helps to uniquely identify instances of the child entity, the relationship is said to be identifying. On the other hand, if the foreign key plays no role in identifying instances of the child entity, then it is recorded as non-key data in our model. It’s only purpose is to point to a child entity’s specific parent. For example, MemberID in the MEMBER ORDER entity serves only to point to the correct MEMBER entity instance for an order. In this case, the relationship is called non-identifying.

4th Step - Generalized Hierarchies At this time, it would be useful to identify any generalization hierarchies in a business problem. 5th Step - The Fully Attributed Data Model The fifth task is to identify the remaining data attributes. 6th Step - The Fully Described Model The last task is to fully describe the data model. Most CASE tools provide extensive facilities for describing the data types, domains, and defaults for all attributes to the repository. It is recommendable to have a fully described data model at this stage. For one, the data models are more stable than the process models. There may be less changes to the data structure itself. Another reason is that deeper understanding of the business data makes it easier to (physically) design other aspects of systems, namely input forms and output reports. Nonetheless, it is a time consuming process, and thinking about the time we have left, I will not require the fully described data models. Yet, you should have a firm grasp about its data structure.

Conventional Files Versus the Database
Files are collections of similar records. Databases are collections of interrelated files. Discuss advantages and disadvantages of each Compare and Contrast A database is not merely a collection of files. The records in each file must allow for relationships (think of them as ‘pointers’) to the records in other files. For example, a SALES database might contain ORDER records that are somehow “linked’’ to their corresponding CUSTOMER and PRODUCT records. The database is not necessarily dependent on the applications that will use it. In other words, given a database, new applications can be built to share that database. Each environment has its advantages and disadvantages. Conventional files are relatively easy to design and implement because they are normally based on a single application or information system. Historically, another advantage of conventional files has been processing speed. Duplication of data items in multiple files is normally cited as the principal disadvantage of file-based systems. A significant disadvantage of files is their inflexibility and non-scalability. The principal advantage of a database is the ability to share the same data across multiple applications and systems. Database technology offers the advantage of storing data in flexible formats. Databases allow the use of the data in ways not originally specified by the end-users - data independence. The database scope can even be extended without impacting existing programs that use it. DB disadvantages: Database technology is more complex than file technology. A DBMS is still somewhat slower than file technology. Database technology requires a significant investment. In order to achieve the benefits of database technology, analysts and database specialists must adhere to rigorous design principles. Another potential problem with the database approach is the increased vulnerability inherent in the use of shared data.

Databases A database management system (DBMS) is specialized computer software available from computer vendors that is used to create, access, control, and manage the database. Data becomes a business resource in a database environment. Information systems are built around this resource to give both computer programmers and end-users flexible access to data. Operational databases - These systems were (and are) developed over time to replace the conventional files that used to support the applications. Access to these databases is limited to computer programs that use the DBMS to process transactions, maintain the data, and generate regularly scheduled management reports. Some query access may also be provided.

06.5.18 Figure 6-8: A Typical Modern Data Architecture
The figure above illustrates the data architecture into which many companies have evolved. As shown in the figure, most companies still have numerous conventional file-based information system applications, most of which were developed prior to the emergence of high performance database technology. In many cases, the processing efficiency of these files or the projected cost to redesign these files has slowed conversion of the systems to database. Many information systems shops hesitate to give end-users access to operational databases, because the volume of unscheduled reports and queries could overload the computers and hamper business operations. To remedy that problem, data warehouses were developed. Data warehouses store data that is extracted from the production databases and conventional files. Fourth-generation programming languages, query tools, and decision support tools are then used to generate reports and analyses off these data warehouses. The database management system is purchased from a database technology vendor such as Oracle, IBM, Microsoft, or Sybase.

A Simple Logical Data Model
Figure 6-9: A Simple, Logical Data Model

Figure 6-10: A Simple, Physical Database Schema

The Database Schema Data type. Each DBMS supports different data types, and terms for those data types. Size of the Field. Different DBMSs express precision of real numbers differently. Required or NOT Required. Must the field have a value before the record can be committed to storage? Domains. Many DBMSs can automatically edit data to ensure that fields contain legal data. Default. Many DBMSs allow a default value to be automatically set in the event that a user or programmer submits a record without a value. Referential Integrity. An integrity constraint specifying that the value (or existence) of an attribute in one relation depends on the value (or existence) of the same attribute in another relation Data type. For example, different systems may designate a large alphanumeric field differently (e.g., MEMO in Access and LONG VARCHAR in Oracle). Also, some databases allow the choice of no compression versus compression of unused space (e.g., CHAR versus VARCHAR in Oracle). Size of the Field. For example, in Oracle, a size specification of NUMBER (3,2) supports a range from to 9.99. NULL or NOT NULL. Again, different DBMSs may require different reserved words to express this property. Primary keys can never be allowed to have null values. Domains. This can be a great benefit to ensuring data integrity independent from the application programs. If the programmer makes a mistake, the DBMS catches the mistake. But for DBMSs that support data integrity, the rules must be precisely specified in a language that is understood by the DBMS. Referential integrity is specified in the form of deletion rules: No restriction. Any record in the table may be deleted without regard to any records in any other tables. Delete:Cascade. A deletion of a record in the table must be automatically followed by the deletion of matching records in a related table. Delete:Restrict. A deletion of a record in the table must be disallowed until any matching records are deleted from a related table. Delete:Set Null. A deletion of a record in the table must be automatically followed by setting any matching keys in a related table to the value NULL.

Other Considerations Denormalization Choices of storage formats
File organizations Backup/Recovery Security of Data Denormalization The process of splitting or combining normalized relations into physical tables based on affinity of use of rows and fields Design Goals Efficient use of secondary storage (disk space) Disks are divided into units that can be read in one machine operation Space is used most efficiently when the physical length of a table row divides close to evenly with storage unit Efficient data processing Data are most efficiently processed when stored next to each other in secondary memory File Organization A technique for physically arranging the records of a file Sequential: The rows in the file are stored in sequence according to a primary key value Indexed: The rows are stored either sequentially or nonsequentially and an index is created that allows software to locate individual rows Hashed File Organization: The address for each row is determined using an algorithm

Data Modeling and Database Design

Similar presentations

Presentation on theme: "Data Modeling and Database Design"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Modeling and Database Design

Similar presentations

Presentation on theme: "Data Modeling and Database Design"— Presentation transcript:

Similar presentations

About project

Feedback