Presentation is loading. Please wait.

Presentation is loading. Please wait.

Physical level: The lowest level of abstraction describes how the data are actually stored. The physical level describes complex low-level data structures.

Similar presentations


Presentation on theme: "Physical level: The lowest level of abstraction describes how the data are actually stored. The physical level describes complex low-level data structures."— Presentation transcript:

1 Physical level: The lowest level of abstraction describes how the data are actually stored. The physical level describes complex low-level data structures in detail. Logical level: The next-higher level of abstraction describes what data are stored in the database, and what relationships exist among those data. The logical level thus describes the entire database in terms of a small number of relatively simple structures. Although implementation of the simple structures at the logical level may involve complex physical-level structures, the user of the logical level does not need to be aware of this complexity. Database administrators, who must decide what information to keep in the database, use the logical level of abstraction. View level: The highest level of abstraction describes only part of the entire database. Even though the logical level uses simpler structures, complexity remains because of the variety of information stored in a large database. Many users of the database system do not need all this information; instead, they need to access only a part of the database. The view level of abstraction exists to simplify their interaction with the system. The system may provide many views for the same database.

2 An architecture for a database system

3 Similar to types and variables in programming languages Schema – the logical structure of the database Example: The database consists of information about a set of customers and accounts and the relationship between them. Analogous to type information of a variable in a program Physical schema: database design at the physical level Logical schema: database design at the logical level Instance – the actual content of the database at a particular point in time Analogous to the value of a variable Physical Data Independence – the ability to modify the physical schema without changing the logical schema Applications depend on the logical schema In general, the interfaces between the various levels and components should be well defined so that changes in some parts do not seriously influence others.

4 Data independence is the type of data transparency that matters for a centralized DBMS. It refers to the immunity of user applications to make changes in the definition and organization of data. Physical data independence deals with hiding the details of the storage structure from user applications. The data independence and operation independence together gives the feature of data abstraction. There are two levels of data independence.

5 First level: The logical structure of the data is known as the schema definition. In general, if a user application operates on a subset of the attributes of a relation, it should not be affected later when new attributes are added to the same relation. Logical data independence indicates that the conceptual schema can be changed without affecting the existing schemas.

6 Second Level: The physical structure of the data is referred to as "physical data description". Physical data independence deals with hiding the details of the storage structure from user applications. The application should not be involved with these issues since, conceptually, there is no difference in the operations carried out against the data. There are two types of data independence: Logical data independence: The ability to change the logical (conceptual) schema without changing the External schema (User View) is called logical data independence. For example, the addition or removal of new entities, attributes, or relationships to the conceptual schema should be possible without having to change existing external schemas or having to rewrite existing application programs. Physical data independence: The ability to change the physical schema without changing the logical schema is called physical data independence. For example, a change to the internal schema, such as using different file organization or storage structures, storage devices, or indexing strategy, should be possible without having to change the conceptual or external schemas. View level data independence: always independent no effect, because there doesn't exist any other level above view level.

7 We know that three view-levels are described by means of three schemas. These schemas are stored in the data dictionary. In DBMS, each user refers only to its own external schema. Hence, the DBMS must transform a request on a specified external schema into a request against conceptual schema, and then into a request against internal schema to store and retrieve data to and from the database. The process to convert a request (from external level) and the result between view levels is called mapping. The mapping defines the correspondence between three view levels. The mapping description is also stored in data dictionary. The DBMS is responsible for mapping between these three types of schemas. There are two types of mapping. (i) External-Conceptual mapping (ii) Conceptual-Internal mapping

8 External-Conceptual: An external-conceptual mapping defines the correspondence between a particular external view and the conceptual view. The external-conceptual mapping tells the DBMS which objects on the conceptual level correspond to the objects requested on a particular user's external view. If changes are made to either an external view or conceptual view, then mapping must be changed accordingly. Conceptual-Internal : The conceptual-internal mapping defines the correspondence between the conceptual view and the internal view, i.e. database stored on the physical storage device. It describes how conceptual records are stored and retrieved to and from the storage device. This means that conceptual-internal mapping tells the DBMS that how the conceptual records are physically represented. If the structure of the stored database is changed, then the mapping must be changed accordingly. It is the responsibility of DBA to manage such changes.

9 DBMS (Database Management System) acts as an interface between the user and the database. The user requests the DBMS to perform various operations (insert, delete, update and retrieval) on the database. The components of DBMS perform these requested operations on the database and provide necessary data to the users.

10 (web browser)

11

12

13 Database Users and User Interface Naive users: are unsophisticated users who interact with the system by invoking one of the application programs that have been written previously. The end users or naive users use the database system through a menu-oriented application program, where the type and range of response is always displayed on the screen. The user need not be aware of the presence of the database system and is instructed through each step. A user of an ATM falls in this category. Application Programmers: are computer professionals who write application programs. Application programmers can form many tools to develop user interfaces. Rapid application development(RAD) tools are used to construct and reports with minimal programming effort.

14 Database Users and User Interface Sophisticated users : interact with the system without writing programs. Instead, they requests in a database query language. They submit each such query to a query processor, whose function is to break down DML statements into instructions that the storage manager understands. Analysts who submit queries to explore data in the database fall in this category. Specialized users: are sophisticated users who write specialized database applications that do not fit into the traditional data- processing framework. Among these applications are computer- aided design systems, knowledge based and expert systems that store data with complex data types(graphics and audio data), and environment–modeling systems

15 The Query Processor DDL Interpreter: interprets DDL statements and records the definitions in the data dictionary. DML Compiler: translates DML statements in a query language into an evaluation plan consisting of low-level instructions that the query evaluation engine understands. A query can usually be translated into any of a number of alternative evaluation plans that all give the same result. The DML compiler also performs query optimization, that is, it picks the lowest cost evaluation plan from among the alternatives. Query Evaluation Engine: executes low-level instructions generated by the DML compiler.

16 Storage Manager A storage manager is a program module that provides the interface between the low-level data stored in the database and the application programs and queries submitted to the system. The storage manager is responsible for the interaction with the file manager. The storage manager translates the various DML statements into low-level file-system commands. Thus, the storage manager is responsible for storing, retrieving, and updating data in the database. Authorization and Integrity Manager: interprets tests for the satisfaction of integrity constraints and checks the authority of users to access data. Transaction Manager: ensures that the database remains in a consistent state despite system failures, and that concurrent transaction exactions proceed without conflicting. File Manager: executes manages the allocation of space on disk storage and the data structures used to represent information stored on disk. Buffer Manager: is responsible for fetching data from disk storage into main memory and deciding what data to cache in main memory.

17 A collection of tools for describing Data Data relationships Data semantics Data constraints Relational model Entity-Relationship data model (mainly for database design) Object-based data models (Object-oriented and Object- relational) Semistructured data model (XML) Other older models: Network model Hierarchical model

18 (RDBMS - relational database management system) A database based on the relational model developed by E.F. Codd. A relational database allows the definition of data structures, storage and retrieval operations and integrity constraints. In such a database the data and relations between them are organized in tables. A table is a collection of records and each record in a table contains the same fields. Properties of Relational Tables: Values Are Atomic Each Row is Unique Column Values Are of the Same Kind The Sequence of Columns is Insignificant The Sequence of Rows is Insignificant Each Column Has a Unique Name Certain fields may be designated as keys, which means that searches for specific values of that field will use indexing to speed them up.

19 Models an enterprise as a collection of entities and relationships Entity: a thing or object in the enterprise that is distinguishable from other objects Described by a set of attributes Relationship: an association among several entities Represented diagrammatically by an entity-relationship diagram:

20 According to Rao (1994), "The object-oriented database (OODB) paradigm is the combination of object-oriented programming language (OOPL) systems and persistent systems. The power of the OODB comes from the seamless treatment of both persistent data, as found in databases, and transient data, as found in executing programs." Object DBMSs add database functionality to object programming languages. They bring much more than persistent storage of programming language objects. Object DBMSs extend the semantics of the C++, Smalltalk and Java object programming languages to provide full-featured database programming capability In contrast to a relational DBMS where a complex data structure must be flattened out to fit into tables or joined together from those tables to form the in-memory structure, object DBMSs have no performance overhead to store or retrieve a web or hierarchy of interrelated objects. This one-to-one mapping of object programming language objects to database objects has two benefits over other storage approaches: it provides higher performance management of objects, and it enables better management of the complex interrelationships between objects. This makes object DBMSs better suited to support applications such as financial portfolio risk analysis systems, telecommunications service applications, world wide web document structures, design and manufacturing systems, and hospital patient record systems, which have complex relationships between data.

21 Object/relational database management systems (ORDBMSs) add new object storage capabilities to the relational systems at the core of modern information systems. These new facilities integrate management of traditional fielded data, complex objects such as time-series and geospatial data and diverse binary media such as audio, video, images, and applets. By encapsulating methods with data structures, an ORDBMS server can execute complex analytical and data manipulation operations to search and transform multimedia and other complex objects.

22 In semistructured data model, the information that is normally associated with a schema is contained within the data, which is sometimes called ``self-describing''. In such database there is no clear separation between the data and the schema, and the degree to which it is structured depends on the application. In some forms of semistructured data there is no separate schema, in others it exists but only places loose constraints on the data. Semi-structured data is naturally modeled in terms of graphs which contain labels which give semantics to its underlying structure. Semistructured data has recently emerged as an important topic of study for a variety of reasons. First, there are data sources such as the Web, which we would like to treat as databases but which cannot be constrained by a schema. Second, it may be desirable to have an extremely flexible format for data exchange between disparate databases. Third, even when dealing with structured data, it may be helpful to view it as semistructured for the purposes of browsing.

23 The popularity of the network data model coincided with the popularity of the hierarchical data model. Some data were more naturally modeled with more than one parent per child. So, the network model permitted the modeling of many-to-many relationships in data. In 1971, the Conference on Data Systems Languages (CODASYL) formally defined the network model. The basic data modeling construct in the network model is the set construct. A set consists of an owner record type, a set name, and a member record type. A member record type can have that role in more than one set, hence the multiparent concept is supported. An owner record type can also be a member or owner in another set. The data model is a simple network, and link and intersection record types may exist, as well as sets between them. Thus, the complete network of relationships is represented by several pairwise sets; in each set some (one) record type is owner (at the tail of the network arrow) and one or more record types are members (at the head of the relationship arrow). Usually, a set defines a 1:M relationship, although 1:1 is permitted. The CODASYL network model is based on mathematical set theory.

24 The hierarchical data model organizes data in a tree structure. There is a hierarchy of parent and child data segments. This structure implies that a record can have repeating information, generally in the child data segments. Data in a series of records, which have a set of field values attached to it. It collects all the instances of a specific record together as a record type. These record types are the equivalent of tables in the relational model, and with the individual records being the equivalent of rows. To create links between these record types, the hierarchical model uses Parent Child Relationships. These are a 1:N mapping between record types. This is done by using trees, like set theory used in the relational model, "borrowed" from maths. For example, an organization might store information about an employee, such as name, employee number, department, salary. The organization might also store information about an employee's children, such as name and date of birth. The employee and children data forms a hierarchy, where the employee data represents the parent segment and the children data represents the child segment. If an employee has three children, then there would be three child segments associated with one employee segment. In a hierarchical database the parent-child relationship is one to many. This restricts a child segment to having only one parent segment. Hierarchical DBMSs were popular from the late 1960s, with the introduction of IBM's Information Management System (IMS) DBMS, through the 1970s.

25 Active database: An active database is a database that includes an event-driven architecture which can respond to conditions both inside and outside the database. Possible uses include security monitoring, alerting, statistics gathering and authorization. Cloud database: A Cloud database is a database that relies on cloud technology. Both the database and most of its DBMS reside remotely, "in the cloud," while its applications are both developed by programmers and later maintained and utilized by (application's) end- users through a web browser and Open APIs. More and more such database products are emerging, both of new vendors and by virtually all established database vendors.

26 Data warehouse: Data warehouses archive data from operational databases and often from external sources such as market research firms. Often operational data undergo transformation on their way into the warehouse, getting summarized, reclassified, etc. The warehouse becomes the central source of data for use by managers and other end- users who may not have access to operational data. Some basic and essential components of data warehousing include retrieving, analyzing, and mining data, transforming, loading and managing data so as to make them available for further use. Distributed database: In general it typically refers to a modular DBMS architecture that allows distinct DBMS instances to cooperate as a single DBMS over processes, computers, and sites, while managing a single database distributed itself over multiple computers, and different sites.

27 Document-oriented database: A document-oriented database is a computer program designed for storing, retrieving, and managing document-oriented, or semi structured data, information. Document- oriented databases are one of the main categories of so-called NoSQL databases and the popularity of the term "document-oriented database" (or "document store") has grown with the use of the term NoSQL itself. Embedded database: An embedded database system is a DBMS which is tightly integrated with an application software that requires access to stored data in a way that the DBMS is hidden from the applications end-user and requires little or no ongoing maintenance. It is actually a broad technology category that includes DBMSs with differing properties and target markets. The term "embedded database" can be confusing because only a small subset of embedded database products is used in real-time embedded systems such as telecommunications switches and consumer electronics devices.

28 End-user database: These databases consist of data developed by individual end-users. Examples of these are collections of documents, spreadsheets, presentations, multimedia, and other files. Several products exist to support such databases. Some of them are much simpler than full fledged DBMSs, with more elementary DBMS functionality (e.g., not supporting multiple concurrent end-users on a same database), with basic programming interfaces, and a relatively small "foot-print" (not much code to run as in "regular" general- purpose databases). Federated database and multi-database: A federated database is an integrated database that comprises several distinct databases, each with its own DBMS. It is handled as a single database by a federated database management system (FDBMS), which transparently integrates multiple autonomous DBMSs, possibly of different types (which makes it a heterogeneous database), and provides them with an integrated conceptual view. The constituent databases are interconnected via computer network, and may be geographically decentralized.

29 Graph database Hypermedia databases Hypertext database In-memory database Knowledge base Operational database Parallel database Real-time database Spatial database Temporal database Unstructured-data database

30 So far we have studied the DBMS at level of the logical model. The logical model of a database system is the correct level for the database users to focus on. The goal of a database system is to simplify and facilitate access to data. As members of the development staff and as potential Database Administrators, we need to understand the physical level better than a typical user. Overview of Physical Storage Media Storage media are classified by speed of access, cost per unit of data to buy the media, and by the medium's reliability. Unfortunately, as speed and cost go up, the reliability does down. 1. Cache is the fastest and the most costly for of storage. The type of cache referred to here is the type that is typically built into the CPU chip and is 256KB, 512KB, or 1MB. Thus, cache is used by the operating system and has no application to database, per se. 2. Main memory is the volatile memory in the computer system that is used to hold programs and data. While prices have been dropping at a staggering rate, the increases in the demand for memory have been increasing faster. Today's 32-bit computers have a limitation of 4GB of memory. This may not be sufficient to hold the entire database and all the associated programs, but the more memory available will increase the response time of the DBMS. There are attempts underway to create a system with the most memory that is cost effective, and to reduce the functionality of the operating system so that only the DBMS is supported, so that system response can be increased. However, the contents of main memory are lost if a power failure or system crash occurs. 3. Flash memory is also referred to as electrically erasable programmable read-only memory (EEPROM). Since it is small (5 to 10MB) and expensive, it has little or no application to the DBMS.

31 4. Magnetic-disk storage is the primary medium for long-term on-line storage today. Prices have been dropping significantly with a corresponding increase in capacity. New disks today are in excess of 20GB. Unfortunately, the demands have been increasing and the volume of data has been increasing faster. The organizations using a DBMS are always trying to keep up with the demand for storage. This media is the most cost-effective for on-line storage for large databases. 5. Optical storage is very popular, especially CD-ROM systems. This is limited to data that is read- only. It can be reproduced at a very low-cost and it is expected to grow in popularity, especially for replacing written manuals. Recently, a new optical format, digit video disk(DVD) has become standard. These disks hold between 4.7 and 17 GB data. WROM(write once, read many) disks are popular for archival storage of data since they have a high capacity (abut 500 MB), long life time than HD, and can be removed from drive – good for audit trail (hard to tamper). 6. Magnetic Tape storage is used for backup and archival data. It is cheaper and slower than all of the other forms, but it does have the feature that there is no limit on the amount of data that can be stored, since more tapes can be purchased. As the tapes get increased capacity, however, restoration of data takes longer and longer, especially when only a small amount of data is to be restored. This is because the retrieval is sequential, the slowest possible method. 8mm tape drive has the highest density, and we store 5GB data on a 350-foot tape.

32 Disks are actually relatively simple. There is normally a collection of platters on a spindle. Each platter is coated with a magnetic material on both sides and the data is stored on the surfaces. There is a read-write head for each surface that is on an arm assembly that moves back and forth. A motor spins the platters at a high constant speed, (60, 90, or 120 revolutions per seconds.) The surface is divided into a set of tracks (circles). These tracks are divided into a set of sectors, which is the smallest unit of data that can be written or read at one time. Sectors can range in size from 31 bytes to 4096 bytes, with 512 bytes being the most common. A collection of a specific track from both surfaces and from all of the platters is called a cylinder. Platters can range in size from 1.8 inches to 14 inches. Today, 5 1/4 inches and 3 1/2 inches are the most common, because they have the highest seek times and lowest cost. A disk controller interfaces the computer system and the actual hardware of the disk drive. The controller accepts high-level command to read or write sectors. The controller then converts the commands in the necessary specific low-level commands. The controller will also attempt to protect the integrity of the data by computing and using checksums for each sector. When attempting to read the data back, the controller recalculates the checksum and makes several attempts to correctly read the data and get matching checksums. If the controller is unsuccessful, it will notify the operating system of the failure. The controller can also handle the problem of eliminating bad sectors. Should a sector go bad, the controller logically remaps the sector to one of the extra unused sectors that disk vendors provide, so that the reliability of the disk system is higher. It is cheaper to produce disks with a greater amount of sectors than advertised and then map out bad sectors than it is to produce disks with no bad sectors or with extremely limited possibility of sectors going bad.

33 One other characteristic of disks that provides an interesting performance is the distance from the read-write head to the surface of the platter. The smaller this gap is means that data can be written in a smaller area on the disk, so that the tracks can be closer together and the disk has a greater capacity. Often the distance is measured in microns. However, this means that the possibility of the head touching the surface is increased. When the head touches the surface while the surface is spinning at a high speed, the result is called a "head crash", which scratches the surface and defaces the head. The bottom line to this is that someone must replace the disk. Storage Access Seek time is the time to reposition the head and increases with the distance that the head must move. Seek times can range from 2 to 30 milliseconds. Average seek time is the average of all seek times and is normally one-third of the worst-case seek time. Rotational latency time is the time from when the head is over the correct track until the data rotates around and is under the head and can be read. When the rotation is 120 rotations per second, the rotation time is 8.35 milliseconds. Normally, the average rotational latency time is one-half of the rotation time. Access time is the time from when a read or write request is issued to when the data transfer begins. It is the sum of the seek time and latency time. Data-transfer rate is the rate at which data can be retrieved from the disk and sent to the controller. This will be measured as megabytes per second. Mean time to failure is the number of hours (on average) until a disk fails. Typical times today range from 30,000 to 800,000 hours (or 3.4 to 91 years).

34 Redundant Array of Independent (or Inexpensive) Disks, a category of disk drives that employ two or more drives in combination for fault tolerance and performance. RAID disk drives are used frequently on servers but are not generally necessary for personal computer. RAID allows you to store the same data redundantly (in multiple places) in a balanced way to improve overall performance. There are a number of different RAID levels: Level 0 - Stripe disk array without fault tolerance: Provides data striping (spreading out blocks of each file across multiple disk drives) but no redundancy. This improves performance but does not deliver fault tolerance. If one drive fails then all data in the array is lost. Level 1 – Mirroring and duplexing(provides disk mirroring): level 1 provides twice the read transaction rate as single disks. Level 2 – Error-Correcting Coding: Not a typical implementation and rarely used. It stripes data at the bit level rather than the block level. Level 3 – Bit-Interleveled Parity: Provides byte-level striping with a dedicated parity disk. Level 3, with which cannot service simultaneous multiple request, also is rarely used. Level 4 – Dedicated Parity Drive: A commonly used implementation of RAID, level 4 provides block-level striping (like Level 0) with a parity disk. If a data disk fails, the parity data is used to create a replacement disk. A disadvantage to Level 4 is that the parity disk can create write bottlenecks.

35 There are a number of different RAID levels: Level 5 – Stripe Block Interleaved Distributed Parity: Provides data striping at the byte level and also stripe error correction information. This results in excellent performance and good fault tolerance. Level 5 is one of the most popular implementations of RAID. Level 6 – Independent Data Disks with Double Parity: Provides block-level striping with parity data distributed across all disks. Level 0+1 – A Mirror of Stripes: Not one of the original RAID levels, two RAID 0 stripes are created and a RAID 1 mirror is created over them. Used for both replicating and sharing data among disks. Level 10 – A Stripe of Mirrors: Not one of the original RAID levels, multiple RAID 1 mirrors are created, and a RAID 0 stripe is created over these. Level 7 – A trademark of Storage Computer Corporation that adds caching to Levels 3 or 4. RAID S – (also called Parity RAID) EMC Corporations proprietary, striped parity RAID system used in its Symmetric storage systems. Need for RAID An array of multiple disks accessed in parallel will give greater throughput than a single disk. Redundant data on multiple disks provides fault tolerance.

36 Each file is partitioned into fixed-length storage units, called blocks, which are the unit of both storage allocation and data transfer. It is desirable to keep as many blocks as possible in main memory. Usually we cannot keep all blocks in main memory, so we need to manage the allocation of available main memory space. We need to use disk storage for the database, and to transfer blocks of data between main memory and disk. We also want to minimize the number of such transfers, as they are time-consuming. The buffer is the part of main memory available for storage of copies of disk blocks.

37 A RDBMS needs to maintain data about the relations, such as the schema. This is stored in a data dictionary (sometimes called a system catalog): Names of the relations Names of the attributes of each relation Domains and lengths of attributes Names of views, defined on the database, and definitions of those views Integrity constraints Names of authorized users Accounting information about users Number of tuples in each relation Method of storage for each relation (clustered/non-clustered) Name of the index Name of the relation being indexed Attributes on which the index in defined Type of index formed

38 Programs in a DBMS make requests (that is, calls) on the buffer manager when they need a block from a disk. If the block is already in the buffer, the requester is passed the address of the block in main memory. If the block in not in the buffer, the buffer manager first allocates space in the buffer for the block, through out some other block, if required, to make space for the new block. If the block that is to be thrown out has been modified, it must first be written back to the disk. The internal actions of the buffer manager are transparent to the programs that issue disk-block requests. The Buffer Manager must use some sophisticated techniques in order to provide good service: Replacement strategy. When there is no room left in the buffer, a block must be removed from the buffer before a new one can be read in. Typically, operating systems use a least recently use (LRU) scheme. There is also a Most Recent Used (MRU) that can be more optimal for DBMSs. Pinned blocks. A block that is not allowed to be written back to disk is said to be pinned. This could be used to store data that has not been committed yet. Forced output of blocks. There are situations in which it is necessary to write back to the block to the disk, even though the buffer space is not currently needed called forced output of the block. This is due to the fact that main memory contents are lost in a crash, while disk data usually survives

39 Record is a unit which data is usually stored in. Each record is a collection of related data items, where each item is formed of one or more bytes and corresponds to a particular field of record. Records usually describe entities and their attributes. A collection of field(item) names and their corresponding data types constitutes a record type. In short, we may say that a record types corresponds to an entity type and a record of a specific types represents an instance of the corresponding entity type.

40 1. A file is organized logically as a sequence of records. 2. Records are mapped onto disk blocks 3. Files are provided as a basic construct in operating system, so we assume the existence of an underlying file system. 4. Blocks are of fixed size determined by the operating system. 5. Record sizes vary. 6. In relational database tuples of distinct relations may be of different size. 7. One approach to mapping database to files is to store records of one length in a given file. 8. An alternative is to structure files to accommodate variable length records

41 To understand file organization we have to cover following points 1. Fixed-length and variable-length Records in Files 2. Fixed-Length Representation for Variable-Length Records 3. Allocating Records to Blocks 4. File Headers 5. Operations on Files Find(or locate) Read (or get) Find Next Delete Modify Insert Find All Reorganize Open close File organization Access method

42 Suppose we have a table that has the following organization: type deposit = record branch-name : char(22); account-number : char(10); balance : real; end If each character occupies 1 byte and a real occupies 8 bytes, then this record occupies 40 bytes. If the first record occupies the first 40 bytes and the second record occupies the second 40 bytes, etc. we have some problems. It is difficult to delete a record, because there is no way to indicate that the record is deleted. (At least one system automatically adds one byte to each record as a flag to show if the record is deleted.) Unless the block size happens to be a multiple of 40 (which is extremely unlikely), some records will cross block boundaries. It would require two block access to read or write such a record. One solution might be to compress the file after each deletion. This will incur a major amount of overhead processing, especially on larger files. Additionally, there is the same problem on inserts! Another solution would be to have two sets of pointers. One that would link the current record to the next logical record (linked list) plus a free list (a list of free slots.) This increases the size the file.

43 We can use variable length records: Storage of multiple record types in one file. Record types that allow variable lengths for one or more fields Record types that allow repeating fields. A simple method for implementing variable-length records is to attach a special end-of-record symbol at the end of each record. But this has problems: To easy to reuse space occupied formerly by a deleted record. There is no space in general for records to grow. If a variable-length record is updated and needs more space, it must be moved. This can be very costly. It could be solved: By making a variable-length record into a fixed length representation. By using pointers to point to fixed length records, chained together by pointers.

44 Heap File Organization Any record can be placed anywhere in the file. There is no ordering of records and there is a single file for each relation. Sequential File Organization Records are stored in sequential order based on the value of the search key (primary key). Hashing File Organization Any record can be placed anywhere in the file. A hashing function is computed on some attribute of each record. The function specifies in which block the record should be placed. Clustering File Organization Several different relations can be stored in the same file. Related records of the different relations can be stored in the same block, so that one I/O operation fetches related records from all the relations.

45 An index is a small table having only two columns. The first column contains a copy of the primary or candidate key of a table and the second column contains a set of pointers holding the address of the disk block where that particular key value can be found. The advantage of using index lies in the fact is that index makes search operation perform very fast. Suppose a table has a several rows of data, each row is 20 bytes wide. If you want to search for the record number 100, the management system must thoroughly read each and every row and after reading 99x20 = 1980 bytes it will find record number 100. If we have an index, the management system starts to search for record number 100 not from the table, but from the index. The index, containing only two columns, may be just 4 bytes wide in each of its rows. After reading only 99x4 = 396 bytes of data from the index the management system finds an entry for record number 100, reads the address of the disk block where record number 100 is stored and directly points at the record in the physical storage device. The result is a much quicker access to the record (a speed advantage of 1980:396). The only minor disadvantage of using index is that it takes up a little more space than the main table. Additionally, index needs to be updated periodically for insertion or deletion of records in the main table. However, the advantages are so huge that these disadvantages can be considered negligible.

46 In an ordered index, index entries are stored sorted on the search key value. E.g., author catalog in library. Primary index: in a sequentially ordered file, the index whose search key specifies the sequential order of the file. Also called clustering index The search key of a primary index is usually but not necessarily the primary key. Secondary index: an index whose search key specifies an order different from the sequential order of the file. Also called non-clustering index. Index-sequential file: ordered sequential file with a primary index.

47

48 In primary index, there is a one-to-one relationship between the entries in the index table and the records in the main table. Primary index can be of two types: Dense primary index: the number of entries in the index table is the same as the number of entries in the main table. In other words, each and every record in the main table has an entry in the index.

49 Sparse or Non-Dense Primary Index: For large tables the Dense Primary Index itself begins to grow in size. To keep the size of the index smaller, instead of pointing to each and every record in the main table, the index points to the records in the main table in a gap. See the following example.

50 It may happen sometimes that we are asked to create an index on a non-unique key, such as Dept-id. There could be several employees in each department. Here we use a clustering index, where all employees belonging to the same Dept-id are considered to be within a single cluster, and the index pointers point to the cluster as a whole.

51 The previous scheme might become a little confusing because one disk block might be shared by records belonging to different cluster. A better scheme could be to use separate disk blocks for separate clusters.

52 While creating the index, generally the index table is kept in the primary memory (RAM) and the main table, because of its size is kept in the secondary memory (Hard Disk). Theoretically, a table may contain millions of records (like the telephone directory of a large city), for which even a sparse index becomes so large in size that we cannot keep it in the primary memory. And if we cannot keep the index in the primary memory, then we lose the advantage of the speed of access. For very large table, it is better to organize the index in multiple levels. See the following example.

53 We can use tree-like structures as index as well. For example, a binary search tree can also be used as an index. If we want to find out a particular record from a binary search tree, we have the added advantage of binary search procedure, that makes searching be performed even faster. A binary tree can be considered as a 2-way Search Tree, because it has two pointers in each of its nodes, thereby it can guide you to two distinct ways. Remember that for every node storing 2 pointers, the number of value to be stored in each node is one less than the number of pointers, i.e. each node would contain 1 value each.

54 M-Way Search Tree : The above mentioned concept can be further expanded with the notion of the m-Way Search Tree, where m represents the number of pointers in a particular node. If m = 3, then each node of the search tree contains 3 pointers, and each node would then contain 2 values. A sample m-Way Search Tree with m = 3 is given in the following.

55 Disadvantage of indexed-sequential files performance degrades as file grows, since many overflow blocks get created. Periodic reorganization of entire file is required. Advantage of B + -tree index files: automatically reorganizes itself with small, local, changes, in the face of insertions and deletions. Reorganization of entire file is not required to maintain performance. (Minor) disadvantage of B + -trees: extra insertion and deletion overhead, space overhead. Advantages of B + -trees outweigh disadvantages B + -trees are used extensively

56 Typical node K i are the search-key values P i are pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for leaf nodes). The search-keys in a node are ordered K 1 < K 2 < K 3 <... < K n–1 Usually the size of a node is that of a block

57

58 All paths from root to leaf are of the same length Each node that is not a root or a leaf has between n/2 and n children. A leaf node has between (n–1)/2 and n–1 values Special cases: If the root is not a leaf, it has at least 2 children. If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n–1) values.

59 For i = 1, 2,..., n–1, pointer P i either points to a file record with search- key value K i, or to a bucket of pointers to file records, each record having search-key value K i. Only need bucket structure if search-key does not form a primary key. If L i, L j are leaf nodes and i < j, L i s search-key values are less than L j s search-key values P n points to next leaf node in search-key order

60 Non leaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf node with m pointers: All the search-keys in the subtree to which P 1 points are less than K 1 For 2 i n – 1, all the search-keys in the subtree to which P i points have values greater than or equal to K i–1 and less than K i All the search-keys in the subtree to which P n points have values greater than or equal to K n–1

61 Leaf nodes must have between 2 and 4 values ( (n–1)/2 and n –1, with n = 5). Non-leaf nodes other than root must have between 3 and 5 children ( (n/2 and n with n =5). Root must have at least 2 children.

62 A bucket is a unit of storage containing one or more records (a bucket is typically a disk block). In a hash file organization we obtain the bucket of a record directly from its search-key value using a hash function. Hash function h is a function from the set of all search-key values K to the set of all bucket addresses B. Hash function is used to locate records for access, insertion as well as deletion. Records with different search-key values may be mapped to the same bucket; thus entire bucket has to be searched sequentially to locate a record.

63 Hash file organization of account file, using branch_name as key There are 10 buckets, The binary representation of the ith character is assumed to be the integer i. The hash function returns the sum of the binary representations of the characters modulo 10 E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3

64 Worst hash function maps all search-key values to the same bucket; this makes access time proportional to the number of search-key values in the file. An ideal hash function is uniform, i.e., each bucket is assigned the same number of search-key values from the set of all possible values. Ideal hash function is random, so each bucket will have the same number of records assigned to it irrespective of the actual distribution of search-key values in the file. Typical hash functions perform computation on the internal binary representation of the search-key. For example, for a string search-key, the binary representations of all the characters in the string could be added and the sum modulo the number of buckets could be returned..

65 Bucket overflow can occur because of Insufficient buckets Skew in distribution of records. This can occur due to two reasons: multiple records have same search-key value chosen hash function produces non-uniform distribution of key values Although the probability of bucket overflow can be reduced, it cannot be eliminated; it is handled by using overflow buckets.

66 Overflow chaining – the overflow buckets of a given bucket are chained together in a linked list. Above scheme is called closed hashing. An alternative, called open hashing, which does not use overflow buckets, is not suitable for database applications.

67 Hashing can be used not only for file organization, but also for index-structure creation. A hash index organizes the search keys, with their associated record pointers, into a hash file structure

68 In static hashing, function h maps search-key values to a fixed set of B of bucket addresses. Databases grow or shrink with time. If initial number of buckets is too small, and file grows, performance will degrade due to too much overflows. If space is allocated for anticipated growth, a significant amount of space will be wasted initially (and buckets will be underfull). If database shrinks, again space will be wasted. One solution: periodic re-organization of the file with a new hash function Expensive, disrupts normal operations Better solution: allow the number of buckets to be modified dynamically.

69 Good for database that grows and shrinks in size Allows the hash function to be modified dynamically Extendable hashing – one form of dynamic hashing Hash function generates values over a large range typically b-bit integers, with b = 32 (Note that 2 32 is quite large!) At any time use only a prefix of the hash function to index into a table of bucket addresses. Let the length of the prefix be i bits, 0 i 32. Bucket address table size = 2 i. Initially i = 0 Value of i grows and shrinks as the size of the database grows and shrinks. Multiple entries in the bucket address table may point to a same bucket. Thus, actual number of buckets is < 2 i The number of buckets also changes dynamically due to coalescing and splitting of buckets.

70 A key is a single or combination of multiple fields. Its purpose is to access or retrieve data rows from table according to the requirement. The keys are defined in tables to access or sequence the stored data quickly and smoothly. They are also used to create links between different tables. Superkey - A superkey is defined in the relational model as a set of attributes of a relation variable for which it holds that in all relations assigned to that variable there are no two distinct tuples (rows) that have the same values for the attributes in this set. Candidate key - Minimal superkey is called candidate keys. A candidate key is a field or combination of fields that can act as a primary key field for that table to uniquely identify each record in that table. Primary key - a primary key is a value that can be used to identify a unique row in a table. Attributes are associated with it. Examples of primary keys are Social Security numbers (associated to a specific person) or ISBNs (associated to a specific book). In the relational model of data, a primary key is a candidate key chosen as the main method of uniquely identifying a tuple in a relation.

71 Foreign key - a foreign key (FK) is a field or group of fields in a database record that points to a key field or group of fields forming a key of another database record in some (usually different) table. Usually a foreign key in one table refers to the primary key (PK) of another table. This way references can be made to link information together and it is an essential part of database normalization Alternate key - An alternate key is any candidate key which is not selected to be the primary key. Compound key - compound key (also called a composite key or concatenated key) is a key that consists of 2 or more attributes.

72 A database can be modeled as: a collection of entities, relationship among entities. An entity is an object that exists and is distinguishable from other objects. Example: specific person, company, event, plant Entities have attributes Example: people have names and addresses An entity set is a set of entities of the same type that share the same properties. Example: set of all persons, companies, trees, holidays

73 cust_id cust_name cust_ street cust_ no loan_no amount

74 A relationship is an association among several entities Example: HayesdepositorA-102 customer entity relationship set account entity A relationship set is a mathematical relation among n 2 entities, each taken from entity sets {(e 1, e 2, … e n ) | e 1 E 1, e 2 E 2, …, e n E n } where (e 1, e 2, …, e n ) is a relationship Example: (Hayes, A-102) depositor

75

76 An attribute can also be property of a relationship set. For instance, the depositor relationship set between entity sets customer and account may have the attribute access-date

77 An entity is represented by a set of attributes, that is descriptive properties possessed by all members of an entity set. Example: customer = (customer_id, customer_name, customer_street, customer_city ) loan = (loan_number, amount ) Domain – the set of permitted values for each attribute Attribute types: Simple and composite attributes. Single-valued and multi-valued attributes Example: multivalued attribute: phone_numbers Derived attributes Can be computed from other attributes Example: age, given date_of_birth

78 Composite attributes are flattened out by creating a separate attribute for each component attribute Example: given entity set customer with composite attribute name with component attributes first_name and last_name the schema corresponding to the entity set has two attributes name.first_name and name.last_name A multivalued attribute M of an entity E is represented by a separate schema EM Schema EM has attributes corresponding to the primary key of E and an attribute corresponding to multivalued attribute M Example: Multivalued attribute dependent_names of employee is represented by a schema: employee_dependent_names = ( employee_id, dname) Each value of the multivalued attribute maps to a separate tuple of the relation on schema EM For example, an employee entity with primary key and dependents Jack and Jane maps to two tuples: ( , Jack) and ( , Jane)

79

80 Express the number of entities to which another entity can be associated via a relationship set. Most useful in describing binary relationship sets. For a binary relationship set the mapping cardinality must be one of the following types: One to one One to many Many to one Many to many

81 One to oneOne to many

82 Many to oneMany to many

83

84

85 We express cardinality constraints by drawing either a directed line ( ), signifying one, or an undirected line (), signifying many, between the relationship set and the entity set. One-to-one relationship: A customer is associated with at most one loan via the relationship borrower A loan is associated with at most one customer via borrower

86 In the one-to-many relationship a loan is associated with at most one customer via borrower, a customer is associated with several (including 0) loans via borrower

87 In a many-to-one relationship a loan is associated with several (including 0) customers via borrower, a customer is associated with at most one loan via borrower

88 A customer is associated with several (possibly 0) loans via borrower A loan is associated with several (possibly 0) customers via borrower

89 Cardinality limits can also express participation constraints

90 Total participation (indicated by double line): every entity in the entity set participates in at least one relationship in the relationship set E.g. participation of loan in borrower is total every loan must have a customer associated to it via borrower Partial participation: some entities may not participate in any relationship in the relationship set Example: participation of customer in borrower is partial

91 Rectangles represent entity sets. Diamonds represent relationship sets. Lines link attributes to entity sets and entity sets to relationship sets. Ellipses represent attributes Double ellipses represent multivalued attributes. Dashed ellipses denote derived attributes. Underline indicates primary key attributes (will study later)

92

93

94 Entity sets of a relationship need not be distinct The labels manager and worker are called roles; they specify how employee entities interact via the works for relationship set. Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles. Role labels are optional, and are used to clarify semantics of the relationship

95

96 In general, any non-binary relationship can be represented using binary relationships by creating an artificial entity set. Replace R between entity sets A, B and C by an entity set E, and three relationship sets: 1. R A, relating E and A 2.R B, relating E and B 3. R C, relating E and C Create a special identifying attribute for E Add any attributes of R to E For each relationship (a i, b i, c i ) in R, create 1. a new entity e i in the entity set E2. add (e i, a i ) to R A 3. add (e i, b i ) to R B 4. add (e i, c i ) to R C

97 An entity set that does not have a primary key is referred to as a weak entity set. The existence of a weak entity set depends on the existence of a identifying entity set it must relate to the identifying entity set via a total, one-to-many relationship set from the identifying to the weak entity set Identifying relationship depicted using a double diamond The discriminator (or partial key) of a weak entity set is the set of attributes that distinguishes among all the entities of a weak entity set. The primary key of a weak entity set is formed by the primary key of the strong entity set on which the weak entity set is existence dependent, plus the weak entity sets discriminator.

98 We depict a weak entity set by double rectangles. We underline the discriminator of a weak entity set with a dashed line. payment_number – discriminator of the payment entity set Primary key for payment – (loan_number, payment_number)

99 Top-down design process; we designate subgroupings within an entity set that are distinctive from other entities in the set. These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set. Depicted by a triangle component labeled ISA (E.g. customer is a person). Attribute inheritance – a lower-level entity set inherits all the attributes and relationship participation of the higher- level entity set to which it is linked.

100

101 A bottom-up design process – combine a number of entity sets that share the same features into a higher-level entity set. Specialization and generalization are simple inversions of each other; they are represented in an E-R diagram in the same way. The terms specialization and generalization are used interchangeably. Can have multiple specializations of an entity set based on different features. E.g. permanent_employee vs. temporary_employee, in addition to officer vs. secretary vs. teller Each particular employee would be a member of one of permanent_employee or temporary_employee, and also a member of one of officer, secretary, or teller The ISA relationship also referred to as superclass - subclass relationship

102 Consider the ternary relationship works_on, which we saw earlier Suppose we want to record managers for tasks performed by an employee at a branch

103 Relationship sets works_on and manages represent overlapping information Every manages relationship corresponds to a works_on relationship However, some works_on relationships may not correspond to any manages relationships So we cant discard the works_on relationship Eliminate this redundancy via aggregation Treat relationship as an abstract entity Allows relationships between relationships Abstraction of relationship into new entity Without introducing redundancy, the following diagram represents: An employee works on a particular job at a particular branch An employee, branch, job combination may have an associated manager

104

105 The use of an attribute or entity set to represent an object. Whether a real-world concept is best expressed by an entity set or a relationship set. The use of a ternary relationship versus a pair of binary relationships. The use of a strong or weak entity set. The use of specialization/generalization – contributes to modularity in the design. The use of aggregation – can treat the aggregate entity set as a single unit without concern for the details of its internal structure.

106


Download ppt "Physical level: The lowest level of abstraction describes how the data are actually stored. The physical level describes complex low-level data structures."

Similar presentations


Ads by Google