3 Physical Database Design Purpose –of this design is to translate the logical description of data into the technical specifications for storing and retrieving data Goal – of this phase is to create a design for storing data that will provide adequate performance and ensure database integrity, security and recoverability
Physical Design Information Information needed for physical file and database design includes: Normalized relations plus size estimates for them Definitions of each attribute Descriptions of where and when data are used entered, retrieved, deleted, updated, and how often Expectations and requirements for response time, and data security, backup, recovery, retention and integrity Descriptions of the technologies used to implement the database
Physical Design Decisions During this phase, the decisions are taken on the Storage Format Physical record composition Data arrangement Indexes Query optimization and performance tuning
Physical Design Decisions During this phase, the decisions are taken on the Create base relations Name Attributes Primary key Foreign key Alternative key Indexes Implement integrity rules Domain Enterprise Referential (no action, cascade, set null, set default, and no check for deleting and updating) Entity
Storage Format Choosing the storage format of each field (attribute). The DBMS provides some set of data types that can be used for the physical storage of fields in the database Data Type (format) is chosen to minimize storage space and maximize data integrity
Objectives of data type selection Minimize storage space Represent all possible values Improve data integrity Support all data manipulations The correct data type should, in minimal space, represent every possible value (but eliminate illegal values) for the associated attribute and can support the required data manipulations (e.g. numerical or string operations)
9 Choosing Data Types CHAR – fixed-length character VARCHAR – variable-length character (memo) LONG – large number NUMBER – positive/negative number DATE – actual date BLOB – binary large object (good for graphics, sound clips, etc.)
Designing Physical Records A physical record is a group of fields stored in adjacent memory locations and retrieved together as a unit The records can be either Fixed Length and variable length
Data Storage The data is stored on memories. The memories can be classified as Cache memory Primary memory (Secondary memory)Disk Tape
The Memory Hierarchy Main Memory = Disk Cache Volatile storage capacity 256Mb-1Gb Access time: nanoseconds Persistent storage capacity GB storage speed: Access time= msecs. 1.5 MB/S transfer rate 280 GB typical capacity Only sequential access Processor Cache: access time 10 nano’s storage capacity 512K Disk Tape
Main Memory Fastest, most expensive (excluding cache) Today: 512MB are common even on PCs Many databases could fit in memory New industry trend: Main Memory Database E.g TimesTen Main issue is volatility
Secondary Storage Secondary Storage is Disks They are Slower, cheaper than main memory It is non volatile in nature, i.e. the data is permanently stored. The unit of disk I/O = block Typically 1 block = 4k A disk block is also called a disk page or simply a page Blocking factor (bfr) for a file is the average number of records stored in a disk block.
The Mechanics of Disk Mechanical characteristics: Rotation speed (5400RPM) Number of platters (1-30) Number of tracks (<=10000) Number of sectors (256/track) Number of bytes / sector (2 9 =512) Block size (2 12 =4096) Platters Spindle Disk head Arm movement Arm assembly Tracks Sector Cylinder
Important Disk Access Characteristics Block access time = Disk latency + transfer time Disk latency = seek time + rotational latency Seek time = time for the head to reach the right track 10ms – 40ms Rotational latency = rotation time to get to the right sector Time for one rotation = 10ms Average rotation latency = 10ms Transfer time is typically 5-10MB/s Disks read/write one block at a time (typically 4kB)
Representing Data Elements Relational database elements: CREATE TABLE Product ( pid INT PRIMARY KEY, name CHAR(20), description VARCHAR(200), maker CHAR(10) REFERENCES Company(name)) A tuple is represented as a record
Record Formats: Fixed Length All fields in the record are fixed in length, so the length of the record is fixed. So all records are equal in length Base address (B) L1L2 L3L4 F1F2 F3F4 Address = B+L1+L2
Record Header L1L2 L3L4 F1F2 F3F4 To schema length timestamp Need the header because: The schema may change for a while new+old may coexist Records from different relations may coexist header
Variable Length Records L1L2 L3L4 F1F2 F3F4 Other header information length Place the fixed fields first: F1, F2 Then the variable length fields: F3, F4 header
Records With Referencing Fields L1L2 L3 F1F2 F3 Other header information length header E.g. to represent one-many or many-many relationships
Storing Records in Blocks Blocks have fixed size (typically 4k) R1R2R3 BLOCK R4
Spanning Records Across Blocks block header block header R1R2 R3
BLOB Binary large objects Supported by modern database systems E.g. images, sounds, etc. Storage: attempt to cluster blocks together
Modifications: Insertion File is unsorted add it to the end File is sorted: Is there space in the right block ? Yes: we are lucky, store it there Is there space in a neighboring block ? Look 1-2 blocks to the left/right, shift records If anything else fails, create overflow block
Overflow Blocks After a while the file starts being dominated by overflow blocks: time to reorganize Block n-1 Block n Block n+1 Overflow
Modifications: Deletions Free space in block, shift records Maybe be able to eliminate an overflow block
Modifications: Updates If new record is shorter than previous, easy If it is longer, need to shift records, create overflow blocks
Physical Addresses Each block and each record have a physical address that consists of: The disk The cylinder number The track number The block within the track For records: an offset in the block
Logical Addresses Logical address: a string of bytes (10- 16) More flexible: can blocks/records around But need translation table: Logical address Physical address L1P1 L2P2 L3P3
Main Memory Address When the block is read in main memory, it receives a main memory address Buffer manager has another translation table Memory address Logical address M1L1 M2L2 M3L3
Physical Design Interface 1: User request to the DBMS. The user presents a query, the DBMS determines which physical DBs are needed to resolve the query Interface 2: The DBMS uses an internal model access method to access the data stored in a logical database. Interface 3: The internal model access methods and OS access methods access the physical records of the database.
Physical File Design A Physical file is a portion of secondary storage (disk space) allocated for the purpose of storing physical records Pointers - a field of data that can be used to locate a related field or record of data Access Methods - An operating system algorithm for storing and locating data in secondary storage Pages - The amount of data read or written in one disk input or output operation
Internal Model Access Methods Many types of access methods: Physical Sequential Indexed Sequential Indexed Random Direct Hashed Differences in Access Efficiency Storage Efficiency
Physical Sequential Key values of the physical records are in logical sequence Main use is for “dump” and “restore” Access method may be used for storage as well as retrieval Storage Efficiency is near 100% Access Efficiency is poor (unless fixed size physical records)
Sequential File Organization A sequential file is one in which the records are stored in sorted order of one or more key fields.
Sequential File Organization Sequential access means that data is accessed in a ordered sequence. Sequential access is sometimes the only way of accessing the data, for example tape. Records are usually stored on tape and processed one after the other
Advantages Simple file design Very efficient when most of the records must be processed e.g. Payroll Very efficient if the data has a natural order Can be stored on inexpensive devices like magnetic tape.
Disadvantages Entire file must be processed even if a single record is to be searched. Transactions have to be sorted before processing Overall processing is slow, because you have to go through each record until you get to the one you want!
Sequential File Organization A collection of records Stored in key sequence Adding/deleting record requires making new file (so that the sequence is maintained) Used as master files
Indexed Sequential Key values of the physical records are in logical sequence Access method may be used for storage and retrieval Index of key values is maintained with entries for the highest key values per block(s) Access Efficiency depends on the levels of index, storage allocated for index, number of database records, and amount of overflow Storage Efficiency depends on size of index and volatility of database
Indexed sequential file Each record of a file has a key field which uniquely identifies that record. An index consists of keys and addresses, just like an index in a book: The pages in a book are stored sequentially, so you can read through it page by page OR You can look up the page you want in the index and flick straight to it
Indexed sequential file An indexed sequential file is a sequential file (i.e. sorted into order of a key field) which has an index. A full index to a file is one in which there is an entry for every record. Because each record has an index, we can access individual records directly, without having to scroll through all the other records first.
Indexed sequential file Indexed sequential files are important for applications where data needs to be accessed..... sequentially, one record after another OR randomly using the index.
An example of an Indexed Sequential file A company may store details about its employees as an indexed sequential file. Sometimes the file is accessed.... sequentially. For example when the whole of the file is processed to produce pay slips at the end of the month.
An example of an Indexed Sequential file Sometimes the file is accessed.... randomly. Maybe an employee changes address, or a female employee gets married and changes her surname.
Indexed sequential file An indexed sequential file can only be stored on a random access device e.g. magnetic disc or CD. This is because we need a device that will allow us direct access to random files, rather than the sequential access that magnetic tape allows.
Advantages Provides flexibility for users who need both type of access with the same file Faster than sequential
Disadvantages Extra storage space for the index is required, just like in a book: your text book would be 372 pages without the index (go on, check!) but is 380 pages with the index.
Index Sequential Data File Block 1 Block 2 Block 3 Address Block Number 123…123… Actual Value Dumpling Harty Texaci... Adams Becker Dumpling Getta Harty Mobile Sunoci Texaci
Indexed Sequential: Two Levels Address 789…789… Key Value Address 1212 Key Value Address 3434 Key Value Address 5656 Key Value
Indexed Random Key values of the physical records are not necessarily in logical sequence Index may be stored and accessed with Indexed Sequential Access Method Index has an entry for every data base record. These are in ascending order. The index keys are in logical sequence. Database records are not necessarily in ascending sequence. Access method may be used for storage and retrieval
Indexed Random Address Block Number Actual Value Adams Becker Dumpling Getta Harty Becker Harty Adams Getta Dumpling
Btree F | | P | | Z | R | | S | | Z |H | | L | | P |B | | D | | F | Devils Aces Boilers Cars Minors Panthers Seminoles Flyers Hawkeyes Hoosiers
Direct (Random) File Organization Records are read directly from or written on to the file. The records are stored at known address. The address is calculated by applying a mathematical function to the key field.
Direct Key values of the physical records are not necessarily in logical sequence There is a one-to-one correspondence between a record key and the physical address of the record May be used for storage and retrieval No duplicate keys permitted
Hashing A bucket is a unit of storage containing one or more records (a bucket is typically a disk block). In a hash file organization we obtain the bucket of a record directly from its search-key value using a hash function. Hash function h is a function from the set of all search-key values K to the set of all bucket addresses B.
Hashing Organization Hash function is used to locate records for access, insertion as well as deletion. Records with different search-key values may be mapped to the same bucket; thus entire bucket has to be searched sequentially to locate a record.
60 EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = d a c b h(e) = 1 e
a b c e d EXAMPLE: deletion Delete: e f f g maybe move “g” up c d