3DisclaimerThe contents of the slides are solely for the purpose of teaching students at SRM University. All copyrights and Trademarks of organizations/persons apply even if not specified explicitly.
4Classification of Physical Storage Media Speed with which data can be accessedCost per unit of dataReliabilitydata loss on power failure or system crashphysical failure of the storage deviceCan differentiate storage into:volatile storage: loses contents when power is switched offnon-volatile storage:Contents persist even when power is switched off.Includes secondary and tertiary storage, as well as battery-backed up main-memory.
5Physical Storage Media Cache – fastest and most costly form of storage; volatile; managed by the computer system hardwareMain memory:fast access (10s to 100s of nanoseconds; 1 nanosecond = 10–9 seconds)generally too small (or too expensive) to store the entire databasecapacities of up to a few Gigabytes widely used currentlyCapacities have gone up and per-byte costs have decreased steadily and rapidly (roughly factor of 2 every 2 to 3 years)Volatile — contents of main memory are usually lost if a power failure or system crash occurs.
6Physical Storage Media (Cont.) Flash memoryData survives power failureData can be written at a location only once, but location can be erased and written to againCan support only a limited number (10K – 1M) of write/erase cycles.Erasing of memory has to be done to an entire bank of memoryReads are roughly as fast as main memoryBut writes are slow (few microseconds), erase is slower
7Physical Storage Media (Cont.) Flash memoryNOR FlashFast reads, very slow erase, lower capacityUsed to store program code in many embedded devicesNAND FlashPage-at-a-time read/write, multi-page eraseHigh capacity (several GB)Widely used as data storage mechanism in portable devices
8Physical Storage Media (Cont.) Magnetic-diskData is stored on spinning disk, and read/written magneticallyPrimary medium for the long-term storage of data; typically stores entire database.Data must be moved from disk to main memory for access, and written back for storagedirect-access – possible to read data on disk in any order, unlike magnetic tapeSurvives power failures and system crashesdisk failure can destroy data: is rare but does happen
9Physical Storage Media (Cont.) Optical storagenon-volatile, data is read optically from a spinning disk using a laserCD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular formsWrite-one, read-many (WORM) optical disks used for archival storage (CD-R, DVD-R, DVD+R)Multiple write versions also available (CD-RW, DVD-RW, DVD+RW, and DVD-RAM)Reads and writes are slower than with magnetic diskJuke-box systems, with large numbers of removable disks, a few drives, and a mechanism for automatic loading/unloading of disks available for storing large volumes of data
10Physical Storage Media (Cont.) Tape storagenon-volatile, used primarily for backup (to recover from disk failure), and for archival datasequential-access – much slower than diskvery high capacity (40 to 300 GB tapes available)tape can be removed from drive storage costs much cheaper than disk, but drives are expensiveTape jukeboxes available for storing massive amounts of datahundreds of terabytes (1 terabyte = 109 bytes) to even a petabyte (1 petabyte = 1012 bytes)
12RAID RAID: Redundant Arrays of Independent Disks disk organization techniques that manage a large numbers of disks, providing a view of a single disk ofhigh capacity and high speed by using multiple disks in parallel, andhigh reliability by storing data redundantly, so that data can be recovered even if a disk failsThe chance that some disk out of a set of N disks will fail is much higher than the chance that a specific single disk will fail.E.g., a system with 100 disks, each with MTTF of 100,000 hours (approx. 11 years), will have a system MTTF of hours (approx. 41 days)
13Improvement in Performance via Parallelism Two main goals of parallelism in a disk system:1. Load balance multiple small accesses to increase throughput2. Parallelize large accesses to reduce response time.Improve transfer rate by striping data across multiple disks.Bit-level striping – split the bits of each byte across multiple disksBut seek/access time worse than for a single diskBit level striping is not used much any moreBlock-level striping – with n disks, block i of a file goes to disk (i mod n) + 1Requests for different blocks can run in parallel if the blocks reside on different disksA request for a long sequence of blocks can utilize all disks in parallel
14RAID LevelsRAID organizations, or RAID levels, have differing cost, performance and reliability characteristicsRAID Level 0: Block striping; non-redundant.Used in high-performance applications where data lost is not critical.RAID Level 1: Mirrored disks with block stripingOffers best write performance.Popular for applications such as storing log files in a database system.
15RAID Levels (Cont.)RAID Level 2: Memory-Style Error-Correcting-Codes (ECC) with bit striping.RAID Level 3: Bit-Interleaved Paritya single parity bit is enough for error correction, not just detectionWhen writing data, corresponding parity bits must also be computed and written to a parity bit diskTo recover data in a damaged disk, compute XOR of bits from other disks (including parity bit disk)
16RAID Levels (Cont.) RAID Level 3 (Cont.) Faster data transfer than with a single disk, but fewer I/Os per second since every disk has to participate in every I/O.RAID Level 4: Block-Interleaved Parity; uses block-level striping, and keeps a parity block on a separate disk for corresponding blocks from N other disks.When writing data block, corresponding block of parity bits must also be computed and written to parity diskTo find value of a damaged block, compute XOR of bits from corresponding blocks (including parity block) from other disks.
17RAID Levels (Cont.) RAID Level 4 (Cont.) Provides higher I/O rates for independent block reads than Level 3block read goes to a single disk, so blocks stored on different disks can be read in parallelBefore writing a block, parity data must be computedCan be done by using old parity block, old value of current block and new value of current block (2 block reads + 2 block writes)Or by recomputing the parity value using the new values of blocks corresponding to the parity blockMore efficient for writing large amounts of data sequentiallyParity block becomes a bottleneck for independent block writes since every block write also writes to parity disk
18RAID Levels (Cont.)RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk.E.g., with 5 disks, parity block for nth set of blocks is stored on disk (n mod 5) + 1, with the data blocks stored on the other 4 disks.
19RAID Levels (Cont.) RAID Level 5 (Cont.) Higher I/O rates than Level 4.Block writes occur in parallel if the blocks and their parity blocks are on different disks.Subsumes Level 4: provides same benefits, but avoids bottleneck of parity disk.RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but stores extra redundant information to guard against multiple disk failures.Better reliability than Level 5 at a higher cost; not used as widely.
20Transaction A transaction is a unit of program execution that accesses and possibly updates various data items.A transaction must see a consistent database.During transaction execution the database may be inconsistent.When the transaction is committed, the database must be consistent.
21To ensure integrity of data, the database system must maintain: ACID PropertiesTo ensure integrity of data, the database system must maintain:Atomicity. Either all operations of the transaction are properly reflected in the database or none are.Consistency. Execution of a transaction in isolation preserves the consistency of the database.Isolation. Although multiple transactions may execute concurrently, each transaction must be unaware of other concurrently executing transactions.That is, for every pair of transactions Ti and Tj, it appears to Ti that either Tj, finished execution before Ti started, or Tj started execution after Ti finished.Durability. After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures.
22Example Of TransferTransaction to transfer $100 from Checking account A to Saving account B:1. read(A)2. A := A – 1003. write(A)4. read(B)5. B := B + 1006. write(B)Consistency requirement – the sum of A and B is unchanged by the execution of the transaction.Atomicity requirement — if the transaction fails after step 3 and before step 6, the system should ensure that its updates are not reflected in the database, else an inconsistency will result.
23Transfer Example (Cont.) Durability requirement — once the user has been notified that the transaction has completed (i.e., the transfer of the $100 has taken place), the updates to the database by the transaction must persist despite failures.Isolation requirement — if between steps 3 and 6, another transaction is allowed to access the partially updated database, it will see an inconsistent database.
24Transaction StateActive, the initial state; the transaction stays in this state while it is executingPartially committed, after the final statement has been executed.Failed, after the discovery that normal execution can no longer proceed.Aborted, after the transaction has been rolled back and the database restored to its state prior to the start of the transaction.1) Restart the transaction – only if no internal logical error2) kill the transactionCommitted, after successful completion..
26Implementation of Atomicity and Durability The shadow-database scheme:assume that only one transaction is active at a time.a pointer called db_pointer always points to the current consistent copy of the database.all updates are made on a shadow copy of the database, and db_pointer is made to point to the updated shadow copy only after the transaction reaches partial commit and all updated pages have been flushed to disk.in case transaction fails, old consistent copy pointed to by db_pointer can be used, and the shadow copy can be deleted.
28Concurrent Executions Multiple transactions are allowed to run concurrently in the system. Advantages are:increased processor and disk utilization, leading to better transaction throughput: one transaction can be using the CPU while another is reading from or writing to the diskreduced waiting time for transactions: short transactions need not wait behind long ones.
29SchedulesSchedules – sequences that indicate the chronological order in which instructions of concurrent transactions are executeda schedule for a set of transactions must consist of all instructions of those transactionsmust preserve the order in which the instructions appear in each individual transaction.
30Example SchedulesLet T1 transfer $50 from A to B, and T2 transfer 10% of the balance from A to B. The following is a serial schedule (Schedule 1 in the text), in which T1 is followed by T2.
31Cont.Let T1 and T2 be the transactions defined previously. The following schedule is not a serial schedule, but it is equivalent to Schedule 1.
32Cont.The following concurrent schedule does not preserve the value of the the sum A + B.
33SerializabilityBasic Assumption – Each transaction preserves database consistency.Thus serial execution of a set of transactions preserves database consistency.A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule. Different forms of schedule equivalence give rise to the notions of:1. conflict serializability2. view serializability
34Conflict Serializability Instructions li and lj of transactions Ti and Tj respectively, conflict if and only if there exists some item Q accessed by both li and lj, and at least one of these instructions wrote Q.1. li = read(Q), lj = read(Q). li and lj don’t conflict. 2. li = read(Q), lj = write(Q). They conflict. 3. li = write(Q), lj = read(Q). They conflict 4. li = write(Q), lj = write(Q). They conflict
35Conflict Serializability (Cont.) If a schedule S can be transformed into a schedule S´ by a series of swaps of non-conflicting instructions, we say that S and S´ are conflict equivalent.We say that a schedule S is conflict serializable if it is conflict equivalent to a serial scheduleExample of a schedule that is not conflict serializable:T3 T4 read(Q) write(Q) write(Q) We are unable to swap instructions in the above schedule to obtain either the serial schedule < T3, T4 >, or the serial schedule < T4, T3 >.
36Conflict Serializability (Cont.) Schedule 3 below can be transformed into Schedule 1, a serial schedule where T2 follows T1, by series of swaps of non- conflicting instructions. Therefore Schedule 3 is conflict serializable.
37View SerializabilityLet S and S´ be two schedules with the same set of transactions. S and S´ are view equivalent if the following three conditions are met:1. For each data item Q, if transaction Ti reads the initial value of Q in schedule S, then transaction Ti must, in schedule S´, also read the initial value of Q.2. For each data item Q if transaction Ti executes read(Q) in schedule S, and that value was produced by transaction Tj (if any), then transaction Ti must in schedule S´ also read the value of Q that was produced by transaction Tj .3. For each data item Q, the transaction (if any) that performs the final write(Q) operation in schedule S must perform the final write(Q) operation in schedule S´.As can be seen, view equivalence is also based purely on readsand writes alone.
38View Serializability (Cont.) A schedule S is view serializable it is view equivalent to a serial schedule.Every conflict serializable schedule is also view serializable.Schedule 9 (from text) — a schedule which is view-serializable but not conflict serializable. Every view serializable schedule that is not conflict serializable has blind writes.
39Levels of Consistency in SQL-92 Serializable — defaultRepeatable read — only committed records to be read, repeated reads of same record must return same value. However, a transaction may not be serializable – it may find some records inserted by a transaction but not find others.Read committed — only committed records can be read, but successive reads of record may return different (but committed) values.Read uncommitted — even uncommitted records may be read.
40Lock-Based ProtocolsA lock is a mechanism to control concurrent access to a data itemData items can be locked in two modes :1. exclusive (X) mode. Data item can be both read as well aswritten. X-lock is requested using lock-X instruction.2. shared (S) mode. Data item can only be read. S-lock isrequested using lock-S instruction.Lock requests are made to concurrency-control manager. Transaction can proceed only after request is granted.
41Lock-Based Protocols (Cont.) Lock-compatibility matrixA transaction may be granted a lock on an item if the requested lock is compatible with locks already held on the item by other transactionsAny number of transactions can hold shared locks on an item,but if any transaction holds an exclusive on the item no other transaction may hold any lock on the item.If a lock cannot be granted, the requesting transaction is made to wait till all incompatible locks held by other transactions have been released. The lock is then granted.
42Lock-Based Protocols (Cont.) Example of a transaction performing locking:T2: lock-S(A);read (A);unlock(A);lock-S(B);read (B);unlock(B);display(A+B)Locking as above is not sufficient to guarantee serializability — if A and B get updated in-between the read of A and B, the displayed sum would be wrong.A locking protocol is a set of rules followed by all transactions while requesting and releasing locks. Locking protocols restrict the set of possible schedules.
43Pitfalls of Lock-Based Protocols Consider the partial scheduleNeither T3 nor T4 can make progress — executing lock-S(B) causes T4 to wait for T3 to release its lock on B, while executing lock-X(A) causes T3 to wait for T4 to release its lock on A.Such a situation is called a deadlock.To handle a deadlock one of T3 or T4 must be rolled back and its locks released.
44Pitfalls of Lock-Based Protocols (Cont.) The potential for deadlock exists in most locking protocols. Deadlocks are a necessary evil.Starvation is also possible if concurrency control manager is badly designed. For example:A transaction may be waiting for an X-lock on an item, while a sequence of other transactions request and are granted an S-lock on the same item.The same transaction is repeatedly rolled back due to deadlocks.Concurrency control manager can be designed to prevent starvation.
45The Two-Phase Locking Protocol This is a protocol which ensures conflict-serializable schedules.Phase 1: Growing Phasetransaction may obtain lockstransaction may not release locksPhase 2: Shrinking Phasetransaction may release lockstransaction may not obtain locksThe protocol assures serializability. It can be proved that the transactions can be serialized in the order of their lock points (i.e. the point where a transaction acquired its final lock).
46The Two-Phase Locking Protocol (Cont.) Two-phase locking does not ensure freedom from deadlocksCascading roll-back is possible under two-phase locking. To avoid this, follow a modified protocol called strict two-phase locking. Here a transaction must hold all its exclusive locks till it commits/aborts.Rigorous two-phase locking is even stricter: here all locks are held till commit/abort. In this protocol transactions can be serialized in the order in which they commit.
47The Two-Phase Locking Protocol (Cont.) There can be conflict serializable schedules that cannot be obtained if two-phase locking is used.However, in the absence of extra information (e.g., ordering of access to data), two-phase locking is needed for conflict serializability in the following sense:Given a transaction Ti that does not follow two-phase locking, we can find a transaction Tj that uses two-phase locking, and a schedule for Ti and Tj that is not conflict serializable.
48Lock Conversions Two-phase locking with lock conversions: – First Phase:can acquire a lock-S on itemcan acquire a lock-X on itemcan convert a lock-S to a lock-X (upgrade)– Second Phase:can release a lock-Scan release a lock-Xcan convert a lock-X to a lock-S (downgrade)This protocol assures serializability. But still relies on the programmer to insert the various locking instructions.
49Deadlock Handling T1 T2 lock-X on X write (X) lock-X on Y write (X) Consider the following two transactions:T1: write (X) T2: write(Y)write(Y) write(X)Schedule with deadlockT1T2lock-X on Xwrite (X)lock-X on Ywrite (X)wait for lock-X on Xwait for lock-X on Y
50Deadlock HandlingSystem is deadlocked if there is a set of transactions such that every transaction in the set is waiting for another transaction in the set.Deadlock prevention protocols ensure that the system will never enter into a deadlock state. Some prevention strategies :Require that each transaction locks all its data items before it begins execution (predeclaration).Impose partial ordering of all data items and require that a transaction can lock data items only in the order specified by the partial order (graph-based protocol).
51More Deadlock Prevention Strategies Following schemes use transaction timestamps for the sake of deadlock prevention alone.wait-die scheme — non-preemptiveolder transaction may wait for younger one to release data item. Younger transactions never wait for older ones; they are rolled back instead.a transaction may die several times before acquiring needed data itemwound-wait scheme — preemptiveolder transaction wounds (forces rollback) of younger transaction instead of waiting for it. Younger transactions may wait for older ones.may be fewer rollbacks than wait-die scheme.
52Deadlock prevention (Cont.) Both in wait-die and in wound-wait schemes, a rolled back transactions is restarted with its original timestamp. Older transactions thus have precedence over newer ones, and starvation is hence avoided.Timeout-Based Schemes :a transaction waits for a lock only for a specified amount of time. After that, the wait times out and the transaction is rolled back.thus deadlocks are not possiblesimple to implement; but starvation is possible. Also difficult to determine good value of the timeout interval.
53Deadlock DetectionDeadlocks can be described as a wait-for graph, which consists of a pair G = (V,E),V is a set of vertices (all the transactions in the system)E is a set of edges; each element is an ordered pair Ti Tj.If Ti Tj is in E, then there is a directed edge from Ti to Tj, implying that Ti is waiting for Tj to release a data item.When Ti requests a data item currently being held by Tj, then the edge Ti Tj is inserted in the wait-for graph. This edge is removed only when Tj is no longer holding a data item needed by Ti.The system is in a deadlock state if and only if the wait-for graph has a cycle. Must invoke a deadlock-detection algorithm periodically to look for cycles.
54Deadlock Detection (Cont.) Wait-for graph without a cycleWait-for graph with a cycle
55Deadlock Recovery When deadlock is detected : Some transaction will have to rolled back (made a victim) to break deadlock. Select that transaction as victim that will incur minimum cost.Rollback -- determine how far to roll back transactionTotal rollback: Abort the transaction and then restart it.More effective to roll back transaction only as far as necessary to break deadlock.Starvation happens if same transaction is always chosen as victim. Include the number of rollbacks in the cost factor to avoid starvation
56Failure Classification Transaction failure :Logical errors: transaction cannot complete due to some internal error conditionSystem errors: the database system must terminate an active transaction due to an error condition (e.g., deadlock)System crash: a power failure or other hardware or software failure causes the system to crash.Fail-stop assumption: non-volatile storage contents are assumed to not be corrupted by system crashDatabase systems have numerous integrity checks to prevent corruption of disk dataDisk failure: a head crash or similar disk failure destroys all or part of disk storageDestruction is assumed to be detectable: disk drives use checksums to detect failures
57Recovery AlgorithmsRecovery algorithms are techniques to ensure database consistency and transaction atomicity and durability despite failuresFocus of this chapterRecovery algorithms have two partsActions taken during normal transaction processing to ensure enough information exists to recover from failuresActions taken after a failure to recover the database contents to a state that ensures atomicity, consistency and durability
58Parallel databasesParallel machines are becoming quite common and affordablePrices of microprocessors, memory and disks have dropped sharplyRecent desktop computers feature multiple processors and this trend is projected to accelerateDatabases are growing increasingly largelarge volumes of transaction data are collected and stored for later analysis.multimedia objects like images are increasingly stored in databasesLarge-scale parallel database systems increasingly used for:storing large volumes of dataprocessing time-consuming decision-support queriesproviding high throughput for transaction processing
59Parallelism in Databases Data can be partitioned across multiple disks for parallel I/O.Individual relational operations (e.g., sort, join, aggregation) can be executed in paralleldata can be partitioned and each processor can work independently on its own partition.Queries are expressed in high level language (SQL, translated to relational algebra)makes parallelization easier.Different queries can be run in parallel with each other. Concurrency control takes care of conflicts.Thus, databases naturally lend themselves to parallelism.
60I/O ParallelismReduce the time required to retrieve relations from disk by partitioningthe relations on multiple disks.Horizontal partitioning – tuples of a relation are divided among many disks such that each tuple resides on one disk.Partitioning techniques (number of disks = n):Round-robin:Send the ith tuple inserted in the relation to disk i mod n.Hash partitioning:Choose one or more attributes as the partitioning attributes.Choose hash function h with range 0…n - 1Let i denote result of hash function h applied to the partitioning attribute value of a tuple. Send tuple to disk i.
61I/O Parallelism (Cont.) Partitioning techniques (cont.):Range partitioning:Choose an attribute as the partitioning attribute.A partitioning vector [vo, v1, ..., vn-2] is chosen.Let v be the partitioning attribute value of a tuple. Tuples such that vi vi+1 go to disk I + 1. Tuples with v < v0 go to disk 0 and tuples with v vn-2 go to disk n-1.E.g., with a partitioning vector [5,11], a tuple with partitioning attribute value of 2 will go to disk 0, a tuple with value 8 will go to disk 1, while a tuple with value 20 will go to disk2.
62Distributed Database System (ddbs) DDBS: Multiple logically interrelated databases distributed over a computer networkA distributed database system consists of loosely coupled sites that share no physical componentDatabase systems that run on each site are independent of each otherTransactions may access data at one or more sites
63Homogeneous Distributed Databases In a homogeneous distributed databaseAll sites have identical softwareAre aware of each other and agree to cooperate in processing user requests.Each site surrenders part of its autonomy in terms of right to change schemas or softwareAppears to user as a single systemIn a heterogeneous distributed databaseDifferent sites may use different schemas and softwareDifference in schema is a major problem for query processingDifference in software is a major problem for transaction processingSites may not be aware of each other and may provide only limited facilities for cooperation in transaction processing
64Data ReplicationSystem maintains multiple copies of data, stored in different sites, for faster retrieval and fault toleranceA relation or fragment of a relation is replicated if it is stored redundantly in two or more sites.Full replication of a relation is the case where the relation is stored at all sites.Fully redundant databases are those in which every site contains a copy of the entire database.
65Data Replication (Cont.) Advantages of ReplicationAvailability: failure of site containing relation r does not result in unavailability of r is replicas exist.Parallelism: queries on r may be processed by several nodes in parallel.Reduced data transfer: relation r is available locally at each site containing a replica of r.Disadvantages of ReplicationIncreased cost of updates: each replica of relation r must be updated.Increased complexity of concurrency control: concurrent updates to distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented.One solution: choose one copy as primary copy and apply concurrency control operations on primary copy
66Data FragmentationDivision of relation r into fragments r1, r2, …, rn which contain sufficient information to reconstruct relation r.Horizontal fragmentation: each tuple of r is assigned to one or more fragmentsVertical fragmentation: the schema for relation r is split into several smaller schemasAll schemas must contain a common candidate key (or superkey) to ensure lossless join property.A special attribute, the tuple-id attribute may be added to each schema to serve as a candidate key.Example : relation account with following schemaAccount = (account_number, branch_name , balance )
69Advantages of Fragmentation Horizontal:allows parallel processing on fragments of a relationallows a relation to be split so that tuples are located where they are most frequently accessedVertical:allows tuples to be split so that each part of the tuple is stored where it is most frequently accessedtuple-id attribute allows efficient joining of vertical fragmentsVertical and horizontal fragmentation can be mixed.Fragments may be successively fragmented to an arbitrary depth.Replication and fragmentation can be combinedRelation is partitioned into several fragments: system maintains several identical replicas of each such fragment.
70Bibliography1. Raghu Ramakrishnan, Johannes Gehrke, Database Management System, McGraw Hill., 3rd Edition 2003.2. Elmashri & Navathe, Fundamentals of Database System, Addison-Wesley Publishing, 3rd Edition,2000.3. Date C.J, An Introduction to Database, Addison-Wesley Pub Co, 7th Edition , 2001.4. Jeffrey D. Ullman, Jennifer Widom, A First Course in Database System, Prentice Hall, AWL 1st Edition ,2001.5. Peter rob, Carlos Coronel, Database Systems – Design, Implementation, and Management, 4th Edition, Thomson Learning, 2001.
71Review questions Define flash memory. Define log disk. What is meant by log-based file system?Define a transaction.List the required properties of a transaction to ensure integrity of the data.What is meant by cascading rollback?Define concurrency control.Define locking protocol.Define cache coherency.Define parallel aggregation.Define query optimizationDefine fuzzy checkpoint.What is meant by Write-ahead-logging (WAL) rule?