Presentation on theme: "G.Vadivu & P.Subha, Assistant Professor, SRM University, Kattankulathur 8/22/2011School of Computing, Department of IT."— Presentation transcript:
G.Vadivu & P.Subha, Assistant Professor, SRM University, Kattankulathur 8/22/2011School of Computing, Department of IT
The contents of the slides are solely for the purpose of teaching students at SRM University. All copyrights and Trademarks of organizations/persons apply even if not specified explicitly.
Speed with which data can be accessed Cost per unit of data Reliability data loss on power failure or system crash physical failure of the storage device Can differentiate storage into: volatile storage: loses contents when power is switched off non-volatile storage: Contents persist even when power is switched off. Includes secondary and tertiary storage, as well as battery-backed up main-memory.
Cache – fastest and most costly form of storage; volatile; managed by the computer system hardware Main memory: fast access (10s to 100s of nanoseconds; 1 nanosecond = 10 –9 seconds) generally too small (or too expensive) to store the entire database capacities of up to a few Gigabytes widely used currently Capacities have gone up and per-byte costs have decreased steadily and rapidly (roughly factor of 2 every 2 to 3 years) Volatile contents of main memory are usually lost if a power failure or system crash occurs.
Flash memory Data survives power failure Data can be written at a location only once, but location can be erased and written to again Can support only a limited number (10K – 1M) of write/erase cycles. Erasing of memory has to be done to an entire bank of memory Reads are roughly as fast as main memory But writes are slow (few microseconds), erase is slower
Flash memory NOR Flash Fast reads, very slow erase, lower capacity Used to store program code in many embedded devices NAND Flash Page-at-a-time read/write, multi-page erase High capacity (several GB) Widely used as data storage mechanism in portable devices
Magnetic-disk Data is stored on spinning disk, and read/written magnetically Primary medium for the long-term storage of data; typically stores entire database. Data must be moved from disk to main memory for access, and written back for storage direct-access – possible to read data on disk in any order, unlike magnetic tape Survives power failures and system crashes disk failure can destroy data: is rare but does happen
Optical storage non-volatile, data is read optically from a spinning disk using a laser CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms Write-one, read-many (WORM) optical disks used for archival storage (CD-R, DVD-R, DVD+R) Multiple write versions also available (CD-RW, DVD-RW, DVD+RW, and DVD-RAM) Reads and writes are slower than with magnetic disk Juke-box systems, with large numbers of removable disks, a few drives, and a mechanism for automatic loading/unloading of disks available for storing large volumes of data
Tape storage non-volatile, used primarily for backup (to recover from disk failure), and for archival data sequential-access – much slower than disk very high capacity (40 to 300 GB tapes available) tape can be removed from drive storage costs much cheaper than disk, but drives are expensive Tape jukeboxes available for storing massive amounts of data hundreds of terabytes (1 terabyte = 10 9 bytes) to even a petabyte (1 petabyte = bytes)
RAID: Redundant Arrays of Independent Disks disk organization techniques that manage a large numbers of disks, providing a view of a single disk of high capacity and high speed by using multiple disks in parallel, and high reliability by storing data redundantly, so that data can be recovered even if a disk fails The chance that some disk out of a set of N disks will fail is much higher than the chance that a specific single disk will fail. E.g., a system with 100 disks, each with MTTF of 100,000 hours (approx. 11 years), will have a system MTTF of 1000 hours (approx. 41 days)
Two main goals of parallelism in a disk system: 1.Load balance multiple small accesses to increase throughput 2.Parallelize large accesses to reduce response time. Improve transfer rate by striping data across multiple disks. Bit-level striping – split the bits of each byte across multiple disks But seek/access time worse than for a single disk Bit level striping is not used much any more Block-level striping – with n disks, block i of a file goes to disk (i mod n) + 1 Requests for different blocks can run in parallel if the blocks reside on different disks A request for a long sequence of blocks can utilize all disks in parallel
RAID organizations, or RAID levels, have differing cost, performance and reliability characteristics RAID Level 0: Block striping; non-redundant. Used in high-performance applications where data lost is not critical. RAID Level 1: Mirrored disks with block striping Offers best write performance. Popular for applications such as storing log files in a database system.
RAID Level 2: Memory-Style Error-Correcting-Codes (ECC) with bit striping. RAID Level 3: Bit-Interleaved Parity a single parity bit is enough for error correction, not just detection When writing data, corresponding parity bits must also be computed and written to a parity bit disk To recover data in a damaged disk, compute XOR of bits from other disks (including parity bit disk)
RAID Level 3 (Cont.) Faster data transfer than with a single disk, but fewer I/Os per second since every disk has to participate in every I/O. RAID Level 4: Block-Interleaved Parity; uses block-level striping, and keeps a parity block on a separate disk for corresponding blocks from N other disks. When writing data block, corresponding block of parity bits must also be computed and written to parity disk To find value of a damaged block, compute XOR of bits from corresponding blocks (including parity block) from other disks.
RAID Level 4 (Cont.) Provides higher I/O rates for independent block reads than Level 3 block read goes to a single disk, so blocks stored on different disks can be read in parallel Before writing a block, parity data must be computed Can be done by using old parity block, old value of current block and new value of current block (2 block reads + 2 block writes) Or by recomputing the parity value using the new values of blocks corresponding to the parity block More efficient for writing large amounts of data sequentially Parity block becomes a bottleneck for independent block writes since every block write also writes to parity disk
RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk. E.g., with 5 disks, parity block for nth set of blocks is stored on disk (n mod 5) + 1, with the data blocks stored on the other 4 disks.
RAID Level 5 (Cont.) Higher I/O rates than Level 4. Block writes occur in parallel if the blocks and their parity blocks are on different disks. Subsumes Level 4: provides same benefits, but avoids bottleneck of parity disk. RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but stores extra redundant information to guard against multiple disk failures. Better reliability than Level 5 at a higher cost; not used as widely.
A transaction is a unit of program execution that accesses and possibly updates various data items. A transaction must see a consistent database. During transaction execution the database may be inconsistent. When the transaction is committed, the database must be consistent.
To ensure integrity of data, the database system must maintain: Atomicity. Either all operations of the transaction are properly reflected in the database or none are. Consistency. Execution of a transaction in isolation preserves the consistency of the database. Isolation. Although multiple transactions may execute concurrently, each transaction must be unaware of other concurrently executing transactions. That is, for every pair of transactions T i and T j, it appears to T i that either T j, finished execution before T i started, or T j started execution after T i finished. Durability. After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures.
Transaction to transfer $100 from Checking account A to Saving account B: 1.read(A) 2.A := A – write(A) 4.read(B) 5.B := B write(B) Consistency requirement – the sum of A and B is unchanged by the execution of the transaction. Atomicity requirement if the transaction fails after step 3 and before step 6, the system should ensure that its updates are not reflected in the database, else an inconsistency will result.
Durability requirement once the user has been notified that the transaction has completed (i.e., the transfer of the $100 has taken place), the updates to the database by the transaction must persist despite failures. Isolation requirement if between steps 3 and 6, another transaction is allowed to access the partially updated database, it will see an inconsistent database.
Active, the initial state; the transaction stays in this state while it is executing Partially committed, after the final statement has been executed. Failed, after the discovery that normal execution can no longer proceed. Aborted, after the transaction has been rolled back and the database restored to its state prior to the start of the transaction. 1) Restart the transaction – only if no internal logical error 2) kill the transaction Committed, after successful completion..
The shadow-database scheme: assume that only one transaction is active at a time. a pointer called db_pointer always points to the current consistent copy of the database. all updates are made on a shadow copy of the database, and db_pointer is made to point to the updated shadow copy only after the transaction reaches partial commit and all updated pages have been flushed to disk. in case transaction fails, old consistent copy pointed to by db_pointer can be used, and the shadow copy can be deleted.
Multiple transactions are allowed to run concurrently in the system. Advantages are: increased processor and disk utilization, leading to better transaction throughput: one transaction can be using the CPU while another is reading from or writing to the disk reduced waiting time for transactions: short transactions need not wait behind long ones.
Schedules – sequences that indicate the chronological order in which instructions of concurrent transactions are executed a schedule for a set of transactions must consist of all instructions of those transactions must preserve the order in which the instructions appear in each individual transaction.
Let T 1 transfer $50 from A to B, and T 2 transfer 10% of the balance from A to B. The following is a serial schedule (Schedule 1 in the text), in which T 1 is followed by T 2.
Let T 1 and T 2 be the transactions defined previously. The following schedule is not a serial schedule, but it is equivalent to Schedule 1.
The following concurrent schedule does not preserve the value of the the sum A + B.
Basic Assumption – Each transaction preserves database consistency. Thus serial execution of a set of transactions preserves database consistency. A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule. Different forms of schedule equivalence give rise to the notions of: 1.conflict serializability 2.view serializability
Instructions l i and l j of transactions T i and T j respectively, conflict if and only if there exists some item Q accessed by both l i and l j, and at least one of these instructions wrote Q. 1. l i = read(Q), l j = read(Q). l i and l j dont conflict. 2. l i = read(Q), l j = write(Q). They conflict. 3. l i = write(Q), l j = read(Q). They conflict 4. l i = write(Q), l j = write(Q). They conflict
If a schedule S can be transformed into a schedule S´ by a series of swaps of non-conflicting instructions, we say that S and S´ are conflict equivalent. We say that a schedule S is conflict serializable if it is conflict equivalent to a serial schedule Example of a schedule that is not conflict serializable: T 3 T 4 read(Q) write(Q) write(Q) We are unable to swap instructions in the above schedule to obtain either the serial schedule, or the serial schedule.
Schedule 3 below can be transformed into Schedule 1, a serial schedule where T 2 follows T 1, by series of swaps of non- conflicting instructions. Therefore Schedule 3 is conflict serializable.
Let S and S´ be two schedules with the same set of transactions. S and S´ are view equivalent if the following three conditions are met: 1.For each data item Q, if transaction T i reads the initial value of Q in schedule S, then transaction T i must, in schedule S´, also read the initial value of Q. 2.For each data item Q if transaction T i executes read(Q) in schedule S, and that value was produced by transaction T j (if any), then transaction T i must in schedule S´ also read the value of Q that was produced by transaction T j. 3.For each data item Q, the transaction (if any) that performs the final write(Q) operation in schedule S must perform the final write(Q) operation in schedule S´. As can be seen, view equivalence is also based purely on reads and writes alone.
A schedule S is view serializable it is view equivalent to a serial schedule. Every conflict serializable schedule is also view serializable. Schedule 9 (from text) a schedule which is view-serializable but not conflict serializable. Every view serializable schedule that is not conflict serializable has blind writes.
Serializable default Repeatable read only committed records to be read, repeated reads of same record must return same value. However, a transaction may not be serializable – it may find some records inserted by a transaction but not find others. Read committed only committed records can be read, but successive reads of record may return different (but committed) values. Read uncommitted even uncommitted records may be read.
A lock is a mechanism to control concurrent access to a data item Data items can be locked in two modes : 1. exclusive (X) mode. Data item can be both read as well as written. X-lock is requested using lock-X instruction. 2. shared (S) mode. Data item can only be read. S-lock is requested using lock-S instruction. Lock requests are made to concurrency-control manager. Transaction can proceed only after request is granted.
Lock-compatibility matrix A transaction may be granted a lock on an item if the requested lock is compatible with locks already held on the item by other transactions Any number of transactions can hold shared locks on an item, but if any transaction holds an exclusive on the item no other transaction may hold any lock on the item. If a lock cannot be granted, the requesting transaction is made to wait till all incompatible locks held by other transactions have been released. The lock is then granted.
Example of a transaction performing locking: T 2 : lock-S(A); read (A); unlock(A); lock-S(B); read (B); unlock(B); display(A+B) Locking as above is not sufficient to guarantee serializability if A and B get updated in-between the read of A and B, the displayed sum would be wrong. A locking protocol is a set of rules followed by all transactions while requesting and releasing locks. Locking protocols restrict the set of possible schedules.
Consider the partial schedule Neither T 3 nor T 4 can make progress executing lock-S(B) causes T 4 to wait for T 3 to release its lock on B, while executing lock-X(A) causes T 3 to wait for T 4 to release its lock on A. Such a situation is called a deadlock. To handle a deadlock one of T 3 or T 4 must be rolled back and its locks released.
The potential for deadlock exists in most locking protocols. Deadlocks are a necessary evil. Starvation is also possible if concurrency control manager is badly designed. For example: A transaction may be waiting for an X-lock on an item, while a sequence of other transactions request and are granted an S-lock on the same item. The same transaction is repeatedly rolled back due to deadlocks. Concurrency control manager can be designed to prevent starvation.
This is a protocol which ensures conflict-serializable schedules. Phase 1: Growing Phase transaction may obtain locks transaction may not release locks Phase 2: Shrinking Phase transaction may release locks transaction may not obtain locks The protocol assures serializability. It can be proved that the transactions can be serialized in the order of their lock points (i.e. the point where a transaction acquired its final lock).
Two-phase locking does not ensure freedom from deadlocks Cascading roll-back is possible under two-phase locking. To avoid this, follow a modified protocol called strict two-phase locking. Here a transaction must hold all its exclusive locks till it commits/aborts. Rigorous two-phase locking is even stricter: here all locks are held till commit/abort. In this protocol transactions can be serialized in the order in which they commit.
There can be conflict serializable schedules that cannot be obtained if two-phase locking is used. However, in the absence of extra information (e.g., ordering of access to data), two-phase locking is needed for conflict serializability in the following sense: Given a transaction T i that does not follow two-phase locking, we can find a transaction T j that uses two-phase locking, and a schedule for T i and T j that is not conflict serializable.
Two-phase locking with lock conversions: – First Phase: can acquire a lock-S on item can acquire a lock-X on item can convert a lock-S to a lock-X (upgrade) – Second Phase: can release a lock-S can release a lock-X can convert a lock-X to a lock-S (downgrade) This protocol assures serializability. But still relies on the programmer to insert the various locking instructions.
Consider the following two transactions: T 1 : write (X) T 2 : write(Y) write(Y) write(X) Schedule with deadlock T1T1 T2T2 lock-X on X write (X) lock-X on Y write (X) wait for lock-X on X wait for lock-X on Y
System is deadlocked if there is a set of transactions such that every transaction in the set is waiting for another transaction in the set. Deadlock prevention protocols ensure that the system will never enter into a deadlock state. Some prevention strategies : Require that each transaction locks all its data items before it begins execution (predeclaration). Impose partial ordering of all data items and require that a transaction can lock data items only in the order specified by the partial order (graph-based protocol).
Following schemes use transaction timestamps for the sake of deadlock prevention alone. wait-die scheme non-preemptive older transaction may wait for younger one to release data item. Younger transactions never wait for older ones; they are rolled back instead. a transaction may die several times before acquiring needed data item wound-wait scheme preemptive older transaction wounds (forces rollback) of younger transaction instead of waiting for it. Younger transactions may wait for older ones. may be fewer rollbacks than wait-die scheme.
Both in wait-die and in wound-wait schemes, a rolled back transactions is restarted with its original timestamp. Older transactions thus have precedence over newer ones, and starvation is hence avoided. Timeout-Based Schemes : a transaction waits for a lock only for a specified amount of time. After that, the wait times out and the transaction is rolled back. thus deadlocks are not possible simple to implement; but starvation is possible. Also difficult to determine good value of the timeout interval.
Deadlocks can be described as a wait-for graph, which consists of a pair G = (V,E), V is a set of vertices (all the transactions in the system) E is a set of edges; each element is an ordered pair T i T j. If T i T j is in E, then there is a directed edge from T i to T j, implying that T i is waiting for T j to release a data item. When T i requests a data item currently being held by T j, then the edge T i T j is inserted in the wait-for graph. This edge is removed only when T j is no longer holding a data item needed by T i. The system is in a deadlock state if and only if the wait-for graph has a cycle. Must invoke a deadlock-detection algorithm periodically to look for cycles.
Wait-for graph without a cycle Wait-for graph with a cycle
When deadlock is detected : Some transaction will have to rolled back (made a victim) to break deadlock. Select that transaction as victim that will incur minimum cost. Rollback -- determine how far to roll back transaction Total rollback: Abort the transaction and then restart it. More effective to roll back transaction only as far as necessary to break deadlock. Starvation happens if same transaction is always chosen as victim. Include the number of rollbacks in the cost factor to avoid starvation
Transaction failure : Logical errors: transaction cannot complete due to some internal error condition System errors: the database system must terminate an active transaction due to an error condition (e.g., deadlock) System crash: a power failure or other hardware or software failure causes the system to crash. Fail-stop assumption: non-volatile storage contents are assumed to not be corrupted by system crash Database systems have numerous integrity checks to prevent corruption of disk data Disk failure: a head crash or similar disk failure destroys all or part of disk storage Destruction is assumed to be detectable: disk drives use checksums to detect failures
Recovery algorithms are techniques to ensure database consistency and transaction atomicity and durability despite failures Focus of this chapter Recovery algorithms have two parts 1. Actions taken during normal transaction processing to ensure enough information exists to recover from failures 2. Actions taken after a failure to recover the database contents to a state that ensures atomicity, consistency and durability
Parallel machines are becoming quite common and affordable Prices of microprocessors, memory and disks have dropped sharply Recent desktop computers feature multiple processors and this trend is projected to accelerate Databases are growing increasingly large large volumes of transaction data are collected and stored for later analysis. multimedia objects like images are increasingly stored in databases Large-scale parallel database systems increasingly used for: storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing
Data can be partitioned across multiple disks for parallel I/O. Individual relational operations (e.g., sort, join, aggregation) can be executed in parallel data can be partitioned and each processor can work independently on its own partition. Queries are expressed in high level language (SQL, translated to relational algebra) makes parallelization easier. Different queries can be run in parallel with each other.Concurrency control takes care of conflicts. Thus, databases naturally lend themselves to parallelism.
Reduce the time required to retrieve relations from disk by partitioning the relations on multiple disks. Horizontal partitioning – tuples of a relation are divided among many disks such that each tuple resides on one disk. Partitioning techniques (number of disks = n): Round-robin: Send the i th tuple inserted in the relation to disk i mod n. Hash partitioning: Choose one or more attributes as the partitioning attributes. Choose hash function h with range 0…n - 1 Let i denote result of hash function h applied tothe partitioning attribute value of a tuple. Send tuple to disk i.
Partitioning techniques (cont.): Range partitioning: Choose an attribute as the partitioning attribute. A partitioning vector [v o, v 1,..., v n-2 ] is chosen. Let v be the partitioning attribute value of a tuple. Tuples such that v i v i+1 go to disk I + 1. Tuples with v < v 0 go to disk 0 and tuples with v v n-2 go to disk n-1. E.g., with a partitioning vector [5,11], a tuple with partitioning attribute value of 2 will go to disk 0, a tuple with value 8 will go to disk 1, while a tuple with value 20 will go to disk2.
DDBS: Multiple logically interrelated databases distributed over a computer network A distributed database system consists of loosely coupled sites that share no physical component Database systems that run on each site are independent of each other Transactions may access data at one or more sites
In a homogeneous distributed database All sites have identical software Are aware of each other and agree to cooperate in processing user requests. Each site surrenders part of its autonomy in terms of right to change schemas or software Appears to user as a single system In a heterogeneous distributed database Different sites may use different schemas and software Difference in schema is a major problem for query processing Difference in software is a major problem for transaction processing Sites may not be aware of each other and may provide only limited facilities for cooperation in transaction processing
System maintains multiple copies of data, stored in different sites, for faster retrieval and fault tolerance A relation or fragment of a relation is replicated if it is stored redundantly in two or more sites. Full replication of a relation is the case where the relation is stored at all sites. Fully redundant databases are those in which every site contains a copy of the entire database.
Advantages of Replication Availability: failure of site containing relation r does not result in unavailability of r is replicas exist. Parallelism: queries on r may be processed by several nodes in parallel. Reduced data transfer: relation r is available locally at each site containing a replica of r. Disadvantages of Replication Increased cost of updates: each replica of relation r must be updated. Increased complexity of concurrency control: concurrent updates to distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented. One solution: choose one copy as primary copy and apply concurrency control operations on primary copy
Division of relation r into fragments r 1, r 2, …, r n which contain sufficient information to reconstruct relation r. Horizontal fragmentation: each tuple of r is assigned to one or more fragments Vertical fragmentation: the schema for relation r is split into several smaller schemas All schemas must contain a common candidate key (or superkey) to ensure lossless join property. A special attribute, the tuple-id attribute may be added to each schema to serve as a candidate key. Example : relation account with following schema Account = (account_number, branch_name, balance )
Horizontal: allows parallel processing on fragments of a relation allows a relation to be split so that tuples are located where they are most frequently accessed Vertical: allows tuples to be split so that each part of the tuple is stored where it is most frequently accessed tuple-id attribute allows efficient joining of vertical fragments Vertical and horizontal fragmentation can be mixed. Fragments may be successively fragmented to an arbitrary depth. Replication and fragmentation can be combined Relation is partitioned into several fragments: system maintains several identical replicas of each such fragment.
1. Raghu Ramakrishnan, Johannes Gehrke, Database Management System, McGraw Hill., 3 rd Edition Elmashri & Navathe, Fundamentals of Database System, Addison-Wesley Publishing, 3 rd Edition, Date C.J, An Introduction to Database, Addison-Wesley Pub Co, 7 th Edition, Jeffrey D. Ullman, Jennifer Widom, A First Course in Database System, Prentice Hall, AWL 1 st Edition, Peter rob, Carlos Coronel, Database Systems – Design, Implementation, and Management, 4 th Edition, Thomson Learning, 2001.
Define flash memory. Define log disk. What is meant by log-based file system? Define a transaction. List the required properties of a transaction to ensure integrity of the data. What is meant by cascading rollback? Define concurrency control. Define locking protocol. Define cache coherency. Define parallel aggregation. Define query optimization Define locking protocol. Define fuzzy checkpoint. What is meant by Write-ahead-logging (WAL) rule?