2 Databases Data Models Data Retrieval Data Storage Data Integrity Conceptual representation of the dataData RetrievalHow to ask questions of the databaseHow to answer those questionsData StorageHow/where to store data, how to access itData IntegrityManage crashes, concurrencyManage semantic inconsistencies
4 Storage Hierarchy Tradeoffs between speed and cost of access Volatile vs nonvolatileVolatile: Loses contents when power switched offSequential vs random accessSequential: read the data contiguouslyRandom: read the data from anywhere at any time
6 Storage Hierarchy Cache Super fast; volatile Typically on chip L1 vs L2 vs L3 caches ???Huge L3 caches available now-a-daysBecoming more and more important to care about thisCache misses are expensiveSimilar tradeoffs as were seen between main memory and disksCache-coherency ??
7 Storage Hierarchy Main memory Flash memory (EEPROM) Question 10s or 100s of ns; volatilePretty cheap and dropping: 1GByte < 100$Main memory databases feasible now-a-daysFlash memory (EEPROM)Limited number of write/erase cyclesNon-volatile, slower than main memory (especially writes)Examples ?QuestionHow does what we discuss next change if we use flash memory only ?Key issue: Random access as cheap as sequential access
8 Storage Hierarchy Magnetic Disk (Hard Drive) Non-volatileSequential access much much faster than random accessDiscuss in more detail laterOptical Storage - CDs/DVDs; JukeboxesUsed more as backups… Why ?Very slow to write (if possible at all)Tape storageBackups; super-cheap; painful to accessIBM just released a secure tape drive storage solution
9 Jim Gray’s Storage Latency Analogy: How Far Away is the Data? RegistersOn Chip CacheOn Board CacheMemoryDisk1210100Tape /OpticalRobot96SacramentoThis HotelThis RoomMy Head10 min1.5 hr2 Years1 minPluto2,000 YearsAndromeda
10 Storage… Primary Secondary Tertiary e.g. Main memory, cache; typically volatile, fastSecondarye.g. Disks; non-volatileTertiarye.g. Tapes; Non-volatile, super cheap, slow
18 Accessing DataAccessing a sectorTime to seek to the track (seek time)average 4 to 10ms+ Waiting for the sector to get under the head (rotational latency)average 4 to 11ms+ Time to transfer the data (transfer time)very lowAbout 10ms per accessSo if randomly accessed blocks, can only do 100 block transfers100 x 512bytes = 50 KB/sData transfer ratesRate at which data can be transferred (w/o any seeks)30-50MB/s (Compare to above)Seeks are bad !
19 Reliability Mean time to/between failure (MTTF/MTBF): Consider: 57 to 136 yearsConsider:1000 new disks1,200,000 hours of MTTF eachOn average, one will fail 1200 hours = 50 days !
20 Disk Controller Interface between the disk and the CPU Accepts the commandschecksums to verify correctnessRemaps bad sectors
21 Optimizing block accesses Typically sectors too smallBlock: A contiguous sequence of sectors512 bytes to several KbytesAll data transfers done in units of blocksScheduling of block access requests ?Considerations: performance and fairnessElevator algorithm
23 RAID Redundant array of independent disks Goal: Many raid “levels” Disks are very cheapFailures are very costlyUse “extra” disks to ensure reliabilityIf one disk goes down, the data still survivesAlso allows faster access to dataMany raid “levels”Different reliability and performance properties
24 RAID Levels (a) No redundancy. (b) Make a copy of the disks. If one disk goes down, we have a copy.Reads: Can go to either disk, so higher data rate possible.Writes: Need to write to both disks.
25 RAID Levels (c) Memory-style Error Correcting Keep extra bits around so we can reconstruct.Superceeded by below.(d) One disk contains “parity” for the main data disks.Can handle a single disk failure.Little overhead (only 25% in the above case).
26 RAID Level 5 Distributed parity “blocks” instead of bits Subsumes Level 4Normal operation:“Read” directly from the disk. Uses all 5 disks“Write”: Need to read and update the parity blockTo update 9 to 9’read 9 and P2compute P2’ = P2 xor 9 xor 9’write 9’ and P2’
27 RAID Level 5 Failure operation (disk 3 has failed) “Read block 0”: Read it directly from disk 2“Read block 1” (which is on disk 3)Read P0, 0, 2, 3 and compute 1 = P0 xor 0 xor 2 xor 3“Write”:To update 9 to 9’read 9 and P2Oh… P2 is on disk 3So no need to update itWrite 9’
28 Choosing a RAID level Main choice between RAID 1 and RAID 5 Level 1 better write performance than level 5Level 5: 2 block reads and 2 block writes to write a single blockLevel 1: only requires 2 block writesLevel 1 preferred for high update environments such as log disksLevel 5 lower storage costLevel 1 60% more disksLevel 5 is preferred for applications with low update rate, and large amounts of data
30 Buffer Manager Data must be in RAM for DBMS to operate on it! Page Requests from Higher LevelsBUFFER POOLdisk pagefree frameMAIN MEMORYDISKDBchoice of frame dictatedby replacement policyData must be in RAM for DBMS to operate on it!Buffer Mgr hides the fact that not all data is in RAM
31 Buffer Manager Similar to virtual memory manager Buffer replacement policiesWhat page to evict ?LRU: Least Recently UsedThrow out the page that was not used in a long timeMRU: Most Recently UsedThe oppositeWhy ?Clock ?An efficient implementation of LRU
32 Buffer Manager Pinning a block Force-output (force-write) Not allowed to write back to the diskForce-output (force-write)Force the contents of a block to be written to diskOrder the writesThis block must be written to disk before this blockCritical for fault tolerant guaranteesOtherwise the database has no control over whats on disk and whats not on disk
34 File Organization How are the relations mapped to the disk blocks ? Use a standard file system ?High-end systems have their own OS/file systemsOS interferes more than helps in many casesMapping of relations to file ?One-to-one ?Advantages in storing multiple relations clustered togetherA file is essentially a collection of disk blocksHow are the tuples mapped to the disk blocks ?How are they stored within each block
35 File Organization Goals: Simplest case Next: Allow insertion/deletions of tuples/recordsFetch a particular record (specified by record id)Find all tuples that match a condition (say SSN = 123) ?Simplest caseEach relation is mapped to a fileA file contains a sequence of recordsEach record corresponds to a logical tupleNext:How are tuples/records stored within a block ?
36 Fixed Length Records n = number of bytes per record Store record i at position:n * (i – 1)Records may cross blocksNot desirableStagger so that that doesn’t happenInserting a tuple ?Depends on the policy usedOne option: Simply append at the end of the recordDeletions ?Option 1: RearrangeOption 2: Keep a free list and use for next insert
37 Variable-length Records Slotted page structureIndirection:The records may move inside the page, but the outside world is oblivious to itWhy ?The headers are used as a indirection mechanismRecord ID 1000 is the 5th entry in the page number X
38 File Organization Which block of a file should a record go to ? Anywhere ?How to search for “SSN = 123” ?Called “heap” organizationSorted by SSN ?Called “sequential” organizationKeeping it sorted would be painfulHow would you search ?Based on a “hash” keyCalled “hashing” organizationStore the record with SSN = x in the block number x%1000Why ?
39 Sequential File Organization Keep sorted by some search keyInsertionFind the block in which the tuple should beIf there is free space, insert itOtherwise, must create overflow pagesDeletionsDelete and keep the free spaceDatabases tend to be insert heavy, so free space gets used fastCan become fragmentedMust reorganize once in a while
40 Sequential File Organization What if I want to find a particular record by value ?Account info for SSN = 123Binary searchTakes log(n) number of disk accessesRandom accessesToo muchn = 1,000,000, log(n) = 30Recall each random access approx 10 ms300 ms to find just one account information< 4 requests satisfied per second
42 Index A data structure for efficient search through large databaess Two key ideas:The records are mapped to the disk blocks in specific waysSorted, or hash-basedAuxiliary data structures are maintained that allow quick searchThink library index/catalogueSearch key:Attribute or set of attributes used to look up recordsE.g. SSN for a persons tableTwo types of indexesOrdered indexesHash-based indexes
43 Ordered Indexes Index Relation Primary index Secondary index The relation is sorted on the search key of the indexSecondary indexIt is notCan have only one primary index on a relationIndexRelation
44 Primary Sparse Index Every key doesn’t have to appear in the index Allows for very small indexesBetter chance of fitting in memoryTradeoff: Must access the relation file even if the record is not present
45 Secondary Index Relation sorted on branch But we want an index on balanceMust be denseEvery search key must appear in the index
46 Multi-level Indexes What if the index itself is too big for memory ? Relation size = n = 1,000,000,000Block size = 100 tuples per blockSo, number of pages = 10,000,000Keeping one entry per page takes too much spaceSolutionBuild an index on the index itself
47 Multi-level Indexes How do you search through a multi-level index ? What about keeping the index up-to-date ?Tuple insertions and deletionsThis is a static structureNeed overflow pages to deal with insertionsWorks well if no inserts/deletesNot so good when inserts and deletes are common
50 B+-Tree Node Structure Typical nodeKi are the search-key valuesPi are pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for leaf nodes).The search-keys in a node are orderedK1 < K2 < K3 < < Kn–1
51 Properties of B+-Trees It is balancedEvery path from the root to a leaf is same lengthLeaf nodes (at the bottom)P1 contains the pointers to tuple(s) with key K1…Pn is a pointer to the next leaf nodeMust contain at least n/2 entries
53 Properties Interior nodes All tuples in the subtree pointed to by P1, have search key < K1To find a tuple with key K1’ < K1, follow P1…Finally, search keys in the tuples contained in the subtree pointed to by Pn, are all larger than Kn-1Must contain at least n/2 entries (unless root)
55 B+-Trees - Searching How to search ? Logarithmic Follow the pointersLogarithmiclogB/2(N), where B = Number of entries per blockB is also called the order of the B+-Tree IndexTypically 100 or soIf a relation contains1,000,000,000 entries, takes only 4 random accessesThe top levels are typically in memorySo only requires 1 or 2 random accesses per request
56 Tuple Insertion Find the leaf node where the search key should go If already presentInsert record in the file. Update the bucket if necessaryThis would be needed for secondary indexesIf not presentInsert the record in the fileAdjust the indexAdd a new (Ki, Pi) pair to the leaf nodeRecall the keys in the nodes are sortedWhat if there is no space ?
57 Tuple Insertion Splitting a node Node has too many key-pointer pairs Needs to store n, only has space for n-1Split the node into two nodesPut about half in eachRecursively go up the treeMay result in splitting all the way to the rootIn fact, may end up adding a level to the treePseudocode in the book !!
58 B+-Tree before and after insertion of “Clearview” B+-Trees: InsertionB+-Tree before and after insertion of “Clearview”
59 Updates on B+-Trees: Deletion Find the record, delete it.Remove the corresponding (search-key, pointer) pair from a leaf nodeNote that there might be another tuple with the same search-keyIn that case, this is not neededIssue:The leaf node now may contain too few entriesWhy do we care ?Solution:See if you can borrow some entries from a siblingIf all the siblings are also just barely full, then merge (opposite of split)May end up merging all the way to the rootIn fact, may reduce the height of the tree by one
60 Examples of B+-Tree Deletion Before and after deleting “Downtown”Deleting “Downtown” causes merging of under-full leavesleaf node can become empty only for n=3!
61 Examples of B+-Tree Deletion Deletion of “Perryridge” from result of previous example
62 Example of B+-tree Deletion Parent of leaf containing Perryridge became underfull, and borrowed a pointer from its left siblingSearch-key value in the parent’s parent changes as a resultBefore and after deletion of “Perryridge” from earlier example
63 B+ Trees in Practice Typical order: 100. Typical fill-factor: 67%. average fanout = 133Typical capacities:Height 3: 1333 = 2,352,637 entriesHeight 4: 1334 = 312,900,700 entriesCan often hold top levels in buffer pool:Level 1 = page = KbytesLevel 2 = pages = MbyteLevel 3 = 17,689 pages = 133 MBytes
64 B+ Trees: Summary Searching: Insertion: Deletion logd(n) – Where d is the order, and n is the number of entriesInsertion:Find the leaf to insert intoIf full, split the node, and adjust index accordinglySimilar cost as searchingDeletionFind the leaf nodeDeleteMay not remain half-full; must adjust the index accordingly
65 More… Primary vs Secondary Indexes More B+-Trees Hash-based Indexes Static HashingExtendible HashingLinear HashingGrid-filesR-Treesetc…
66 Secondary IndexIf relation not sorted by search key, called a secondary indexNot all tuples with the same search key will be togetherSearching is more expensive
67 B+-Tree File Organization Store the records at the leavesSorted order etc..
68 B-Tree Predates Different treatment of search keys Less storage Significantly harder to implementNot used.
69 Hash-based File Organization (4044, “C”, …)(401, “Ax”,…)(21, “Bx”,…)(1002, “Ay”,…)(10, “By”,…)(1003, “Az”,…)(35, “Bz”,…)Block 0Block 1Block 2Block 3Store record with search key kin block number h(k)e.g. for a person file,h(SSN) = SSN % 4Blocks called “buckets”What if the block becomes full ?Overflow pagesUniformity property:Don’t want all tuples to map tothe same bucketh(SSN) = SSN % 2 would be badBuckets
70 Hash-based File Organization Hashed on “branch-name”Hash function:a = 1, b = 2, .., z = 26h(abz)= ( ) % 10= 9
71 Hash Indexes Extends the basic idea Search: Find the block with search keyFollow the pointerRange search ?a < X < b ?
72 Hash Indexes Very fast search on equality Can’t search for “ranges” at allMust scan the fileInserts/DeletesOverflow pages can degrade the performanceTwo approachesDynamic hashingExtendible hashing
73 Grid Files Stores pointers to tuples with : branch-name between Mianus and Perryridgeand balance < 1kMultidimensional index structureCan handle: X = x1 and Y = y1a < X < b and c < Y < d
74 R-TreesFor spatial data (e.g. maps, rectangles, GPS data etc)
75 ConclusionsIndexing Goal: “Quickly find the tuples that match certain conditions”Equality and range queries most commonHence B+-Trees the predominant structure for on-disk representationHashing is used more commonly for in-memory operationsMany many more types of indexing structures existFor different types of dataFor different types of queriesE.g. “nearest-neighbor” queries