Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS346: Advanced Databases

Similar presentations

Presentation on theme: "CS346: Advanced Databases"— Presentation transcript:

1 CS346: Advanced Databases
Graham Cormode Storage, Files and Indexing

2 Outline Part 1: Disk properties and file storage
File organizations: ordered, unordered, and hashed Storage topics: RAID and Storage area networks Chapter: “Disk Storage, Basic File Stuctures and Hashing” in Elmasri and Navathe Part 2: Indexes CS346 Advanced Databases

3 Why? Important to understand how high-level abstractions (databases) map down to low-level concepts (disks, files) Get a sense of the scale of the quantities involved (seek times, overhead of inefficient solutions) Appreciate the difference that smart solutions can bring Understand where the bottlenecks lie Give a “bottom-up” perspective on data management See the whole picture starting from the low-level Demystify some aspects that can seem opaque (B-trees, hashing, file organization) Apply to many areas of computer science (OS, algorithms…) CS346 Advanced Databases

4 The Memory Hierarchy Flash Storage CS346 Advanced Databases

5 Data on Disks Databases ultimately rely on non-volatile disk storage
Data typically does not fit in (volatile) memory Physical properties of disks affect performance of the DBMS Need to understand some basics of disks A few exceptions to disk-based databases: Some real-time applications use “in-memory databases” Some legacy/massive applications use tape storage as well Different tradeoffs with flash-based storage Much faster to read, but limits on number of deletions No major difference between random access and linear scan “Flash databases” are a niche, but growing area CS346 Advanced Databases

6 Rotating Disk: 5000 – 10000RPM Sector size: 0.5KB – 4KB, basic unit of data transfer from disk Seek time: move read head into position, currently ~4ms Includes rotational delay: wait for sector to come under read head Random access: 1/0.004 * 4KB = 1MB/second: quite slow Track-to-track move, currently ~0.4ms: 10 times faster Sustained read/write time: 100MB/second (caching can improve) CS346 Advanced Databases

7 Disk properties: the fundamental contrast
Random access is slow, sequential access is fast By factors of up to 100s Want to design storage of data to avoid or minimize random access and make data access as fast as possible Buffering can help in multithreaded systems: Work on other processes while waiting for data to arrive Double buffering: maintain two buffers of data work on current buffer of data, while other buffer fills from disk Maximizes parallel utilization, but doesn’t make my thread faster CS346 Advanced Databases

8 Records: the basic unit of the database
Databases fundamentally composed of records Each record describes an object with a number of fields Fields have a type (integer, float, string, time, compound…) Fixed or variable length Need to know when one field ends and the next begins Field length codes Field separators (special characters) Leads to variable length records How to effectively search through data with variable length records? CS346 Advanced Databases

9 CS346 Advanced Databases

10 Records and Blocks Records get stored on disks organized into blocks
Small records: pack an integer number into each block Leaves some space left over in blocks Blocking Factor: (average) number of records per block Large records: may not be effective to leave slack Records may span across multiple blocks (spanned organization) May use a pointer at end of block to point to next block CS346 Advanced Databases

11 Files A sequence of records is stored as a file
Either using OS file system support, or handled by DBMS Database requires support for various file operations: Open file, return new file handler Scan for the next record that satisfies a search condition Read the next record from disk into memory Delete the current record and (eventually) update file on disk Modify the current record and (eventually) update file on disk Insert a new record at the current location Close the file, flush any buffers and postponed operations Need suitable file layout and indices to allow fast scan operation CS346 Advanced Databases

12 File organization: unordered
Just dump the records on disk in no particular order Insert is very efficient: just add to last block Scan is very inefficient: need to do a linear search Read half the file on average Delete could be inefficient: Read whole file, write it back with deleted record omitted Instead, just “mark” record as deleted Periodically remove marked records CS346 Advanced Databases

13 File organization: ordered
Keep records ordered on some (key) attribute Can scan through records in that order very easily Can search for a value (or range of values) by binary search Binary search: log2 b seeks to find desired record out of b blocks Linear search: b/2 seeks on average to find record Insertion is rather more expensive and complex to do well Keep recent records in “overflow buffer” for periodic merge If modifying the key field, treat as a deletion and an insertion CS346 Advanced Databases

14 CS346 Advanced Databases

15 File organization: hashed
Use hashing to ensure records with same key are grouped together Arrange file blocks into M equal sized buckets Often, 1 block = 1 bucket Apply hash function to key field to determine its bucket Usual hash table concerns emerge Need to deal with collisions, e.g. by open addressing, or chaining Deletions also get messy, depending on collision method used CS346 Advanced Databases

16 External hashing Don’t store records directly in buckets, store pointers to records Pointers are small, fit more in a block “All problems in computer science can be solved by another level of indirection” – David Wheeler CS346 Advanced Databases

17 External hashing: issues
Aim for 70-90% occupancy of the hash table Not too much wastage, not too many collisions Hash function should spread records evenly across buckets If very skewed distribution, we lose benefits of hashing Still costly if access to records ordered by key is required And doesn’t help with accessing records not by key Main disadvantage: hard to adjust if number of records grows Need to resize the hash table What if too many records hash to the same bucket? Can handle extra records by “chaining” to overflow buckets CS346 Advanced Databases

18 Hashing: Overflow buckets
CS346 Advanced Databases

19 Extendible hashing Hashing scheme that allows the hash table to grow and shrink Avoid wasted space and avoid excessive collisions Makes use of a directory of bucket addresses Directory size is a power of two, 2d So can double or halve the directory size as needed The first d bits of the hash value are used to index into the directory Directory entries point to disk blocks storing records Contiguous directory entries can point to same disk block Disk blocks can have a local value of d, d’ Insertions into a block may cause it to overflow and split in two The directory is then updated accordingly CS346 Advanced Databases

20 Extendible hashing example
Some values of d’ less than global d CS346 Advanced Databases

21 Extendible Hashing: Updating d
If a bucket becomes full, may need to increase d  d + 1 Double the size of the directory Similarly, if all buckets have local d’ < d, can decrease d  d – 1 Halve the size of the directory Other adaptive hashing variants exist Dynamic hashing: binary tree directory CS346 Advanced Databases

22 RAID disk technology RAID originally a way to combine multiple cheap disks for reliability “Redundant Array of Inexpensive Disks” (1980s) Now general purpose approach to providing reliability “Redundant Array of Independent Disks Sets of different levels of replication RAID 0: spread data over multiple disks (striping) Increases throughput, but increases risk of data loss CS346 Advanced Databases

23 Important RAID levels RAID 1: duplication of data across multiple disks (mirroring) Data copied to 2 (or more) disks Disk reliability measured in “mean time between failures” (MTBF) Typical MTBF is 100K hours – 1M hours (~ 1 century) Chance of both disks failing at same time is small So enough time to recover a copy RAID 5: block level striping and parity coding spread over disks Parity coding: allows recovery of 1 missing disk 1 Data bits Parity bit CS346 Advanced Databases

24 RAID levels RAID 6: Reed-Solomon coding allows multiple disk losses
Other RAID levels (2, 3, 4) not in common usage CS346 Advanced Databases

25 Storage Area Networks Storage Area Networks: virtual disks
Disks attached to “headless” server Easy to configure, low maintenance overhead Many advantages to SANs: Flexible configuration: hot-swap new disks in/out Can be physically remote from other network elements Provided on fast (fibre-based) network Separate storage for server configuration, OS updates etc. CS346 Advanced Databases

26 Outline Part 2 Indexes: primary and secondary
Multilevel indexes and B-trees Chapter: “Indexing Structure for Files” in Elmasri and Navathe CS346 Advanced Databases

27 Indexing for Files Chapter: “Indexing Structure for Files” in Elmasri and Navathe Move focus from how file is stored on disk to how file is accessed / indexed by the DBMS Index: an auxiliary file that makes it faster to find certain records An index is usually for one field of the record (e.g. index by name) Can have multiple indexes, each for different fields A basic form of an index is a sorted list of pointers <field value, pointer to record>, ordered by field value “An access path” for the indexed field CS346 Advanced Databases

28 Indexes as access paths
Indexes usually take up much less space than the original file Each index entry is much smaller than the full record Just need a field value, and a pointer (few bytes) Efficient to look up matching records Binary search on the index, then follow pointer The index may be dense or sparse Dense index: contains an entry for every possible search value Sparse index: contains entries only for some search values Can have an index on the field that the file is sorted on! Why? Can be faster to search via index than do binary search on file CS346 Advanced Databases

29 Primary Index A primary index applies when the file is ordered by a key field A sparse index: one entry for each block of the data file An index for the first record in the block (the block anchor) Can be much fewer entries in index than in the data file Straightforward to search for a record Use the index to find the block that the record should be in Retrieve the block and see if the record is there Insertion and deletion of records in the main file is a pain! Almost all the pointers change! Some standard tricks to mitigate the pain Buffer updates in an “overflow” file and check against this Linked list of overflow records for each block as needed Mark records as deleted, and only purge periodically CS346 Advanced Databases

30 CS346 Advanced Databases

31 Indexing Example Example: Given a data file EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... ) Suppose that: record size R=100 bytes (fixed size) block size B=1024 bytes file size r=30000 records Blocking factor Bfr= B / R = 1024 / 100 = 10 records/block Number of file blocks b= (r/Bfr)= (30000/10)= 3000 blocks CS346 Advanced Databases

32 Indexing Example For an index on the SSN field, assume the field size VSSN=9 bytes and the record pointer size PR=6 bytes. Then: index entry size RI=(VSSN+ PR)=(9+6)=15 bytes index blocking factor BfrI= B / RI  = 1024/15 = 68 entries/block number of index blocks bI = (r/ BfrI)= (3000/68)= 45 blocks binary search needs log2(bI)= log2(45)= 6 block accesses [In practice, likely that these 45 blocks would end up in cache] This is compared to an average linear search cost of: (b/2) = 30000/2 = block accesses If the file records are ordered, the binary search cost would be: log2b = log23000 = 12 block accesses CS346 Advanced Databases

33 Clustering Index Clustering index applies when data is ordered on a non-key field The field on which data is ordered is called the clustering field The data file is described as a clustered file Clustering index is sorted list of <field value, pointer> pairs Why make a distinction between clustering and primary index? Field values can appear in many consecutive records Only one entry in index for each distinct field value No point having multiple entries Index points to first data block containing the matching value Same issues with insertion and deletion as for primary index CS346 Advanced Databases

34 CS346 Advanced Databases

35 Linked list if more than one block is needed
Cluster index where each distinct value is allocated a whole disk block Linked list if more than one block is needed CS346 Advanced Databases

36 Secondary indexes Secondary indexes provide a secondary means of access to data For when some primary access already exists (e.g. index on key) A secondary index is on some other field(s) Either other candidate key fields which are unique for every record Or non-key field with duplicate values Secondary index is an ordered file of <field value, pointer> pairs Pointer can be to a file block, or record within a file A dense index: must be one pointer per record Many secondary indexes can be created for a file Allowing access based on different fields By contrast, there can be only one primary index CS346 Advanced Databases

37 Secondary index with block pointers
Unique data values so structure is simple CS346 Advanced Databases

38 Secondary index example
Same set up as previous example: r=30000 records of size R=100 bytes, block size B = 1024 bytes File is stored in 3000 blocks as worked out before Search for a record based on a field of V = 9 bytes Linear search would read 1500 blocks on average Secondary index on target attribute (9 + 6) = 15 bytes/record Blocking factor for index is 1024/15 = 68 entries per block Need 30000/68 = 442 blocks to store the (dense) index Binary search on index takes log2 442 = 9 block accesses Slightly more than the primary index (why?) CS346 Advanced Databases

39 Secondary index for non-key, non-ordering
Secondary index for a non-key non-ordering field I.e. a field that has duplicate values in many records Several possible approaches Include duplicate index entries for the same field value (dense) Have variable length entries in the index: a list of pointers to all blocks containing the target value Use an extra level of indirection: fixed length index entries point to list of pointers, arranged as list of disk blocks Option 3 is most commonly used All options are painful when data file is subject to insert/deletes CS346 Advanced Databases

40 “Option 3” secondary index
CS346 Advanced Databases

41 Single Level Indexing Summary
Primary index: on the field that the data is sorted by Allows faster access than searching the file directly Secondary index: on any field(s) in the data Can have multiple secondary indexes Typically dense All indexes require extra effort to maintain if the data is subject to frequent updates (insert/delete operations) CS346 Advanced Databases

42 Multilevel Indexing The indexes described so far miss a trick: they do binary search But we can read a block of k index records at a time Can do a k-way split instead of a 2-way split Improves cost from log2 N to logk N Another way to look at it: if index is large, build index on index… Original index is first level index, then there is second level index Can repeat, creating third level index, fourth level index… Until top level of index fits into one disk block For all realistic file sizes, a constant number of levels is needed Apply this idea to any index type (primary, secondary, cluster) Assume first level index has fixed length, distinct valued entries CS346 Advanced Databases

43 Two-level index CS346 Advanced Databases

44 Example Convert previous example into a multilevel index
Blocking factor for indexes remains 68 442 blocks of first level index Second level index: 442/68 = 7 blocks Third level index fits in 1 block: stop here! Hence, need three levels of index: three accesses to find (pointer to) target record CS346 Advanced Databases

45 Dynamic multilevel indexes
Can we modify our storage of indices to make handling inserts/deletes less painful? Use tree-structure to directly access data Keep some space in file blocks to reduce cost of updates Use the language of trees to describe the structure: CS346 Advanced Databases

46 Search trees A search tree: a tree where each node contains at most p-1 search values and p pointers as P1, K1, P2, K2, … Kq-1, Pq, q ≤ p The values are in order: K1 < K2 < … Kq-1 Each pointer Pi points to a subtree so that Ki-1 < X ≤ Ki for all keys in subtree Rules allow efficient search for any key value Search within the only subtree it can be in at each level CS346 Advanced Databases

47 Search tree example Leaf-level entries have the full record
Insertion is easier: we can add a new block without having to rewrite the rest of the tree If tree is unbalanced (some very deep paths), searches are long Try to avoid by using rules to avoid tree getting unbalanced Perform occasional rebalancing or “self-balancing” trees CS346 Advanced Databases

48 B-trees and B+-trees B-trees add the constraint that the tree should be balanced The root to leaf path should be about the same length for all leaves Avoid wasted space: each node should be between half full and full B+-tree is a slight modification of B-tree that is now the standard B-trees: allow pointers to data at all levels of the tree B+-tree: pointers to data only at the leaf level B+-tree slightly simpler (fewer cases to deal handle with updates) The trees can be used for (primary, secondary) multi-level indexes Updates to data can be reflected in tree easily These trees are widely used in file systems and database systems File systems: NTFS [Windows], NSS, XFS, JFS – for directory entries DBMSs: IBM DB2, Informix, MS SQL Server, Oracle, SQLite CS346 Advanced Databases

49 B+-tree Internal nodes: P1, K1, P2, K2, … Kq-1, Pq, where p/2 < q ≤ p Leaf nodes: K1, Pr1, K2, Pr2, … Kq-1, Prq-1, Pnext, p/2 < q ≤ p Ki, Pri  : Pri points to record with value Ki Pnext points to the next leaf node in the tree (for linear access) CS346 Advanced Databases

50 B+-tree: Search Search on a B+-tree is fairly straightforward
Start at root block While not at a leaf block Determine between which values in the block the key falls Follow the relevant pointer to the new block Search current leaf block for desired value If found, follow pointer to retrieve record CS346 Advanced Databases

51 B-tree: insertion As with many tree algorithms, insertion is based on search Start by searching for where the record should be If room in the leaf block, insert a pointer to the new record Else, split the leaf block into two, and insert the pointer Now there are two leaf blocks: need to update parent Similar process to update parent: may need to split parent May propagate back to root Note that we do not explicitly attempt to keep tree balanced The condition p/2 < q ≤ p ensures that it can’t be too unbalanced Algorithms fans: condition ensures height is O(log n) for n keys Worst case time for {insert, delete, search} is O(log n) CS346 Advanced Databases

52 CS346 Advanced Databases

53 B+-tree: deletion Essentially the inverse of insertion
Find the record to delete from the B+-tree Remove the pointer and if block is still large enough, halt Else, try to redistribute: move entries from sibling block If can’t redistribute, merge the two siblings Then delete one pointer from parent and recurse up tree CS346 Advanced Databases

54 CS346 Advanced Databases

55 Summary Disk properties and file storage
File organizations: ordered, unordered, and hashed Storage topics: RAID and Storage area networks Indexes: primary and secondary Multilevel indexes and B-trees Chapter: “Disk Storage, Basic File Stuctures and Hashing” in Elmasri and Navathe Chapter: “Indexing Structure for Files” in Elmasri and Navathe CS346 Advanced Databases

Download ppt "CS346: Advanced Databases"

Similar presentations

Ads by Google