Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS346: Advanced Databases Graham Cormode Storage, Files and Indexing.

Similar presentations

Presentation on theme: "CS346: Advanced Databases Graham Cormode Storage, Files and Indexing."— Presentation transcript:

1 CS346: Advanced Databases Graham Cormode Storage, Files and Indexing

2 Outline Part 1:  Disk properties and file storage  File organizations: ordered, unordered, and hashed  Storage topics: RAID and Storage area networks  Chapter: “Disk Storage, Basic File Stuctures and Hashing” in Elmasri and Navathe Part 2: Indexes CS346 Advanced Databases 2

3 Why?  Important to understand how high-level abstractions (databases) map down to low-level concepts (disks, files) – Get a sense of the scale of the quantities involved (seek times, overhead of inefficient solutions) – Appreciate the difference that smart solutions can bring – Understand where the bottlenecks lie  Give a “bottom-up” perspective on data management – See the whole picture starting from the low-level – Demystify some aspects that can seem opaque (B-trees, hashing, file organization) – Apply to many areas of computer science (OS, algorithms…) CS346 Advanced Databases 3

4 The Memory Hierarchy CS346 Advanced Databases 4 Flash Storage

5 Data on Disks  Databases ultimately rely on non-volatile disk storage – Data typically does not fit in (volatile) memory  Physical properties of disks affect performance of the DBMS – Need to understand some basics of disks  A few exceptions to disk-based databases: – Some real-time applications use “in-memory databases” – Some legacy/massive applications use tape storage as well  Different tradeoffs with flash-based storage – Much faster to read, but limits on number of deletions – No major difference between random access and linear scan – “Flash databases” are a niche, but growing area CS346 Advanced Databases 5

6 Rotating Disk: 5000 – 10000RPM  Sector size: 0.5KB – 4KB, basic unit of data transfer from disk  Seek time: move read head into position, currently ~4ms – Includes rotational delay: wait for sector to come under read head – Random access: 1/0.004 * 4KB = 1MB/second: quite slow  Track-to-track move, currently ~0.4ms: 10 times faster – Sustained read/write time: 100MB/second (caching can improve) CS346 Advanced Databases 6

7 Disk properties: the fundamental contrast  Random access is slow, sequential access is fast – By factors of up to 100s – Want to design storage of data to avoid or minimize random access and make data access as fast as possible  Buffering can help in multithreaded systems: – Work on other processes while waiting for data to arrive – Double buffering: maintain two buffers of data work on current buffer of data, while other buffer fills from disk – Maximizes parallel utilization, but doesn’t make my thread faster CS346 Advanced Databases 7

8 Records: the basic unit of the database  Databases fundamentally composed of records – Each record describes an object with a number of fields  Fields have a type (integer, float, string, time, compound…) – Fixed or variable length  Need to know when one field ends and the next begins – Field length codes – Field separators (special characters)  Leads to variable length records – How to effectively search through data with variable length records? CS346 Advanced Databases 8

9 9

10 Records and Blocks  Records get stored on disks organized into blocks  Small records: pack an integer number into each block – Leaves some space left over in blocks – Blocking Factor: (average) number of records per block  Large records: may not be effective to leave slack – Records may span across multiple blocks (spanned organization) – May use a pointer at end of block to point to next block CS346 Advanced Databases 10

11 Files  A sequence of records is stored as a file – Either using OS file system support, or handled by DBMS  Database requires support for various file operations: – Open file, return new file handler – Scan for the next record that satisfies a search condition – Read the next record from disk into memory – Delete the current record and (eventually) update file on disk – Modify the current record and (eventually) update file on disk – Insert a new record at the current location – Close the file, flush any buffers and postponed operations  Need suitable file layout and indices to allow fast scan operation CS346 Advanced Databases 11

12 File organization: unordered  Just dump the records on disk in no particular order  Insert is very efficient: just add to last block  Scan is very inefficient: need to do a linear search – Read half the file on average  Delete could be inefficient: – Read whole file, write it back with deleted record omitted – Instead, just “mark” record as deleted – Periodically remove marked records CS346 Advanced Databases 12

13 File organization: ordered  Keep records ordered on some (key) attribute  Can scan through records in that order very easily  Can search for a value (or range of values) by binary search – Binary search: log 2 b seeks to find desired record out of b blocks – Linear search: b/2 seeks on average to find record  Insertion is rather more expensive and complex to do well – Keep recent records in “overflow buffer” for periodic merge  If modifying the key field, treat as a deletion and an insertion CS346 Advanced Databases 13

14 CS346 Advanced Databases 14

15 File organization: hashed  Use hashing to ensure records with same key are grouped together  Arrange file blocks into M equal sized buckets – Often, 1 block = 1 bucket  Apply hash function to key field to determine its bucket  Usual hash table concerns emerge – Need to deal with collisions, e.g. by open addressing, or chaining – Deletions also get messy, depending on collision method used CS346 Advanced Databases 15

16 External hashing CS346 Advanced Databases 16  Don’t store records directly in buckets, store pointers to records – Pointers are small, fit more in a block – “All problems in computer science can be solved by another level of indirection” – David Wheeler

17 External hashing: issues  Aim for 70-90% occupancy of the hash table – Not too much wastage, not too many collisions  Hash function should spread records evenly across buckets – If very skewed distribution, we lose benefits of hashing  Still costly if access to records ordered by key is required – And doesn’t help with accessing records not by key  Main disadvantage: hard to adjust if number of records grows – Need to resize the hash table  What if too many records hash to the same bucket? – Can handle extra records by “chaining” to overflow buckets CS346 Advanced Databases 17

18 Hashing: Overflow buckets CS346 Advanced Databases 18

19 Extendible hashing  Hashing scheme that allows the hash table to grow and shrink – Avoid wasted space and avoid excessive collisions  Makes use of a directory of bucket addresses – Directory size is a power of two, 2 d – So can double or halve the directory size as needed – The first d bits of the hash value are used to index into the directory  Directory entries point to disk blocks storing records – Contiguous directory entries can point to same disk block – Disk blocks can have a local value of d, d’  Insertions into a block may cause it to overflow and split in two – The directory is then updated accordingly CS346 Advanced Databases 19

20 CS346 Advanced Databases 20  Extendible hashing example – Some values of d’ less than global d

21 Extendible Hashing: Updating d  If a bucket becomes full, may need to increase d  d + 1 – Double the size of the directory  Similarly, if all buckets have local d’ < d, can decrease d  d – 1 – Halve the size of the directory  Other adaptive hashing variants exist – Dynamic hashing: binary tree directory CS346 Advanced Databases 21

22 RAID disk technology  RAID originally a way to combine multiple cheap disks for reliability – “Redundant Array of Inexpensive Disks” (1980s)  Now general purpose approach to providing reliability – “Redundant Array of Independent Disks – Sets of different levels of replication  RAID 0: spread data over multiple disks (striping) – Increases throughput, but increases risk of data loss CS346 Advanced Databases 22

23 Important RAID levels  RAID 1: duplication of data across multiple disks (mirroring) – Data copied to 2 (or more) disks – Disk reliability measured in “mean time between failures” (MTBF) – Typical MTBF is 100K hours – 1M hours (~ 1 century) – Chance of both disks failing at same time is small – So enough time to recover a copy  RAID 5: block level striping and parity coding spread over disks – Parity coding: allows recovery of 1 missing disk CS346 Advanced Databases 23 101101 Data bitsParity bit

24 RAID levels  RAID 6: Reed-Solomon coding allows multiple disk losses  Other RAID levels (2, 3, 4) not in common usage CS346 Advanced Databases 24

25 Storage Area Networks  Storage Area Networks: virtual disks – Disks attached to “headless” server – Easy to configure, low maintenance overhead  Many advantages to SANs: – Flexible configuration: hot-swap new disks in/out – Can be physically remote from other network elements Provided on fast (fibre-based) network – Separate storage for server configuration, OS updates etc. CS346 Advanced Databases 25

26 Outline Part 2  Indexes: primary and secondary  Multilevel indexes and B-trees  Chapter: “Indexing Structure for Files” in Elmasri and Navathe CS346 Advanced Databases 26

27 Indexing for Files  Chapter: “Indexing Structure for Files” in Elmasri and Navathe – Move focus from how file is stored on disk to how file is accessed / indexed by the DBMS  Index: an auxiliary file that makes it faster to find certain records – An index is usually for one field of the record (e.g. index by name) – Can have multiple indexes, each for different fields  A basic form of an index is a sorted list of pointers –, ordered by field value – “An access path” for the indexed field CS346 Advanced Databases 27

28 Indexes as access paths  Indexes usually take up much less space than the original file – Each index entry is much smaller than the full record – Just need a field value, and a pointer (few bytes)  Efficient to look up matching records – Binary search on the index, then follow pointer  The index may be dense or sparse – Dense index: contains an entry for every possible search value – Sparse index: contains entries only for some search values  Can have an index on the field that the file is sorted on! Why? – Can be faster to search via index than do binary search on file CS346 Advanced Databases 28

29 Primary Index  A primary index applies when the file is ordered by a key field  A sparse index: one entry for each block of the data file – An index for the first record in the block (the block anchor) – Can be much fewer entries in index than in the data file  Straightforward to search for a record – Use the index to find the block that the record should be in – Retrieve the block and see if the record is there  Insertion and deletion of records in the main file is a pain! – Almost all the pointers change!  Some standard tricks to mitigate the pain – Buffer updates in an “overflow” file and check against this – Linked list of overflow records for each block as needed – Mark records as deleted, and only purge periodically CS346 Advanced Databases 29

30 CS346 Advanced Databases 30

31 Indexing Example CS346 Advanced Databases 31  Example: Given a data file EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL,... )  Suppose that: – record size R=100 bytes (fixed size) – block size B=1024 bytes – file size r=30000 records  Blocking factor Bfr=  B / R  =  1024 / 100  = 10 records/block  Number of file blocks b= (r/Bfr)= (30000/10)= 3000 blocks

32 Indexing Example  For an index on the SSN field, assume the field size V SSN =9 bytes and the record pointer size P R =6 bytes. Then: – index entry size R I =(V SSN + P R )=(9+6)=15 bytes – index blocking factor Bfr I =  B / R I  =  1024/15  = 68 entries/block – number of index blocks b I = (r/ Bfr I )= (3000/68)= 45 blocks – binary search needs log 2 (b I )= log 2 (45)= 6 block accesses [In practice, likely that these 45 blocks would end up in cache]  This is compared to an average linear search cost of: – (b/2) = 30000/2 = 15000 block accesses  If the file records are ordered, the binary search cost would be: – log 2 b = log 2 3000 = 12 block accesses CS346 Advanced Databases 32

33 Clustering Index  Clustering index applies when data is ordered on a non-key field – The field on which data is ordered is called the clustering field – The data file is described as a clustered file – Clustering index is sorted list of pairs  Why make a distinction between clustering and primary index? – Field values can appear in many consecutive records – Only one entry in index for each distinct field value No point having multiple entries – Index points to first data block containing the matching value  Same issues with insertion and deletion as for primary index CS346 Advanced Databases 33

34 CS346 Advanced Databases 34

35  Cluster index where each distinct value is allocated a whole disk block  Linked list if more than one block is needed CS346 Advanced Databases 35

36 Secondary indexes  Secondary indexes provide a secondary means of access to data – For when some primary access already exists (e.g. index on key)  A secondary index is on some other field(s) – Either other candidate key fields which are unique for every record – Or non-key field with duplicate values  Secondary index is an ordered file of pairs – Pointer can be to a file block, or record within a file – A dense index: must be one pointer per record  Many secondary indexes can be created for a file – Allowing access based on different fields – By contrast, there can be only one primary index CS346 Advanced Databases 36

37  Secondary index with block pointers  Unique data values so structure is simple CS346 Advanced Databases 37

38 Secondary index example  Same set up as previous example: r=30000 records of size R=100 bytes, block size B = 1024 bytes  File is stored in 3000 blocks as worked out before  Search for a record based on a field of V = 9 bytes – Linear search would read 1500 blocks on average  Secondary index on target attribute (9 + 6) = 15 bytes/record – Blocking factor for index is  1024/15  = 68 entries per block – Need  30000/68  = 442 blocks to store the (dense) index – Binary search on index takes  log 2 442  = 9 block accesses – Slightly more than the primary index (why?) CS346 Advanced Databases 38

39 Secondary index for non-key, non-ordering  Secondary index for a non-key non-ordering field – I.e. a field that has duplicate values in many records – Several possible approaches 1.Include duplicate index entries for the same field value (dense) 2.Have variable length entries in the index: a list of pointers to all blocks containing the target value 3.Use an extra level of indirection: fixed length index entries point to list of pointers, arranged as list of disk blocks  Option 3 is most commonly used – All options are painful when data file is subject to insert/deletes CS346 Advanced Databases 39

40 CS346 Advanced Databases 40  “Option 3” secondary index

41 Single Level Indexing Summary  Primary index: on the field that the data is sorted by – Allows faster access than searching the file directly  Secondary index: on any field(s) in the data – Can have multiple secondary indexes – Typically dense  All indexes require extra effort to maintain if the data is subject to frequent updates (insert/delete operations) CS346 Advanced Databases 41

42 Multilevel Indexing  The indexes described so far miss a trick: they do binary search – But we can read a block of k index records at a time – Can do a k-way split instead of a 2-way split – Improves cost from log 2 N to log k N  Another way to look at it: if index is large, build index on index… – Original index is first level index, then there is second level index – Can repeat, creating third level index, fourth level index… – Until top level of index fits into one disk block – For all realistic file sizes, a constant number of levels is needed  Apply this idea to any index type (primary, secondary, cluster) – Assume first level index has fixed length, distinct valued entries CS346 Advanced Databases 42

43 Two-level index CS346 Advanced Databases 43

44 Example  Convert previous example into a multilevel index – Blocking factor for indexes remains 68 – 442 blocks of first level index – Second level index:  442/68  = 7 blocks – Third level index fits in 1 block: stop here!  Hence, need three levels of index: three accesses to find (pointer to) target record CS346 Advanced Databases 44

45 Dynamic multilevel indexes  Can we modify our storage of indices to make handling inserts/deletes less painful?  Use tree-structure to directly access data – Keep some space in file blocks to reduce cost of updates  Use the language of trees to describe the structure: CS346 Advanced Databases 45

46 Search trees  A search tree: a tree where each node contains at most p-1 search values and p pointers as  P 1, K 1, P 2, K 2, … K q-1, P q , q ≤ p – The values are in order: K 1 < K 2 < … K q-1 – Each pointer P i points to a subtree so that K i-1 < X ≤ K i for all keys in subtree  Rules allow efficient search for any key value – Search within the only subtree it can be in at each level CS346 Advanced Databases 46

47 Search tree example  Leaf-level entries have the full record  Insertion is easier: we can add a new block without having to rewrite the rest of the tree  If tree is unbalanced (some very deep paths), searches are long – Try to avoid by using rules to avoid tree getting unbalanced – Perform occasional rebalancing or “self-balancing” trees CS346 Advanced Databases 47

48 B-trees and B + -trees  B-trees add the constraint that the tree should be balanced – The root to leaf path should be about the same length for all leaves – Avoid wasted space: each node should be between half full and full  B + -tree is a slight modification of B-tree that is now the standard – B-trees: allow pointers to data at all levels of the tree – B + -tree: pointers to data only at the leaf level – B + -tree slightly simpler (fewer cases to deal handle with updates)  The trees can be used for (primary, secondary) multi-level indexes – Updates to data can be reflected in tree easily  These trees are widely used in file systems and database systems – File systems: NTFS [Windows], NSS, XFS, JFS – for directory entries – DBMSs: IBM DB2, Informix, MS SQL Server, Oracle, SQLite CS346 Advanced Databases 48

49 B+-tree CS346 Advanced Databases 49  Internal nodes:  P 1, K 1, P 2, K 2, … K q-1, P q , where p/2 < q ≤ p  Leaf nodes:  K 1, Pr 1 ,  K 2, Pr 2 , …  K q-1, Pr q-1 , P next , p/2 < q ≤ p –  K i, Pr i  : Pr i points to record with value K i – P next points to the next leaf node in the tree (for linear access)

50 B + -tree: Search  Search on a B + -tree is fairly straightforward – Start at root block – While not at a leaf block Determine between which values in the block the key falls Follow the relevant pointer to the new block – Search current leaf block for desired value – If found, follow pointer to retrieve record CS346 Advanced Databases 50

51 B-tree: insertion  As with many tree algorithms, insertion is based on search – Start by searching for where the record should be – If room in the leaf block, insert a pointer to the new record – Else, split the leaf block into two, and insert the pointer  Now there are two leaf blocks: need to update parent – Similar process to update parent: may need to split parent – May propagate back to root  Note that we do not explicitly attempt to keep tree balanced – The condition p/2 < q ≤ p ensures that it can’t be too unbalanced  Algorithms fans: condition ensures height is O(log n) for n keys – Worst case time for {insert, delete, search} is O(log n) CS346 Advanced Databases 51

52 CS346 Advanced Databases 52

53 B+-tree: deletion  Essentially the inverse of insertion – Find the record to delete from the B+-tree – Remove the pointer and if block is still large enough, halt – Else, try to redistribute: move entries from sibling block – If can’t redistribute, merge the two siblings – Then delete one pointer from parent and recurse up tree CS346 Advanced Databases 53

54 CS346 Advanced Databases 54

55 Summary  Disk properties and file storage  File organizations: ordered, unordered, and hashed  Storage topics: RAID and Storage area networks  Indexes: primary and secondary  Multilevel indexes and B-trees  Chapter: “Disk Storage, Basic File Stuctures and Hashing” in Elmasri and Navathe  Chapter: “Indexing Structure for Files” in Elmasri and Navathe CS346 Advanced Databases 55

Download ppt "CS346: Advanced Databases Graham Cormode Storage, Files and Indexing."

Similar presentations

Ads by Google