Indexing 4/11/2019
Index Concept Main idea: A separate data structure used to locate records Many, many, many, many flavors of index organization have been proposed and tried including structures which combine hashing and indexing Various buffering schemes are somewhat orthogonal We’ll focus on General concepts [chapter 4.3-4.4] ISAM (indexed sequential) [ch. 5.1] B-trees and B+ trees [ch.5.2-5.8] 4/11/2019
Index Terminology Most generally, index is a list of value/address pairs Each pair is an index “entry” Value is the index “key” Address will point to a data record, or to a data page There might be many records on a page The assumption is that the value/address pair will be much smaller in size than the full record If index is small, a copy can be maintained in memory! Permanent disk copy is still needed 4/11/2019
Key Terminology Index key field Primary index Secondary index Not necessarily the same as the primary DB key of the table! But called a “key” anyway Primary index Key is the primary (DB) key Only one index per file Secondary index Key is not the primary DB key Could be many indices per file (or none) 4/11/2019
More Indexing Terminology Dense index One index entry for each record (or page) Non-dense or sparse index Less than one index entry for each record Inverted file: File which has a dense secondary index Clustering index Preserves locality: close index entries refer to close data records Multilevel indexing each level is an index to the next level down 4/11/2019
Indexing Pitfalls Index itself is a file Occupies disk space Must worry about maintenance, consistency, recovery, etc. Large indices won't fit in memory May require multiple seeks to locate record entry 4/11/2019
Desiderata for Multilevel Indexes Should support efficient random access Should also support efficient sequential access, if possible Should have low height Should be efficiently updatable Should be storage-efficient Top level(s) should fit in memory 4/11/2019
ISAM = Indexed Sequential Access Method IBM terminology “Indexed Sequential” more general term (non-IBM) ISAM as described in textbook (5.1) is very close to B+ tree simpler versions exist Main idea: maintain sequential file but give it an index Sequentiality for efficient “batch” processing Index for random record access 4/11/2019
ISAM Technique Build a dense index of the pages (1st level index) Sparse from a record viewpoint Then build an index of the 1st level index (2nd level index) Continue recursively until top level index fits on 1 page Some implementations may stop after a fixed # of levels 4/11/2019
Updating an ISAM File Data set must be kept sequential So that it can be processed without the index May have to rewrite entire file to add records Could use overflow pages chained together or in fixed locations (overflow area) Index is usually NOT updated as records are added! Once in a while the whole thing is “reorganized” Data pages recopied to eliminate overflows Index recreated 4/11/2019
ISAM Pros, Cons Pro Cons Relatively simple Great for true sequential access Cons Not very dynamic Inefficient if lots of overflow pages Can only be one ISAM index per file 4/11/2019
B-Tree B-Tree is a type of multilevel index from another standpoint: it's a type of balanced tree Invented in 1972 by Boeing engineers R. Bayer and E. McCreight By 1979: "the standard organization for indexes in a database system" (Comer) 4/11/2019
B-Tree Overview Assume for now that keys are fixed-length and unique A B-tree can be thought of as a generalized binary search tree multiple branches rather than just L or R Trees are always perfectly balanced Some wasted space in the nodes is tolerated 4/11/2019
B-Tree Concepts Each node contains tree (index node) pointers, and key values (with record or page pointers) Given a key K and the two node pointers L and R around it All key values pointed to by L are < K All key values pointed to by R are > K “Order p” means (up to) p tree pointers, (up to) p-1 keys Terminology differs between authors 4/11/2019
B+ Tree vs. B-tree Textbook only discusses B+ trees So do we from now on Two big differences: Original B-trees had record pointers in all of the index nodes; B+ trees only in leaf nodes Given a key K and the two node pointers L and R around it All key values pointed to by L are < K All key values pointed to by R are >= K B+ tree data pages are linked together to form a sequential file Gives the advantages of ISAM In our book, it’s a doubly-linked list 4/11/2019
Alternate Views of the Leaf Nodes [cf. Chapter 4.3.1] Leaf nodes might be actual data pages Leaf nodes might contain pointers to the actual data records or pages For B+ trees, this implies the leaf node format is different from the non-leaf node format may hold different number of entries The leaf nodes can be chained together, regardless of whether the actual data pages are! 4/11/2019
B+Tree Growth and Change The big idea: When a node is full, it splits. middle value is propagated upward If we’re lucky, there’s room for it in the level above two new nodes are at same level as original node Height of tree increases only when the root splits A very nice property This is what keeps the tree perfectly balanced Recommended: split only “on the way down” On deletion: two adjacent nodes recombine if both are < half full 4/11/2019
Variations Could redistribute records between adjacent blocks esp. on deletion (B* tree) Variable order: accommodate varying key lengths Could store the whole record in the index block especially if records are few and small in a B+ tree, this would make sequential access especially efficient 4/11/2019
B+ Trees with Other Indices Suppose you have a B+ tree for the file Leaf nodes of the index are the actual pages of the file, doubly linked together for sequential access Suppose you have some secondary indices What happens when a B+ tree node splits or merges??? 4/11/2019
Other Forms of Indexing Bitmap indexes One index per value (property) of interest One bit per record TRUE if record has a particular property Indexed hash: hash function takes you to an entry in an index allows physical record locations to change Clever indexing schemes are useful in optimizing complex queries 4/11/2019