Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 

Similar presentations


Presentation on theme: "Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing "— Presentation transcript:

1 Chapter 11 Indexing & Hashing

2 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing  An index is specified on one (or more) field(s), called search key field, of the record, which is not necessarily unique  Different index structures associated with different search keys  Allows fast random access to records  Index record (forms an access path to the data record), is of the form

3 3 Indexing n Dense Index: For every unique search-key value, there is an index record n Sparse Index: Index records are created for some search- key values  Sparse index is slower, but requires less space & overhead n Primary Index:  Defined on an ordered data file, ordered on a search key field & is usually the primary key.  A sequentially ordered file with a primary index is called index-sequential file  A binary search on the index yields a pointer to the record  Index value is the search-key value of the first data record in the block

4 4 Figure. Dense index Figure. Sparse index

5 5 Figure. Primary index on the ordering key field of a file

6 6 Primary Index n Index Deletion  (Dense Index) delete the search-key value (of the deleted record) from the index file if the deleted record is the last record with the search-key value  (Sparse Index) if the deleted record is the last record with the search-key value & the search-key value v (of the deleted record) exists in the index file F, replace v by w, where w is the next search-key value in order; if v and w are in F, simply delete v n Index Insertion  (Dense Index) if the search-key value v (of the new record) does not exist in the index file, insert v  (Sparse Index) if a new data block B is created, then the 1 st search-key value of B is inserted into the index file

7 7 Multi-Level Indices n First-level index: the original index file n Second-level index: primary index to the original index file n (Rare) Third-level index: top level index (fit in one disk block) n Form a search tree, such as B-tree or B + -tree structures n Insertion/deletion of new indexes are not trivial in indexed files

8 8 Figure. A two- level primary index

9 9 Secondary Indices n Defined on an unordered data file, i.e., not by the indexed field order (can be defined on a candidate key/non-key field) n Each pointer often points to a bucket which consists of pointers to records with the same search-key value. The bucket structure can be eliminated if  the index is dense, and the search-key values form a primary key, i.e., unique n Advantages: i. Improve the performance of queries that use candidate keys ii. Eliminate extra pointers within the records iii. Eliminate the need for scanning records sequentially n Disadvantages: overhead/modification n Types of Secondary Indices:  Dense: pointers in a bucket point to records w/ same search-key values  Sparse: a pointer in a bucket points to records w/ search-key values in the appropriate range

10 10 Figure. A secondary index on a key field of a file.

11 11 Figure. A secondary index on a non-key field implemented using a level of indirection

12 12 B + -Tree (Multi-level) Indices n Frequently used index structure in DB n Allow efficient insertion/deletion of new/existing search-key values n A balanced tree structure: all leaf nodes are at the same level (which may form a dense index) n Each node, corresponding to a disk block, has the format: P 1 K 1 P 2 … P n-1 K n-1 P n where P i, 1  i  n, is a pointer K i, 1  i  n-1, is a search-key value & K i < K j, i < j, i.e., search-key values are in order P 1 K 1 … K i-1 P i K i … K n-1 P n n In each leaf node, P i points to either (i) a data record with search-key value K i or (ii) a bucket of pointers, each points to a data record with search-key value K i XXX X < K 1 K i-1  X < K i K n-1  X

13 13 B + -Tree (Multi-level) Indices n Each leaf node is kept between half full & completely full, i.e., (  (n-1)/2 , n-1) search-key values n Non-leaf nodes form a sparse index n Each non-leaf node (except the root) must have (  n/2 , n) pointers n No. of Block accesses required for searching a search-key value @leaf-node level is log  n/2  (K) where K = no. of unique search-key values & n = no. of indices/node n Insertion into a full node causes a split into two nodes which may propagate to higher tree levels Note: if there are n search-key values to be split, put the first (  (n-1)/2  in the existing node & the remaining in a new node n A less than half full node caused by a deletion must be merged with neighboring nodes

14 14 B + -Tree Algorithms n Algorithm 1. Searching for a record with search-key value K, using a B+-Tree. Begin n  block containing root node of B + -Tree ; read block n; while (n is not a leaf node of the B + -Tree) do begin q  number of tree pointers in node n; if K < n.K 1 /* n.K i refers to the i th search-key value in node n */ then n  n.P 1 /* n.P i refers to the i th pointer in node n */ else if K  n.K q-1 then n  n.P q else begin search node n for an entry i such that n.K i-1  K < n.K i ; n  n.P i ; end; /*ELSE*/ read block n; end; /*WHILE*/ search block n for entry K i with K = K i ; /*search leaf node*/ if found, then read data file block with address P i and retrieve record else record with search-key value K is not in the data file; end. /*Algorithm 1*/

15 15 B + -Tree Algorithms n Algorithm 2. Inserting a record with search-key value K in a B + -Tree of order p. /* A B + -Tree of order p contains at most p-1 values an p pointers*/ Begin n  block containing root node of B + -Tree ; read block n; set stack S to empty; while (n is not a leaf node of the B + -Tree ) do begin push address of n on stack S; /* S holds parent nodes that are needed in case of split */ q  number of tree pointers in node n; if K < n.K 1 /* n.K i refers to the i th search-key value in node n */ then n  n.P 1 /* n.P i refers to the i th pointer in node n */ else if K  n.K q-1 then n  n.P q else begin search node n for an entry i such that n.K i-1  K < n.K i ; n  n.P i ; end; /* ELSE */ read block n; end; /* WHILE */ search block n for entry K i with K = K i ; /* search leaf node */

16 16 Algorithm 2 Continue if found then return /*record already in index file - no insertion is needed */ else begin /* insert entry in B + -Tree to point to record */ create entry (P, K), where P points to file block containing new record; if leaf node n is not full then insert entry (P, K) in correct position in leaf node n else begin /* leaf node n is full – split */ copy n to temp; /* temp is an oversize leaf node to hold extra entry */ insert entry (P, K) in temp in correct position; /* temp now holds p+1 entries of the form (P i, K i ) */ new  a new empty leaf node for the tree; *j   p/2  n  first j entries in temp (up to entry (P j, K j )); n.P next  new; /* P next points to the next leaf node*/ new  remaining entries in temp; * K  K j+1 ; /* Now we must move (K, new) and insert in parent internal node. However, if parent is full, split may propagate */ finished  false;

17 17 Algorithm 2 continue Repeat if stack S is empty, then /*no parent node*/ begin /* new root node is created for the B + -Tree */ root  a new empty internal node for the tree; * root  ; /* set P 1 to n & P 2 to new */ finished  true; end else begin n  pop stack S; if internal node n is not full, then begin /* parent node not full - no split */ insert (K, new) in correct position in internal node n; finished  true end else

18 18 Algorithm 2 continue begin /* internal node n is full with p tree pointers – split */ copy n to temp; /* temp is an oversize internal node */ insert (K, new) in temp in correct position; /* temp has p+1 tree pointers */ new  a new empty internal node for the tree; * j  (  (p + 1)/2  n  entries up to tree pointer P j in temp; /* n contains */ new  entries from tree pointer P j+1 in temp; /*new contains */ * K  K j ; /* now we must move (K, new) and insert in parent internal node */ end until finished end; /* ELSE */ end. /* Algorithm 2 */

19 19 Hashing n Uses dense index n Avoids accessing an index structure to locate data n Allocate search-key values to different buckets n (Static Hash Function) given a search-key value v, a hash function h computes (assigns) the address of the desired bucket (which contains a pointer to the record) for v h: K  B where K: set of search-key values B: set of (fixed) bucket addresses n The hash function maps a search-key value to a bucket b and perform a (linear) search of every record in b n An ideal hash function  Uniform distribution of search-key values, i.e., same no. of search-key values in each bucket  Random distribution of search-key values, i.e., each search-key value has the same possibility

20 20 Dynamic (Extendable) Hash Function (EHF) n Resolves the problems of static hashing  Allowing hash function to be modified dynamically, accommodating changes in DB size (no reserved buckets for future growth)  Minimizing space overhead, i.e., bucket address table (b-a-t) is small n Allows buckets to be split or combined to maintain space efficiency n Buckets are created on demand, as records are inserted.  Result: low performance overhead (reorganization requires one bucket at a time)

21 21 Dynamic (Extendable) Hash Function (EHF) n EHF uses i bits, which grows and shrinks with DB size, as an offset into b-a-t n i bits (which changes as file grows) of h(K) are required to determine the correct bucket for K n All entries of the i-bit b-a-t pointed to the same bucket j have a common hash prefix (chp) and bucket j is associated with an integer i j to denote the length of the chp No. of entries of b-a-t that point to bucket j = 2 (i - i j )

22 22 Figure. General extendable hash structure = 2 = 1 = 2 ………………

23 23 Dynamic (Extendable) Hash Function (EHF) n Lookup K, a search-key value: locate the bucket pointed to by the b-a-t entry which is determined by the first i high-order bits of h(K) n Insert a record r with search-key value K 1. Lookup K and locate bucket j 2. If j is not full, insert the info of K in j and r in the file 3. If j is full, create a new bucket z. There are two cases to be considered:

24 24 Dynamic (Extendable) Hash Function (EHF) (a) Case i = i j (only one entry in b-a-t points to j): 1. Increase i by 1, i.e., doubling the size of b-a-t. Each entry is replaced by 2 entries which contain the same pointer as the original entry 2. (For the b-a-t entry that causes the split) Set the 2 nd entry created from the entry for j to point to z 3. Set i j = i(new) and i z = i(new) 4. Rehash the records in j based on (new) i and redistribute records in j and r 5. Re-attempt to insert r and repeat the whole process if r and all records in j have the same hash prefix

25 25 Figure. Sample deposit file Figure. Hash function for branch-name Figure. Initial extendable hash structure (Each bucket can hold up to 2 records) Figure. Hash structure after 3 insertions (Downtown, Round Hill, Perryridge) Downtown Round Hill

26 26 Figure. Sample deposit file Figure. Hash function for branch-name Figure. Hash structure after four insertions * * *

27 27 Hashing n Insert a record r with search-key value K (b) Case i > i j (> 1 entry in b-a-t points to j): 1. i z = i j + 1 and i j = i j + 1 2. Adjust entries in b-a-t that point to j: set the first half of entries point to j and the remaining ones to z 3. Rehash and allocate records in j 4. Reattempt to insert r and repeat the whole process (of insertion) if r and all records in j have the same hash prefix n Delete a record r with search-key value K: 1. Lookup K and locate bucket j 2. Remove K from j and r from the file. Remove j if j becomes empty 3. Adjust b-a-t if necessary n Disadvantages  Lookup involves additional level of indirection (must access b-a-t)  Additional complexity in implementation

28 28 Figure. Sample deposit file Figure. Hash function for branch-name Figure. Extendable hash structure for the deposit file

29 29 Figure. Sample account file Figure. Hash function for branch-name Figure. Initial extendable hash structure.

30 30 Figure 11.28 Hash structure after four insertions

31 31 Figure 11.29 Hash structure after seven insertions Redwood A-222 700 0011 Round Hill A-305 350 1101 Figure 11.29 Hash structure after nine insertions

32 32 Figure 11.30 Extendable hash structure for the account file

33 33 Indexing & Hashing n Expected types of queries is critical to the choice between indexing and hashing n Comparison  For query with an equality comparison of an attribute, hashing is preferable  For query with a range of values specified, indexing is preferable  Most DB systems use indexing - difficult to find a good hash function that preserves order to support range queries


Download ppt "Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing "

Similar presentations


Ads by Google