Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.

Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search speed Hash Indices (as separate from file organization) Static vs. Dynamic Hashing Extendable Hashing –with a detailed example of how it works

Today HW#3 due today HW #4: due Thursday April 24 (next week) –Questions: 12.11, 12.12, 12.13, 12.16 Today: –indices on multiple attributes –grid files –bitmap indices

Index Definition in SQL Create an index create index on ( ) e.g., create index br-index on branch(branch-name) Use create unique index to indirectly specify and enforce the condition that the search key is a candidate key –not really required if SQL unique integrity constraint is supported To drop an index drop index

Multiple-Key Access With some queries we can use multiple indices Example: select account-number from account where branch-name=“Perryridge” and balance=1000 Possible strategies for processing this query using indices on single attributes: –use index on balance to find accounts with balances =1000, then test them individually to see if branch-name=“Perryridge” –use index on branch-name to find accounts with branch- name=“Perryridge”, then test them individually to see if balances =1000 –use branch-name index to find pointers to all records of the Perryridge branch, and use balance index similarly, then take intersection of both sets of pointers

Multiple-Key Access (2) With some queries using a single-attribute index is unnecessarily expensive –with methods (1) and (2) from the earlier slide, we might have the index we use return a very large set, even though the final result is quite small –Even with method (3) (use both indices and then find the intersection) we might have both indices return a large set, which will make for a lot of unneeded work if the final result (the intersection) is small An alternative strategy is to create and use an index on more than one attribute -- in this example, an index on (branch-name, balance) (both attributes)

Indices on Multiple Attributes Suppose we have an ordered index on the combined search-key (branch-name, balance) Examining the earlier query with the clause where branch-name=“Perryridge” and balance=1000 Our new index will fetch only records that satisfy both conditions -- much more efficient than trying to answer the query with separate single-valued indices We can also handle range queries like: where branch-name=“Perryridge” and balance<1000 But our index will not work for where branch-name<“Perryridge” and balance=1000

Multi-Attribute Indexing Example: EMP(eno, ename, age, sal) lots of ways to handle this problem separate indices: lots of false hits combined index based on composite key –key= sal*100 + age –search for 30<sal<40 translates into 3000<key<4000 (easy) –search for 60<age<80 difficult Grid files (up next) R-tree (B-tree generalization) Quad-trees, K-d trees, etc...

Grid Files Structure used to speed processing of general multiple search-key queries involving one or more comparison operators The grid file has: –a single grid array –array has number of dimensions equal to the number of search-key attributes –one linear scale for each search-key attribute Multiple cells of the grid array can point to the same bucket To find the bucket for a search-key value, locate the row and column of the cell using the linear scales to get the grid location, then follow the pointer in that grid location to the bucket

Example Grid File for account

Queries on a Grid File A grid file on two attributes A and B can handle queries of all the following forms with reasonable efficiency: –(a1 < A < a2) –(b1 < B < b2) –(a1 < A < a2 and b1 < B < b2) For example, to answer (a1 < A < a2 and b1 < B < b2), use the linear scales to find corresponding candidate grid array cells, and look up all the buckets pointed-to from those cells

Grid Files (cont) During insertion, if a bucket becomes full, a new bucket can be created if more than one cell points to it –idea similar to extendable hashing, but in multiple dimensions –if only one cell points to the bucket, either an overflow bucket must be created or the grid array size must be increased Linear scales can be chosen to uniformly distribute records across buckets –not necessary to have scale uniform across the domain -- if records are distributed in some other pattern, the linear scale can mirror it (as shown in the example earlier)

Grid Files (end) Periodic re-organization to increase grid array size helps with good performance –reorganization can be very expensive, though Space overhead of the grid array can be high R-trees (chapter 23) are an alternative

Bitmap Indices Bitmap indices are a special type of index designed for efficient queries on multiple keys Records in a relation are assumed to be numbered sequentially from 0 –given a number n the objective is for it to be easy to retrieve record n –very easy to achieve if we’re looking at fixed-length records Bitmap indices are applicable on attributes that take on a relatively small number of distinct values –e.g. gender, country, state, hockey team –or an arbitrary mapping of a wider spectrum of values into a small number of categories (e.g. income level: divide income into a small number of levels such as 0-9999, 10K-19999, 20K-49999, 50K and greater) A bitmap is simply an array of bits

Bitmap Indices (cont) In the simplest form a bitmap index on an attribute has a bitmap for each value of the attribute –bitmap has as many bits as there are records –in a bitmap for value v, the bit for a record is 1 if the record has the value v for the attribute, and 0 otherwise

Bitmap Indices (cont) Bitmap indices are useful for queries on multiple attributes –not particularly useful for single-attribute queries Queries are answered using bitmap operations –intersection (and) –union (or) –complementation (not) Each operation takes two bitmaps of the same size and applies the operation on corresponding bits to get the result bitmap –e.g. 100110 and 110011 = 100010 100110 or 110011 = 110111 not 100110 = 011001

Bitmap Indices (cont) Every 1 bit in the result marks a desired tuple –can compute its location (since records are numbered sequentially and are all of the same size) and retrieve the records –counting number of tuples in the result (SQL count aggregate) is even faster Bitmap indices are generally very small compared to relation size –e.g. if record is 100 bytes, space for a single bitmap on all tuples of the relation takes 1/800 of the size of the relation itself –if bitmap allows for 8 distinct values, bitmap is only 1% of the size of the whole relation

Bitmap Indices (cont) Deletion needs to be handled properly –can’t just store “zero” values at deleted locations (why?) –need existence bitmap to note if the record at location X is valid or not existence bitmap necessary for complementation not(A=v): (not bitmap-A-v) and ExistenceBitmap Should keep bitmaps for all values, even “null” –to correctly handle SQL null semantics for not (A=v) must compute: not(bitmap-A-Null) and (not bitmap-A-v) and ExistenceBitmap

Efficient Implementation of Bitmap Operations Bitmaps are packed into words; a single word and is a basic CPU instruction that computes the and of 32 or 64 bits at once –e.g. one-million-bit maps can be anded with just 31,250 instructions Counting number of 1s can be done fast by a trick: –use each byte to index into a precomputed array of 256 elements each storing the count of 1s in the binary representation –add up the retrieved counts –can use pairs of bytes to speed up further in a similar way, but at a higher memory cost (64K values in precomputed array for 2 byte)

Bitmaps and B+-trees Bitmaps can be used instead of Tuple-ID lists at leaf levels of B+-trees, for values that have a large number of matching records –assuming a tuple-id is 64 bits, this becomes worthwhile if >1/64 of the records have a given value This technique merges benefits of bitmap and B+-tree indices –useful if some values are uncommon, and some are quite common –why not useful if there are only a small number of different values total (16, 20, something less than 64)? That covers all of chapter 12.

Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.

Similar presentations

Presentation on theme: "Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.

Similar presentations

Presentation on theme: "Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search."— Presentation transcript:

Similar presentations

About project

Feedback