Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spring 2003 ECE569 Lecture 04-2.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang www.ece.rutgers.edu/~yyzhangwww.ece.rutgers.edu/~yyzhang.

Similar presentations


Presentation on theme: "Spring 2003 ECE569 Lecture 04-2.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang www.ece.rutgers.edu/~yyzhangwww.ece.rutgers.edu/~yyzhang."— Presentation transcript:

1 Spring 2003 ECE569 Lecture 04-2.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang www.ece.rutgers.edu/~yyzhangwww.ece.rutgers.edu/~yyzhang Course URL www.ece.rutgers.edu/~yyzhang/spring03www.ece.rutgers.edu/~yyzhang/spring03

2 Spring 2003 ECE569 Lecture 04-2.2 Access Paths  Associative access can be realized via scan.  The class of algorithms and data structures designed for translating attribute values into TID, or into other types of internal addresses of tuples having those attribute values, is called access paths.  Depending on what kind of selection predicate is to be supported, the techniques for associative access vary greatly.

3 Spring 2003 ECE569 Lecture 04-2.3 Content addressability techniques  Primary key access. l A tuple of a relation must be retrieved efficiently via the value of its primary (unique) key(s). e.g., key-sequenced files and hased files. l Point query vs. range query  Secondary key access l A set of tuples are produced  Multi-table access l Tuple access is often based on relationships between different tuples.

4 Spring 2003 ECE569 Lecture 04-2.4 Operations on files  Assumptions l n = number of records in file l R = number of records that can fit in block  Lookup – Given a key find corresponding record l On average, n / (2R) block accesses.  Insertion – add record to file (allows duplicates) l Read last block; it may need to allocate a new block. Approximately, requires 2 accesses  Deletion – delete record l look up record n / (2R) l Write back to disk (1 access) l Reorganize (unpinned) – move tuple from last page to utilize space (2 disk accesses)

5 Spring 2003 ECE569 Lecture 04-2.5 Hashed Files  File is divided into B buckets  Hash function h maps elements of the key space to range [0, B) l Key space is large and unevenly distributed -SSNs as character strings -Each character takes on at most 10 of the possible 256 values Hash function h must map key values evenly among a relatively small number of values.  A bucket directory is an array of B pointers to the allocated buckets. l Small enough to fit entirely in memory l Buckets are allocated only as they are needed.

6 Spring 2003 ECE569 Lecture 04-2.6 Hashed files

7 Spring 2003 ECE569 Lecture 04-2.7 Hash-based associative access FOLDING HASHING Range of positive integers tuple address space Range of Potential Key Values (the shaded areas denote used key values)

8 Spring 2003 ECE569 Lecture 04-2.8 Folding  Convert arbitrary data types to a positive integer h can be applied to.  Reduce number of bits so that arithmetic is efficient.  Example: Key is “Keefe” and 16803 l Key value is the concatenation of byte representation of individual fields l Folded value of key is 0x4b 0x65 0x65 0x66 0x65 0x0 0x0 0x0 0x41 0xa3 Partition result into words and combine using XOR 0x4b 0x65 0x65 0x66 0x65 0x0 0x0 0x0 0x41 0xa3 0x0 0x0  0x6f 0xc6 0x65 0x66 = 1875273062

9 Spring 2003 ECE569 Lecture 04-2.9 Hashing  goal of hasing  How to choose hash function if all the key values are uniformly distributed?  The critical issue is to produce 1:1 mapping  Collision: different inputs are mapped to the same output.  The criteria of a good hash function is to keep the collision as small as possible.

10 Spring 2003 ECE569 Lecture 04-2.10 Static Hashing  Input: folded key values  Output: bytes (relative to the beginning of the file), blocks ?? l Bytes are not good because of the varying tuple size. l A block/page is called a bucket.  H: {0 … 2 32 -1} -> {0, B-1} l Continuous allocation l Fixed size: B pages are allocated at file creation time. l Insert -Determine the bucket -Check the bucket ( collision may happen)

11 Spring 2003 ECE569 Lecture 04-2.11 How to find a good hash function  Division / remainder (Congruential hashing) H(K b ) = k b mod B where k b is folded key value and B is the number of buckets.  Nth power l Compute k b N, and from the resulting bit string (n x 31 bits) take log 2 B bits from the middle.  Base transformation  Polynomial division  Numerical analysis  encryption

12 Spring 2003 ECE569 Lecture 04-2.12 Performance  Assumption l Perfect hash function (tuples are uniformly distributed over B buckets)  Lookup l  ½  n/R  1/B  To finish first match l  n/R  1/B  If tuple does not exist  Insertion l  n/R  1/B  + 1Test for duplicates l 1Otherwise  Deletion l  ½  n/R  1/B  delete first match

13 Spring 2003 ECE569 Lecture 04-2.13 Collision  Two keys collide if they hash to same value  A bucket with room for R tuples can accommodate R – 1 collisions before it overflows l Internal resolution: Place overflow blocks in another bucket -(h(K) + 1) mod Bopen addressing -(h2(h1(K))multiple hashing

14 Spring 2003 ECE569 Lecture 04-2.14 Collision - continued l External resolution: Allocation overflow block, link to overflow chain bucketsOverflow pages

15 Spring 2003 ECE569 Lecture 04-2.15 Discussion  How do you limit the number of pages accessed when retrieving a tuple, for both external and internal resolution?

16 Spring 2003 ECE569 Lecture 04-2.16 How to locate a tuple in a page?  Sequential search  Page directory  hash

17 Spring 2003 ECE569 Lecture 04-2.17 Extendible Hashing  The number of buckets can grow/shrink.  An intermediate data structure translates the hash results into page addresses. This data structure needs to be as compact as possible. l Hashes into an array of pointer to buckets (directory). l The array is small enough to be kept in memory.

18 Spring 2003 ECE569 Lecture 04-2.18 Directory Growth To adapt to dynamically varying size of hash file- modify directory size Assume a hash function h(K b ) that produces a bit string s. The directory is of size 2 d. d is called the global depth and is initially 0. Use least significant d bits of s to determine bucket to access Each bucket has a corresponding local depth in the range [0, d]

19 Spring 2003 ECE569 Lecture 04-2.19 Example  Insert 0x13, 0x10, 0x07, 0x00, 0x1f  Each page can contain no more than 2 tuples

20 Spring 2003 ECE569 Lecture 04-2.20 Example – insert 0x1f

21 Spring 2003 ECE569 Lecture 04-2.21 Performance  2 steps for retrieving a tuple  If we can keep the directory in memory, each retrieval is one page access  Assuming 4 bytes per entry, 4KB pages, 1GB hash files, and we want to keep the entire directory in memory, what is the minimum buffer size?

22 Spring 2003 ECE569 Lecture 04-2.22 Discussion  How easy is it to keep the directory in the memory?  How do we reduce the structure when the file shrinks?


Download ppt "Spring 2003 ECE569 Lecture 04-2.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang www.ece.rutgers.edu/~yyzhangwww.ece.rutgers.edu/~yyzhang."

Similar presentations


Ads by Google