1 CS143: Index. 2 Topics to Learn Important concepts –Dense index vs. sparse index –Primary index vs. secondary index (= clustering index vs. non-clustering.

1 CS143: Index

2 Topics to Learn Important concepts –Dense index vs. sparse index –Primary index vs. secondary index (= clustering index vs. non-clustering index) –Tree-based vs. hash-based index Tree-based index –Indexed sequential file –B+-tree Hash-based index –Static hashing –Extendible hashing

3 Basic Problem SELECT * FROM Student WHERE sid = 40 How can we answer the query? sidnameGPA 20Elaine3.2 70Peter2.6 40Susan3.7

4 Random-Order File How do we find sid=40? sidnameGPA 20Susan3.5 60James1.7 70Peter2.6 40Elaine3.9 30Christy2.9

5 Sequential File Table sequenced by sid. Find sid=40? sidnameGPA 20Susan3.5 30James1.7 40Peter2.6 50Elaine3.9 60Christy2.9

6 Binary Search 100,000 records Q: How many blocks to read? Any better way? –In a library, how do we find a book?

7 Basic Idea Build an “index” on the table –An auxiliary structure to help us locate a record given a “key” 20 60 10 40 80 40

8 Dense, Primary Index Primary index (clustering index) –Index on the search key Dense index –(key, pointer) pair for every record Find the key from index and follow pointer –Maybe through binary search Q: Why dense index? –Isn’t binary search on the file the same? 20 10 40 30 60 50 80 70 100 90 Dense Index 10 20 30 40 50 60 70 80 90 100 110 120 Sequential File

9 Why Dense Index? Example –10,000,000 records (900-bytes/rec) –4-byte search key, 4-byte pointer –4096-byte block. Unspanned tuples Q: How many blocks for table (how big)? Q: How many blocks for index (how big)?

10 Sparse, Primary Index Sparse index –(key, pointer) pair per every “block” –(key, pointer) pair points to the first record in the block Q: How can we find 60? Sequential File 20 10 40 30 60 50 80 70 100 90 Sparse Index 10 30 50 70 90 110 130 150

11 Multi-level index Sequential File 20 10 40 30 60 50 80 70 100 90 Sparse 2nd level 10 30 50 70 90 110 130 150 170 190 210 230 10 90 170 250 330 410 490 570 1st level Q: Why multi-level index? Q: Does dense, 2nd level index make sense?

12 Secondary (non-clustering) Index Secondary (non-clustering) index –When tuples in the table are not ordered by the index search key Index on a non-search-key for sequential file Unordered file Q: What index? –Does sparse index make sense? Sequence field 50 30 70 20 40 80 10 100 60 90

13 Sparse and secondary index? 50 30 70 20 40 80 10 100 60 90 30 20 80 100 90...

14 Secondary index 50 30 70 20 40 80 10 100 60 90 10 20 30 40 50 60 70... 10 50 90... sparse High level First level is always dense Sparse from the second level

15 Important terms Dense index vs. sparse index Primary index vs. secondary index –Clustering index vs. non-clustering index Multi-level index Indexed sequential file –Sometimes called ISAM (indexed sequential access method) Search key (  primary key)

16 Insertion 20 10 30 50 40 60 10 30 40 60 Insert 35 Q: Do we need to update higher-level index?

17 Insertion 20 10 30 50 40 60 10 30 40 60 Insert 15 (overflow) Q: Do we need to update higher-level index?

18 Insertion 20 10 30 50 40 60 10 30 40 60 Insert 15 (redistribute) Q: Do we need to update higher-level index?

19 Potential performance problem After many insertions… 10 20 30 40 50 60 70 80 90 39 31 35 36 32 38 34 33 overflow pages (not sequential) Main index

20 Traditional Index (ISAM) Advantage –Simple –Sequential blocks Disadvantage –Not suitable for updates –Becomes ugly (loses sequentiality and balance) over time

21 B+Tree Most popular index structure in RDBMS Advantage –Suitable for dynamic updates –Balanced –Minimum space usage guarantee Disadvantage –Non-sequential index blocks

22 B+Tree Example (n=3) 20 30 50 80 90 70 50 80 70 Leaf Non leaf root 20Susan2.7 30James3.6 50Peter1.8 ………... Balanced: All leaf nodes are at the same level

23 n: max # of pointers in a node All pointers (except the last one) point to tuples At least half of the pointers are used. (more precisely,  (n+1)/2  pointers) Sample Leaf Node (n=3) 20 30 From a non-leaf node Last pointer: to the next leaf node 20Susan2.7 30James3.6 50Peter1.8 ……… points to tuple

24 Points to the nodes one-level below - No direct pointers to tuples At least half of the ptrs used (precisely,  n/2  ) - except root, where at least 2 ptrs used Sample Non-leaf Node (n=3) 23 56 To keys 23  k<56 To keys 56  k To keys k<23

25 Find a greater key and follow the link on the left (Algorithm: Figure 12.10 on textbook) Find 30, 60, 70? Search on B+tree 20 30 50 80 90 70 50 80 70

26 Nodes are never too empty Use at least Non-leaf:  n/2  pointers Leaf:  (n+1)/2  pointers full node min. node Non-leaf Leaf n=4 5 8 10 5 5 8

27 Non-leaf (non-root) nn-1  n/ 2   n/ 2  -1 Leaf (non-root) nn-1 Rootnn-121 Max Max Min Min Ptrs keys ptrs keys  (n+1)/2   (n-1)/2  Number of Ptrs/Keys for B+tree

28 (a) simple case (no overflow) (b) leaf overflow (c) non-leaf overflow (d) new root B+Tree Insertion

29 Insert 60 Insertion (Simple Case) 20 30 50 80 90 70 50 80 70

30 Insert 55 No space to store 55 Insertion (Leaf Overflow) 20 30 50 60 80 90 70 50 80 70

31 Insert 55 Insertion (Leaf Overflow) 20 30 50 55 80 90 70 50 60 80 70 60 Q: After split, leaf nodes always half full? No overflow. Stop

32 Insertion (Non-leaf Overflow) Insert 52 20 30 50 55 60 50 60 70 Leaf overflow. Split and copy the first key of the new node

33 50 60 Insertion (Non-leaf Overflow) Insert 52 20 30 50 52 60 55 70 55 No overflow. Stop Q: After split, non-leaf at least half full?

34 Insertion (New Root Node) Insert 25 20 30 50 55 60 50 60

35 60 30 50 Insertion (New Root Node) Insert 25 20 25 50 55 60 30 Q: At least 2 ptrs at root?

36 B+Tree Insertion Leaf node overflow –The first key of the new node is copied to the parent Non-leaf node overflow –The middle key is moved to the parent Detailed algorithm: Figure 12.13

37 B+Tree Deletion (a) Simple case (no underflow) (b) Leaf node, coalesce with neighbor (c) Leaf node, redistribute with neighbor (d) Non-leaf node, coalesce with neighbor (e) Non-leaf node, redistribute with neighbor In the examples, n = 4 –Underflow for non-leaf when fewer than  n/2  = 2 ptrs –Underflow for leaf when fewer than  (n+1)/2  = 3 ptrs –Nodes are labeled as a, b, c, d, …

38 (a) Simple case Delete 25 20 25 30 40 50 20 40 60 a bcde

39 (b) Coalesce with sibling (leaf) Delete 50 20 30 40 50 20 40 60 60 bcde a

40 (b) Coalesce with sibling (leaf) Delete 50 –Check underflow at a. Min 2 ptrs, currently 3 20 30 40 20 60 60 bce a

41 (c) Redistribute (leaf) Delete 50 20 25 30 40 50 20 40 60 60 bcde a

42 (c) Redistribute (leaf) Delete 50 –No underflow at a. Done. 20 25 30 40 20 40 60 60 bcde a 30

43 (d) Coalesce (non-leaf) Delete 20 –Underflow! Merge d with e. Move everything in the right to the left 10 20 30 40 50 90 50 60 70 30 70 a bc defg

44 (d) Coalesce (non-leaf) Delete 20 –Underflow at a ? Min 2 ptrs. Currently 2. Done. 10 30 40 90 50 60 70 50 70 a b dfg

45 (e) Redistribute (non-leaf) Delete 20 –Underflow! Merge d with e. 10 20 30 40 50 99 50 60 70 30 70 90 97 a bc defg

46 (e) Redistribute (non-leaf) Delete 20 –Underflow at a ? Min 2 ptrs, currently 3. Done 10 30 40 90 99 50 60 70 50 70 97 a bc dfg

47 Important Points Remember: –For leaf node merging, we delete the mid-key from the parent –For non-leaf node merging/redistribution, we pull down the mid-key from their parent. Exact algorithm: Figure 12.17 In practice –Coalescing is often not implemented Too hard and not worth it

48 Where does n come from? n determined by –Size of a node –Size of search key –Size of an index pointer Q: 1024B node, 10B key, 8B ptr  n?

49 Question on B+tree SELECT * FROM Student WHERE sid > 60? 20 30 50 60 80 90 70 50 80 70

50 Summary on tree index Issues to consider –Sparse vs. dense –Primary (clustering) vs. secondary (non-clustering) Indexed sequential file (ISAM) –Simple algorithm. Sequential blocks –Not suitable for dynamic environment B+trees –Balanced, minimum space guarantee –Insertion, deletion algorithms

51 Index Creation in SQL CREATE INDEX ON (,,…) Example –CREATE INDEX stidx ON Student(sid) Creates a B+tree on the attributes Speeds up lookup on sid

52 Primary (Clustering) Index MySQL: –Primary key becomes the clustering index DB2: –CREATE INDEX idx ON Student(sid) CLUSTER –Tuples in the table are sequenced by sid Oracle: Index-Organized Table (IOT) –CREATE TABLE T (... ) ORGANIZATION INDEX –B+tree on primary key –Tuples are stored at the leaf nodes of B+tree Periodic reorganization may still be necessary to improve range scan performance

53 Next topic Hash index –Static hashing –Extendible hashing

54 What is a Hash Table? Hash Table –Hash function h (k): key  integer [0…n] e.g., h (‘Susan’) = 7 –Array for keys: T[0…n] –Given a key k, store it in T[ h ( k )] 0 1Neil 2 3James 4Susan 5 h(Susan) = 4 h(James) = 3 h(Neil) = 1

55 Hashing for DBMS (Static Hashing) (key, record)...... Disk blocks (buckets) search key  h(key) 0 1 2 3 4

56 Overflow and Chaining Insert h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 h(e) = 1 Delete h(b) = 2 h(c) = 1 0 1 2 3 a b c d e

57 Major Problem of Static Hashing How to cope with growth? –Data tends to grow in size –Overflow blocks unavoidable 10 20 30 40 50 60 70 80 90 39 31 35 36 32 38 34 33 overflow blocks hash buckets

58 (a) Use i of b bits output by hash function Extendible Hashing (two ideas) 00110101 h(K)  b use i  grows over time

59 (b) Use directory that maintains pointers to hash buckets (indirection) Extendible Hashing (two ideas)............ c e hash bucket directory h(c)

60 Example h(k) is 4 bits; 2 keys/bucket Insert 0111 1 1 0001 1001 1100 1 0 1 i = 0111

61 Example Insert 1010 (overflow) 1 1 0001 1001 1100 0111 i = 1 0 1 Increase i of the bucket. Split it.

62 Example Insert 0000 1 2 0001 1001 1010 0111 2 1100 00 01 10 11 2 i = Split bucket and increase i

63 Insert 0011 (double directory) 2 2 0111 1001 1010 2 1100 00 01 10 11 2 i = 2 0000 0001 Split bucket, increase i, redistribute keys

64 Extendible Hashing: Deletion Two options a)No merging of buckets b)Merge buckets and shrink directory if possible

65 Delete 1010 (merge buckets and shrink directory) 1 0001 2 1001 2 1100 00 01 10 11 2 i = a b c 1010

66 Bucket Merge Condition Bucket merge condition –Bucket i’s are the same –First (i-1) bits of the hash key are the same Directory shrink condition –All bucket i’s are smaller than the directory i

67 Questions on Extendible Hashing Can we provide minimum space guarantee?

68 Space Waste 21 4 0001 4 0000 4 i = 3

69 Hash index summary Static hashing –Overflow and chaining Extendible hashing –Can handle growing files No periodic reorganizations –Indirection Up to 2 disk accesses to access a key –Directory doubles in size Not too bad if the data is not too large

70 Hashing vs. Tree Can an extendible-hash index support? SELECT FROM R WHERE R.A > 5 Which one is better, B+tree or Extendible hashing? SELECT FROM R WHERE R.A = 5

1 CS143: Index. 2 Topics to Learn Important concepts –Dense index vs. sparse index –Primary index vs. secondary index (= clustering index vs. non-clustering.

Similar presentations

Presentation on theme: "1 CS143: Index. 2 Topics to Learn Important concepts –Dense index vs. sparse index –Primary index vs. secondary index (= clustering index vs. non-clustering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CS143: Index. 2 Topics to Learn Important concepts –Dense index vs. sparse index –Primary index vs. secondary index (= clustering index vs. non-clustering.

Similar presentations

Presentation on theme: "1 CS143: Index. 2 Topics to Learn Important concepts –Dense index vs. sparse index –Primary index vs. secondary index (= clustering index vs. non-clustering."— Presentation transcript:

Similar presentations

About project

Feedback