Presentation is loading. Please wait.

Presentation is loading. Please wait.

Notions of clustering Clustered relation: tuples are stored in blocks mostly devoted to that relation. Clustering index: tuples (of the relation) with.

Similar presentations


Presentation on theme: "Notions of clustering Clustered relation: tuples are stored in blocks mostly devoted to that relation. Clustering index: tuples (of the relation) with."— Presentation transcript:

1 Notions of clustering Clustered relation: tuples are stored in blocks mostly devoted to that relation. Clustering index: tuples (of the relation) with same search key are stored together.

2 Index-based algorithms: selection To evaluate  a= (R) use index on a, if it exists Cost: cost of index lookup (negligible) plus If index is clustering: B(R)/V(R,a) I/O’s (the fraction of the relation with some value for a) Otherwise, with a non-clustering index, an approximation is that each tuple we retrieve is in a different block, so we get: T(R)/V(R,a) I/O’s

3 Example of index-based selection  a= (R), and B(R) = 1000, T(R) =20,000 R is clustered, but no index on attribute a  1000 disk I/O’s R has a clustering index on a, V(R,a) = 100  10 I/O’s R has a non-clustering index on a, V(R,a) = 100  20,000/100 = 200 disk I/O’s V(R,a) = 20,000 (i.e. attribute a is key)  just 1 I/O

4 Index joins We want to compute R(X,Y)  S(Y,Z) Suppose there is a Y-index on S. Algorithm: For each tuple t of R, lookup all tuples in S with key-value t[Y] and output the join of t. Cost: B(R) to read R (clustered case) --- We can neglect this cost Each tuple of R joins with T(S)/V(S,Y) tuples of S, on average. If S has a non-clustered index on Y:  I/O cost is T(R)T(S)/V(S,Y) If S has a clustered index on Y:  I/O cost is T(R)B(S)/V(S,Y)

5 Example of index-join T(R) = 10,000, B(R) = 1000 T(S) = 5000, B(S) = 500, V(S,Y) = 100 To compute R(X,Y)  S(Y,Z) using a clustered Y-index on S: 1000 + 10,000*(500/100) = 51,000 I/O’s Bad!!

6 However, things are not so bad in practice Suppose we have the relations: StarsIn(title, year, starName) MovieStar(name, address, gender, birthdate) And there is an index on MovieStar.name Consider the SQL query: SELECT birthdate FROM StarsIn, MovieStar WHERE title = 'King Kong' AND starName = name;

7 We can first do the selection of those tuples in StarsIn relation with title=‘King Kong’. Suppose they are 10 such tuples. Now, we know that stars take care to not have the same name with some other star. So, name is a key for the relation MovieStar. (V(MovieStar, name) = ?) Hence, V(MovieStar, name) = T(MovieStar) Finally the number of I/O’s is: B(StarsIn) + T(  name=‘King Kong’ (StarsIn)) Practice (Cont’d)

8 Joins using sorted indexes We want to compute R(X,Y)  S(Y,Z) If S has a B-tree index on Y, Create sorted sublists of R only, and Do a sort join, extracting the S-tuples in order through the index Or, if both have B-tree index on Y, do a zigzag-join.

9 Example (B-Tree index on S[Y]) T(R) = 10,000, B(R) = 1000, T(S) = 5000, B(S) = 500, V(S,Y) = 100, S has a B-Tree index on Y Assume that both relations and the indexes are clustered. M = 101 buffers Create 10 sorted sublists of R. Cost: 2B(R) 10 buffers for sublists of R, 1 buffer for S (retrieved via index) Join tuples from input buffers Total cost: 2B(R) + B(R) + B(S) + index lookup = 2000 + 1000 + 500 + index lookup = 3500 + index lookup

10 Zigzag Join Suppose we have B-Tree indexes on both S[Y] and R[Y]. We can jump back and forth between the indexes finding Y- values that they share in common. Tuples from R with Y-value that doesn’t appear in S need never be retrieved, and similarly tuples of S whose Y-value doesn’t appear in R need never be retrieved. Example. Let the Y-values for R be: 1,3,4,4,4,5,6 Let the Y-values for S be: 2,2,4,4,6,7 Start with the 1 and 2. Since 1<2 skip 1 in R. Since 2<3 skip the 2’s in S. Since 3<4 skip 3 in R. Join 4’s. …

11 Example (Zigzag Join) T(R) = 10,000, B(R) = 1000, T(S) = 5000, B(S) = 500, S and R both have clustered B-Tree indexes on Y We use just 1000+500 disk I/O’s to read the blocks of R and S through their indexes. We can determine from the indexes alone that a large fraction of R or S cannot match tuples of the other relation, so the cost might be considerably less than 1500 I/O’s.


Download ppt "Notions of clustering Clustered relation: tuples are stored in blocks mostly devoted to that relation. Clustering index: tuples (of the relation) with."

Similar presentations


Ads by Google