Presentation is loading. Please wait.

Presentation is loading. Please wait.

Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.

Similar presentations


Presentation on theme: "Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department."— Presentation transcript:

1 Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department of Computer Science North Dakota State University

2 Outline  Introduction Review of Aggregate Functions Review of Iceberg Queries  Algorithms of Aggregate Function Computation Using P-trees SUM, COUNT, and AVERAGE. MAX, MIN, MEDIAN, RANK, and TOP-K.  Iceberg Query Operation Using P-trees An Iceberg Query Example  Performance Analysis  Conclusion

3 Introduction  The commonly used aggregation functions include COUNT, SUM, AVERAGE, MIN, MAX, MEDIAN, RANK, and TOP-K.  There are three types of aggregate functions: T is a set of tuple, {S i | i = 1... n} U i S i = T and ∩ i S i = {} Distributive  An aggregate function F is distributive if there is a function G such that F (T) = G ({F (S i )| i = 1... n}). SUM, MIN, and MAX are distributive with G = F. Count is distributive with G = SUM.

4 Review of Aggregate Functions (Cont.) Algebraic  An Aggregate function F is algebraic if there is an M-tuple valued function G and a function H such that F (T) = H ({G (S i ) | i = 1... n}). Average, Standard Deviation, MaxN, MinN, and Center_of_Mass are all algebraic. Holistic  An aggregate function F is holistic if there is no constant bound on the size of the storage needed to describe a sub-aggregate. Median, MostFrequent (also called the Mode), and Rank are common examples of holistic functions.

5 Review of Iceberg Queries  Iceberg queries perform aggregate functions across attributes and then eliminate aggregate values that are below some specified threshold.  We use an example to review iceberg queries. SELECT Location, Product Type, Sum (# Product) FROM Relation Sales GROUPBY Location, Product Type HAVING Sum (# Product) >= T

6 Review of Iceberg Queries (Cont.)  We illustrate the procedure of calculating by three steps.  Step one: Generate Location-list. SELECT Location, Sum (# Product) FROM Relation Sales GROUPBY Location HAVING Sum (# Product) >= T  Step Two: Generate Product Type-list. SELECT Type, Sum (# Product) FROM Relation Sales GROUPBY Product Type HAVING Sum (# Product) >= T

7 Review of Iceberg Queries (Cont.)  Step Three: Generate location & Product Type pair groups.  From the Location-list and the Type-list we generated in first two steps, we can eliminate many of the location & Product Type pair groups according to the threshold T.

8 Algorithms of Aggregate Function Computation Using P-trees IdMonLocTypeOn line# Product 1JanNew YorkNotebookY10 2JanMinneapolisDesktopN5 3FebNew YorkPrinterY6 4MarNew YorkNotebookY7 5MarMinneapolisNotebookY11 6MarChicagoDesktopY9 7AprMinneapolisFaxN3  The dataset we used in our example.  We use the data in relation Sales to illustrate algorithms of aggregate function. Table 1. Relation Sales.

9 Algorithms of Aggregate Function Computation Using P-trees (Cont.) IdMonLocTypeOn line# Product P 0,3 P 0,2 P 0,1 P 0,0 P 1,4 P 1,3 P 1,2 P 1,1 P 1,0 P 2,2 P 2,1 P 2,0 P 3,0 P 4,3 P 4,2 P 4,1 P 4,0 100010000100111010 200010010101000101 300100000110010110 400110000100110111 500110010100111011 600110011001011001 701000010110100011  Table 2 shows the binary representation of data in relation Sales. Table 2. Binary Form of Sales.

10 Algorithm of Aggregate Function COUNT  COUNT function: It is not necessary to write special function for COUNT because P-tree RootCount function has already provided the mechanism to implement it. Given a P-tree P i, RootCount(P i ) returns the number of 1s in P i. IdMonLocTypeOn line# Product P 0,3 P 0,2 P 0,1 P 0,0 P 1,4 P 1,3 P 1,2 P 1,1 P 1,0 P 2,2 P 2,1 P 2,0 P 3,0 P 4,3 P 4,2 P 4,1 P 4,0 100010000100111010 200010010101000101 300100000110010110 400110000100110111 500110010100111011 600110011001011001 701000010110100011 Table 1. Relation Sales.

11 Algorithm of Aggregate Function SUM  SUM function: Sum function can total a field of numerical values. Algorithm 4.1 Evaluating sum () with P-tree. total = 0.00; For i = 0 to n { total = total + 2 i * RootCount (P i ); } Return total Algorithm 4. 1. Sum Aggregate

12 Algorithm of Aggregate Function SUM P 4,3 P 4,2 P 4,1 P 4,0 10001101000110 01110000111000 10111011011101 01011110101111 {3} {5} 2 3 * + 2 2 * + 2 1 * + 2 0 * = 51  For example, if we want to know the total number of products which were sold out in relation S, the procedure is showed on left 10 5 6 7 11 9 3

13 Algorithm of Aggregate Function AVERAGE  Average function: Average function will show the average value in a field. It can be calculated from function COUNT and SUM. Average () = Sum ()/Count ().

14 Algorithm of Aggregate Function MAX  Max function: Max function returns the largest value in a field. Algorithm 4.2 Evaluating max () with P-tree. max = 0.00; c = 0; P c is set all 1s For i = n to 0 { c = RootCount (P c AND P i ); If (c >= 1) P c = P c AND P i ; max = max + 2 i ; } Return max; Algorithm 4. 2. Max Aggregate.

15 Algorithm of Aggregate Function MAX P 4,3 P 4,2 P 4,1 P 4,0 10001101000110 01110000111000 10111011011101 01011110101111 {1} {0} {1} 1. P c = P 4,3 RootCount (P c ) = 3 >= 1 2. RootCount (P c AND P 4,2 ) = 0 < 1 P c = P c AND P’ 4,2 3. RootCount (P c AND P 4,1 ) = 2 >= 1 P c = P c AND P 4,1 4. RootCount (P c AND P 4,0 ) = 1 >= 1 10 5 6 7 11 9 3 Steps IF Pos Bits 2 3 * + 2 2 * + 2 1 * + 2 0 * = {1} {0} {1} 11

16 Algorithm of Aggregate Function MIN  Min function: Min function returns the smallest value in a field. Algorithm 4.3. Evaluating Min () with P-tree. min = 0.00; c = 0; P c is set all 1s For i = n to 0 { c = RootCount (P c AND NOT (P i )); If (c >= 1) P c = P c AND NOT (P i ); Else min = min + 2 i ; } Return min; Algorithm 4. 2. Max Aggregate.

17 Algorithm of Aggregate Function MIN P 4,3 P 4,2 P 4,1 P 4,0 10001101000110 01110000111000 10111011011101 01011110101111 {0} {1} 1. P c = P’ 4,3 RootCount (P c ) = 4 > = 1 2. RootCount (P c AND P’ 4,2 ) = 1 >= 1 P c = P c AND P’ 4,2 3. RootCount (P c AND P’ 4,1 ) = 0 < 1 P c = P c AND P 4,1 4. RootCount (P c AND P’ 4,0 ) = 0 < 1 10 5 6 7 11 9 3 Steps IF Pos Bits 2 3 * + 2 2 * + 2 1 * + 2 0 * = {0} {1} 3

18 Algorithms of Aggregate Function MEDIAN and RANK  Median/Rank: Median function returns the median value in a field.  Rank (K) function returns the value that is the kth largest value in a field. Algorithm 4.4. Evaluating Median () with P-tree median = 0.00; pos = N/2; for rank pos = K; c = 0; P c is set all 1s for single attribute For i = n to 0 { c = RootCount (P c AND P i ); If (c >= pos) median = median + 2 i ; P c = P c AND P i ; Else pos = pos - c; P c = P c AND NOT (P i ); } Return median; Algorithm 4. 2. Median Aggregate.

19 Algorithm of Aggregate Function MEDIAN P 4,3 P 4,2 P 4,1 P 4,0 10001101000110 01110000111000 10111011011101 01011110101111 {0} {1} 1. P c = P 4,3 RootCount (P c ) = 3 < 4 Pc = P’ 4,3 pos = 4 – 3 = 1 2. RootCount (P c AND P 4,2 ) = 3 >= 1 P c = P c AND P 4,2 3. RootCount (P c AND P 4,1 ) = 2 >= 1 P c = P c AND P 4,1 4. RootCount (P c AND P 4,0 ) = 1 >= 1 10 5 6 7 11 9 3 Steps IF Pos Bits 2 3 * + 2 2 * + 2 1 * + 2 0 * = {0}{1} 7

20 Algorithm of Aggregate Function TOP-K  Top-k function: In order to get the largest k values in a field, first, we will find rank k value V k using function Rank (K).  Second, we will find all the tuples whose values are greater than or equal to V k. Using ENRING technology of P-tree

21 Iceberg Query Operation Using P-rees  We demonstrate the computation procedure of iceberg querying with the following example: SELECT Loc, Type, Sum (# Product) FROM Relation S GROUPBY Loc, Type HAVING Sum (# Product) >= 15

22 Iceberg Query Operation Using P- trees (Step One)  Step one: We build value P-trees for the 4 values, {Loc| New York, Minneapolis, Chicago}, of attribute Loc. P MN 0 1 0 1 0 1 P NY 1 0 1 0 P CH 0 1 0 Figure 4. Value P-trees of Attribute Loc

23 Iceberg Query Operation Using P- trees (Step One) LOC 0 0 0 0 1 P 1,4 P 1,3 P 1,2 P 1.1 P 1.0 P’ 1,4 P’ 1,3 P’ 1,2 P’ 1.1 P 1.0 P NY 00000000000000 00000000000000 01001110100111 00000100000010 11111011111101 11111111111111 11111111111111 10110001011000 11111011111101 11111011111101 10110001011000 Figure 5. Procedure of Calculating P NY  Figure 5 illustrates the calculation procedure of value P-tree P NY. Because the binary value of New York is 00001, we will get formula 1. P NY = P ’ 1,4 AND P ’ 1,3 AND P ’ 1,2 AND P ’ 1,1 AND P 1,0 (1)

24 Iceberg Query Operation Using P- trees (Step One)  After getting all the value P-trees for each location, we calculate the total number of products sold in each place. We still use the value, New York, as our example. Sum(# product | New York) = 2 3 * RootCount (P 4,3 AND P NY ) + 2 2 * RootCount (P 4,2 AND P NY ) + 2 1 * RootCount (P 4,1 AND P NY ) + 2 0 * RootCount (P 4,0 AND P NY ) = 8 * 1 + 4 * 2 + 2 * 3 + 1 * 1 = 23 (2)

25 Iceberg Query Operation Using P- trees (Step One) Loc ValuesSum (# Product)Threshold New York23Y Minneapolis18Y Chicago9N Table 3 shows the total number of products sold out in each of the three of the locations. Because our threshold T is 15, we eliminate the city Chicago. Table 3. the Summary Table of Attribute Loc.

26 Iceberg Query Operation Using P- trees (Step Two)  Step two: Similarly we build value P-trees for every value of attribute Type. Attribute Type has four values {Type | Notebook, desktop, Printer, Fax}. Figure 6 shows the value P-tree of the four values of attribute Type. 10011001001100 01000100100010 00100000010000 00000010000001 P Notebook P Desktop P Printer P FAX Figure 6. Value P-trees of Attribute Type.

27 Iceberg Query Operation Using P- trees (Step Two) Type ValuesSum (# Product)Threshold Notebook28Y Desktop14N FAX3N Printer6N Similarly we get the summary table for each value of attribute Type. According to the threshold, T equals 15, only value P-tree of notebook will be used in the future. Table 4. Summary Table of Attribute Type.

28 Iceberg Query Operation Using P- trees (Step Three)  Step three: We only generate candidate Loc and Type pairs for local store and Product type, which can pass the threshold T. By Performing And operation on P NY with P Notebook, we obtain value P-tree P NY AND Notebook 10110001011000 10011001001100 10010001001000 P NY P Notebook P NY AND Notebook AND = Figure 7. Procedure of Calculating PNY AND Notebook

29 Iceberg Query Operation Using P- trees (Step Three)  We calculate the total number of notebooks sold out in New York by formula 3. Sum(# Product | New York) = 2 3 * RootCount (P 4,3 AND P NY AND Notebook ) + 2 2 * RootCount (P 4,2 AND P NY AND Notebook ) + 2 1 * RootCount (P 4,1 AND P NY AND Notebook ) + 2 0 * RootCount (P 4,0 AND P NY AND Notebook ) = 8 * 1 + 4 * 1 + 2 * 2 + 1* 1 = 17 (3)

30 Iceberg Query Operation Using P- trees (Step Three)  By performing And operations on P MN with P Notebook, we obtain value P-tree P MN AND Notebook 01001010100101 10011001001100 00001000000100 P MN P Notebook P MN AND Notebook AND= Figure 8. Procedure of Calculating PMN AND Notebook

31 Iceberg Query Operation Using P- trees (Step Three)  We calculate the total number of notebook sold out in Minneapolis by formula 4. Sum (# product | Minneapolis) = 2 3 * RootCount (P 4,3 AND P MN AND Notbook ) + 2 2 * RootCount (P 4,2 AND P MN AND Notbook ) + 2 1 * RootCount (P 4,1 AND P MN AND Notbook ) + 2 0 * RootCount (P 4,0 AND P MN AND Notbook ) = 8 * 1 + 4 * 0 + 2 * 1 + 1 * 1 = 11 (4)

32 Iceberg Query Operation Using P- trees (Step Three)  Finally, we obtain the summary table 5. According to the threshold T=15, we can see that only group pair “ New York And Notebook ” pass our threshold T. From value P-tree P NY AND Notebook, we can see that tuple 1 and 4 are in the results of our iceberg query example. Type ValuesSum (# Product)Threshold New York And Notebook17Y Minneapolis And Notebook11N Table 5. Summary Table of Our Example. 10010001001000 P NY AND Notebook

33 Performance Analysis Figure 15. Iceberg Query with multi-attributes aggregation Performance Time Comparison

34 Performance Analysis  Our experiments are implemented in the C++ language on a 1GHz Pentium PC machine with 1GB main memory running on Red Hat Linux.  In figure 15, we compare the running time of P-tree method and bitmap method on calculating multi-attribute iceberg query. In this case P-trees are proved to be substantially faster.

35 Conclusion  we believe our study confirms that the P-tree approach is superior to the bitmap approach for aggregation of all types and multi-attribute iceberg queries.  It also proves that the advantages of basic P-tree representations of files are: First, there is no need for redundant, auxiliary structures. Second basic P-trees are good at calculating multi- attribute aggregations, numeric value, and fair to all attributes.

36 Thank you !


Download ppt "Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department."

Similar presentations


Ads by Google