Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.

Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department of Computer Science North Dakota State University

Introduction An aggregate on T is functional on 2 T (i.e., a map, F:2 T  R, R = real numbers). Common include COUNT, SUM, AVERAGE, MIN, MAX, MEDIAN, RANK, TOP-K. There are 3 types of aggregate functions: Let T be a set, let G be a numeric aggregate (i.e., aggregates an set of numbers into one number) and let S={S i } i=1…n be a partition of T (i.e., collectively exhaustive and mutually exclusive: U i=1..n S i =T and S j ∩S i =   i  j). 1.Distributive Aggregates: An aggregate, F, of T is G-distributive if  partition, S, of T, G-aggregating the F-aggregates of S is the same as F-aggregating T. (i.e., F(T)=G{F(S i )}  S={S i }). – SUM and COUNT are SUM-distributive (F=SUM or F=COUNT, G=SUM) – MIN is MIN-distributive – MAX is MAX-distributive An aggregate, F, is self-distributive iff it is F-distributive – e.g., SUM, MIN, MAX, but not COUNT – What about AVG, MEDIAN, RANK, TOP-K? 2. Algebraic Aggregates: An Aggregate, F, of T is algebraic if there is an M-tuple valued function K and a function H such that F(T)=H({K(S i )} i=1..n. Average, Standard Deviation, MaxN, MinN, and Center_of_Mass are all algebraic. 3. Holistic Aggregates: An aggregate function F is holistic if there is no constant bound on the size of the storage needed to describe a sub-aggregate. Median, MostFrequent (also called the Mode), and Rank are common examples of holistic functions.

Review of Iceberg Queries Iceberg queries perform aggregate functions across attributes and then eliminate aggregate values that are below some specified threshold. We use an example. SELECT Location, Product Type, Sum (# Product) FROM Relation Sales GROUPBY Location, Product Type HAVING Sum (# Product) >= T We illustrate the procedure of calculating by three steps. Step one: Generate Location-list. SELECT Location, Sum (# Product) FROM Relation Sales GROUPBY Location HAVING Sum (# Product) >= T Step Two: Generate Product Type-list. SELECT Type, Sum (# Product) FROM Relation Sales GROUPBY Product Type HAVING Sum (# Product) >= T Step Three: Generate location & Product Type pair groups. From the Location-list and the Type-list we generated in first two steps, we can eliminate many of the location & Product Type pair groups according to the threshold T.

Algorithms of Aggregate Function Computation Using P-trees IdMonLocTypeOn line# Product 1JanNew YorkNotebookY10 2JanMinneapolisDesktopN5 3FebNew YorkPrinterY6 4MarNew YorkNotebookY7 5MarMinneapolisNotebookY11 6MarChicagoDesktopY9 7AprMinneapolisFaxN3  The dataset we used in our example.  We use the data in relation Sales to illustrate algorithms of aggregate function. Table 1. Relation Sales.

Algorithms of Aggregate Function Computation Using P-trees (Cont.) IdMonLocTypeOn line# Product P 0,3 P 0,2 P 0,1 P 0,0 P 1,4 P 1,3 P 1,2 P 1,1 P 1,0 P 2,2 P 2,1 P 2,0 P 3,0 P 4,3 P 4,2 P 4,1 P 4,0 100010000100111010 200010010101000101 300100000110010110 400110000100110111 500110010100111011 600110011001011001 701000010110100011  Table 2 shows the binary representation of data in relation Sales. Table 2. Binary Form of Sales.

Algorithm of Aggregate Function COUNT COUNT function: It is not necessary to write special function for COUNT because P-tree RootCount function has already provided the mechanism to implement it. Given a P-tree P i, RootCount(P i ) returns the number of 1s in P i. IdMonLocTypeOn line# Product P 0,3 P 0,2 P 0,1 P 0,0 P 1,4 P 1,3 P 1,2 P 1,1 P 1,0 P 2,2 P 2,1 P 2,0 P 3,0 P 4,3 P 4,2 P 4,1 P 4,0 100010000100111010 200010010101000101 300100000110010110 400110000100110111 500110010100111011 600110011001011001 701000010110100011 Table 1. Relation Sales.

Algorithm of Aggregate Function SUM SUM function: Sum function can total a field of numerical values. Algorithm 4.1 Evaluating sum () with P-tree. total = 0.00; For i = 0 to n { total = total + 2 i * RootCount (P i ); } Return total Algorithm 4. 1. Sum Aggregate

Algorithm of Aggregate Function SUM P 4,3 P 4,2 P 4,1 P 4,0 10001101000110 01110000111000 10111011011101 01011110101111 {3} {5} 2 3 * + 2 2 * + 2 1 * + 2 0 * = 51  For example, if we want to know the total number of products which were sold out in relation S, the procedure is showed on left 10 5 6 7 11 9 3

Algorithm of Aggregate Function AVERAGE Average function: Average function will show the average value in a field. It can be calculated from function COUNT and SUM. Average () = Sum ()/Count ().

Algorithm of Aggregate Function MAX Max function: Max function returns the largest value in a field. Algorithm 4.2 Evaluating max () with P-tree. max = 0.00; c = 0; P c is set all 1s For i = n to 0 { c = RootCount (P c AND P i ); If (c >= 1) P c = P c AND P i ; max = max + 2 i ; } Return max; Algorithm 4. 2. Max Aggregate.

Algorithm of Aggregate Function MAX P 4,3 P 4,2 P 4,1 P 4,0 10001101000110 01110000111000 10111011011101 01011110101111 {1} {0} {1} 1. P c = P 4,3 RootCount (P c ) = 3 >= 1 2. RootCount (P c AND P 4,2 ) = 0 < 1 P c = P c AND P’ 4,2 3. RootCount (P c AND P 4,1 ) = 2 >= 1 P c = P c AND P 4,1 4. RootCount (P c AND P 4,0 ) = 1 >= 1 10 5 6 7 11 9 3 Steps IF Pos Bits 2 3 * + 2 2 * + 2 1 * + 2 0 * = {1} {0} {1} 11

Algorithm of Aggregate Function MIN Min function: Min function returns the smallest value in a field. Algorithm 4.3. Evaluating Min () with P-tree. min = 0.00; c = 0; P c is set all 1s For i = n to 0 { c = RootCount (P c AND NOT (P i )); If (c >= 1) P c = P c AND NOT (P i ); Else min = min + 2 i ; } Return min; Algorithm 4. 2. Max Aggregate.

Algorithm of Aggregate Function MIN P 4,3 P 4,2 P 4,1 P 4,0 10001101000110 01110000111000 10111011011101 01011110101111 {0} {1} 1. P c = P’ 4,3 RootCount (P c ) = 4 > = 1 2. RootCount (P c AND P’ 4,2 ) = 1 >= 1 P c = P c AND P’ 4,2 3. RootCount (P c AND P’ 4,1 ) = 0 < 1 P c = P c AND P 4,1 4. RootCount (P c AND P’ 4,0 ) = 0 < 1 10 5 6 7 11 9 3 Steps IF Pos Bits 2 3 * + 2 2 * + 2 1 * + 2 0 * = {0} {1} 3

Algorithms of Aggregate Function MEDIAN and RANK Median/Rank: Median function returns the median value in a field. Rank (K) function returns the value that is the kth largest value in a field. Algorithm 4.4. Evaluating Median () with P-tree median = 0.00; pos = N/2; for rank pos = K; c = 0; P c is set all 1s for single attribute For i = n to 0 { c = RootCount (P c AND P i ); If (c >= pos) median = median + 2 i ; P c = P c AND P i ; Else pos = pos - c; P c = P c AND NOT (P i ); } Return median; Algorithm 4. 2. Median Aggregate.

Algorithm of Aggregate Function MEDIAN P 4,3 P 4,2 P 4,1 P 4,0 10001101000110 01110000111000 10111011011101 01011110101111 {0} {1} 1. P c = P 4,3 RootCount (P c ) = 3 < 4 Pc = P’ 4,3 pos = 4 – 3 = 1 2. RootCount (P c AND P 4,2 ) = 3 >= 1 P c = P c AND P 4,2 3. RootCount (P c AND P 4,1 ) = 2 >= 1 P c = P c AND P 4,1 4. RootCount (P c AND P 4,0 ) = 1 >= 1 10 5 6 7 11 9 3 Steps IF Pos Bits 2 3 * + 2 2 * + 2 1 * + 2 0 * = {0}{1} 7

Algorithm of Aggregate Function TOP-K Top-k function: In order to get the largest k values in a field, first, we will find rank k value V k using function Rank (K). Second, we will find all the tuples whose values are greater than or equal to V k. Using ENRING technology of P-tree

Iceberg Query Operation Using P-rees We demonstrate the computation procedure of iceberg querying with the following example: SELECT Loc, Type, Sum (# Product) FROM Relation S GROUPBY Loc, Type HAVING Sum (# Product) >= 15

Iceberg Query Operation Using P-trees (Step One) Step one: We build value P-trees for the 4 values, {Loc| New York, Minneapolis, Chicago}, of attribute Loc. P MN 0 1 0 1 0 1 P NY 1 0 1 0 P CH 0 1 0 Figure 4. Value P-trees of Attribute Loc

Iceberg Query Operation Using P-trees (Step One) LOC 0 0 0 0 1 P 1,4 P 1,3 P 1,2 P 1.1 P 1.0 P’ 1,4 P’ 1,3 P’ 1,2 P’ 1.1 P 1.0 P NY 00000000000000 00000000000000 01001110100111 00000100000010 11111011111101 11111111111111 11111111111111 10110001011000 11111011111101 11111011111101 10110001011000 Figure 5. Procedure of Calculating P NY  Figure 5 illustrates the calculation procedure of value P-tree P NY. Because the binary value of New York is 00001, we will get formula 1. P NY = P’ 1,4 AND P’ 1,3 AND P’ 1,2 AND P’ 1,1 AND P 1,0 (1)

Iceberg Query Operation Using P-trees (Step One) After getting all the value P-trees for each location, we calculate the total number of products sold in each place. We still use the value, New York, as our example. Sum(# product | New York) = 2 3 * RootCount (P 4,3 AND P NY ) + 2 2 * RootCount (P 4,2 AND P NY ) + 2 1 * RootCount (P 4,1 AND P NY ) + 2 0 * RootCount (P 4,0 AND P NY ) = 8 * 1 + 4 * 2 + 2 * 3 + 1 * 1 = 23 (2)

Iceberg Query Operation Using P-trees (Step One) Loc ValuesSum (# Product)Threshold New York23Y Minneapolis18Y Chicago9N Table 3 shows the total number of products sold out in each of the three of the locations. Because our threshold T is 15, we eliminate the city Chicago. Table 3. the Summary Table of Attribute Loc.

Iceberg Query Operation Using P-trees (Step Two) Step two: Similarly we build value P-trees for every value of attribute Type. Attribute Type has four values {Type | Notebook, desktop, Printer, Fax}. Figure 6 shows the value P-tree of the four values of attribute Type. 10011001001100 01000100100010 00100000010000 00000010000001 P Notebook P Desktop P Printer P FAX Figure 6. Value P-trees of Attribute Type.

Iceberg Query Operation Using P-trees (Step Two) Type ValuesSum (# Product)Threshold Notebook28Y Desktop14N FAX3N Printer6N Similarly we get the summary table for each value of attribute Type. According to the threshold, T equals 15, only value P-tree of notebook will be used in the future. Table 4. Summary Table of Attribute Type.

Iceberg Query Operation Using P-trees (Step Three) Step three: We only generate candidate Loc and Type pairs for local store and Product type, which can pass the threshold T. By Performing And operation on P NY with P Notebook, we obtain value P- tree P NY AND Notebook 10110001011000 10011001001100 10010001001000 P NY P Notebook P NY AND Notebook AND = Figure 7. Procedure of Calculating PNY AND Notebook

Iceberg Query Operation Using P-trees (Step Three) We calculate the total number of notebooks sold out in New York by formula 3. Sum(# Product | New York) = 2 3 * RootCount (P 4,3 AND P NY AND Notebook ) + 2 2 * RootCount (P 4,2 AND P NY AND Notebook ) + 2 1 * RootCount (P 4,1 AND P NY AND Notebook ) + 2 0 * RootCount (P 4,0 AND P NY AND Notebook ) = 8 * 1 + 4 * 1 + 2 * 2 + 1* 1 = 17 (3)

Iceberg Query Operation Using P-trees (Step Three) By performing And operations on P MN with P Notebook, we obtain value P-tree P MN AND Notebook 01001010100101 10011001001100 00001000000100 P MN P Notebook P MN AND Notebook AND= Figure 8. Procedure of Calculating PMN AND Notebook

Iceberg Query Operation Using P-trees (Step Three) We calculate the total number of notebook sold out in Minneapolis by formula 4. Sum (# product | Minneapolis) = 2 3 * RootCount (P 4,3 AND P MN AND Notbook ) + 2 2 * RootCount (P 4,2 AND P MN AND Notbook ) + 2 1 * RootCount (P 4,1 AND P MN AND Notbook ) + 2 0 * RootCount (P 4,0 AND P MN AND Notbook ) = 8 * 1 + 4 * 0 + 2 * 1 + 1 * 1 = 11 (4)

Iceberg Query Operation Using P-trees (Step Three) Finally, we obtain the summary table 5. According to the threshold T=15, we can see that only group pair “New York And Notebook” pass our threshold T. From value P-tree P NY AND Notebook, we can see that tuple 1 and 4 are in the results of our iceberg query example. Type ValuesSum (# Product)Threshold New York And Notebook17Y Minneapolis And Notebook11N Table 5. Summary Table of Our Example. 10010001001000 P NY AND Notebook

Performance Analysis Figure 15. Iceberg Query with multi-attributes aggregation Performance Time Comparison

Performance Analysis Our experiments are implemented in the C++ language on a 1GHz Pentium PC machine with 1GB main memory running on Red Hat Linux. In figure 15, we compare the running time of P-tree method and bitmap method on calculating multi-attribute iceberg query. In this case P-trees are proved to be substantially faster.

Conclusion we believe our study confirms that the P-tree approach is superior to the bitmap approach for aggregation of all types and multi-attribute iceberg queries. It also proves that the advantages of basic P-tree representations of files are: – First, there is no need for redundant, auxiliary structures. – Second basic P-trees are good at calculating multi-attribute aggregations, numeric value, and fair to all attributes.

Thank you !

Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.

Similar presentations

Presentation on theme: "Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.

Similar presentations

Presentation on theme: "Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department."— Presentation transcript:

Similar presentations

About project

Feedback