Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.

Slides:

Advertisements

Similar presentations

16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinVinayan Verenkar Computer Science Dept San Jose State University.

Advertisements

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.

Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.

Fast Algorithms For Hierarchical Range Histogram Constructions

The Efficiency of Algorithms

School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.

15.8 Algorithms using more than two passes Presented By: Seungbeom Ma (ID 125) Professor: Dr. T. Y. Lin Computer Science Department San Jose State University.

Chapter 6 Additional Relational Operations Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2008.

Concepts of Database Management Seventh Edition

Concepts of Database Management Sixth Edition

Concepts of Database Management Seventh Edition

Chapter 11 Group Functions

The University of Akron Dept of Business Technology Computer Information Systems The Relational Model: Query-By-Example (QBE) 2440: 180 Database Concepts.

Database Systems Chapter 6 ITM Relational Algebra The basic set of operations for the relational model is the relational algebra. –enable the specification.

Chapter 4: Divide and Conquer The Design and Analysis of Algorithms.

Connecting with Computer Science, 2e

Relational Algebra Chapter 4 - part I. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.

Chap8: Trends in DBMS 8.1 Database support for Field Entities 8.2 Content-based retrieval 8.3 Introduction to spatial data warehouses 8.4 Summary.

Concepts of Database Management Sixth Edition

Connecting with Computer Science 2 Objectives Learn why numbering systems are important to understand Refresh your knowledge of powers of numbers Learn.

Advanced Databases 5841 DATA CUBE. Index of Content 1. The “ALL” value and ALL() function 2. The New Features added in CUBE 3. Computing the CUBE and.

Concepts of Database Management, Fifth Edition

Xin  Syntax ◦ SELECT field1 AS title1, field2 AS title2,... ◦ FROM table1, table2 ◦ WHERE conditions  Make a query that returns all records.

1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.

Merge Sort. What Is Sorting? To arrange a collection of items in some specified order. Numerical order Lexicographical order Input: sequence of numbers.

Concepts of Database Management Seventh Edition

CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )

Concepts of Database Management Seventh Edition

Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.

Chapter 6 The Relational Algebra Copyright © 2004 Ramez Elmasri and Shamkant Navathe.

TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.

IST 210 The Relational Language Todd S. Bacastow January 2004.

A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.

Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.

Slide 6- 1 Additional Relational Operations Aggregate Functions and Grouping A type of request that cannot be expressed in the basic relational algebra.

Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.

Mining real world data RDBMS and SQL. Index RDBMS introduction SQL (Structured Query language)

1 Chapter 3 Single Table Queries. 2 Simple Queries Query - a question represented in a way that the DBMS can understand Basic format SELECT-FROM Optional.

DATABASE OPERATORS AND SOLID STATE DRIVES Geetali Tyagi ( ) Mahima Malik ( ) Shrey Gupta ( ) Vedanshi Kataria ( )

Concepts of Database Management, Fifth Edition Chapter 3: The Relational Model 2: SQL.

Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.

Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.

Database Management System

Lecturer : Dr. Pavle Mogin

Relational Algebra - Part 1

Efficient Image Classification on Vertically Decomposed Data

Efficient Ranking of Keyword Queries Using P-trees

Efficient Ranking of Keyword Queries Using P-trees

Yue (Jenny) Cui and William Perrizo North Dakota State University

Chapter 2: Intro to Relational Model

Chapter 2: Intro to Relational Model

Yue (Jenny) Cui and William Perrizo North Dakota State University

Efficient Image Classification on Vertically Decomposed Data

Relational Algebra Chapter 4 - part I.

A Fast and Scalable Nearest Neighbor Based Classification

MongoDB Aggregations.

MongoDB Aggregations.

Relational Algebra.

Lesson 4: Introduction to Functions

Query Functions.

Chapter 2: Intro to Relational Model

Projecting output in MySql

MongoDB Aggregations.

Algorithm of Aggregate Function SUM

Algorithm for the Aggregate Function SUM

LINQ to SQL Part 3.

Relational Algebra Chapter 4 - part I.

Presentation transcript:

Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department of Computer Science North Dakota State University

Introduction An aggregate on T is functional on 2 T (i.e., a map, F:2 T  R, R = real numbers). Common include COUNT, SUM, AVERAGE, MIN, MAX, MEDIAN, RANK, TOP-K. There are 3 types of aggregate functions: Let T be a set, let G be a numeric aggregate (i.e., aggregates an set of numbers into one number) and let S={S i } i=1…n be a partition of T (i.e., collectively exhaustive and mutually exclusive: U i=1..n S i =T and S j ∩S i =   i  j). 1.Distributive Aggregates: An aggregate, F, of T is G-distributive if  partition, S, of T, G-aggregating the F-aggregates of S is the same as F-aggregating T. (i.e., F(T)=G{F(S i )}  S={S i }). – SUM and COUNT are SUM-distributive (F=SUM or F=COUNT, G=SUM) – MIN is MIN-distributive – MAX is MAX-distributive An aggregate, F, is self-distributive iff it is F-distributive – e.g., SUM, MIN, MAX, but not COUNT – What about AVG, MEDIAN, RANK, TOP-K? 2. Algebraic Aggregates: An Aggregate, F, of T is algebraic if there is an M-tuple valued function K and a function H such that F(T)=H({K(S i )} i=1..n. Average, Standard Deviation, MaxN, MinN, and Center_of_Mass are all algebraic. 3. Holistic Aggregates: An aggregate function F is holistic if there is no constant bound on the size of the storage needed to describe a sub-aggregate. Median, MostFrequent (also called the Mode), and Rank are common examples of holistic functions.

Review of Iceberg Queries Iceberg queries perform aggregate functions across attributes and then eliminate aggregate values that are below some specified threshold. We use an example. SELECT Location, Product Type, Sum (# Product) FROM Relation Sales GROUPBY Location, Product Type HAVING Sum (# Product) >= T We illustrate the procedure of calculating by three steps. Step one: Generate Location-list. SELECT Location, Sum (# Product) FROM Relation Sales GROUPBY Location HAVING Sum (# Product) >= T Step Two: Generate Product Type-list. SELECT Type, Sum (# Product) FROM Relation Sales GROUPBY Product Type HAVING Sum (# Product) >= T Step Three: Generate location & Product Type pair groups. From the Location-list and the Type-list we generated in first two steps, we can eliminate many of the location & Product Type pair groups according to the threshold T.

Algorithms of Aggregate Function Computation Using P-trees IdMonLocTypeOn line# Product 1JanNew YorkNotebookY10 2JanMinneapolisDesktopN5 3FebNew YorkPrinterY6 4MarNew YorkNotebookY7 5MarMinneapolisNotebookY11 6MarChicagoDesktopY9 7AprMinneapolisFaxN3  The dataset we used in our example.  We use the data in relation Sales to illustrate algorithms of aggregate function. Table 1. Relation Sales.

Algorithms of Aggregate Function Computation Using P-trees (Cont.) IdMonLocTypeOn line# Product P 0,3 P 0,2 P 0,1 P 0,0 P 1,4 P 1,3 P 1,2 P 1,1 P 1,0 P 2,2 P 2,1 P 2,0 P 3,0 P 4,3 P 4,2 P 4,1 P 4,  Table 2 shows the binary representation of data in relation Sales. Table 2. Binary Form of Sales.

Algorithm of Aggregate Function COUNT COUNT function: It is not necessary to write special function for COUNT because P-tree RootCount function has already provided the mechanism to implement it. Given a P-tree P i, RootCount(P i ) returns the number of 1s in P i. IdMonLocTypeOn line# Product P 0,3 P 0,2 P 0,1 P 0,0 P 1,4 P 1,3 P 1,2 P 1,1 P 1,0 P 2,2 P 2,1 P 2,0 P 3,0 P 4,3 P 4,2 P 4,1 P 4, Table 1. Relation Sales.

Algorithm of Aggregate Function SUM SUM function: Sum function can total a field of numerical values. Algorithm 4.1 Evaluating sum () with P-tree. total = 0.00; For i = 0 to n { total = total + 2 i * RootCount (P i ); } Return total Algorithm Sum Aggregate

Algorithm of Aggregate Function SUM P 4,3 P 4,2 P 4,1 P 4, {3} {5} 2 3 * * * * = 51  For example, if we want to know the total number of products which were sold out in relation S, the procedure is showed on left

Algorithm of Aggregate Function AVERAGE Average function: Average function will show the average value in a field. It can be calculated from function COUNT and SUM. Average () = Sum ()/Count ().

Algorithm of Aggregate Function MAX Max function: Max function returns the largest value in a field. Algorithm 4.2 Evaluating max () with P-tree. max = 0.00; c = 0; P c is set all 1s For i = n to 0 { c = RootCount (P c AND P i ); If (c >= 1) P c = P c AND P i ; max = max + 2 i ; } Return max; Algorithm Max Aggregate.

Algorithm of Aggregate Function MAX P 4,3 P 4,2 P 4,1 P 4, {1} {0} {1} 1. P c = P 4,3 RootCount (P c ) = 3 >= 1 2. RootCount (P c AND P 4,2 ) = 0 < 1 P c = P c AND P’ 4,2 3. RootCount (P c AND P 4,1 ) = 2 >= 1 P c = P c AND P 4,1 4. RootCount (P c AND P 4,0 ) = 1 >= Steps IF Pos Bits 2 3 * * * * = {1} {0} {1} 11

Algorithm of Aggregate Function MIN Min function: Min function returns the smallest value in a field. Algorithm 4.3. Evaluating Min () with P-tree. min = 0.00; c = 0; P c is set all 1s For i = n to 0 { c = RootCount (P c AND NOT (P i )); If (c >= 1) P c = P c AND NOT (P i ); Else min = min + 2 i ; } Return min; Algorithm Max Aggregate.

Algorithm of Aggregate Function MIN P 4,3 P 4,2 P 4,1 P 4, {0} {1} 1. P c = P’ 4,3 RootCount (P c ) = 4 > = 1 2. RootCount (P c AND P’ 4,2 ) = 1 >= 1 P c = P c AND P’ 4,2 3. RootCount (P c AND P’ 4,1 ) = 0 < 1 P c = P c AND P 4,1 4. RootCount (P c AND P’ 4,0 ) = 0 < Steps IF Pos Bits 2 3 * * * * = {0} {1} 3

Algorithms of Aggregate Function MEDIAN and RANK Median/Rank: Median function returns the median value in a field. Rank (K) function returns the value that is the kth largest value in a field. Algorithm 4.4. Evaluating Median () with P-tree median = 0.00; pos = N/2; for rank pos = K; c = 0; P c is set all 1s for single attribute For i = n to 0 { c = RootCount (P c AND P i ); If (c >= pos) median = median + 2 i ; P c = P c AND P i ; Else pos = pos - c; P c = P c AND NOT (P i ); } Return median; Algorithm Median Aggregate.

Algorithm of Aggregate Function MEDIAN P 4,3 P 4,2 P 4,1 P 4, {0} {1} 1. P c = P 4,3 RootCount (P c ) = 3 < 4 Pc = P’ 4,3 pos = 4 – 3 = 1 2. RootCount (P c AND P 4,2 ) = 3 >= 1 P c = P c AND P 4,2 3. RootCount (P c AND P 4,1 ) = 2 >= 1 P c = P c AND P 4,1 4. RootCount (P c AND P 4,0 ) = 1 >= Steps IF Pos Bits 2 3 * * * * = {0}{1} 7

Algorithm of Aggregate Function TOP-K Top-k function: In order to get the largest k values in a field, first, we will find rank k value V k using function Rank (K). Second, we will find all the tuples whose values are greater than or equal to V k. Using ENRING technology of P-tree

Iceberg Query Operation Using P-rees We demonstrate the computation procedure of iceberg querying with the following example: SELECT Loc, Type, Sum (# Product) FROM Relation S GROUPBY Loc, Type HAVING Sum (# Product) >= 15

Iceberg Query Operation Using P-trees (Step One) Step one: We build value P-trees for the 4 values, {Loc| New York, Minneapolis, Chicago}, of attribute Loc. P MN P NY P CH Figure 4. Value P-trees of Attribute Loc

Iceberg Query Operation Using P-trees (Step One) LOC P 1,4 P 1,3 P 1,2 P 1.1 P 1.0 P’ 1,4 P’ 1,3 P’ 1,2 P’ 1.1 P 1.0 P NY Figure 5. Procedure of Calculating P NY  Figure 5 illustrates the calculation procedure of value P-tree P NY. Because the binary value of New York is 00001, we will get formula 1. P NY = P’ 1,4 AND P’ 1,3 AND P’ 1,2 AND P’ 1,1 AND P 1,0 (1)

Iceberg Query Operation Using P-trees (Step One) After getting all the value P-trees for each location, we calculate the total number of products sold in each place. We still use the value, New York, as our example. Sum(# product | New York) = 2 3 * RootCount (P 4,3 AND P NY ) * RootCount (P 4,2 AND P NY ) * RootCount (P 4,1 AND P NY ) * RootCount (P 4,0 AND P NY ) = 8 * * * * 1 = 23 (2)

Iceberg Query Operation Using P-trees (Step One) Loc ValuesSum (# Product)Threshold New York23Y Minneapolis18Y Chicago9N Table 3 shows the total number of products sold out in each of the three of the locations. Because our threshold T is 15, we eliminate the city Chicago. Table 3. the Summary Table of Attribute Loc.

Iceberg Query Operation Using P-trees (Step Two) Step two: Similarly we build value P-trees for every value of attribute Type. Attribute Type has four values {Type | Notebook, desktop, Printer, Fax}. Figure 6 shows the value P-tree of the four values of attribute Type P Notebook P Desktop P Printer P FAX Figure 6. Value P-trees of Attribute Type.

Iceberg Query Operation Using P-trees (Step Two) Type ValuesSum (# Product)Threshold Notebook28Y Desktop14N FAX3N Printer6N Similarly we get the summary table for each value of attribute Type. According to the threshold, T equals 15, only value P-tree of notebook will be used in the future. Table 4. Summary Table of Attribute Type.

Iceberg Query Operation Using P-trees (Step Three) Step three: We only generate candidate Loc and Type pairs for local store and Product type, which can pass the threshold T. By Performing And operation on P NY with P Notebook, we obtain value P- tree P NY AND Notebook P NY P Notebook P NY AND Notebook AND = Figure 7. Procedure of Calculating PNY AND Notebook

Iceberg Query Operation Using P-trees (Step Three) We calculate the total number of notebooks sold out in New York by formula 3. Sum(# Product | New York) = 2 3 * RootCount (P 4,3 AND P NY AND Notebook ) * RootCount (P 4,2 AND P NY AND Notebook ) * RootCount (P 4,1 AND P NY AND Notebook ) * RootCount (P 4,0 AND P NY AND Notebook ) = 8 * * * 2 + 1* 1 = 17 (3)

Iceberg Query Operation Using P-trees (Step Three) By performing And operations on P MN with P Notebook, we obtain value P-tree P MN AND Notebook P MN P Notebook P MN AND Notebook AND= Figure 8. Procedure of Calculating PMN AND Notebook

Iceberg Query Operation Using P-trees (Step Three) We calculate the total number of notebook sold out in Minneapolis by formula 4. Sum (# product | Minneapolis) = 2 3 * RootCount (P 4,3 AND P MN AND Notbook ) * RootCount (P 4,2 AND P MN AND Notbook ) * RootCount (P 4,1 AND P MN AND Notbook ) * RootCount (P 4,0 AND P MN AND Notbook ) = 8 * * * * 1 = 11 (4)

Iceberg Query Operation Using P-trees (Step Three) Finally, we obtain the summary table 5. According to the threshold T=15, we can see that only group pair “New York And Notebook” pass our threshold T. From value P-tree P NY AND Notebook, we can see that tuple 1 and 4 are in the results of our iceberg query example. Type ValuesSum (# Product)Threshold New York And Notebook17Y Minneapolis And Notebook11N Table 5. Summary Table of Our Example P NY AND Notebook

Performance Analysis Figure 15. Iceberg Query with multi-attributes aggregation Performance Time Comparison

Performance Analysis Our experiments are implemented in the C++ language on a 1GHz Pentium PC machine with 1GB main memory running on Red Hat Linux. In figure 15, we compare the running time of P-tree method and bitmap method on calculating multi-attribute iceberg query. In this case P-trees are proved to be substantially faster.

Conclusion we believe our study confirms that the P-tree approach is superior to the bitmap approach for aggregation of all types and multi-attribute iceberg queries. It also proves that the advantages of basic P-tree representations of files are: – First, there is no need for redundant, auxiliary structures. – Second basic P-trees are good at calculating multi-attribute aggregations, numeric value, and fair to all attributes.

Thank you !