00-04-271 On Computing the Data Cube. Research Report 10026, IBM Almaden Research Center, San Jose, California, 1996. 병렬 분산 컴퓨팅 연구실 석사 1 학기 송지숙.

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Selinger Optimizer Lecture 10 October 15, 2009 Sam Madden.
Frequent Closed Pattern Search By Row and Feature Enumeration
Minimum Spanning Tree Sarah Brubaker Tuesday 4/22/8.
6.830 Lecture 10 Query Optimization 10/6/2014. Selinger Optimizer Algorithm algorithm: compute optimal way to generate every sub-join: size 1, size 2,...
Lecture 10 Query Optimization II Automatic Database Design.
6.830 Lecture 11 Query Optimization & Automatic Database Design 10/8/2014.
Zeev Dvir – GenMax From: “ Efficiently Mining Frequent Itemsets ” By : Karam Gouda & Mohammed J. Zaki.
Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.
Based on slides by: Charles Kime & Thomas Kaminski © 2004 Pearson Education, Inc. ECE/CS 352: Digital System Fundamentals Lecture 8 – Systematic Simplification.
Efficient Incremental Maintenance of Data Cubes Ki Yong Lee Software Laboratories Samsung Electronics Co., Ltd. Myoung Ho Kim Division of.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Bhargav Vadher (208) APRIL 9 th, 2008 Submittetd To: Dr. T Y Lin Computer Science Department San Jose State University.
Understanding Recursion /1 Powerful computing/problem-solving techniques Examples Factorial: f(n) = 1, if n = 1 f(n) = f(n-1) * n, if n ≥ 1 Quick sort:
Classroom Exercise: Normalization
CPSC-608 Database Systems Fall 2010 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #9.
Association Rule Mining - MaxMiner. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and.
1 Computing the cube Abhinandan Das CS 632 Mar
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #8.
Performance and Scalability: Apriori Implementation.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates
Chapter 3: Decision Tree Learning. Decision Tree Learning t Introduction t Decision Tree Representation t Appropriate Problems for Decision Tree Learning.
Lesson 1.9 Probability Objective: Solve probability problems.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Efficient Methods for Data Cube Computation and Data Generalization
Lecture 9 Query Optimization.
1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung.
The Binomial Theorem Lecture 29 Section 6.7 Mon, Apr 3, 2006.
Examples. Examples (1/11)  Example #1: f(A,B,C,D) =  m(2,3,4,5,7,8,10,13,15) Fill in the 1’s. 1 1 C A B CD AB D 1 1.
Counting CSC-2259 Discrete Structures Konstantin Busch - LSU1.
Converting to Minterms Form
ITCS 6163 Cube Computation. Two Problems Which cuboids should be materialized? –Ullman et.al. Sigmod 96 paper How to efficiently compute cube? –Agrawal.
Frank Dehnewww.dehne.net Parallel Data Cube Data Mining OLAP (On-line analytical processing) cube / group-by operator in SQL.
1 Minimum Spanning Trees (some material adapted from slides by Peter Lee, Ananda Guna, Bettina Speckmann)
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
CS4432: Database Systems II Query Processing- Part 2.
Performance techniques for OLAP ( On-line analytical processing ) 이화여자대학교 컴퓨터학과 데이터베이스 연구실 석사 2 학기 강 주 영
CONSENSUS THEOREM Choopan Rattanapoka.
Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006.
The Cubetree Storage Organization A High Performance ROLAP Datablade 데이터베이스 연구실 석사 3 학기 강 주 영
1 Ch20. Dynamic Programming. 2 BIRD’S-EYE VIEW Dynamic programming The most difficult one of the five design methods Has its foundation in the principle.
Counting Techniques Tree Diagram Multiplication Rule Permutations Combinations.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Cubing Heuristics (JIT lecture) Heuristics used during data cube computation.
병렬분산컴퓨팅연구실 1 Cubing Algorithms, Storage Estimation, and Storage and Processing Alternatives for OLAP 병렬 분산 컴퓨팅 연구실 석사 1 학기 이 은 정
WEEK 12 Graphs IV Minimum Spanning Tree Algorithms.
CS4432: Database Systems II Query Processing- Part 1 1.
Lecture 34 Section 6.7 Wed, Mar 28, 2007
Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies 병렬 분산 컴퓨팅 연구실 석사 1 학기 김남희.
Shortest Path -Prim’s -Djikstra’s. PRIM’s - Minimum Spanning Tree -A spanning tree of a graph is a tree that has all the vertices of the graph connected.
Lower bound algorithm. 1 Start from A. Delete vertex A and all edges meeting at A. A B C D 4 5 E Find the length of the minimum spanning.
A novel, low-latency algorithm for multiple group-by query optimization Duy-Hung Phan Pietro Michiardi ICDE16.
Greedy Technique.
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
تصنيف التفاعلات الكيميائية
External Joins Query Optimization 10/4/2017
Design and Analysis of Multi-Factored Experiments
Single Source Shortest Paths Bellman-Ford Algorithm
Unit 3 MINING FREQUENT PATTERNS ASSOCIATION AND CORRELATIONS
Fractional Factorial Design
Design matrix Run A B C D E
Jan 2009.
Presentation transcript:

On Computing the Data Cube. Research Report 10026, IBM Almaden Research Center, San Jose, California, 병렬 분산 컴퓨팅 연구실 석사 1 학기 송지숙

Contents zIntroduction zPipeSort Algorithm zPipeHash Algorithm zComparing PipeSort and PipeHash zConclusion

Optimization(1/2) zSmallest-parent 이전에 계산된 group-by 중 가장 작은 것으로부터 group- by 계산 zCache-results disk I/O 를 줄이기 위해서 결과가 memory 에 저장된 group- by 로부터 다른 group-by 계산 zAmortize-scans 가능하면 한꺼번에 많은 group-by 를 계산함으로써 disk read 를 줄이는 것

Optimization(2/2) zShare-sorts sort-based algorithm 에만 한정 여러 group-by 간에 sorting cost 를 공유 zShare-partitions hash-based algorithm 에만 한정 hash-table 이 memory 에 비해 너무 클 경우, data 를 memory 에 맞게 분할하여 각 partition 에 대해 aggregation  여러 group-by 간에 partitioning cost 공유

Sort-based methods zPipeSort algorithm optimization share-sorts 와 smallest-parent 의 결합 : 두 optimization 간에 대립이 생길 수 있기 때문에 group- by 를 할 때 global planning 통해 minimum total cost 얻음. optimization cache-results 와 amortize-scans 도 포함 : pipeline 방식으로 여러 group-by 를 실행함으로써 disk scan cost 를 줄임.

 Share-sorts and smallest-parent all A B C D AB AC AD BC BD CD ABC ABD ACD BCD ABCD Level BDA AB ABC A

 cache-results and amortize-scans all A B C D AB AC AD BC BD CD ABC ABD ACD BCD ABCD Level AB ABC A ABCD

Algorithm PipeSort(1/2) zInput search lattice - vertex : group-by cube - edge : i 로부터 j 가 generate 될 때, i 에서 j 로 연결한다. j 는 i 보다 attribute 를 하나 적게 가지고 i 를 j 의 parent 라고 부른다. - cost : S 는 i 가 정렬되어 있지 않을 때 i 로부터 j 를 계산하는 cost A 는 i 가 정렬되어 있을 때 i 로부터 j 를 계산하는 cost zOutput subgraph of the search lattice - 각 group-by 는 그것의 attribute 정렬순서로 결합되어 있고 그것을 계산하는데 이용되는 하나의 parent 와 연결된다.

Algorithm PipeSort(2/2) ABACBC ABACBC ABC 2513 ABC all Level ASAS ABACBC ABC ABACBC 2513 Minimum cost matching BACABA

Minimum cost sort plan CBAD CBA BAD ACD DBC CB BA AC DB AD CD C B A D all Raw data A() S() Pipeline edges sort edges BADC ACDB DBA DBCA ADCCDA

Hash-based methods zPipeHash algorithm optimization cache-results 와 amortize-scans 의 결합 : multiple hash-table 의 신중한 memory allocation 이 요구 optimization smallest-parent 도 포함 optimization share-partitions 포함 : aggregation data 는 hash-table 이 memory 에 들어가기에 너무 크기 때문에, 하나 또는 그 이상의 attribute 에 대해서 data 를 partition 한다. Partitioning attribute 를 포함하는 모든 group-by 간에 data partitioning cost 를 공유한다.

 cache-results and amortize-scans ABCD Level all A B C D AB AC AD BC BD CD ABC ABD ACD BCD AB AC AB

Algorithm PipeHash zInput search lattice zFirst step 각 group-by 에 대해, 가장 작은 total size 추정치를 가지는 parent group-by 를 선택한다. 그 결과가 minimum spanning tree 이다. zNext step 대개 MST 안에 모든 group-by 를 함께 계산하기에 memory 가 충분하지 않다. 다른 hash-table 을 위해 memory displacement 가 일어날 때, 어떤 group-by 가 함께 계산될지, data partitioning 을 위해 어떤 attribute 를 선택할지 결정한다. Optimization cache-results 와 amortize-scan 을 위해 MST 의 subtree 중 가장 큰 것을 선택하도록 한다.

all A B C D AB AC BC AD CD BD ABC ABD ACD BCD ABCD Raw Data Minimum spanning tree

A AB AC AD ABCD Raw Data ABC ABD ACD BC ABC B AB all A ABCD BCD CDBD CD First subtree partitioned on A Remaining subtrees

Comparing PipeSort and PipeHash(1/5) z Datasets z Performance results faster than the naive methods The performance of PipeHash is very close to lower bound for hash- based algorithms. PipeHash is inferior to the PipeSort algorithms.

Comparing PipeSort and PipeHash(2/5)

z 각 group-by 결과로 tuple 의 수가 많이 줄어들 때, hash- based method 가 sort-based method 보다 더 좋은 성능을 가질 것이다. zSynthetic datasets number of tuples, T number of grouping attributes, N ratio among the number of distinct values of each attribute, d 1 :d 2 :…:d N ratio of T to the total number of possible attribute value combinations, p - data sparsity 정도를 바꾸는데 사용 Comparing PipeSort and PipeHash(3/5)

Effect of sparseness on relative performance of the hash and sort-based algorithms for a 5 attribute synthetic dataset. Comparing PipeSort and PipeHash(4/5)

z Results x-axis denotes decreasing levels of sparsity. y-axis denotes the ratio between the total running time of algorithms PipeHash and PipeSort. data 가 점점 덜 sparse 해짐에 따라, hash-based method 가 sort-based method 보다 더 좋은 성능을 가진다. PipeHash 와 PipeSort algorithm 의 상대적인 성능의 predictor 는 sparsity 임을 알 수 있다. Comparing PipeSort and PipeHash(5/5)

Conclusion zPresented five optimizations smallest-parent, cache- results, amortize-scans, share-sorts and share- partitions zThe PipeHash and PipeSort algorithms combine them so as to reduce the total cost. zPipeHash does better on low sparsity data whereas PipeSort does better on high sparsity data.