Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ.

Slides:



Advertisements
Similar presentations
Multidimensional Index Structures One dimensional index structures assume a single search key, and retrieve records that match a given search-key value.
Advertisements

Algorithm Engineering Parallele Suche Stefan Edelkamp.
Ken C. K. Lee, Baihua Zheng, Huajing Li, Wang-Chien Lee VLDB 07 Approaching the Skyline in Z Order 1.
Supervisor : Prof . Abbdolahzadeh
Choosing an Order for Joins
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Materialization and Cubing Algorithms. Cube Materialization Each cell of the data cube is a view consisting of an aggregation of interest. The values.
Robust query processing Goetz Graefe, Christian König, Harumi Kuno, Volker Markl, Kai-Uwe Sattler Dagstuhl – September 2010.
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
Multidimensional Indexing
Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.
Multidimensional Data
OLAP Services Business Intelligence Solutions. Agenda Definition of OLAP Types of OLAP Definition of Cube Definition of DMR Differences between Cube and.
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
Parallelizing the Data Cube PhD Oral Defence Todd Eavis July 23, 2003.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Chapter 13 The Data Warehouse
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates
Parallel OLAP Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
A Paradigm Shift in Database Optimization: From Indices to Aggregates Presented to: The Data Warehousing & Data Mining mini-track – AMCIS 2002 as Research-in-Progress.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Database Management 9. course. Execution of queries.
OnLine Analytical Processing (OLAP)
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Frank Dehnewww.dehne.net Parallel Data Cube Data Mining OLAP (On-line analytical processing) cube / group-by operator in SQL.
1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.
Research Interests Andrew Rau-Chaplin
A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
5/29/2008AI UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006.
Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik.
Advanced Database Concepts
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
OLAP Seminar1 Sanjay Goil Alok Choudhary Department of Electrical & Computer Engineering and Center for Parallel and Distributed Computing, Northwestern.
병렬분산컴퓨팅연구실 1 Cubing Algorithms, Storage Estimation, and Storage and Processing Alternatives for OLAP 병렬 분산 컴퓨팅 연구실 석사 1 학기 이 은 정
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.
Managing Data Resources File Organization and databases for business information systems.
Dense-Region Based Compact Data Cube
Supervisor : Prof . Abbdolahzadeh
CPS216: Data-intensive Computing Systems
CS522 Advanced database Systems
Parallel Databases.
Multidimensional Access Structures
Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University
Chapter 13 The Data Warehouse
Oracle SQL*Loader
E. Borovikov, A. Sussman, L. Davis, University of Maryland
File Organizations Chapter 8 “How index-learning turns no student pale
Lecture 17: Distributed Transactions
MANAGING DATA RESOURCES
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Selected Topics: External Sorting, Join Algorithms, …
B-Trees and Sorting Zachary G. Ives April 12, 2019
File Organizations and Indexing
Presentation transcript:

Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ. Todd Eavis, Dalhousie Univ.

Data Warehousing for Decision Support zOperational data collected into DW zDW used to support multi- dimensional views zViews form the basis of OLAP processing zOur focus: the OLAP server

Multi-dimensional views zCollection of feature attributes zAggregate along one or more measure attributes zReduce the granularity by collapsing dimensions zPoints generated by: ydistributive functions(e.g., sum) yalgebraic functions (e.g., average) yholistic functions(e.g., median)

Data Cube Generation zProposed by Gray et al in 1995 zCan be generated manually from a relational DB but this is very inefficient zExploit the relationship between cuboids to compute all 2 d cuboids zIn OLAP environments, we typically pre-compute these views to improve query response time ABC AB ACBC AC B ALL

Existing Parallel Results zGoil & Choudhary zMOLAP solution yin-memory structures yglobal partition + d communication rounds ydistributed views zLimitations yMemory for multi- dimensional arrays yexpensive communication for larger d J. Of Data Mining & Knowledge Discovery 1(4), 1997

Our Approach zROLAP solution yConstruct and cost the data cube lattice yFind a least cost spanning tree yPartition the spanning tree over the processors equally, construct views and distribute yCan handle partial cubes zLimitations yWhat about indexing????? ABCD ABCABD ACDBCD AB AC ADBCBDCD AA BB CCDD All CCGrid01 + J. Dist. & Parallel Databases 11(2), 2001

Parallel Multi-dimensional Indexing zQuery specifies a range on multiple dimensions zForms a hypercube in the point space

General Approach zNo multidimensional index is universally successful zExploit domain specific information and the features of a particular index zOLAP yData is provided up front yUpdates are batch oriented

Design Goals zA framework for distributed high- performance indexing of ROLAP cubes yPractical to implement yLow communication volume yFully adapted to external memory (disks) yNo shared disk required yIncrementally maintainable yEfficient for high D spatial searches yScalable in terms of data size, dimensions, processors

Challenge zHow to order and partition data such that yNumber of records retrieved per node is as balanced as possible yMinimize the number of disk seeks required in answering a query ABC P1P1 P2P2 P3P3 P4P4

Indexing the Data Cube zCombine the strengths of a space filling and an r-tree index zUse Hilbert curve to load buckets zIndex buckets with r- tree zUpdate indexes with merge/sort

Space Filling Curves & Striping

Query Retrieval P1P1 P2P2 P3P3 P4P4 ABC

Example Original SpaceProcessor 1Processor 2 8 points to be reported Reports: 2 consecutive blocks & 4 points

The Parallel Framework zA single view is partitioned across p processors zPartial Hilbert/r-tree indexes are computed locally zQueries are answered concurrently zQueries answered individually or piggy- backed

The Virtual Data Cube z Problem: Full cube often to large to materialize z Solution: Use surrogate views

Surrogate Processing

Other issues… zDimension ordering zQuery piggybacking zBatch updating zManaging Hierarchies of views

Experimental Results zMachine y17 node cluster yNode = 1.8 GHz Xeon, 1 GB RAM, 2 * 40 GB IDE drives, running Linux yInterconnect = Intel Fast Ethernet switch zTest Data y10 dimensions and 1,000,000 records

RCUBE index Construction Output: ~640 million rows, 16 Gigabytes

Distributed Query Resolution Test: Random queries returning ~15% of points (10 experiments per point)

Disk blocks retrieved vs. Disk Seeks Test: Random queries returning 5-15% of points (15 experiments per point)

Distributed Query Resolution in Surrogate Group-bys

Thank You Questions?