Parallel Multi-Dimensional ROLAP Indexing Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint work with Frank Dehne, Carleton Univ. Todd Eavis, Dalhousie Univ.
Data Warehousing for Decision Support zOperational data collected into DW zDW used to support multi- dimensional views zViews form the basis of OLAP processing zOur focus: the OLAP server
Multi-dimensional views zCollection of feature attributes zAggregate along one or more measure attributes zReduce the granularity by collapsing dimensions zPoints generated by: ydistributive functions(e.g., sum) yalgebraic functions (e.g., average) yholistic functions(e.g., median)
Data Cube Generation zProposed by Gray et al in 1995 zCan be generated manually from a relational DB but this is very inefficient zExploit the relationship between cuboids to compute all 2 d cuboids zIn OLAP environments, we typically pre-compute these views to improve query response time ABC AB ACBC AC B ALL
Existing Parallel Results zGoil & Choudhary zMOLAP solution yin-memory structures yglobal partition + d communication rounds ydistributed views zLimitations yMemory for multi- dimensional arrays yexpensive communication for larger d J. Of Data Mining & Knowledge Discovery 1(4), 1997
Our Approach zROLAP solution yConstruct and cost the data cube lattice yFind a least cost spanning tree yPartition the spanning tree over the processors equally, construct views and distribute yCan handle partial cubes zLimitations yWhat about indexing????? ABCD ABCABD ACDBCD AB AC ADBCBDCD AA BB CCDD All CCGrid01 + J. Dist. & Parallel Databases 11(2), 2001
Parallel Multi-dimensional Indexing zQuery specifies a range on multiple dimensions zForms a hypercube in the point space
General Approach zNo multidimensional index is universally successful zExploit domain specific information and the features of a particular index zOLAP yData is provided up front yUpdates are batch oriented
Design Goals zA framework for distributed high- performance indexing of ROLAP cubes yPractical to implement yLow communication volume yFully adapted to external memory (disks) yNo shared disk required yIncrementally maintainable yEfficient for high D spatial searches yScalable in terms of data size, dimensions, processors
Challenge zHow to order and partition data such that yNumber of records retrieved per node is as balanced as possible yMinimize the number of disk seeks required in answering a query ABC P1P1 P2P2 P3P3 P4P4
Indexing the Data Cube zCombine the strengths of a space filling and an r-tree index zUse Hilbert curve to load buckets zIndex buckets with r- tree zUpdate indexes with merge/sort
Space Filling Curves & Striping
Query Retrieval P1P1 P2P2 P3P3 P4P4 ABC
Example Original SpaceProcessor 1Processor 2 8 points to be reported Reports: 2 consecutive blocks & 4 points
The Parallel Framework zA single view is partitioned across p processors zPartial Hilbert/r-tree indexes are computed locally zQueries are answered concurrently zQueries answered individually or piggy- backed
The Virtual Data Cube z Problem: Full cube often to large to materialize z Solution: Use surrogate views
Surrogate Processing
Other issues… zDimension ordering zQuery piggybacking zBatch updating zManaging Hierarchies of views
Experimental Results zMachine y17 node cluster yNode = 1.8 GHz Xeon, 1 GB RAM, 2 * 40 GB IDE drives, running Linux yInterconnect = Intel Fast Ethernet switch zTest Data y10 dimensions and 1,000,000 records
RCUBE index Construction Output: ~640 million rows, 16 Gigabytes
Distributed Query Resolution Test: Random queries returning ~15% of points (10 experiments per point)
Disk blocks retrieved vs. Disk Seeks Test: Random queries returning 5-15% of points (15 experiments per point)
Distributed Query Resolution in Surrogate Group-bys
Thank You Questions?