Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Star Join + DataIndexes : Efficient Query Processing in Data Warehousing and OLAP Anindya Datta Debra VanderMeer Krithi Ramamritham Presented.

Similar presentations


Presentation on theme: "Parallel Star Join + DataIndexes : Efficient Query Processing in Data Warehousing and OLAP Anindya Datta Debra VanderMeer Krithi Ramamritham Presented."— Presentation transcript:

1 Parallel Star Join + DataIndexes : Efficient Query Processing in Data Warehousing and OLAP Anindya Datta Debra VanderMeer Krithi Ramamritham Presented by – Ashutosh Joshi

2 Motivation OLAP involves efficient retrieval of data from data warehouses for decision-support purposes Data Warehouses are extremely large and queries are highly computationally expensive DataIndex is a storage structure serving as both index and data Parallel Star Join (PSJ) is an efficient algorithm for performing star join in parallel

3 The Road Map A physical design principle for exploiting parallelism Parallel Star Join algorithm Experiment results

4 The Star Schema PART PartKey 4 Name55 Mfgr25 Brand 10 Type 25 Size 4 Others ,000 CUSTOMER CustKey 4 Name 25 Address 40 Nation 25 Region 25 Phone 15 AcctBal 8 MktSegment 10 Comment ,000SUPPLIER SuppKey 4 Name25 Address 40 Nation 25 Region25 Phone 15 AcctBal 8 Comment ,000 TIME TimeKey 2 Alpha 10 Year 4 Month 4 Week 4 Day ,557 SALES PartKey 4 SuppKey 4 CustKey 4 Quantity 8 ExtPrice 8 Discount 8 Tax 8 RetFlag 1 Status 1 ShipDate 2 CommitDate 2 ReceiptDate 2 ShipInstruct 25 ShipMode 10 Comment ,000,000 Fact Table Dimension Table

5 A Physical Design Principle DataIndexes Serve as both index as well as data Based on vertical partitioning of tables Two types Projection Index (PI) Join Index (JI)

6 Projection Index CustKeyQty DiscountExtPrice CK1Q1 D1E1 CK2Q2 D2E2 CK3Q3 D3E3 CK4Q4 D4E4 Base Table CustKey CK1 CK2 CK3 CK4 Qty Q1 Q2 Q3 Q4 DiscountExtPrice D1E1 D2E2 D3E3 D4E4 PI

7 Join Index Tax T1 T2 T3 T4 Base Fact Table RIDs RID1 RID2 RID3 Tax T1 T2 T3 T4 JI NameAddress N1A1 N2A2 N3A3 Base Dimension Table NameAddress N1A1 N2A2 N3A3 DiscountExtPrice D1E1 D2E2 D3E3 D4E4 CustKey CK1 CK2 CK3 CustKey CK1 CK2 CK3 DiscountExtPrice D1E1 D2E2 D3E3 D4E4 PI CustKey CK1 CK2 CK3 PI

8 The Principle Each foreign key column in the fact table is stored as Join Index (JI) Rest of the columns (for both dimension as well as fact table) are stored as Projection Index (PI)

9 Parallel Star Join Data placement strategy Based on shared nothing architecture with N processors Assume a d dimensional data warehouse Partition N processors into d+1 groups Assign to each group j, dimension table D j and J j, the fact table join index Assign metric PIs to the group d+1

10 Processor Group Partitioning Number of processors is governed by the size of dimension table D j Size of j th processor group Size of metric group

11 Physical Data Placement Horizontally partition JI’s across all processors Replicate PI’s on all processors Use round-robin strategy for partitioning JI’s

12 The Parallel Star Join Algorithm A general k- dimensional star join query Select A d P, A m P from F, D 1, …, D k where P join and P select The algorithm has three phases Local rowset generation Global rowset synthesis Output preparation

13 Local Rowset generation Load PI fragment P1P1 P2P2 PcPc PI fragment PI fragment Qty > Rowset fragment PI fragment

14 Local Rowset Generation (contd) Merge dimension rowset fragments Distribute dimension rowset P1P1 P2P2 P3P3 P4P4 OR R dim, i Rowset fragment

15 Local Rowset Generation (contd) Load JI fragment Merge partial fact rowsets RIDs RID1 RID2 RID R dim, i R fact,i JI i

16 Global Rowset Synthesis Merge local fact rowsets Distribute global rowset to groups participating in the output phase G1G1 G2G2 G3G3 G4G4 AND R global R fact,1 R fact,2

17 Output Preparation Distribute global rowset to individual processors Load PI columns necessary for output Merge output PI i JI i R global RIDs RID1 RID2 RID3 CustKey CK1 CK2 CK3 CK4 Output CK1 CK2

18 Performance Comparison The PSJ algorithm was compared with Bitmapped Join Index algorithm and the Pipelined Hash join algorithm Two performance metrics used Response time in block access (RTBA) Aggregate Data Transmission (ADT)

19 Scalability Experiments The curves rise as the scale factor and number of processors increase PSJ cost is much lower than BJI and HASH costs At large memory sizes, PSJ approaches “near- perfect” scalability

20 Scalability Experiments(contd) Transmission costs for PSJ and BJI are the same Both curves exhibit imperfect scalability HASH has substantially higher transmission costs than PSJ

21 Conclusion DataIndex is a physical design strategy which provides efficient partitioning of the schema Parallel Star Join algorithm provides a means to perform star join in parallel PSJ algorithm performs better than BJI and HASH algorithms in terms of I/O and transmission costs


Download ppt "Parallel Star Join + DataIndexes : Efficient Query Processing in Data Warehousing and OLAP Anindya Datta Debra VanderMeer Krithi Ramamritham Presented."

Similar presentations


Ads by Google