Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses George Candea (EPFL & Aster Data) Neoklis Polyzotis (UC Santa Cruz) Radek Vingralek.

Similar presentations


Presentation on theme: "A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses George Candea (EPFL & Aster Data) Neoklis Polyzotis (UC Santa Cruz) Radek Vingralek."— Presentation transcript:

1 A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses George Candea (EPFL & Aster Data) Neoklis Polyzotis (UC Santa Cruz) Radek Vingralek (Aster Data)

2 Highly Concurrent Data Warehouses Data analytics is a core service of any DW. High query concurrency is becoming important. At the same time, customers need predictability. – Requirement of actual customer: Increasing concurrency from one query to 40 should not increase latency by more than 6x. 2

3 Shortcoming of Existing Systems DWs employ the query-at-a-time model. – Each query executes as a separate physical plan. Result: Concurrent plans contend for resources. This creates a situation of “workload fear”. 3

4 Our Contribution: CJOIN A novel physical operator for star queries. – Star queries arise frequently in ad-hoc analytics. Main ideas: – A single physical plan for all concurrent queries. – The plan is always ``on’’. – Deep work sharing: I/O, join processing, storage. 4

5 Outline Preliminaries The CJOIN operator Experimental study Conclusions 5

6 Setting We assume a star-schema DW. We target the class of star queries. Goal: Executing efficiently concurrent star queries. – Low latency. – Graceful scale-up. 6

7 Further Assumptions Fact table is too large to fit in main memory. Dimension tables are “small”. – Example from TPC-DS: 2.5GB of dimension data for 1TB warehouse. Indices and materialized views may exist. Workload is volatile. 7

8 Outline Preliminaries The CJOIN operator Experimental study Conclusions 8

9 Design Overview 9 Preprocessor Filter Distributor Filter Optimizer Conventional Query Processor CJOIN Star Queries Other Queries Query Stream

10 Running Example 10 Q1Q1 select COUNT(*) from F join X join Y where φ 1 (X) and ψ 1 (Y) Q2Q2 select SUM(F.m) from F join Y where ψ 2 (Y) Queries Schema Fact Table F m Dimension X Dimension Y join X and TRUE(X)

11 The CJOIN Operator 11 Preprocessor Filter Distributor Filter Fact Table F COUNT SUM Q1Q1 Q2Q2 Continuous Scan

12 The CJOIN Operator 12 Preprocessor Filter Distributor Filter Dimension X Q1Q1 Dimension Y Q 1 ∧ −Q 2 −Q1 ∧ Q2−Q1 ∧ Q2 Q1 ∧ Q2Q1 ∧ Q2 Fact Table F COUNT SUM Q1Q1 Q2Q2 Continuous Scan a a b Q 1 : a Q 2 : b Q1Q1 Q2Q2 11 * * 01 Hash Table X Q1Q1 Q2Q2 10 * * Hash Table Y Query Start

13 Processing Fact Tuples 13 Preprocessor Filter Distributor Filter Q1Q1 Q2Q2 11 * * 01 Q1Q1 Q2Q2 Q1Q1 Q2Q2 10 * * 00 Fact Table F Q1Q1 Q2Q2 Q1Q1 Q2Q2 COUNT SUM Q1Q1 Q2Q Q1Q1 Q2Q a a b Q 1 : a Q 2 : b Hash Table XHash Table Y Query Start 0 1 Continuous Scan

14 Registering New Queries 14 Preprocessor Filter Distributor Filter Dimension X Q1Q1 Q1Q1 Q2Q2 11 * * 01 Q1Q1 Q2Q2 Fact Table F Q1Q1 Q2Q2 Q1Q1 Q2Q2 COUNT SUM Q1Q1 Q2Q2 Q1Q1 Q2Q2 10 * * Q1Q1 Q2Q a a b Q 1 : a Q 2 : b Hash Table XHash Table Y Query Start Q1Q1 Q2Q2 11 * * Q3Q Q3Q Continuous Scan Q3Q3 select AVG(F.m) from F join X where φ 3 (X) join Y and TRUE(Y) select * from X where φ 3 (Χ) −Q 1 ∧ Q 3 ∧ −Q 3

15 Registering New Queries 15 Preprocessor Filter Distributor Filter Q1Q1 Q 2 Q 3 Fact Table F Q1Q1 Q 2 Q 3 Q1Q1 COUNT SUM Q1Q1 Q2Q2 Q1Q1 Q2Q2 10 * * Q1Q1 Q 2 Q a a b Q 1 : a Q 2 : b Hash Table XHash Table Y Query Start Q3Q c Q 3 : c Begin Q 3 AVG Q3Q Continuous Scan Q1Q1 Q2Q2 11 * * Q3Q select AVG(F.m) from F join X where φ 3 (X) join Y and TRUE(Y) c:

16 Properties of CJOIN Processing CJOIN enables a deep form of work sharing: – Join computation. – Tuple storage. – I/O. Computational cost per tuple is low. -Hence, CJOIN can sustain a high I/O throughput. Predictable query latency. – Continuous scan can provide a progress indicator. 16

17 Other Details (in the paper) Run-time optimization of Filter ordering. Updates. Implementation on multi-core systems. Extensions: – Column stores. – Fact table partitioning. – Galaxy schemata. 17 Preprocessor Distributor Filter x n

18 Outline Preliminaries The CJOIN operator Experimental study Conclusions 18

19 Experimental Methodology Systems: – CJOIN Prototype on top of Postgres. – Postgres with shared scans enabled. – Commercial system X. We use the Star Schema Benchmark (SSB). – Scale factor = 100 (100GB of data). – Workload comprises parameterized SSB queries. Hardware: – Quad-core Intel Xeon. – 8GB of shared RAM. – RAID-5 array of four 15K RPM SAS disks. 19

20 Effect of Concurrency 20 Throughput increases with more concurrent queries.

21 Response Time Predictability 21 Query latency is predictable; no more workload fear.

22 Influence of Data Scale 22 CJOIN is effective even for small data sets. Concurrency level: 128

23 Related Work Materialized views [R+95,HRU96]. Multiple query Optimization [T88]. Work Sharing. – Staged DBs [HSA05]. – Scan Sharing [F94, Z+07, Q+08]. – Aggregation [CR07]. BLINK [R+08]. Streaming database systems [M+02, B+04]. 23

24 Conclusions High query concurrency is crucial for DWs. Query-at-a-time leads to poor performance. Our solution: CJOIN. – Target: Class of star queries. – Deep work sharing: I/O, join, tuple storage. – Efficient realization on multi-core architectures. Experiments show an order of magnitude improvement over commercial system. 24

25 THANK YOU! 25


Download ppt "A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses George Candea (EPFL & Aster Data) Neoklis Polyzotis (UC Santa Cruz) Radek Vingralek."

Similar presentations


Ads by Google