Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh.

Similar presentations


Presentation on theme: "A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh."— Presentation transcript:

1 A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh + Microsoft Research PR-Join:

2 Online Aggregation Data warehouse and business intelligence –Fast growing multi-billion dollar market Interactive ad-hoc queries on big data –Important for detecting new trends –Fast response time hard to achieve One promising approach: Online Aggregation (OLA) –Provides early representative results for aggregate queries (sum, count, avg, etc.) –For example, “average is 123.4 ± 5.6 with 95% confidence” Essential to OLA: non-blocking join algorithm PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath 2 [Hellerstein et al. 97]

3 Non-Blocking Join for OLA OLA assumption: relations are in random order 3 Relation A Relation B Main memory Temporary storage SpillRead back Estimates based on current results PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

4 Design Goals of Non-Blocking Joins Fast, representative early results Good end-to-end performance 4 Wrong query: stop early Accurate enough: stop early Slow convergence: wait longer High variance, high selectivity, high group counts, data skews … Need the full, accurate result: finish query User may find Design Goals PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

5 Two Metrics in Algorithm Analysis Good end-to-end performance: Fast early results: 5 Result Rate = Newly covered area x selectivity I/Os for covering the new area new records from B records from A Join: check all pairs of records from A and B Early : before completely reading A and B PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath Total I/Os

6 Design Space 6 HighLow High LowTotal I/O Cost Early Representative Result Rate Hash Ripple [Luo, et al’02] SMS [Jermaine, et al’05] GRACE [Kitsuregawa, et al ’83] Ripple PR-Join targets Ideal DBO [Jermaine, et al’07] [Haas & Hellerstein’99] PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

7 Performance Result Preview 7 PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath Near-optimal total I/O cost Higher early result rate

8 Outline Introduction PR-Join (Partitioned expanding Ripple Join) Algorithm Evaluation Conclusion 8 PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

9 Background: Ripple Join records from B records from A spillednew spilled new For each ripple: Read new records from A and B; check for matches Read spilled records; check for matches with new records Spill new to disk 9 [Haas & Hellerstein’99] PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

10 Observations of Ripple Join 10 Total I/Os: O(N 2 ) –N = total # of input pages in A and B –I/Os of ripples form an arithmetic series Result rate of a ripple is higher if wider ripple –Increase ripple width But ripple width limited by the memory size PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath Result Rate = Newly covered area x selectivity I/Os for covering the new area Super linear growth Grows linearly

11 PR-Join Idea 1: Multiplicatively Expanding Ripples Total I/Os: O(N) linear –I/Os of ripples form a geometric series Higher result rate: –Wider ripple leads to higher result rate  But must overcome memory size limitation 11 PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

12 PR-Join Idea 2: Hash Partitioning Each partition < memory Every join invocation performs a ripple on a partition –Estimation is updated after every join invocation –Much faster user responses  Statistically sound 12 empty Partitioned on Join key PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

13 Statistical Guarantees Idea: hash partitioning  disjoint sub-spaces –Stratified sampling in statistics Statistical estimate: 1)Ripple join formula for every partition 2)Stratified sampling formula to combine estimates from partitioned ripples 13 empty Partitioned on Join key PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

14 Comparing Analytical Performance 14 Early Result Rate Symmetric Hash1 (when data fit in memory) Hash Ripple0.5 SMS0.6 Two-Way DBO1.2 Ripple1, 1.25, 1.40, 1.50, …,  2 PR-Join1, 1.7, 3.2, 6.2, 12.2, … … (Parameter setting details in paper) PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

15 Outline Introduction PR-Join Algorithm Evaluation Conclusion 15 PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

16 Non-Blocking Join for OLA 16 Relation A Relation B Main memory Temporary storage SpillRead back Estimates based on current results PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath Hard disk or SSD Hard disks

17 Disk as Temp Storage 17 10GB joins 10GB 500MB memory PR-Join achieves much better end-to-end performance than Ripple Join PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

18 Marginal Result Rate 18 PR-Join achieves an order of magnitude higher result rate than Ripple Join PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath Disk as temp storage

19 SSD as Temp Storage 19 Using SSD, PR-Join achieves near optimal I/O costs 10GB joins 10GB 500MB memory Temp I/Os are almost completely overlapped with I/Os to read input PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

20 More Details in Paper Joining finite data streams: –PR-Join can be easily used for joining finite data streams –Compared with state-of-the-art algorithm (RPJ [Tao et al.’05]) –PR-Join achieves better performance Analysis of non-blocking join algorithms for OLA PR-Join parameter choices Handling skews More experimental results (see us at the plenary session) 20 PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

21 Conclusions In this paper, we propose a new non-blocking join algorithm: PR-Join (Partitioned expanding Ripple Join) PR-Join for Online Aggregation: –Provides statistical guarantee –An order of magnitude higher result rate than prior approach –Near optimal total I/O cost PR-Join for finite data streams: –Better performance than state-of-the-art algorithm 21 PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

22 Thank you! shimin.chen@intel.com 22 PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath


Download ppt "A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh."

Similar presentations


Ads by Google