Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.

Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton

Overview This paper tells how to join a bunch of tables and get the SUM, COUNT, or AVG in GROUP BY clauses showing approximate results immediately and the confidence interval of the results from the first few tuples retrieved updating a GUI display with closer approximation information as the join adds more tuples.

Ripple joins compared to our previous topics  General research area: algorithms  another approximation algorithm  online processing  not maintaining a sample set  aggregate queries: joins, and group-by  requires random retrieval  uses probabilistic calculations to determine the quality of the approximate result  not optimizing  implemented as middleware on the DBMS

Traditional Hash Join stores the smaller relation in memory Two relations R and S with a common attribute: on each distinct value of that attribute, match up the tuples which have the same value. Example: select R.roomnumber, COUNT(S.homeroom) from Rooms R join Student S on R.roomnumber=S.homeroom For each tuple r in R add hash(roomnumber) to the hashtable in memory if hashtable has filled up memory for every tuple s in S if hash(homeroom) is found in the hashtable add tuple r and tuple s to the output reset the hashtable Finally, scan S and add the resulting join tuples to the output.

What's different about ripple join? Traditional hash join blocks until the entire query output is finished. Ripple join reports approximate results after each sampling step, and allows user intervention. In the inner loop, an entire table is scanned. Ripple join expands the sample set incrementally.

The most important difference The tuples are processed in random order.

Pipelining  In pipelining join algorithms, as the join progresses, more and more information gets added to the result.  In ripple joins, each new tuple gets joined with all previously-seen tuples of the other operand(s).  The relative rates of the two (or more) operands are dynamically adjusted.

Worst-case scenario Ripple join reduces to a nested loop join.

The relations do not have to be relatively equal size. Aspect ratio: how many tuples are retrieved from each base relation per sampling step. e.g.β 1 = 1, β 2 = 3, … Ripple join adjusts the aspect ratio according to the sizes of the base relations.

Rectangular version

What can the end user control?  how many groups continue to process Any one group can be stopped. All other groups will continue to process (faster).  the speed of the query selection process What happens to make the process faster? More tuples are skipped in the aggregation, so the approximation will be less accurate, and the confidence interval will be wider. The end user controls the trade-off between speed and accuracy.

GUI, 1999

Confidence interval A running confidence interval displays how close this answer is to the final result. This could be calculated in many ways. The authors present an example calculation built on extending the Central Limit Theorem.

Central Limit Theorem ˆμ ⁿ is estimator for true μ average of the n values in the sample; a random quantity CLT: for large n (e.g. after joining 30 tuples), ˆμ ⁿ has a normal distribution with mean μ and variance σ 2 /n

Random variable Z Shift and scale ˆμ ⁿ to get a "standardized" random variable Z: (ˆμ ⁿ μ) / (σ /√n) Z also has a standard normal distribution. There are a lot of ways to compute the z p values.

"Interval" column on the GUI The authors use ˆσ n as an estimator for true variance: ε n = ( z p ˆσ n ) / √n This is displayed quantity as the final half-width of the confidence interval.

Why call this "Ripple Join"? 1.The algorithm seems to ripple out from a corner of the join. 2.Acronym: "Rectangles of Increasing Perimeter Length"

Variants of ripple join  Block ripple join  Index ripple join  Hash ripple join

Performance

Further publications  Eddies: Continuously Adaptive Query Processing, by Ron Avnur and Joseph M. Hellerstein, MOD 2000, Dallas  Confidence Bounds for Sampling-Based GROUP BY Estimates, by Fei Xu, Christopher Jermaine, and Alin Dobra, ACMTrans. Datab. Syst. 33, 3 (Aug. 2008)  Wavelet synopsis for hierarchical range queries with workloads, by Sudipto Guha, Hyoungmin Park, and Kyuseok Shim, VLDB Journal (2008) 17:1079–1099

Questions?

Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.

Similar presentations

Presentation on theme: "Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.

Similar presentations

Presentation on theme: "Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton."— Presentation transcript:

Similar presentations

About project

Feedback