Presentation is loading. Please wait.

Presentation is loading. Please wait.

Online Aggregation Joe Hellerstein UC Berkeley Online Aggregation: Motivation Select AVG(grade) from ENROLL; A “fancy” interface: + Query Results AVG.

Similar presentations


Presentation on theme: "Online Aggregation Joe Hellerstein UC Berkeley Online Aggregation: Motivation Select AVG(grade) from ENROLL; A “fancy” interface: + Query Results AVG."— Presentation transcript:

1

2 Online Aggregation Joe Hellerstein UC Berkeley

3 Online Aggregation: Motivation Select AVG(grade) from ENROLL; A “fancy” interface: + Query Results AVG 3.262574342

4 A Better Approach Don’t process in batch! Online aggregation:

5 Isn’t This Just Sampling? Yes –we need values to arrive in random order –“confidence intervals” similar to sampling work … and No! –stopping condition set on the fly! –statistical techniques are more sophisticated –can handle GROUP BY w/o a priori knowledge...

6 Grouping Select AVG(grade) from ENROLL GROUP BY major;

7 Requirements Usability –Continuous output non-blocking query plans –time/precision control –fairness/partiality Performance –time to accuracy –time to completion –pacing

8 Fairness/Partiality Speed controls!

9 Applications Large-scale Data Analysis –Online drill-down/roll-up/CUBE –Visualization tools –Data mining –Distributed requests Generally: –Any long-running activity should be visualized and controllable on line. –Accuracy rarely critical for non-lookups

10 A Naïve Approach Select onln_avg(grade) from ENROLL; Can do it in Illustra/Informix today! But... –No grouping –Can’t meet performance & usability needs: no guarantee of continuous output –optimized for completion time! no guarantee of fairness (or control over partiality) no control over pacing

11 Random Access to Data Heap Scan –OK if clustering uncorrelated to agg & grouping attrs Index Scan – can scan an index on attrs uncorrelated to agg or grouping (or index on random()) Sampling: – could introduce new sampling access methods (e.g. Olken’s work)

12 Group By & Distinct Fair, Non-Blocking Group By/Distinct –Can’t sort! Must use hash-based techniques sorting blocks sorting is unfair –Hybrid hashing! especially for duplicate elimination. –“Hybrid Cache” even better.

13 Index Striding For fair Group By: – want random tuple from Group 1, random tuple from Group 2,... –Idea: Index gives tuples from a single group. –Sol’n: one access method opens many cursors in index, one per group. Fetch round-robin. –Can control speed by weighting the schedule –Gives fairness/partiality, info/speed match! (Next step: “heap stride”)

14 Join Algorithms Non-Blocking Joins –no sorting! –merge join OK, but watch “interesting orders”! –hybrid hash not great –symmetric “pipeline” hash [Wilschut/Apers 91] –nested loops always good, can be too slow –optimization...?

15 Query Optimization 2 components in cost function: –dead time (t d ): time spent doing “invisible” work -- tax this at a high rate! –output time (t o ): time spent producing output –this will do a lot automagically! User control vs. performance? –can use competition (e.g., a la Rdb) “interesting” (i.e. bad!) ordering

16 Extended Aggregate Functions Aggs need to return tuples in the middle of processing –add a 4th “current estimate” function open, iterate, estimate, close –sometimes like close function (AVG), sometimes not (COUNT) –aggregation code needs to spit this out to the front-end

17 API Current API uses built-in methods –button-press = query invocation e.g., select StopGroup( val ); High overhead, need to know internals. Very flexible. Easy to code! –Estimates returned as tuples no distinction between estimates and final answer OK? –A general API for running status?

18 Pacing Inter-tuple speed is critical!!

19 Statistical Issues Confidence Intervals for SQL aggs –given an estimate, probability p that we’re within  of the right answer 3 types of estimates  Conservative (Hoeffding’s inequality)  Large-Sample (Central Limit Theorems)  Deterministic Previous work + new results from Peter Haas

20 Initial Implementation prototype in PostgreSQL (a/k/a Postgres95, a/k/a PG-Lite) –aggs with running output –hash-based grouping, dup elimination –index striding –optimizer tweaks –“API” hacks and a simple UI

21 Pacing study Select AVG(grade), Interval(0.99) From ENROLL; Normal plan

22 Access Methods, Big Group Select AVG(grade), Interval(0.99) From ENROLL GROUP BY college; College L: 925K/1.5 Mtuples

23 Access Methods, Small Group Surprise!!! Cost of API call? Select AVG(grade), Interval(0.99) From ENROLL GROUP BY college; College S: 15K/1.5 Mtuples

24 Future Work Better UI –simple example: histograms w/error bars –online data visualization (Tioga DataSplash) data viz = “graphical” aggregate sampled points, or sampled wavelet coefficients? great for panning, zooming Nested Queries –Tie-ins with AI’s “anytime algorithms”

25 Future Work II Control w/o Indices –Heap Striding = do multiple scans optimization 1: fast scans can pass hints to slow optimization 2: “piggyback” scans via buffering big payoff for complex queries Checkpointing/continuation –also continuous data streams

26 Future Work III Sample Tracking –important for financial auditing, statistical quality control –cached views, RID-lists, logical views More stats: –non-standard aggs –simultaneous confidence intervals –allow for slop in cardinality estimation

27 Summary DSS/OLAP users very demanding! –“speed of thought” performance required –only choices: precomputation or estimation –for estimation, need to provide control and steady, useful feedback puts user in the driver’s seat takes some performance burden away from system –Requires significant backend support but well worth the effort runs the “impossible” queries


Download ppt "Online Aggregation Joe Hellerstein UC Berkeley Online Aggregation: Motivation Select AVG(grade) from ENROLL; A “fancy” interface: + Query Results AVG."

Similar presentations


Ads by Google