Presentation is loading. Please wait.

Presentation is loading. Please wait.

Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha (1000578539) Deepak Anand (1000603813) By:

Similar presentations


Presentation on theme: "Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha (1000578539) Deepak Anand (1000603813) By:"— Presentation transcript:

1 Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha (1000578539) Deepak Anand (1000603813) By:

2 AGENDA Motivation Online Aggregation Basic Approach Goals Building an Online Aggregation system Optimization Running confidence intervals Conclusion Future work

3 Motivation Aggregation in traditional databases Long delay in query execution and user is forced to wait without feedback till query completes execution. Users want to see the aggregation information right away. Aggregation queries are typically used to get a ‘rough picture” but they are computed with painstaking precision. This paper suggests the following changes: Perform aggregation online so that: Progress can be observed. execution of the queries can be controlled on the fly.

4 An Example Consider the following example: SELECT AVG(final_grade) FROM grades WHERE course_name = ‘CS186’ If there is no index on the course_name attribute, then this query scans the entire grades table before returning the result. AVG -------------- | 2.631046 | --------------

5 An alternative approach Running aggregate An estimate of the final result based on the records retrieved so far Running confidence interval 2.6336 +/- 0.0652539 with 95% probability Progress Bar

6 Online Aggregation Interface with Groups If the records are retrieved in the random order, a good approximate result can be obtained We can stop sampling once the length of the confidence interval becomes sufficiently small. Consider a GROUP BY query with 6 groups in the output The user is presented with 6 outputs and 6 “Stops-sign” buttons Stopping condition can be set on the fly Easy to understand for non-statistical user Stop Button

7 Usability goals Continuous observation: Users can observe the processing in the GUI and get a sense of the current level of precision. Control of Time/Precision: Users can terminate processing at any time at a fine granularity(trade-off between time and precision) Control of Fairness/Partiality: Users can control the relative rate at which different running aggregates are updated.

8 Performance goals Minimum time to accuracy: Minimize time required to produce a useful estimate of the final answer. Minimum time to completion: Minimize time required to produce the final answer. Pacing: The running aggregates are updated at a regular rate, to guarantee a smooth and continuously improving display.

9 Building an Online Aggregation System There are two approaches that can be taken: 1. A Naive approach: Trivial implementation without modification to POSTGRES. User defined functions can be written in C. Cannot be used with GROUP BY clause. 2. Modifying the DBMS: Difficult to implement online aggregation as user level addition. Modifying the database engine to support Online Aggregation. SELECT running_avg(final_grade) running_confidence(final_grade) running_interval(final_grade) FROM grades

10 Estimates of the running aggregates is accurate when records are retrieved randomly. 1. Heap Scans Simple heap scans can be effective in traditional heap file access methods where records are stored in unspecified order. Need to choose different method for the aggregate attributes, which are correlated to the logical order of formation of heap. 2. Index Scans Can be used if aggregate attributes are not used for indexing. 3. Sampling from Indices Techniques for pseudo random sampling from various index structures can be used. [Olken’s work] Random Access to Data

11 Non-blocking/Fair access GROUP BY and DISTINCT Groups should receive updates in fair manner Solution: Sorting ?? No, because sorting blocks Must use hash based techniques  Pros: Non-blocking  Cons: Does not perform well as the number of groups grow. Solution: Hybrid hashing. Optimized version: Hybrid cache For DISTINCT columns, a similar hashing technique can be used.

12 Index Striding Updates for the groups with few members will be very infrequent. For fair group by Read tuples in round robin fashion (a tuple from group 1, a tuple from group 2, …) Supported by technique index striding What is Index Striding ? Additional advantages Group updating rate can be controlled Particular group processing can be stopped

13 POSTGRES with index striding Speed control

14 Non-blocking Join Algorithms For interactive display of online aggregation, avoid algorithms that block. Sort-merge join Unacceptable as sorting is blocking operation Merge Join OK but produces sorted output Hybrid hash join Not good if inner relation is large Nested loops join is always good, In case of large un-indexed inner relation its too slow An optimizer must be used to choose between these strategies.

15 Optimization Avoid sorting unless explicitly requested by the user. Blocking sub-operations have costs and appropriate costs should be considered. Cost function = f(t o ) + g(t d ) There are 2 components in cost function: dead time (t d ): time spent doing “invisible” work output time (t o ): time spent producing output Preferences to the plans that maximize user control (index striding)

16 Extended aggregate functions Standard set of aggregate functions must be extended Aggregate functions must be written that provides running estimates Running computation SUM, COUNT, AVG – straight forward VAR, STD DEV – can be implemented using algorithms Aggregate functions returning running confidence must be defined.

17 API Current API uses built-in methods e.g., StopGroup(cursor,groupval) speedUpGroup(cursor,groupval) slowDownGroup(cursor,groupval) setSkipFactor(cursor_name,integer) Skip Factor

18 Statistical Issues Running confidence interval Given an estimate, probability p that we’re within  of the right answer Mu A large value of  means that the records seen so far may not be sufficiently representative of the entire database and the current estimate of the result may be far from the final result. Types of running confidence interval s: Conservative confidence interval For n (no of tuples retrieved) >= 1 Answer guaranteed to be >= probability p [based on Hoeffding’s inequality] Large-sample confidence intervals Deterministic confidence intervals Running confidence interval can be dynamically adjusted depending on the value of n.

19 Performance Issues

20 Conclusion An interactive, intuitive and user-controllable approach to aggregation is needed. This can be achieved by significant extensions to the database engine. These extensions satisfy the usability and performance goals. Ability to produce statistical confidence intervals for running aggregates.

21 Future work Better UI Nested Queries Control without Indices Checkpointing / Continuation

22 TIME TO ASK QUESTIONS 


Download ppt "Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha (1000578539) Deepak Anand (1000603813) By:"

Similar presentations


Ads by Google