Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha (1000578539) Deepak Anand (1000603813) By:

Slides:

Advertisements

Similar presentations

Supporting Top-k join Queries in Relational Databases By:Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Calvin R Noronha ( )

Advertisements

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.

Copyright © 2011 Ramez Elmasri and Shamkant Navathe Algorithms for SELECT and JOIN Operations (8) Implementing the JOIN Operation: Join (EQUIJOIN, NATURAL.

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

CS 540 Database Management Systems

Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.

Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

Online Aggregation Joe Hellerstein UC Berkeley Online Aggregation: Motivation Select AVG(grade) from ENROLL; A “fancy” interface: + Query Results AVG.

Online Aggregation Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT.

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.

Joseph M. Hellerstein Peter J. Haas Helen J. Wang

1 Chapter 10 Query Processing: The Basics. 2 External Sorting Sorting is used in implementing many relational operations Problem: –Relations are typically.

CSCI 5708: Query Processing II Pusheng Zhang University of Minnesota Feb 5, 2004.

Query Processing (overview)

CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.

16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.

1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,

CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley CONTROL: Continuous.

Query Processing Presented by Aung S. Win.

Query Optimization, part 2 CS634 Lecture 13, Mar Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

Module 7 Reading SQL Server® 2008 R2 Execution Plans.

Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.

Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.

Ashwani Roy Understanding Graphical Execution Plans Level 200.

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

CS 338Query Evaluation7-1 Query Evaluation Lecture Topics Query interpretation Basic operations Costs of basic operations Examples Textbook Chapter 12.

CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )

Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.

Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,

Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C.

Query Execution Section 15.1 Shweta Athalye CS257: Database Systems ID: 118 Section 1.

CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Buffer-pool aware Query Optimization Ravishankar Ramamurthy David DeWitt University of Wisconsin, Madison.

CS4432: Database Systems II Query Processing- Part 2.

Lecture 14- Parallel Databases Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.

CS 440 Database Management Systems Lecture 5: Query Processing 1.

Lecture 3 - Query Processing (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch

Query Processing – Implementing Set Operations and Joins Chap. 19.

CS 540 Database Management Systems

Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.

Query Processing and Query Optimization CS 157B Dennis Le Weishan Wang.

By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.

Cost Estimation For each plan considered, must estimate cost: –Must estimate cost of each operation in plan tree. Depends on input cardinalities. –Must.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

Chapter 10 The Basics of Query Processing. Copyright © 2005 Pearson Addison-Wesley. All rights reserved External Sorting Sorting is used in implementing.

Database Management System

A paper on Join Synopses for Approximate Query Answering

Ripple Joins for Online Aggregation

Chapter 12: Query Processing

Database Performance Tuning and Query Optimization

Chapter 15 QUERY EXECUTION.

Query Execution Presented by Khadke, Suvarna CS 257

Database Management Systems (CS 564)

Evaluation of Relational Operations: Other Operations

Spatial Online Sampling and Aggregation

Lecture 2- Query Processing (continued)

Query Execution Presented by Jiten Oswal CS 257 Chapter 15

Chapter 11 Database Performance Tuning and Query Optimization

Evaluation of Relational Operations: Other Techniques

External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.

Evaluation of Relational Operations: Other Techniques

Presentation transcript:

Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:

AGENDA Motivation Online Aggregation Basic Approach Goals Building an Online Aggregation system Optimization Running confidence intervals Conclusion Future work

Motivation Aggregation in traditional databases Long delay in query execution and user is forced to wait without feedback till query completes execution. Users want to see the aggregation information right away. Aggregation queries are typically used to get a ‘rough picture” but they are computed with painstaking precision. This paper suggests the following changes: Perform aggregation online so that: Progress can be observed. execution of the queries can be controlled on the fly.

An Example Consider the following example: SELECT AVG(final_grade) FROM grades WHERE course_name = ‘CS186’ If there is no index on the course_name attribute, then this query scans the entire grades table before returning the result. AVG | |

An alternative approach Running aggregate An estimate of the final result based on the records retrieved so far Running confidence interval / with 95% probability Progress Bar

Online Aggregation Interface with Groups If the records are retrieved in the random order, a good approximate result can be obtained We can stop sampling once the length of the confidence interval becomes sufficiently small. Consider a GROUP BY query with 6 groups in the output The user is presented with 6 outputs and 6 “Stops-sign” buttons Stopping condition can be set on the fly Easy to understand for non-statistical user Stop Button

Usability goals Continuous observation: Users can observe the processing in the GUI and get a sense of the current level of precision. Control of Time/Precision: Users can terminate processing at any time at a fine granularity(trade-off between time and precision) Control of Fairness/Partiality: Users can control the relative rate at which different running aggregates are updated.

Performance goals Minimum time to accuracy: Minimize time required to produce a useful estimate of the final answer. Minimum time to completion: Minimize time required to produce the final answer. Pacing: The running aggregates are updated at a regular rate, to guarantee a smooth and continuously improving display.

Building an Online Aggregation System There are two approaches that can be taken: 1. A Naive approach: Trivial implementation without modification to POSTGRES. User defined functions can be written in C. Cannot be used with GROUP BY clause. 2. Modifying the DBMS: Difficult to implement online aggregation as user level addition. Modifying the database engine to support Online Aggregation. SELECT running_avg(final_grade) running_confidence(final_grade) running_interval(final_grade) FROM grades

Estimates of the running aggregates is accurate when records are retrieved randomly. 1. Heap Scans Simple heap scans can be effective in traditional heap file access methods where records are stored in unspecified order. Need to choose different method for the aggregate attributes, which are correlated to the logical order of formation of heap. 2. Index Scans Can be used if aggregate attributes are not used for indexing. 3. Sampling from Indices Techniques for pseudo random sampling from various index structures can be used. [Olken’s work] Random Access to Data

Non-blocking/Fair access GROUP BY and DISTINCT Groups should receive updates in fair manner Solution: Sorting ?? No, because sorting blocks Must use hash based techniques  Pros: Non-blocking  Cons: Does not perform well as the number of groups grow. Solution: Hybrid hashing. Optimized version: Hybrid cache For DISTINCT columns, a similar hashing technique can be used.

Index Striding Updates for the groups with few members will be very infrequent. For fair group by Read tuples in round robin fashion (a tuple from group 1, a tuple from group 2, …) Supported by technique index striding What is Index Striding ? Additional advantages Group updating rate can be controlled Particular group processing can be stopped

POSTGRES with index striding Speed control

Non-blocking Join Algorithms For interactive display of online aggregation, avoid algorithms that block. Sort-merge join Unacceptable as sorting is blocking operation Merge Join OK but produces sorted output Hybrid hash join Not good if inner relation is large Nested loops join is always good, In case of large un-indexed inner relation its too slow An optimizer must be used to choose between these strategies.

Optimization Avoid sorting unless explicitly requested by the user. Blocking sub-operations have costs and appropriate costs should be considered. Cost function = f(t o ) + g(t d ) There are 2 components in cost function: dead time (t d ): time spent doing “invisible” work output time (t o ): time spent producing output Preferences to the plans that maximize user control (index striding)

Extended aggregate functions Standard set of aggregate functions must be extended Aggregate functions must be written that provides running estimates Running computation SUM, COUNT, AVG – straight forward VAR, STD DEV – can be implemented using algorithms Aggregate functions returning running confidence must be defined.

API Current API uses built-in methods e.g., StopGroup(cursor,groupval) speedUpGroup(cursor,groupval) slowDownGroup(cursor,groupval) setSkipFactor(cursor_name,integer) Skip Factor

Statistical Issues Running confidence interval Given an estimate, probability p that we’re within  of the right answer Mu A large value of  means that the records seen so far may not be sufficiently representative of the entire database and the current estimate of the result may be far from the final result. Types of running confidence interval s: Conservative confidence interval For n (no of tuples retrieved) >= 1 Answer guaranteed to be >= probability p [based on Hoeffding’s inequality] Large-sample confidence intervals Deterministic confidence intervals Running confidence interval can be dynamically adjusted depending on the value of n.

Performance Issues

Conclusion An interactive, intuitive and user-controllable approach to aggregation is needed. This can be achieved by significant extensions to the database engine. These extensions satisfy the usability and performance goals. Ability to produce statistical confidence intervals for running aggregates.

Future work Better UI Nested Queries Control without Indices Checkpointing / Continuation

TIME TO ASK QUESTIONS 