Online Aggregation Joe Hellerstein UC Berkeley Online Aggregation: Motivation Select AVG(grade) from ENROLL; A “fancy” interface: + Query Results AVG.

Slides:



Advertisements
Similar presentations
Overview of Query Evaluation (contd.) Chapter 12 Ramakrishnan and Gehrke (Sections )
Advertisements

CS 540 Database Management Systems
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Online Aggregation Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT.
CMU SCS /615Faloutsos/Pavlo1 Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications C. Faloutsos & A. Pavlo Lecture #13: Query.
Slide: 1 Presentation Title Presentation Sub-Title Copyright 2010 Robert Haas, EnterpriseDB Corporation. Creative Commons 3.0 Attribution. The PostgreSQL.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
1 Relational Query Optimization Module 5, Lecture 2.
Midterm Review Spring Overview Sorting Hashing Selections Joins.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang
Lecture 24: Query Execution Monday, November 20, 2000.
CONTROL Overview CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley.
Query Processing (overview)
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
CS186 Final Review Query Optimization.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.
Sorting and Query Processing Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 29, 2005.
A Crystal Ball for Data-Intensive Processing CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali.
CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley CONTROL: Continuous.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
Chocolate Bar! luqili. Milestone 3 Speed 11% of final mark 7%: path quality and speed –Some cleverness required for full marks –Implement some A* techniques.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Access Path Selection in a Relational Database Management System Selinger et al.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
CS4432: Database Systems II Query Processing- Part 2.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
CSCI 5708: Query Processing II Pusheng Zhang University of Minnesota Feb 5, 2004.
Query Processing CS 405G Introduction to Database Systems.
Lecture 17: Query Execution Tuesday, February 28, 2001.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
Lecture 3 - Query Processing (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
1 Lecture 23: Query Execution Monday, November 26, 2001.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
CS4432: Database Systems II Query Processing- Part 1 1.
Diving into Query Execution Plans ED POLLACK AUTOTASK CORPORATION DATABASE OPTIMIZATION ENGINEER.
Ripple Joins for Online Aggregation
Introduction to Query Optimization
Evaluation of Relational Operations: Other Operations
CS222: Principles of Data Management Notes #13 Set operations, Aggregation Instructor: Chen Li.
Introduction to Database Systems
Implementation of Relational Operations (Part 2)
C. Faloutsos Query Optimization – part 1
Yan Huang - CSCI5330 Database Implementation – Access Methods
Database Query Execution
Lecture 2- Query Processing (continued)
Implementation of Relational Operations
CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.
Evaluation of Relational Operations: Other Techniques
Control Theory in Log Processing Systems
C. Faloutsos Query Optimization – part 2
Evaluation of Relational Operations: Other Techniques
Lecture 20: Query Execution
Presentation transcript:

Online Aggregation Joe Hellerstein UC Berkeley

Online Aggregation: Motivation Select AVG(grade) from ENROLL; A “fancy” interface: + Query Results AVG

A Better Approach Don’t process in batch! Online aggregation:

Isn’t This Just Sampling? Yes –we need values to arrive in random order –“confidence intervals” similar to sampling work … and No! –stopping condition set on the fly! –statistical techniques are more sophisticated –can handle GROUP BY w/o a priori knowledge...

Grouping Select AVG(grade) from ENROLL GROUP BY major;

Requirements Usability –Continuous output non-blocking query plans –time/precision control –fairness/partiality Performance –time to accuracy –time to completion –pacing

Fairness/Partiality Speed controls!

Applications Large-scale Data Analysis –Online drill-down/roll-up/CUBE –Visualization tools –Data mining –Distributed requests Generally: –Any long-running activity should be visualized and controllable on line. –Accuracy rarely critical for non-lookups

A Naïve Approach Select onln_avg(grade) from ENROLL; Can do it in Illustra/Informix today! But... –No grouping –Can’t meet performance & usability needs: no guarantee of continuous output –optimized for completion time! no guarantee of fairness (or control over partiality) no control over pacing

Random Access to Data Heap Scan –OK if clustering uncorrelated to agg & grouping attrs Index Scan – can scan an index on attrs uncorrelated to agg or grouping (or index on random()) Sampling: – could introduce new sampling access methods (e.g. Olken’s work)

Group By & Distinct Fair, Non-Blocking Group By/Distinct –Can’t sort! Must use hash-based techniques sorting blocks sorting is unfair –Hybrid hashing! especially for duplicate elimination. –“Hybrid Cache” even better.

Index Striding For fair Group By: – want random tuple from Group 1, random tuple from Group 2,... –Idea: Index gives tuples from a single group. –Sol’n: one access method opens many cursors in index, one per group. Fetch round-robin. –Can control speed by weighting the schedule –Gives fairness/partiality, info/speed match! (Next step: “heap stride”)

Join Algorithms Non-Blocking Joins –no sorting! –merge join OK, but watch “interesting orders”! –hybrid hash not great –symmetric “pipeline” hash [Wilschut/Apers 91] –nested loops always good, can be too slow –optimization...?

Query Optimization 2 components in cost function: –dead time (t d ): time spent doing “invisible” work -- tax this at a high rate! –output time (t o ): time spent producing output –this will do a lot automagically! User control vs. performance? –can use competition (e.g., a la Rdb) “interesting” (i.e. bad!) ordering

Extended Aggregate Functions Aggs need to return tuples in the middle of processing –add a 4th “current estimate” function open, iterate, estimate, close –sometimes like close function (AVG), sometimes not (COUNT) –aggregation code needs to spit this out to the front-end

API Current API uses built-in methods –button-press = query invocation e.g., select StopGroup( val ); High overhead, need to know internals. Very flexible. Easy to code! –Estimates returned as tuples no distinction between estimates and final answer OK? –A general API for running status?

Pacing Inter-tuple speed is critical!!

Statistical Issues Confidence Intervals for SQL aggs –given an estimate, probability p that we’re within  of the right answer 3 types of estimates  Conservative (Hoeffding’s inequality)  Large-Sample (Central Limit Theorems)  Deterministic Previous work + new results from Peter Haas

Initial Implementation prototype in PostgreSQL (a/k/a Postgres95, a/k/a PG-Lite) –aggs with running output –hash-based grouping, dup elimination –index striding –optimizer tweaks –“API” hacks and a simple UI

Pacing study Select AVG(grade), Interval(0.99) From ENROLL; Normal plan

Access Methods, Big Group Select AVG(grade), Interval(0.99) From ENROLL GROUP BY college; College L: 925K/1.5 Mtuples

Access Methods, Small Group Surprise!!! Cost of API call? Select AVG(grade), Interval(0.99) From ENROLL GROUP BY college; College S: 15K/1.5 Mtuples

Future Work Better UI –simple example: histograms w/error bars –online data visualization (Tioga DataSplash) data viz = “graphical” aggregate sampled points, or sampled wavelet coefficients? great for panning, zooming Nested Queries –Tie-ins with AI’s “anytime algorithms”

Future Work II Control w/o Indices –Heap Striding = do multiple scans optimization 1: fast scans can pass hints to slow optimization 2: “piggyback” scans via buffering big payoff for complex queries Checkpointing/continuation –also continuous data streams

Future Work III Sample Tracking –important for financial auditing, statistical quality control –cached views, RID-lists, logical views More stats: –non-standard aggs –simultaneous confidence intervals –allow for slop in cardinality estimation

Summary DSS/OLAP users very demanding! –“speed of thought” performance required –only choices: precomputation or estimation –for estimation, need to provide control and steady, useful feedback puts user in the driver’s seat takes some performance burden away from system –Requires significant backend support but well worth the effort runs the “impossible” queries