A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.

Slides:



Advertisements
Similar presentations
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Advertisements

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Logistics Network Configuration
Fast Algorithms For Hierarchical Range Histogram Constructions
Boosting Rong Jin.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes.
OLAP Over Uncertain and Imprecise Data Adapted from a talk by T.S. Jayram (IBM Almaden) with Doug Burdick (Wisconsin), Prasad Deshpande (IBM), Raghu Ramakrishnan.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
Complex Surveys Sunday, April 16, 2017.
1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research.
Chapter 10 Simple Regression.
Motion Analysis (contd.) Slides are from RPI Registration Class.
Parametric Query Generation Student: Dilys Thomas Mentor: Nico Bruno Manager: Surajit Chaudhuri.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
STAT262: Lecture 5 (Ratio estimation)
A new sampling method: stratified sampling
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
O VERCOMING L IMITATIONS OF S AMPLING FOR A GGREGATION Q UERIES Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
Definitions Observation unit Target population Sample Sampled population Sampling unit Sampling frame.
Lecture 5 slides on Central Limit Theorem Stratified Sampling How to acquire random sample Prepared by Amrita Tamrakar.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Network Aware Resource Allocation in Distributed Clouds.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.
1 Online algorithms Typically when we solve problems and design algorithms we assume that we know all the data a priori. However in many practical situations.
Sampling Design and Analysis MTH 494 LECTURE-12 Ossam Chohan Assistant Professor CIIT Abbottabad.
-Arnaud Doucet, Nando de Freitas et al, UAI
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Presented By Anirban Maiti Chandrashekar Vijayarenu
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented By: Vivek Tanneeru.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Progress Meeting - Rennes - November 2001 Sampling: Theory and applications Progress meeting Rennes, November 28-30, 2001 Progress meeting Rennes, November.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
1 ALLOCATION. 2 DETERMINING SAMPLE SIZE Problem: Want to estimate. How choose n to obtain a margin of error not larger than e? Solution: Solve the inequality.
01/26/05© 2005 University of Wisconsin Last Time Raytracing and PBRT Structure Radiometric quantities.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Sameer Agarwal, Aurojit Panda, Barzan Moxafari Samuel Madden, Ion Stoica.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Data Science Credibility: Evaluating What’s Been Learned
Data Transformation: Normalization
BlinkDB.
BlinkDB.
A paper on Join Synopses for Approximate Query Answering
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Rutgers Intelligent Transportation Systems (RITS) Laboratory
Stratified Sampling STAT262.
2. Stratified Random Sampling.
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Reading Report 6 Yin Chen 5 Mar 2004
Stratified Sampling for Data Mining on the Deep Web
CUBE MATERIALIZATION E0 261 Jayant Haritsa
Presented by: Mariam John CSE /14/2006
Presentation transcript:

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth Sivaram Vallath

Why Approximate query Answering? Most applications are OLAP and data mining for analyzing large databases. Most applications are OLAP and data mining for analyzing large databases. Most of the queries are aggregation queries on these large databases. Most of the queries are aggregation queries on these large databases. Hence expensive and resource intensive. Hence expensive and resource intensive. Approximate answers given accurately and efficiently benefits the scalability of the application. Approximate answers given accurately and efficiently benefits the scalability of the application.

Approaches to Approximation/Data Reduction Use pre-computed samples of data instead of complete data Use pre-computed samples of data instead of complete data Sampling: Sampling: Weighted Sampling Weighted Sampling Congressional sampling Congressional sampling On the fly sampling (error prone when selections, GROUP BYs and joins are used) On the fly sampling (error prone when selections, GROUP BYs and joins are used) Workload (Deterministic solution for identical workloads) Workload (Deterministic solution for identical workloads)

Approaches to Approximation/Data Reduction (contd.) “similar” workloads, are considered as optimization problem (minimizing the error) “similar” workloads, are considered as optimization problem (minimizing the error) Histograms Histograms Wavelets Wavelets

Attacking “similar” workloads Stratified sampling Stratified sampling Minimize error in estimation of aggregates Minimize error in estimation of aggregates

Drawbacks of previous studies Lack of rigorous problem formulations leds to solutions that are difficult to evaluate theoretically. Lack of rigorous problem formulations leds to solutions that are difficult to evaluate theoretically. Does not deal with uncertainty in incoming queries that are “similar” but identical Does not deal with uncertainty in incoming queries that are “similar” but identical Ignore the variance in data distribution of aggregated columns Ignore the variance in data distribution of aggregated columns

Architecture for AQP Queries with selections, foreign- key joins and GROUP BY, containing aggregation functions such as COUNT, SUM, AVG. Queries with selections, foreign- key joins and GROUP BY, containing aggregation functions such as COUNT, SUM, AVG.

Architecture for AQP

Offline Component: building of sample Offline Component: building of sample Online Component: Online Component: 1. Rewrites an incoming query to use the sample to answer the query approximately 2. Reports the answer with error estimates

Pre-computing Samples for Fixed Workload Fundamental Regions: Fundamental Regions: For a given relation R and workload W, consider partitioning the records in R into a minimum number of regions R 1, R 2, …, R r such that for any region R j, each query in W selects either all records in R j or none.

Fixed Workload 1. Identify all fundamental regions → r After step 1, Case A (r ≤ k) and Case B (r >k) (k → sample size) 2. Case A (r ≤ k): (Pick Sample records) Selects the samples by picking exactly one record from each important fundamental region 3. Assigns appropriate values to additional columns in the sample records 2. Case B (r > k): (Pick Sample records) select k regions and then pick one record from each of the selected regions. The heuristic is to select top k. 3. Assign Values to Additional Columns. This is an optimization problem, which is solved by partially differentiating and the resulting linear equations using Gauss-Seidel method.

Disadvantage Per-query (probabilistic) error guarantee is not possible. Per-query (probabilistic) error guarantee is not possible. If incoming query is not identical to a query in the given workload, FIXED can result in unpredictable errors. If incoming query is not identical to a query in the given workload, FIXED can result in unpredictable errors.

Lifting Workload to Query Distributions Should be resilient to situations where “similar” but not identical queries Should be resilient to situations where “similar” but not identical queries “Similar”ity is not based on syntax. If the records returned by the two queries have significant overlap, it is similar. “Similar”ity is not based on syntax. If the records returned by the two queries have significant overlap, it is similar. Each record will have a probability associated with it such that the incoming query will select this record Each record will have a probability associated with it such that the incoming query will select this record

Rationale for Stratified Sampling An effective scheme for stratification should be such that the expected variance, over all queries in each stratum, is small, and allocate more samples to strata with larger expected variances. An effective scheme for stratification should be such that the expected variance, over all queries in each stratum, is small, and allocate more samples to strata with larger expected variances. Minimize the MSE of the lifted workload

Example ProductIDRevenue

Solution for single-table selection queries with Aggregation 1. Stratification 1. How many strata required during partition? 2. How many records should each strata have? 2. Allocation 1. Determine the number of samples required across each strata 3. Sampling

Pragmatic Issues Identifying Fundamental Regions Identifying Fundamental Regions Handling Large Number of Fundamental Regions Handling Large Number of Fundamental Regions Obtaining Integer Solutions Obtaining Integer Solutions Obtaining an Unbiased Estimator Obtaining an Unbiased Estimator

Extensions for more General Queries GROUP BY Queries GROUP BY Queries JOIN Queries JOIN Queries Other Extensions Other Extensions Mix of COUNT and SUM queries Mix of COUNT and SUM queries

Conclusion The solutions FIXED and STRAT handles the problems of data variance, heterogeneous mixes of queries, GROUP BY and foreign-key joins. The solutions FIXED and STRAT handles the problems of data variance, heterogeneous mixes of queries, GROUP BY and foreign-key joins.