DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.

Slides:



Advertisements
Similar presentations
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Advertisements

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
Fast Algorithms For Hierarchical Range Histogram Constructions
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.
Efficient Query Evaluation on Probabilistic Databases
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
Discovering Affine Equalities Using Random Interpretation Sumit Gulwani George Necula EECS Department University of California, Berkeley.
© 2010 Pearson Prentice Hall. All rights reserved Confidence Intervals for the Population Mean When the Population Standard Deviation is Unknown.
Chapter 7(7b): Statistical Applications in Traffic Engineering Chapter objectives: By the end of these chapters the student will be able to (We spend 3.
Machine Learning Week 2 Lecture 2.
Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.
ACM GIS An Interactive Framework for Raster Data Spatial Joins Wan Bae (Computer Science, University of Denver) Petr Vojtěchovský (Mathematics,
Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Approximating Power Indices Yoram Bachrach(Hebew University) Evangelos Markakis(CWI) Ariel D. Procaccia (Hebrew University) Jeffrey S. Rosenschein (Hebrew.
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
© 2010 Pearson Prentice Hall. All rights reserved Chapter Estimating the Value of a Parameter Using Confidence Intervals 9.
ECONOMETRICS I CHAPTER 5: TWO-VARIABLE REGRESSION: INTERVAL ESTIMATION AND HYPOTHESIS TESTING Textbook: Damodar N. Gujarati (2004) Basic Econometrics,
Statistics for Managers Using Microsoft® Excel 7th Edition
Calibration Guidelines 1. Start simple, add complexity carefully 2. Use a broad range of information 3. Be well-posed & be comprehensive 4. Include diverse.
Go to Index Analysis of Means Farrokh Alemi, Ph.D. Kashif Haqqi M.D.
Lesson Logic in Constructing Confidence Intervals about a Population Mean where the Population Standard Deviation is Known.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
ENGR Introduction to Engineering1 ENGR 107 – Introduction to Engineering Estimation, Accuracy and Precision, and Significant Figures (Lecture #3)
Schemata Theory Chapter 11. A.E. Eiben and J.E. Smith, Introduction to Evolutionary Computing Theory Why Bother with Theory? Might provide performance.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.
Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 A Statistics-Based Sensor Selection.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
Normal Distribution.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
ENGR 610 Applied Statistics Fall Week 4 Marshall University CITE Jack Smith.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Trust Me, I’m Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster Shengliang Dai.
June 30, 2008Stat Lecture 16 - Regression1 Inference for relationships between variables Statistics Lecture 16.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
Lesson Confidence Intervals about a Population Mean in Practice where the Population Standard Deviation is Unknown.
SUMMARY EQT 271 MADAM SITI AISYAH ZAKARIA SEMESTER /2015.
© 2010 Pearson Prentice Hall. All rights reserved Chapter Estimating the Value of a Parameter Using Confidence Intervals 9.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Estimating the Value of a Parameter 9.
Constrained Optimization by the  Constrained Differential Evolution with an Archive and Gradient-Based Mutation Tetsuyuki TAKAHAMA ( Hiroshima City University.
Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies 병렬 분산 컴퓨팅 연구실 석사 1 학기 김남희.
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
Confidence Intervals Cont.
Estimating the Value of a Parameter Using Confidence Intervals
Frequency Counts over Data Streams
Specificity of Closure Laws and UQ
A paper on Join Synopses for Approximate Query Answering
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
AQUA: Approximate Query Answering
Statistical Methods For Engineers
Data Integration with Dependent Sources
Section 10.1: Confidence Intervals
Instructors: Fei Fang (This Lecture) and Dave Touretzky
المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen
Introduction to Estimation
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Probabilistic Databases
Presentation transcript:

DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015

Outline Approximate Query Processing SAQ DAQ Bitwise DAQ Scheme Evaluation Conclusion 2

Approximate Query Processing Data volume is growing exponentially Data volume is growing exponentially Queries are interactive to support real-time decisions Decisions are resilient to small errors Quick Approximate Answer is better than Slow Exact Answer Quick Approximate Answer is better than Slow Exact Answer Exploratory analysis demands responsiveness eg: Average Revenue estimate $12M is about as good as $12,345,678 eg: Average Revenue estimate $12M is about as good as $12,345,678 3

SAQ Sampling-based Approximate Querying Run query on a small random subset of data Error in estimate presented as confidence interval Avg revenue = $12.3 ± 0.1 million with 95% confidence Can be “online” – error eventually shrinks to zero => exact estimate 4

Confidence Intervals Avg revenue = $12.3 ± 0.1 million with 95% confidence What does this mean? % Probability Distribution of Average Revenue With 95% probability, true average revenue lies in 12.3 ± 0.1 million With 5% probability, true average revenue lies outside 12.3 ± 0.1 million 5

Confidence Intervals eg: Avg revenue = $12.3 ± 0.1 million with 95% confidence How should we interpret the tails? % Probability Distribution of Average Revenue Tails occur 5% of the time. In Avg Regional Revenue for 100 states, 5 estimates are in the tails There is no bound on error in the tail. The Avg Revenue could be as small as $1M or as large as $100M 6

SAQ: Shortcomings Semantics of the tails of confidence intervals are hard to interpret. Intervals are very broad for outlier aggregates like MAX or Top 100. Intervals are hard to manipulate. No closed algebra. 95% Confidence interval bounds are unintuitive Need to see more of the data to find outliers. Slow convergence. Is 100 ± 10 “greater than” 90 ± 20? How do we add these intervals? 7

Pop quiz! Estimate the sum. DAQ Deterministic Approximate Querying 5,321, ,151, ,362,296 _________________________ = ? ? ? ? ? a.Approximately 2.4 million? b.Approximately 9.7 million? c.Approximately 13.8 million? d.Approximately 17.0 million? a.Approximately 2.4 million? b.Approximately 9.7 million? c.Approximately 13.8 million? d.Approximately 17.0 million? = 9,???,??? = 9,7??,??? = 9,83?,??? 8

Use deterministic intervals instead of probabilistic (confidence) intervals Guaranteed upper and lower bounds Avg revenue = $12.3 ± 0.2 million Can be “online” – Error interval eventually becomes degenerate => exact estimate DAQ Deterministic Approximate Querying DAQ UI Mockup 9

SAQ vs DAQ (at a glance) SAQDAQ Complex semantics using confidence intervals due to the “tail”. Simple semantics using deterministic intervals as there is no “tail”. Slow for outlier aggregates like MAX or Top 100 and heavy-tailed data. Fast for outlier aggregates like MAX or Top 100 and heavy-tailed data. No closed algebra. No clear semantics for predicates and arithmetic operations on estimates. Closed relational algebra. Clear semantics for predicates and arithmetic ops using interval algebra. 10

Conceptual DAQ Scheme Hierarchically partition the attribute’s domain Estimates are represented as intervals [a,b] 11

Conceptual DAQ Scheme Hierarchically partition the attribute’s domain Estimates are represented as intervals [a,b] e.g., Count B-Tree 12

Interval Algebra Predicate evaluation Interval representation for relations Which cities have population > 100? Certainly > 100 Potentially > 100, 13

Formal Definition Interval estimates for attributes and relations Operators consume and produce intervals Closed relational algebra Online DAQ scheme – monotonically shrinking intervals – converges to exact estimate 14

Bitwise DAQ Scheme Similar to the decimal digit-wise sum example Uses Bitsliced Index representation 15

Bitwise DAQ Scheme 16 Use most significant m bits for evaluation Remaining n-m bits set to all-0 and all-1 for bounds Error bound decreases exponentially: 2 n-m

Bitwise DAQ Scheme Algorithms 17 Average aggregation upto m bits

Bitwise DAQ vs. Baseline 18 Exponentially decreasing error bounds in estimating Avg

Bitwise DAQ Scheme Algorithms 19 Less Than predicate upto m bits

Bitwise DAQ vs. Baseline 20 Predicate evaluation: 6x speedup using 8 bits for < 1% error

Bitwise DAQ vs. Baseline 21 Top 100: 3.5x speedup for < 1% error on Uniform, Zipf data

Bitwise DAQ vs. SAQ 22 Top 100: DAQ performs better for heavy-tailed data (Zipf)

Bitwise DAQ Summary 23 Excels at Predicates Outlier aggregates Heavy-tailed data Suffers for Simpler aggregates (sum, avg) Uniform data Compression improves performance further Can operate directly on compressed bitvectors > 8x speedup for Top 100 on Zipf data Embarrassingly parallelizable NUMA-aware parallelization

Future Work B-tree like indices Extension to other operators (Group By, Join) Extension to other data types Query optimization Combining SAQ and DAQ 24