A Black-Box Approach to Query Cardinality Estimation

A Black-Box Approach to Query Cardinality Estimation
Tanu Malik, Randal Burns The Johns Hopkins University Nitesh V. Chawla Notre Dame University

The Black Box Approach Estimate query result sizes without knowledge of Underlying data distributions Query execution plan Machine learning techniques Group queries into syntactic families (templates) Learn in a high-dimension, complex input space Attributes, operators, function arguments, aggregates Partition input space Learn regression functions in each partition Self-tuning, self-correcting models When compared with bottom-up estimation Produces accurate, highly compact, and fast models Lose ability to evaluate sub-plans Independent in estimation process as well Note in contrast to a bottom up approach

Are new techniques needed?
Working with federated and remote data sources No access to data (privacy and performance concerns) Many data sources (can’t keep estimates for all) Our motivation: caching in federations Ask the DB optimizer? Other applications Replica maintenance Grid workflow Distributed query schedulers Economic caching framework, caching decisions rely on query result size

Astronomy Example Typical query Sample bottom-up plan
User-defined functions Mathematical expressions Sample bottom-up plan Many sub-estimates Toug time w/ UDFs and with transformed variables

The Spatial Function Executed at the backend database
Data distribution and queries in attribute domains Function computes a range query Showing the access pattern at execution

Workload Observed at Cache
Point queries in 3-dimensional space 2-d projection on attributes shown Query result-size (log cardinality) Cache does not know domains, only witnesses query paramters and their yields

Learning Query yields are k-means clustered into classes
Two-shown, typically 4-8 Two classes shown, typically 4-8

Learning Query yields are k-means clustered into classes
Class boundaries and regression functions Learning techniques: model trees, classification and regression, and locally-weighted regression Two classes shown, typically 4-8

Virtues of the Black Box
No errors from modeling assumptions, because it makes no assumptions Conditional independence Join distributions Accurate estimates for complex queries User-defined functions High-dimensional queries Multi-way joins Point queries Performance (later)

Drawbacks of the Black Box
Semantic losses Does not use indexes, uniqueness, constraints When available, treat as exceptions Not integrated with query execution plans No sub-plan estimates No what-if scenarios can be explored Parallel execution Operator re-ordering Not naturally suited to the database optimization It’s a middleware technique

Overview of Results How many trees? How accurate?

Space and Time How big? How fast?

A Black-Box Approach to Query Cardinality Estimation
Tanu Malik, Randal Burns The Johns Hopkins University Nitesh V. Chawla Notre Dame University

Quick Comparison Self-tuning histograms, e.g. STHoles, STGrid, others
Machine learning, self-tuning, based on observed workload Produce an estimated data distribution Histograms limited to range queries Costing User-Defined Functions [He et al. 2005] Estimate based on weighted nearest k-neighbors Restricted to function arguments Does not build a model The Black Box approach Data independent in both inputs and estimation Rich input space: enumerated domains, operators, and aggregates Compact models, summary data structures

A Black-Box Approach to Query Cardinality Estimation

Similar presentations

Presentation on theme: "A Black-Box Approach to Query Cardinality Estimation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Black-Box Approach to Query Cardinality Estimation

Similar presentations

Presentation on theme: "A Black-Box Approach to Query Cardinality Estimation"— Presentation transcript:

Similar presentations

About project

Feedback