A Black-Box Approach to Query Cardinality Estimation

Slides:



Advertisements
Similar presentations
Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.
Advertisements

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University.
Hopkins Storage Systems Lab, Department of Computer Science Automated Physical Design in Database Caches T. Malik, X. Wang, R. Burns Johns Hopkins University.
1 DynaMat A Dynamic View Management System for Data Warehouses Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.6/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Fast Algorithms For Hierarchical Range Histogram Constructions
Introduction to Histograms Presented By: Laukik Chitnis
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
Database Implementation of a Model-Free Classifier Konstantinos Morfonios ADBIS 2007 University of Athens.
IBM Software Group ® Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs.
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Access Path Selection in a Relational Database Management System Selinger et al.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
Mutlidimensional Indices Instructor: Randal Burns Lecture for 29 November 2005 Computer Science Johns Hopkins University.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Query Optimization March 10 th, Very Big Picture A query execution plan is a program. There are many of them. The optimizer is trying to chose a.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.
Histograms for Selectivity Estimation
Xiaodan Wang, Randal Burns Department of Computer Science Johns Hopkins University Tanu Malik Cyber Center Purdue University LifeRaft: Data-Driven, Batch.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty
23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.
Secure Data Outsourcing
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data-intensive Computing Algorithms: Classification Ref: Algorithms for the Intelligent Web 7/10/20161.
Dense-Region Based Compact Data Cube
Antara Ghosh Jignashu Parikh
Data Mining – Intro.
Data-intensive Computing Algorithms: Classification
Kyriaki Dimitriadou, Brandeis University
Pathology Spatial Analysis February 2017
Parallel Databases.
Resource Elasticity for Large-Scale Machine Learning
Proactive Re-optimization
Database Performance Tuning and Query Optimization
Pervasive Data Access (PDA) Research Group
Chapter 15 QUERY EXECUTION.
Learning with information of features
Objective of This Course
Data Warehousing and Data Mining
Dimension reduction : PCA and Clustering
Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Chapter 11 Database Performance Tuning and Query Optimization
Nearest Neighbors CSC 576: Data Mining.
Model generalization Brief summary of methods
Multivariate Methods Berlin Chen
Query Optimization.
Topological Signatures For Fast Mobility Analysis
CSE572: Data Mining by H. Liu
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
Slides based on those originally by : Parminder Jeet Kaur
CS249: Neural Language Model
Relational Calculus Chapter 4, Part B
Presentation transcript:

A Black-Box Approach to Query Cardinality Estimation Tanu Malik, Randal Burns The Johns Hopkins University Nitesh V. Chawla Notre Dame University

The Black Box Approach Estimate query result sizes without knowledge of Underlying data distributions Query execution plan Machine learning techniques Group queries into syntactic families (templates) Learn in a high-dimension, complex input space Attributes, operators, function arguments, aggregates Partition input space Learn regression functions in each partition Self-tuning, self-correcting models When compared with bottom-up estimation Produces accurate, highly compact, and fast models Lose ability to evaluate sub-plans Independent in estimation process as well Note in contrast to a bottom up approach

Are new techniques needed? Working with federated and remote data sources No access to data (privacy and performance concerns) Many data sources (can’t keep estimates for all) Our motivation: caching in federations Ask the DB optimizer? Other applications Replica maintenance Grid workflow Distributed query schedulers Economic caching framework, caching decisions rely on query result size

Astronomy Example Typical query Sample bottom-up plan User-defined functions Mathematical expressions Sample bottom-up plan Many sub-estimates Toug time w/ UDFs and with transformed variables

The Spatial Function Executed at the backend database Data distribution and queries in attribute domains Function computes a range query Showing the access pattern at execution

Workload Observed at Cache Point queries in 3-dimensional space 2-d projection on attributes shown Query result-size (log cardinality) Cache does not know domains, only witnesses query paramters and their yields

Learning Query yields are k-means clustered into classes Two-shown, typically 4-8 Two classes shown, typically 4-8

Learning Query yields are k-means clustered into classes Class boundaries and regression functions Learning techniques: model trees, classification and regression, and locally-weighted regression Two classes shown, typically 4-8

Virtues of the Black Box No errors from modeling assumptions, because it makes no assumptions Conditional independence Join distributions Accurate estimates for complex queries User-defined functions High-dimensional queries Multi-way joins Point queries Performance (later)

Drawbacks of the Black Box Semantic losses Does not use indexes, uniqueness, constraints When available, treat as exceptions Not integrated with query execution plans No sub-plan estimates No what-if scenarios can be explored Parallel execution Operator re-ordering Not naturally suited to the database optimization It’s a middleware technique

Overview of Results How many trees? How accurate?

Space and Time How big? How fast?

A Black-Box Approach to Query Cardinality Estimation Tanu Malik, Randal Burns The Johns Hopkins University Nitesh V. Chawla Notre Dame University

Quick Comparison Self-tuning histograms, e.g. STHoles, STGrid, others Machine learning, self-tuning, based on observed workload Produce an estimated data distribution Histograms limited to range queries Costing User-Defined Functions [He et al. 2005] Estimate based on weighted nearest k-neighbors Restricted to function arguments Does not build a model The Black Box approach Data independent in both inputs and estimation Rich input space: enumerated domains, operators, and aggregates Compact models, summary data structures