Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Fast Algorithms For Hierarchical Range Histogram Constructions
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
Computer vision: models, learning and inference Chapter 8 Regression.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
Model Assessment, Selection and Averaging
Visual Recognition Tutorial
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Maximum likelihood (ML) and likelihood ratio (LR) test
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Lecture 5: Learning models using EM
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Maximum likelihood (ML) and likelihood ratio (LR) test
Evaluating Hypotheses
Today Today: Chapter 9 Assignment: 9.2, 9.4, 9.42 (Geo(p)=“geometric distribution”), 9-R9(a,b) Recommended Questions: 9.1, 9.8, 9.20, 9.23, 9.25.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Copyright © Cengage Learning. All rights reserved. 6 Point Estimation.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Maximum likelihood (ML)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
EM and expected complete log-likelihood Mixture of Experts
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Random Sampling, Point Estimation and Maximum Likelihood.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
1 Lecture 16: Point Estimation Concepts and Methods Devore, Ch
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
New Sampling-Based Estimators for OLAP Queries Ruoming Jin, Kent State University Leo Glimcher, The Ohio State University Chris Jermaine, University of.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Sampling and estimation Petter Mostad
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Computacion Inteligente Least-Square Methods for System Identification.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
CS479/679 Pattern Recognition Dr. George Bebis
Data Transformation: Normalization
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Probability Theory and Parameter Estimation I
LECTURE 11: Advanced Discriminant Analysis
Discrete Event Simulation - 4
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Simple Linear Regression
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Learning From Observed Data
Presentation transcript:

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle

Overview Issues and Questions Approximate Pre-Aggregation (APA) Maximum Likelihood Estimation Performing the Estimation Experiments Observations

Issues Assumes numerical or discretized attributes Many data warehouse (for DSS) attributes are categorical (ex. country = France) Can not handle too many dimensions Stratified sampling (used to handle variance in values over which aggregation is performed) Assumes workload is known Targeting particular queries means other queries suffer

Questions Can we decrease the effect of random sampling variance on categorical (or mixed) data? Can we still require only one pass through the data? Can we increase the accuracy of the estimation for all aggregate queries? Can we do this without prior knowledge of the query workload?

Approximate Pre-Aggregation Uses a true random sampling that is not biased towards any query or workload. Sample is combined with small set of statistics about the data that is gathered at same time that sampling is performed.

Approximate Pre-Aggregation (example) Using a sample of 50% of the data (unshaded) to answer the query: select SUM(cmplaints) from sample where prof = ‘Smith’ we get 36, and since the sample constitutes 50% of the database tuples, we estimate that 72 students complained about Professor Smith. The actual number is 121 – a relative error of 40.5%! The issue is in the variance of students who complained each semester. APA uses some additional information to decrease this error.

Approximate Pre-Aggregation (cont.) Suppose we also know that the total number of complaints in the entire table is 148, and the total number of database tuples is 16. Use the selection predicate (i.e. the “where” clause), and divide the data space into 2 n quadrants. In this example, n=1 (prof=‘Smith’), so there will be two quadrants: prof=‘Smith’ and prof<>’Smith’. For each of the quadrants, estimate the probability density function (pdf). Use additional information to create constraints on the distributions. In this example, we know that the number of complaints is 148 and the number of tuples is 16, which gives us a mean of 148/16 = 9.25 Then need to resolve any errors in the mean. In this example, the mean of the “prof=‘Smith’” quadrant is 4.5 and the mean of the “prof<>’Smith’” quadrant is 1.25, for a total mean of 5.75 – significantly different from Use the Maximum Likelihood Estimation (MLE) to re-estimate the mean. In this example, the mean of complaints for Professor Smith now becomes 8.52 (instead of 4.5), which gives us an estimated total of (8.52 * 16)… much closer to the actual 121 than 72.

Maximum Likelihood Estimation Let x be an observable outcome from a given experiment. In our example, x might be the fact that our random sample predicts that the value of our relational aggregate SUM query is 72. Let the pdf with respect to the observing outcome x be the following function, where the parameters are the hidden parameters that we wish to estimate (they describe the model we want to discover):

Maximum Likelihood Estimation (cont.) In order to “fit” the hidden model parameters to the observation (or observations) we maximize the likelihood that our particular model produced the data (loglikelihood):

Maximum Likelihood Estimation (in APA) Basic idea of APA is simply to find the best (or most likely) explanation for the sample which does not violate any of the known facts about the database. Need to pose problem of approximate aggregation over categorical data as a MLE problem. Three specific components of maximum likelihood that we need to describe in the context of APA: The experimental outcomes x 1,x 2,…,x n The model parameters The pdf Once we these three components have been defined, we have transformed the problem of estimation of aggregate functions over categorical data into a MLE problem, and we can begin the task of developing a method to solve the problem.

Maximum Likelihood Estimation (outcomes) First, we describe how we obtain the “outcomes” x 1,x 2,…,x n to postulate APA as a MLE problem. In APA, those outcomes are a set of predictions made by our sample. Suppose we have the following aggregate query: select SUM(salary) from Employee where sex=‘M’ and dept=‘accounting’ and job_type=‘supervisor’ We can number each of the clauses in the relational selection predicate from 1 to m. In this case: b 1 =(sex=‘male’), b 2 =(dept=‘accounting’) and b 3 =(job_type=‘supervisor’) Also the negation of each of the clauses Conceptually, the result is a data cube (which we will call 2 m ). Each of the boolean conditions corresponds to a single cell in the multidimensional data cube.

Maximum Likelihood Estimation (outcomes, cont.) The outcomes x 1,x 2,…,x 2 m for the MLE in APA are the results of the aggregate function in question with respect to each of the relational selection predicates in 2 m, estimated using our sample. Examples: We know that our sample-based estimate for SUM(salary) over b 1 ^ b 2 ^ b 3 is $1.5M. b 1 ^ b 2 ^ b 3 is the first entry in 2 m. This, $1.5M is used as the value of x 1. We know that our sample-based estimate for SUM(salary) over b 1 ^ ~b 2 ^ ~b 3 is $1.1M. This is the fourth entry in 2 m, thus x 4 is $1.1M. In this way, the random sample is used to estimate the value for each cell in the cube, and these values become the outcomes x 1,x 2,…,x 2 m.

Maximum Likelihood Estimation (parameters) Second, we need to describe the set of model parameters which we will attempt to estimate. In APA, these parameters are defined to be the APA guess as to the real value of the aggregate function in question, with respect to each of the cells in the multidimensional data cube. If x i is the value of cell i predicted by the sample, then the model parameter 0 i is the APA maximum likelihood estimate for the correct value for the aggregate function applied to cell i. Example: Assume we know the fact: (SUM(salary) where job_type!=‘supervisor’) = $2.3M The relational selection predicate in this fact is (b 1 ^b 2 ^~b 3 ) v (b 1 ^~b 2 ^~b 3 ) v (~b 1 ^b 2 ^~b 3 ) v (~b 1 ^~b 2 ^~b 3 ). This is a disjunction of the third, fourth, seventh and eight predicates present in 2 m. Thus, this fact is equivalent to the constraint that = $2.3M.

Maximum Likelihood Estimation (pdf) Finally, we need to define the probability density function f, which will give us the likelihood that we would see the experimental observations x 1,x 2,…,x 2 m, given model parameters 0 1,0 2,…,0 2 m. Example: Assume we attempt to estimate the value for an aggregate of the form: select AGG(expression) from TABLE where (predicate) Assume we have a sample size of n from a database of size db, and the estimated value of the query based on the sample is z… (Hellerstein and Haas) To make a long story short, the pdf is defined as follows:

Performing the Estimation Many ways to attempt to estimate the solution to a maximum likelihood problem. Usually necessary to approximate or estimate because of the inherent intractability of discovering the most likely model in the general case. Best known is the Expectation Maximization algorithm (EM). Begins with an initial guess at the solution and then repeatedly refines the guess until it reaches a locally optimal solution EM simply seeks to maximize the loglikelihood value, while APA needs to maximize within the constraints of the model parameters. Instead, APA uses quadratic programming formulation.

Performing the Estimation (quadratic programming) Quadratic programming is an extension of linear programming with the generalization that the objective function to maximize may contain products of two variables, and not simply scalars. A key advantage of this is that many algorithms have been developed that efficiently solve problems posed as quadratic equations. The ability of quadratic programming to incorporate constraints makes it ideal for APA. The constraints for this quadratic programming formulation are simply the linear sums of the values 0 1,0 2,…,0 2 B.

Experiments Six different approximate options, over eight real, high-dimensional data sets, were performed: Random sampling Stratified sampling APA0: store and use all “0-dimensional” facts as constraints in the quadratic programming. APA1: same as APA0, as well as “1-dimensional” facts. APA2: same as APA1, as well as “2-dimensional” facts. APA3: same as APA2, as well as “3-dimensional” facts. Wavelets Results from AVG aggregation:

Observations Wavelets are unsuitable in this domain. Random sampling better for COUNT queries. Additional accuracy of APA2 and APA3 probably not worth the overhead; also are impractical for numerical data as that would require joint distributions of numerical and categorical attributes (difficult). APA0 and APA1 can easily be extended to handle numerical attributes. Should be possible to easily extend APA to work across foreign key joins, using a technique for sampling from the results of joins (ex. join synopses). Largely sidestepped issues associated with computational efficiency.