OLAP Over Uncertain and Imprecise Data Adapted from a talk by T.S. Jayram (IBM Almaden) with Doug Burdick (Wisconsin), Prasad Deshpande (IBM), Raghu Ramakrishnan.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

OLAP Over Uncertain and Imprecise Data T.S. Jayram (IBM Almaden) with Doug Burdick (Wisconsin), Prasad Deshpande (IBM), Raghu Ramakrishnan (Wisconsin),
OLAP over Uncertain and Imprecise Data
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Fast Algorithms For Hierarchical Range Histogram Constructions
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Integration of sensory modalities
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
Visual Recognition Tutorial
Artificial Learning Approaches for Multi-target Tracking Jesse McCrosky Nikki Hu.
Bayesian network inference
Maximum likelihood (ML) and likelihood ratio (LR) test
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Statistical Inference Chapter 12/13. COMP 5340/6340 Statistical Inference2 Statistical Inference Given a sample of observations from a population, the.
. Learning Bayesian networks Slides by Nir Friedman.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Maximum likelihood (ML) and likelihood ratio (LR) test
Ensemble Learning: An Introduction
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Complexity of Mechanism Design Vincent Conitzer and Tuomas Sandholm Carnegie Mellon University Computer Science Department.
Presented by Zeehasham Rasheed
Maximum Likelihood (ML), Expectation Maximization (EM)
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
July 3, Department of Computer and Information Science (IDA) Linköpings universitet, Sweden Minimal sufficient statistic.
Copyright © Cengage Learning. All rights reserved. 6 Point Estimation.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Lecture II-2: Probability Review
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
Random Sampling, Point Estimation and Maximum Likelihood.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
11th SSDBM, Cleveland, Ohio, July 28-30, 1999 Supporting Imprecision in Multidimensional Databases Using Granularities T. B. Pedersen 1,2, C. S. Jensen.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Uncertainty Management in Rule-based Expert Systems
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Uncertainty Management In Rule Based Information Extraction Systems Author Eirinaios Michelakis Rajasekar Krishnamurthy Peter J. Haas Shivkumar Vaithyanathan.
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
CHAPTER 4 ESTIMATES OF MEAN AND ERRORS. 4.1 METHOD OF LEAST SQUARES I n Chapter 2 we defined the mean  of the parent distribution and noted that the.
MathematicalMarketing Slide 5.1 OLS Chapter 5: Ordinary Least Square Regression We will be discussing  The Linear Regression Model  Estimation of the.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Estimation Econometría. ADE.. Estimation We assume we have a sample of size T of: – The dependent variable (y) – The explanatory variables (x 1,x 2, x.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Chapter 3: Maximum-Likelihood Parameter Estimation
A paper on Join Synopses for Approximate Query Answering
Data Mining Lecture 11.
Data Mining Practical Machine Learning Tools and Techniques
Hidden Markov Models Part 2: Algorithms
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
EM for Inference in MV Data
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Probabilistic Databases
Parametric Methods Berlin Chen, 2005 References:
EM for Inference in MV Data
Presentation transcript:

OLAP Over Uncertain and Imprecise Data Adapted from a talk by T.S. Jayram (IBM Almaden) with Doug Burdick (Wisconsin), Prasad Deshpande (IBM), Raghu Ramakrishnan (Wisconsin), Shivakumar Vaithyanathan (IBM) Adapted by S. Sudarshan

CA MA NY TX East West All Location Civic SierraF150Camry Truck Sedan All Automobile Dimensions in OLAP

Auto = Truck Loc = East SUM(Repair) = ? Measures, Facts, and Queries MA NY TX CA West East ALL Civic SierraF150Camry Truck Sedan ALL Automobile p1 p2 p3 p4 p5 p6 p7 p8 Auto = F150 Loc = NY Repair = $200 Cell Location

Restriction on Imprecision We restrict the sets of values in an imprecise fact to either: 1. A singleton set consisting of a leaf level member of the hierarchy, or, 2. The set of all the leaf level members under some non-leaf level member of the hierarchy.

Cells and Regions A region is a vector of attribute values from an imprecise domains of each dimension of the cube. A cell is a region in which all values are leaf level members. Let reg(R) represent the set of cells in a region R.

Queries on precise data A query Q = (R, M, A) refers to a region R, a measure M, and an aggregate function A. Eg : (, Repairs, Sum) The result of the query in a precise database is obtained by applying A on the measure M of all cells in R. For the example above, the result is (P1 + P2)

Extend the OLAP model to handle data ambiguity Imprecision Uncertainty Extend the OLAP model to handle data ambiguity Imprecision Uncertainty

MA NY TX CA West East ALL Location Civic SierraF150Camry Truck Sedan ALL Automobile p1 p2 p3 p4 p5 p6 p7 p8 Auto = F150 Loc = East Repair = $200 p9 p10 Imprecision p11

Representing Imprecision using Dimension Hierarchies Dimension hierarchies lead to a natural space of “partially specified” objects Sources of imprecision: incomplete data, multiple sources of data

SierraF150 Truck MA NY East p1p3 p5 p4p2 Motivating Example We propose desiderata that enable appropriate definition of query semantics for imprecise data Query: COUNT

Queries on imprecise data Consider the query region in the figure. It overlaps two imprecise facts P4 and P5. Three (naive) options for including fact in query: Contains: consider only if contained in query Overlaps: consider if overlapping query None: ignore all imprecise facts

Desideratum I: Consistency Consistency specifies the relationship between answers to related queries on a fixed data set SierraF150 Truck MA NY East p1p3 p5 p4p2

Notions of Consistency Generic idea: if query region is partitioned, and aggregate applied on each partition, then aggregate q on whole region must be consistent in some ways with aggregates qi on partitions General idea: alpha consistency for property alpha Specific forms of consistency discussed in detail in paper Sum consistency (for count/sum) Boundedness consistency (for average)

Contains option : Consistency Intuitively, consistency means that the answer to a query should be consistent with the aggregates from individual partitions of the query. Using the Contains option could give rise to inconsistent results. For example, consider the sum aggregate of the query above and that of its individual cells. With the Contains option, will the individual results add up to be the same as the collective?

Desideratum II: Faithfulness Faithfulness specifies the relationship between answers to a fixed query on related data sets Notion of result quality relative to the quality of the data input to the query. –For example, the answer computed for Q=F150,MA should be of higher quality if p3 were precisely known. SierraF150 MA NY p3 p1 p4 p2 p5 SierraF150 MA NY p3p1p4p2p5 SierraF150 MA NY p3 p1 p4 p2 p5 Data Set 1Data Set 2Data Set 3

Formal definitions of both Consistency and Faithfulness depend on the underlying aggregation operator Can we define query semantics that satisfy these desiderata? Formal definitions of both Consistency and Faithfulness depend on the underlying aggregation operator Can we define query semantics that satisfy these desiderata?

p3 p1 p4 p2 p5 MA NY SierraF150 Query Semantics Possible Worlds [Kripke63,…] SierraF150 MA NY p4 p1 p3 p5 p2 p1 p3 p4 p5 p2 p4 p1 p3 p5 p2 MA NY MA NY SierraF150SierraF150 p3 p4 p1 p5 p2 MA NY SierraF150 w1w1 w2w2 w3w3 w4w4

Possible Worlds Query Semantics Given all possible worlds together with their probabilities, queries are easily answered (using expected values) But number of possible worlds is exponential!

Allocation Allocation gives facts weighted assignments to possible completions, leading to an extended version of the data Size increase is linear in number of (completions of) imprecise facts Queries operate over this extended version Key contributions: Appropriate characterization of the large space of allocation policies Designing efficient allocation policies that take into account the correlations in the data

Storing Allocations using Extended Data Model p3 p1 p4 p2 p5 MA NY SierraF150 Truck East

Advantages of EDM No extra infrastructure required for representing imprecision Efficient algorithms for aggregate queries : SUM and COUNT : linear time algo. AVERAGE : slightly complicated algorithm running in O(m + n 3 ) for m precise facts and n imprecise facts.

Aggregating Uncertain Measures Opinion pooling: provide a consensus opinion from a set of opinions Θ. The opinions in Θ as well as the consensus opinion are represented as pdfs over a discrete domain O linear operator LinOp( Θ ) produces a consensus pdf P that is a weighted linear combination of the pdfs in Θ,

Allocation Policies For every region r in the database, we want to assign an allocation p c, r to each cell c in Reg(r), such that ∑ c Reg(r) p c, r = 1 Three ways of doing so: 1. Uniform : Assign each cell c in a region r an equal probability. p c, r = 1 / |Reg(r)|

Allocation Policies For every region r in the database, we want to assign an allocation p c, r to each cell c in Reg(r), such that ∑ c Reg(r) p c, r = 1 However, we can do better. Some cells may be naturally inclined to have more probability than others. Eg : Mumbai will clearly have more repairs than Bhopal. We can do this automatically by giving more probability to cells with higher number of precise facts. 2. Count based : where N c is the number of precise facts in cell c

Allocation Policies For every region r in the database, we want to assign an allocation p c, r to each cell c in Reg(r), such that ∑ c Reg(r) p c, r = 1 Again, we can arguably get a better result by looking at not just the count, but rather than the actual value of the measure in question. 3. Measure based : next slide.

Measure Based Allocation Assumes the following model : The given database D with imprecise facts has been generated by randomly injecting imprecision in a precise database D'. D' assigns value o to a cell c according to some unknown pdf P(o, c). If we could determine this pdf, the allocation is simply p c, r = P(c) / ∑ c' in Reg(r) P(c')

Classifying Allocation Policies Ignored Used Ignored Used Uniform EMCount Measure Correlation Dimension Correlation

Results on Query Semantics Evaluating queries over extended version of data yields expected value of the aggregation operator over all possible worlds intuitively, the correct value to compute Efficient query evaluation algorithms for SUM, COUNT consistency and faithfulness for SUM, COUNT are satisfied under appropriate conditions Dynamic programming algorithm for AVERAGE Unfortunately, consistency does not hold for AVERAGE

Alternative Semantics for AVERAGE APPROXIMATE AVERAGE E[SUM] / E[COUNT] instead of E[SUM/COUNT] simpler and more efficient satisfies consistency extends to aggregation operators for uncertain measures

Maximum Likelihood Principle A reasonable estimate for this function P can be that which maximises the probability of generating the given imprecise data set D. Example : Suppose the pdf depends only on the cells and is independent of the measure values. Thus, the pdf is a mapping : C ℝ where C is the set of cells. This pdf can be found by maximising the likelihood function : ℒ () = r D ∑ c Reg(r) (c)

EM Algorithm The Expectation Maximization algorithm provides a standard way of maximizing the likelihood, when we have some unknown variables in the observation set. Expectation step (compute data): Calculate the expected value of the unknown variables, given the current estimate of variables. Maximization step (compute generator): Calculate the distribution that maximizes the probability of the current estimated data set.

Initialization Step: Data: [4, 10, ?, ?] Initial mean value: 0 New Data: [4, 10, 0, 0] Step 1: New Mean: 3.5 New Data:[4, 10, 3.5, 3.5] Step 2: New Mean: 5.25 New Data: [4, 10, 5.25, 5.25] Step 3: New Mean: New Data: [4, 10, 6.125, 6.125] Result: New Mean: EM Algorithm : Example Step 4: New Mean: New Data: [4, 10, , ] Step 5: New Mean: New Data: [4, 10, , ]

EM Algorithm : Application

Experiments : Allocation run time

Experiments : Query run time

Experiments : Accuracy

Uncertainty Measure value is modeled as a probability distribution function over some base domain e.g., measure Brake is a pdf over values {Yes,No} sources of uncertainty: measures extracted from text using classifiers Adapt well-known concepts from statistics to derive appropriate aggregation operators Our framework and solutions for dealing with imprecision also extend to uncertain measures

Summary Consistency and faithfulness desiderata for designing query semantics for imprecise data Allocation is the key to our framework Efficient algorithms for aggregation operators with appropriate guarantees of consistency and faithfulness Iterative algorithms for allocation policies

Correlation-based Allocation Involves defining an objective function to capture some underlying correlation structure a more stringent requirement on the allocations solving the resulting optimization problem yields the allocations EM-based iterative allocation policy interesting highlight: allocations are re-scaled iteratively by computing appropriate aggregations