Download presentation

Presentation is loading. Please wait.

Published byArturo Bocook Modified over 2 years ago

1
OLAP Over Uncertain and Imprecise Data T.S. Jayram (IBM Almaden) with Doug Burdick (Wisconsin), Prasad Deshpande (IBM), Raghu Ramakrishnan (Wisconsin), Shivakumar Vaithyanathan (IBM)

2
CA MA NY TX East West All Location Civic SierraF150Camry Truck Sedan All Automobile Dimensions in OLAP

3
Auto = Truck Loc = East SUM(Repair) = ? Measures, Facts, and Queries MA NY TX CA West East ALL Civic SierraF150Camry Truck Sedan ALL Automobile p1 p2 p3 p4 p5 p6 p7 p8 Auto = F150 Loc = NY Repair = $200 Cell Location

4
Extend the OLAP model to handle data ambiguity Imprecision Uncertainty Extend the OLAP model to handle data ambiguity Imprecision Uncertainty

5
MA NY TX CA West East ALL Location Civic SierraF150Camry Truck Sedan ALL Automobile p1 p2 p3 p4 p5 p6 p7 p8 Auto = F150 Loc = East Repair = $200 p9 p10 Imprecision p11

6
Representing Imprecision using Dimension Hierarchies Dimension hierarchies lead to a natural space of partially specified objects Sources of imprecision: incomplete data, multiple sources of data

7
SierraF150 Truck MA NY East p1p3 p5 p4p2 Motivating Example We propose desiderata that enable appropriate definition of query semantics for imprecise data Query: COUNT

8
Desideratum I: Consistency Consistency specifies the relationship between answers to related queries on a fixed data set SierraF150 Truck MA NY East p1p3 p5 p4p2

9
Desideratum II: Faithfulness Faithfulness specifies the relationship between answers to a fixed query on related data sets SierraF150 MA NY p3 p1 p4 p2 p5 SierraF150 MA NY p3p1p4p2p5 SierraF150 MA NY p3 p1 p4 p2 p5 Data Set 1Data Set 2Data Set 3

10
Formal definitions of both Consistency and Faithfulness depend on the underlying aggregation operator Can we define query semantics that satisfy these desiderata? Formal definitions of both Consistency and Faithfulness depend on the underlying aggregation operator Can we define query semantics that satisfy these desiderata?

11
p3 p1 p4 p2 p5 MA NY SierraF150 Query Semantics Possible Worlds [Kripke63,…] SierraF150 MA NY p4 p1 p3 p5 p2 p1 p3 p4 p5 p2 p4 p1 p3 p5 p2 MA NY MA NY SierraF150SierraF150 p3 p4 p1 p5 p2 MA NY SierraF150 w1w1 w2w2 w3w3 w4w4

12
Possible Worlds Query Semantics Given all possible worlds together with their probabilities, queries are easily answered (using expected values) But number of possible worlds is exponential!

13
Allocation Allocation gives facts weighted assignments to possible completions, leading to an extended version of the data Size increase is linear in number of (completions of) imprecise facts Queries operate over this extended version Key contributions: Appropriate characterization of the large space of allocation policies Designing efficient allocation policies that take into account the correlations in the data

14
Storing Allocations using Extended Data Model p3 p1 p4 p2 p5 MA NY SierraF150 IDFactIDAutoLocRepairWeight 11F150NY1001.0 22SierraNY5001.0 33F150MA1500.6 43F150NY1500.4 54SierraMA2001.0 65F150MA1000.5 75SierraMA1000.5 Truck East

15
Classifying Allocation Policies Ignored Used Ignored Used Uniform EMCount Measure Correlation Dimension Correlation

16
Results on Query Semantics Evaluating queries over extended version of data yields expected value of the aggregation operator over all possible worlds intuitively, the correct value to compute Efficient query evaluation algorithms for SUM, COUNT consistency and faithfulness for SUM, COUNT are satisfied under appropriate conditions Dynamic programming algorithm for AVERAGE Unfortunately, consistency does not hold for AVERAGE

17
Alternative Semantics for AVERAGE APPROXIMATE AVERAGE E[SUM] / E[COUNT] instead of E[SUM/COUNT] simpler and more efficient satisfies consistency extends to aggregation operators for uncertain measures

18
Uncertainty Measure value is modeled as a probability distribution function over some base domain e.g., measure Brake is a pdf over values {Yes,No} sources of uncertainty: measures extracted from text using classifiers Adapt well-known concepts from statistics to derive appropriate aggregation operators Our framework and solutions for dealing with imprecision also extend to uncertain measures

19
Summary Consistency and faithfulness desiderata for designing query semantics for imprecise data Allocation is the key to our framework Efficient algorithms for aggregation operators with appropriate guarantees of consistency and faithfulness Iterative algorithms for allocation policies

20
Correlation-based Allocation Involves defining an objective function to capture some underlying correlation structure a more stringent requirement on the allocations solving the resulting optimization problem yields the allocations EM-based iterative allocation policy interesting highlight: allocations are re-scaled iteratively by computing appropriate aggregations

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google