Presentation is loading. Please wait.

Presentation is loading. Please wait.

11th SSDBM, Cleveland, Ohio, July 28-30, 1999 Supporting Imprecision in Multidimensional Databases Using Granularities T. B. Pedersen 1,2, C. S. Jensen.

Similar presentations


Presentation on theme: "11th SSDBM, Cleveland, Ohio, July 28-30, 1999 Supporting Imprecision in Multidimensional Databases Using Granularities T. B. Pedersen 1,2, C. S. Jensen."— Presentation transcript:

1 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 Supporting Imprecision in Multidimensional Databases Using Granularities T. B. Pedersen 1,2, C. S. Jensen 2, and C. E. Dyreson 2 1 Center for Health Information Services, Kommunedata, www.kmd.dk 2 Nykredit Center for Database Research, Department of Computer Science, Aalborg University, www.cs.auc.dk/NDB

2 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 2 Talk Overview Motivation Data model and query language context Handling imprecision Alternative queries Imprecision in grouping Imprecision in computations Presenting imprecise results Using pre-aggregated data Conclusion

3 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 3 Motivation Online Analytical Processing (OLAP) tools are increasingly used in many different areas: Business applications Medical applications Other scientific applications Data “imperfection” is a problem for medical and other applications. Some data is missing. Some data has varying degrees of imprecision. Current OLAP tools assume that data imperfections are handled during the “data cleansing” process. Not realistic for most cases Introduces mapping errors Hides the “true quality” of the data

4 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 4 Previous Work Imprecision versus uncertainty Imprecision is a property of the content of an attribute. Uncertainty concerns the degree of truth of a statement. We handle only imprecision. Our focus is on imprecision in aggregate queries. Most work on imprecision deals with relational databases Fuzzy sets - specifies a degree of set membership for a value Partial values - one of a set of values is the true value Multiple imputation - substitute multiple values for a missing value High computational complexity Only “incomplete datacubes” have treated imprecision in multidimensional databases. Imprecision fixed at schema level Imprecision in computations not handled

5 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 5 A Diabetes Case Study E/R schema of case study Patients have a diagnosis. Diagnoses may be missing. Diagnoses may be specified at the Low-level Diagnosis or Diagnosis Group level. HbA1c% (long-term blood sugar level) measured with several methods of varying precision.

6 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 6 Data Model - Schema Fact type: Patient Dimension types: Diagnosis and HbA1c% There are no “measures”, all data are dimensions. Category types: Low-level Diagnosis, Diagnosis Group, Precise, Imprecise Top category types: corresponds to ALL of the dimension. Bottom category types: the lowest level in each dimension The category types of a dimension type form a lattice. Category types ~ granularities Patient Diagnosis Group LL Diagnosis Diagnosis Dimension Type HbA1c% Dimension Type T Diagnosis T HbA1c% Precise Imprecise

7 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 7 Data Model - Instances Categories: instances of category types, consist of dimension values. Top categories contain only one “T” value. Dimensions = categories + partial order on category values Facts: instances of fact type with separate identity. Fact-dimension relations: links facts to dimensions, may map to values of any granularity. Multidimensional object (MO) = schema + dimension + facts + fact-dimension relations Diagnosis Dimension HbA1c% Dimension T Diabetes ID Diabetes NID Diabetes T JimJohn Jane 5 6 7 5.4 5.5.. 7.0

8 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 8 Algebraic Query Language Close to relational algebra with aggregation functions Selection, projection, rename, union, difference, identity- based join operators Aggregation operator: takes “grouping categories”, aggregation function and result dimension as arguments. Groups together facts characterized by the same dimension values, applies aggregation function to groups, and places result in result dimension. Example: COUNT by Diagnosis Family and T HbA1c% Diagnosis Dimension Result Dimension T T Diabetes {Jim,John,Jane} 0-2 >2 0 1 2 3..

9 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 9 Overview Motivation Data model and query language context Handling imprecision Alternative queries Imprecision in grouping Imprecision in computations Presenting imprecise results Using pre-aggregated data Conclusion

10 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 10 Handling Imprecision We use the granularity of the data to capture imprecision. The dimensions specify an imprecision hierarchy. Data is mapped to the appropriate granularity. Pseudo-code: Procedure EvalImprecise(Q,M) if PreciseEnough(Q,M) then Eval(Q,M) else Q’=Alternative(Q,M) if Q’ is ok then Eval(Q’,M) else Handle Imprecision in Grouping Handle Imprecision in Computation Return Imprecise Result end if end if Most Imprecise=T=Unknown More Imprecise More precise Most precise=bottom category......

11 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 11 Alternative Queries We use a “Precision MO” (PMO) to capture the data granularities. The precision MO has a granularity dimension for every MO dimension D. A granularity dimension has two categories: T and Granularity D. Granularity D contains a value for every category type in D. The set of facts stays the same. Facts are mapped to the appropriate Granularity D value. Gran Diagnosis Dimension Gran HbA1c% Dimension T T DG LLD TDTD P I T H JimJohn Jane

12 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 12 Alternative Queries Queries are rewritten into “testing queries” on the PMO that counts the number of facts mapped to different granularity combinations (details in paper). The result of the testing queries can be used to suggest alternative queries. Example: COUNT by Low-level Diagnosis (and T HbA1c%) ) on MO Testing query: COUNT by Granularity Diagnosis on PMO Result shows that data is not precise enough to group on Low-level Diagnosis Alternative query: COUNT by Diagnosis Group Gran Diagnosis Dimension Result Dimension T LLD DG TDTD 0 1 2 T {Jim}{John,Jane}

13 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 13 Overview Motivation Data model and query language context Handling imprecision Alternative queries Imprecision in grouping Imprecision in computations Presenting imprecise results Using pre-aggregated data Conclusion

14 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 14 Imprecision in Grouping If facts are mapped to a dimension value of coarser granularity than the grouping category, with which dimension value should they be grouped ? - we do not know ! So, we return several answers, based on different groupings of facts: Conservative grouping: only include in a group what is known to belong to it - discard imprecise data. Liberal grouping: include in a group everything that might belong to it - facts may be in several groups. Weighted grouping: include everything that might belong, but give more weight to likely members.

15 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 15 Conservative Grouping Corresponds to the standard aggregation operator. Imprecise data are simply discarded. Example: COUNT by Low-level Diagnosis (and T HbA1c%) ) on MO Jim does not show in any group. The count for both groups is 1. The result is “too conservative” as not all data is accounted for in the result. Diagnosis Dimension Result Dimension T T Diabetes ID Diabetes NID Diabetes 0 1 2 {John} {Jane}

16 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 16 Liberal Grouping We modify the aggregation operator from the query language to compute the liberal grouping (formal semantics in the paper). The computed groups may now overlap, i.e., the same fact may be in several groups. Example: COUNT by Low-level Diagnosis (and T HbA1c%) ) on MO Jim ends up up both groups. The count for both groups is 2. The result is “too liberal” as the same data may be counted several times in the result. Diagnosis Dimension Result Dimension T 0 1 2 Diabetes T ID Diabetes NID Diabetes {Jim,John} {Jim,Jane}

17 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 17 Weighted Grouping Compromise between the conservative and liberal groupings: Weights assigned to the partial order on dimension values and to fact membership in groups (groups become fuzzy sets). Aggregation operator modified to compute the weighted grouping (see paper). Example: COUNT by Low-level Diagnosis (and T HbA1c%) ) on MO 80% of Diabetes patient have ID Diabetes, 20% NID. Jim ends up up both groups, but weighted differently. Result is a weighted COUNT (details on next slide). Diagnosis Dimension T Diabetes ID Diabetes NID Diabetes 1.0.8.2 {Jim.8,John 1.0 } {Jim.2,Jane 1.0 } Result Dimension T 0 1.2 1.8

18 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 18 Imprecision in Computation We also need to handle imprecision in the aggregate computation itself, e.g., the varying precision for HbA1c%. In computations, we impute the expected value for values of non-bottom granularity (generalized imputation). This allows normal computation of the result. A precision computation is performed along with the aggregate computation: A granularity computation measure (GCM) is used to capture the imprecision of a dimension value during computation. A measure combination function (MCF) is used to combine GCMs. The MCF must be distributive. A final granularity measure (FGM) represent the “true” imprecision of a result value. A final granularity function (FGF) maps from a GCM to a FGM.

19 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 19 Imprecision in Computation Example: WAVG(HbA1c%) by Low-Level Diagnosis We use FGM=WAVG(Level) as a measure of precision (Precise values have level 0, etc.) The GCM for a value e with weight w is: (w * Level(e),w) The MCF is: h((n 1, n 2 ),(n 3,n 4 )) = (n 1 +n 3,n 2 +n 4 ) The FGF is: f(n 1,n 2 ) = n 1 /n 2 6.0 & 7.0 imputed for T & 7 in WAVG(HbA1c%) computation Jim, John, and Jane have levels 2, 0, and 1, respectively Results of query: {(IDD,5.7,.9), (NIDD,5.6,1.2)} Diagnosis Dimension HbA1c% Dimension T T Diabetes ID Diabetes NID Diabetes JimJohn Jane 5 6 7 5.4 5.5.. 7.0 6.0 7.0.8.2 1.0 1 2 0

20 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 20 Presenting Imprecise Results Several ways to present the imprecise results to the user: Show result values along with the corresponding final granularity measure values. Very precise estimate of result precision, but hard to grasp. Resulting (value,FGM) = {(IDD,5.7,.9),(NIDD,5.6,1.2)} Map result values into different granularities using a value coarsening function (VCF). More intuitive result, but less precise estimate of precision. Example: Weighted grouping with VCF: r(x) = v such that x  v and Level(v) = Ceiling(x). Diagnosis Dimension Result Dimension T Diabetes ID Diabetes NID Diabetes {Jim.8,John 1.0 } {Jim.2,Jane 1.0 } T 5.4 5.5.. 7.0 5 6 7

21 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 21 Using Pre-Aggregated Data The approach can utilize pre-aggregated data effectively. Aggregate results for the precision MO can usually be fully materialized due to the relatively small multidimensional space (~ 1.000.000 cells). The aggregate computation with expected values can use standard pre-aggregation techniques, e.g., partial pre- aggregation, for good response-time versus storage/update-time tradeoffs. The distributive MCF allows for partial pre-aggregation of precision results.

22 11th SSDBM, Cleveland, Ohio, July 28-30, 1999 22 Conclusion Current OLAP tools and models do not support the imprecision found in real-world data. We have shown an approach to handling imprecision in OLAP databases based on the common multidimensional concept of granularities. The approach can suggest alternative queries when data is not precise enough and handles imprecision both in the grouping of data and in the aggregate computation. The approach has a low computational overhead and is able to exploit pre-aggregated results effectively. The approach can be implemented using existing technology such as SQL and OLAP tools. Future work: presentation of results, MIN/MAX functions, etc.


Download ppt "11th SSDBM, Cleveland, Ohio, July 28-30, 1999 Supporting Imprecision in Multidimensional Databases Using Granularities T. B. Pedersen 1,2, C. S. Jensen."

Similar presentations


Ads by Google