Download presentation

Presentation is loading. Please wait.

Published byJonas Timm Modified over 2 years ago

1
Atanu Roy Chandrima Sarkar Rafal A. Angryk Using Taxonomies to Perform Aggregated Querying over Imprecise Data Presented by: Rafal A. Angryk Date: 2010-12-14

2
Outlines of the Presentation Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 2 Idea Imprecision Motivation Limitations of Previous Work Definitions Approach Experimental Setup & Results Conclusion and Future Work

3
Idea of the Project Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 3 This paper provides framework for answering queries over imprecise data found in the common databases. We propose to solve this by classifying the data into taxonomical hierarchies and then capturing it in weighted hierarchical hypergraph.

4
Imprecision in Databases: An Example Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 4 IDGermination TimeStem Cankers R1AugustAbove-sec-node R2SeptemberAbsent R3FallAbove-Sec-node R4JulyAbsent IDGermination TimeStem Cankers R1AugustAbove-sec-node R2SeptemberAbsent R3FallAbove-Sec-node R4JulyAbsent

5
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 5 IDGermination Time Stem Cankers R1AugustAbove-sec-node R2SeptemberAbsent R3FallAbove-Sec-node R4JulyAbsent Germination Time SummerJuneJulyFallAugustSeptember IDGermination Time Stem Cankers R1AugustAbove-sec-node R2SeptemberAbsent R3AugustAbove-Sec-node R4JulyAbsent IDGermination Time Stem Cankers R1AugustAbove-sec-node R2SeptemberAbsent R3SeptemberAbove-Sec-node R4JulyAbsent Constraint: All soybean seeds with the same kind of stem canker should germinate in the same month of the season. IDGermination Time Stem Cankers R1AugustAbove-sec-node R2SeptemberAbsent R3SeptemberAbove-Sec-node R4JulyAbsent

6
Motivation Several recent papers have focused on retrieval of imprecise data, where every fact can be a region, instead of a point, in a multi-dimensional space. The most prominent one is [BDRV07] They have solved it by constructing marginal databases (MDBs) from extended database (EDBs) with the help of constraint hypergraph. 6 Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data

7
Limitations of Previous Work Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 7 Creating Marginal Databases using weighted hierarchical Hypergraph, employs brute force method for retrieving connected facts (tuples). This increases the overall time complexity and processing time of the queries. [BDRV07] follows a data specific technique but we propose to follow a domain specific knowledge

8
Definitions Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 8 Background knowledge: Knowledge required to generate taxonomies. Expert knowledge: Domain-specific human expertise. Data-derived knowledge: Derived from historic precise database and is used to generate mutually exclusive probabilities Possible worlds: All the possible combinations that an imprecise record can assume. Valid world: All the possible worlds which satisfies a given set of constraints.

9
Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 9

10
Assignment of Probabilities Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 10

11
EDB Creation Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 11 Probability of a possible world is the product of the unconditional occurrences of all imprecise attributes. Sum of probabilities of all possible worlds of an imprecise record is 1. Probability assignment rule creates a set of tuples using

12
Hyperedge Creation Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 12

13
MDB Creation Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 13 Weighted hierarchical hypergraph is defined as H(L, E) where L represents the nodes and E is the set of hyperedges between different taxonomies. Each hyperedge signifies a distinct combination of attribute values. The weight of a possible world assigned to a hyperedge [AC10] needs to preserve the a few properties. All t-norms [AC10] (e.g. minimum, product) fulfill these requirements. We choose product for the purposes of our preliminary investigation.

14
EDB MDB Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 14 IDTempIDGermination TimeStem Canker Prob EDB R1T1AugustAbove-sec-node0.80 R1T2SeptemberAbove-sec-node0.20 R2T3AugustAbove-sec-node0.60 R2T4AugustAbsent0.40 R3T5AugustAove-sec-node0.48 R3T6AugustAbsent0.32 R3T7SeptemberAbove-sec-node0.12 R3T8SeptemberAbsent0.08 R4T9SeptemberAbsent1.00 IDTempIDGermination TimeStem Canker Prob EDB R1T1AugustAbove-sec-node0.80 R1T2SeptemberAbove-sec-node0.20 R2T3AugustAbove-sec-node0.60 R2T4AugustAbsent0.40 R3T5AugustAove-sec-node0.48 R3T6AugustAbsent0.32 R3T7SeptemberAbove-sec-node0.12 R3T8SeptemberAbsent0.08 R4T9SeptemberAbsent1.00 IDTempIDGermination TimeStem Canker Prob MDB R1T1AugustAbove-sec-node0.9057 R1T2SeptemberAbove-sec-node0.0943 R2T3AugustAbove-sec-node0.6429 R2T4AugustAbsent0.3571 R3T5AugustAove-sec-node0.4983 R3T6AugustAbsent0.2768 R3T7SeptemberAbove-sec-node0.0519 R3T8SeptemberAbsent0.1730 R4T9SeptemberAbsent1.0000

15
Aggregated Querying Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 15 We aggregate tuples for aggregated querying based on its uniqueness. Group two tuples only when all their attributes values and the corresponding probabilities are the same. Find the total no. of plants grown in august which have a Stem Canker above-sec-node (44*0.9057) + (25*0.6429) 56 GIDGermination Time Stem CankerMarginal probability No. of plants G1AugustAbove-sec-node0.905744 G1AugustAbove-sec-node0.094344

16
Experimental Setup Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 16 Census-Income dataset from UCI Machine Learning repository. Finally used 7 dimensions. Precise database has 191239 records. Test dataset has 99762 records. Randomly inserted imprecision into the test dataset to make it imprecise.

17
Distribution of Imprecision Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 17 Attributes agemiceducbfcbmcbswwy Level 1 (Root) 150 (1)107 (1)136 (1)100 (1) 375 (1) Level 2450 (3)1393 (16) 409 (3)144 (2) 1125 (3) Level 3900 (6)8500 (24) 955 (7)289 (4) 8500 (9) Level 48500 (12) 8500 (17) 867 (12) Level 58500 (42) Total10000 (22) 10000 (41) 10000 (28) 10000 (61) 10000 (13)

18
Imprecision Characteristics Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 18

19
Scalability Test Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 19

20
Extended Database Analysis Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 20

21
Influence of Imprecision Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 21

22
Absolute Percentage Error Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 22

23
Conclusion and Future Work Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 23 In this research we significantly present a framework for efficient querying over imprecise data with an average of 94% accuracy We intend to extend this research to include Ontology in place of Taxonomy. We also intend to use Associative Weight Mining to assign weights to hyperedges.

24
Questions? Roy, Sarkar, Angryk. Using Taxonomies to Perform Aggregated Querying over Imprecise Data 24 References [BDRV07]: Douglas Burdick, AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan: OLAP over Imprecise Data with Domain Constraints. VLDB 2007: 39-50Douglas BurdickAnHai DoanRaghu Ramakrishnan Shivakumar VaithyanathanVLDB 2007 [AC10]: Rafal A. Angryk, Jacek Czerniak: Heuristic Algorithm for Interpretation of Multi-Valued Attributes in Similarity-based Fuzzy Relational Databases. International Journal of Approximate Reasoning 51: 895-911 (2010)Rafal A. AngrykJacek CzerniakInternational Journal of Approximate Reasoning

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google