Presentation is loading. Please wait.

Presentation is loading. Please wait.

Querying Tree-Structured Data Using Dimension Graphs Dimitri Theodoratos (New Jersey Institute of Technology, USA) Theodore Dalamagas (National Techn.

Similar presentations


Presentation on theme: "Querying Tree-Structured Data Using Dimension Graphs Dimitri Theodoratos (New Jersey Institute of Technology, USA) Theodore Dalamagas (National Techn."— Presentation transcript:

1 Querying Tree-Structured Data Using Dimension Graphs Dimitri Theodoratos (New Jersey Institute of Technology, USA) Theodore Dalamagas (National Techn. University of Athens, Greece)

2 -2--2- Tree-structured Data Management Tree structures: a means to organize the information on the Web. Examples: taxonomies, thematic categories, concept hierarchies, product catalogs, etc. Organizing data in tree structures (tree-structured data) has been vastly established due to the popularity of the XML language. XML language (W3C): the standard data exchange format on the Web Data is stored natively in tree structures, or Data is publicly available in tree structures to enable its automatic processing by programs, scripts, and agents

3 -3--3- Tree-structured Data Management Querying tree-structured data is based on path expression queries. Popular query languages for tree-structured data: XPath and XQuery (W3C), e.g: FOR $i IN /brand/type[price<900] RETURN {$i/id, $i/condition, $i/price} (find products cheaper than 900, and display their id, condition, and price) Querying tree-structured data hits to two major obstacles: the semistructured nature of data, lack of semantics. This is actually the penalty one has to pay for the flexibility offered by XML technologies.... Sony laptop 1 used 800...

4 -4--4- Semistructured Nature of Tree-structured Data Due to the first obstacle (i.e. semistructured nature): Querying tree-structured data requires to resolve structural differences and inconsistencies. The reason? different possible ways of organizing the same information in tree-structures. Examples: Structural differences: certain ‘nodes’ (i.e. categories, elements, etc...) exist in a tree-structured data source but not in another. Structural inconsistencies: variations in ‘node’ sequences (even within a single tree-structured data source).

5 -5--5- Notebooks Custom Ultralight Multimedia Desktops 10'' Servers 8'' PDAs r MacHPSonyIBMSony HPIBM Notebooks Servers Desktops PDAs r Mac HP Sony HPIBM DellSony Used NewUsed NewUsedNew Product Catalog A Multimedia HP IBM Product Catalog B Structural difference Product catalog A has a finer categorization on notebooks, e.g.: Custom/Ultralight and 10’’/8’’ (for the ultralight) compared to Catalog B.

6 -6--6- Notebooks Custom Ultralight Multimedia Desktops 10'' Servers 8'' PDAs r MacHPSonyIBM Sony HPIBM Notebooks New UsedServers Desktops PDAs r Mac HP SonyHPIBM MacSony DellSony Used NewUsed New Used NewUsedNew Product Catalog A Multimedia HPIBM Product Catalog B Structural inconsistency Product catalog A classifies notebooks by brand and next by condition, while catalog B the other way around (Sony/Used vs Used/Sony).

7 -7--7- Semistructured Nature of Tree-structured Data... Sony laptop used 800... laptop used Sony 800... brand type condition type condition brand Structural inconsistency (...cont.) An XML doc includes the element sequence brand, type, condition, while another one (for same data) includes type, condition, brand. Such inconsistencies are observed even within tree-structured data of a single data source.

8 -8--8- Semistructured Nature of Tree-structured Data How structural differences and inconsistencies affects querying of tree-structured data? The user should explicitly specify them as part of the query. Extremely cumbersome. E.g.: explicitly specify disjunctions of possible alternative node sequences: /brand/type[price<900] OR /type/condition[price<900] OR /condition/type[price<900].... Sony laptop used 800...... laptop used Sony 800......

9 -9--9- Semistructured Nature of Tree-structured Data However, sometimes specifying alternate node sequences is not due to the need to resolve structural differences and inconsistencies. Users should be able to pose queries even if they do not know (or do not care about) the exact structure of tree- structured data sources. e.g. find products cheaper than 900, and display their id, condition, and price...but I do not know (or I do not care!) whether condition is before brand and type! Currently, query formulation on tree-structured data is strictly dependent on the structure of data. Only ancestor/descendant relationship may produce relaxed path expressions (brand//type).

10 -10- Lack of Semantics in Tree-structured Data Reminder: Querying tree-structured data hits to two major obstacles: the semistructured nature of data (just explained) + lack of semantics. Tree-structured data provides mainly syntactic and not semantic information. However, there are inherent semantics in tree-structured data. Sets of nodes in a catalog are usually related under a semantic interpretation, e.g. Mac, HP, Sony refer to a brand name. Such information can be exploited to become part of query formulation and support query optimization. Currently, query formulation on tree-structured data ignores this issue.

11 -11- Our Approach We introduce the notion of dimension graphs to capture semantic information in tree-structured data. We design a query language for tree-structured data. Queries are not cast on the structure of tree-structured data. Queries can handle structural differences and inconsistencies effectively. We discuss query evaluation issues. We show how dimension graphs can be used to query multiple tree-structured data sources.

12 -12- Data Model We use value trees to represent tree-structured data. Values (i.e. nodes) in value trees are grouped to form dimensions. A dimension......is a set of semantically related nodes (i.e. values) in the value tree. The semantic interpretation is given by the user. Two nodes in the same path cannot belong to the same dimension.

13 -13- Data Model Notebooks NewUsedServers Desktops PDAs r MacHPSony pc_type brand HPIBMMacSony DellSony UsedNewUsed condition brand R Used Multimedia HPIBM brand conditionpc_category E.g. dimensions pc_type = {Notebooks, Desktops, PDAs}, pc_category = {Servers, Multimedia}, brand = {Mac, Sony, HP, IBM, Dell}, etc. pc_type

14 -14- Data Model We use dimension graphs to capture relationships between dimensions. The nodes of a dimension graph represent dimensions. There is an edge from dimension D1 to D2 if a value of D1 is the parent of some value in D2.

15 -15- Data Model Notebooks NewUsedServers Desktops PDAs r MacHPSony pc_type brand HPIBMMacSony DellSony UsedNewUsed condition brand R Used Multimedia HPIBM brand conditionpc_category condition R pc_type pc_category brand Value Tree T Dimension Graph of T pc_type

16 -16- Data Model Notebooks NewUsedServers Desktops PDAs r MacHPSony pc_type brand HPIBMMacSony DellSony UsedNewUsed condition brand R Used Multimedia HPIBM brand conditionpc_category condition R pc_type pc_category brand Value Tree T Dimension Graph of T

17 -17- Data Model A dimension graph... can be automatically extracted from a value tree, given the dimensions, provides an abstraction of the structural information of value trees, provides semantic query guidance to pose queries on tree- structured data, in the presence of structural differences and inconsistencies, supports query evaluation and optimization....will be explained soon.

18 -18- Querying Tree-structured Data Queries are defined on dimension graphs and not directly on value trees. The user annotates some dimensions. Also, she has the choice of not specifying or partially specifying parent-child and ancestor-descendant relationships between the annotated dimensions in a query. Our system identifies possible ‘valid’ orderings of dimensions exploiting the dimension graph. These orderings are used as patterns for constructing a set of path expressions to be sent directly to the value trees.

19 -19- Querying Notebooks NewUsedServers Desktops PDAs r MacHPSony pc_type brand HPIBMMacSony DellSony UsedNewUsed condition brand R Used Multimedia HPIBM brand conditionpc_category Value Tree T Query on Dimension Graph of T condition = {used} R pc_type = ? pc_category brand = {Sony, IBM} annotated dimension = ? the dimension can have any value = {... } the dimension should have specific values

20 -20- Querying Notebooks NewUsedServers Desktops PDAs r MacHPSony pc_type brand HPIBMMacSony DellSony UsedNewUsed condition brand R Used Multimedia HPIBM brand conditionpc_category Value Tree T Query on Dimension Graph of T condition = {used} R pc_type = ? pc_category brand = {Sony, IBM} ‘Find all Sony, IBM used products’, i.e. find paths in T from r to a leaf node that contain -any of the values of dimension pc_type, -the value ‘used’ of dimension condition, -either value ‘Sony’ or ‘IBM’ of dimension brand.

21 -21- Querying Notebooks NewUsedServers Desktops PDAs r MacHPSony pc_type brand HPIBMMacSony DellSony UsedNewUsed condition brand R Used Multimedia HPIBM brand conditionpc_category Value Tree T Query on Dimension Graph of T condition = {used} R pc_type = ? pc_category brand = {Sony, IBM} ‘Find all Sony, IBM used products’, i.e. find paths in T from r to a leaf node that contain -any of the values of dimension pc_type, -the value ‘used’ of dimension condition, -either value ‘Sony’ or ‘IBM’ of dimension brand.

22 -22- Querying Notebooks NewUsedServers Desktops PDAs r MacHPSony pc_type brand HPIBMMacSony DellSony UsedNewUsed condition brand R Used Multimedia HPIBM brand conditionpc_category Value Tree T Query on Dimension Graph of T condition = {used} R pc_type = ? pc_category brand = {Sony, IBM} Notice how query handles the structural inconsistencies!

23 -23- Querying Notebooks NewUsedServers Desktops PDAs r MacHPSony pc_type brand HPIBMMacSony condition R pc_category Value Tree T Query on Dimension Graph of T condition = {used} R pc_type = ? pc_category brand = {Sony, IBM} ‘Find all Sony, IBM used products. However, the nodes referring to brand name should be after the node ‘used’.’, i.e. Find paths in T from r to a leaf node that contain -any of the values of dimension pc_type, -the value ‘used’ of dimension condition, -either value ‘Sony’ or ‘IBM’ of dimension brand, However: values of condition should be parents of values of brand.....................

24 -24- Querying Notebooks NewUsedServers Desktops PDAs r MacHPSony pc_type brand HPIBMMacSony DellSony UsedNewUsed condition brand R Used Multimedia HP IBM brand conditionpc_category Value Tree T Query on Dimension Graph of T condition = {used} R pc_type = ? pc_category brand = {Sony, IBM} Find paths in T from r to a leaf node that contain -any of the values of dimension pc_type, -the value ‘used’ of dimension condition, -either value ‘Sony’ or ‘IBM’ of dimension brand, However: values of condition should be parents of values of brand.

25 -25- Query Evaluation Query evaluation exploits dimension graphs to detect answer paths. An answer path is a path in a dimension graph that starts from R, includes all annotated dimensions, and ends on an annotated dimension. Query on Dimension Graph of T condition = {used} R pc_type = ? mobile_type pc_category brand = {Sony, IBM} Examples of answer paths: /R/pc_type/condition/brand, /R/pc_type/pc_category/brand/condition,....

26 -26- Query Evaluation Notebooks NewUsedServers Desktops PDAs r MacHPSony pc_type brand HPIBMMacSony DellSony UsedNewUsed condition brand R Used Multimedia HPIBM brand conditionpc_category Value Tree T Query on Dimension Graph of T condition = {used} R pc_type = ? pc_category brand = {Sony, IBM} Answer paths are used to generate path expressions to be exploited by e.g. an XQuery engine to retrieve the answers from a value tree. E.g. /R/pc_type/condition/brand gives /r/(Notebooks|Desktops)/Used/(Sony|IBM)

27 -27- Query Evaluation The answer paths help to detect ordering of values that can possibly exist in a value tree. Only these value orderings will be used to compute the answer of a query on the value tree. This is performed before query evaluation reaches the value tree. Detecting answers paths in a dimension graph is not a costly task since dimension graphs are much smaller than value trees.

28 -28- Query Evaluation Query evaluation exploits dimension graphs to detect unsatisfiable queries (i.e. queries with empty answers in the value tree). Examples of unsatisfiable queries: R pc_type = ? brand = ? mobile_type condition pc_category condition R pc_type = ? mobile_type = ? pc_category Brand = ? R pc_type brand mobile_type = ? condition =? pc_category = ? No answer paths! Two children have the same parent! No path from condition to mobile_type!

29 -29- Query Evaluation Dimension graphs can be used to query multiple value trees. Consider value trees T1, T2,..., Tn over a dimension set D. Let G1, G2,..., Gn be their dimension graphs. Construct a global dimension graph G by merging G1, G2,..., Gn. Queries are formed on G. The annotations are transferred to G1, G2,..., Gn. Query evaluation is performed as described before.

30 -30- Conclusions Querying tree-structured data using dimension graphs: Dimension graphs: capture semantic information in tree- structured data. Used for query formulation and evaluation. Queries are not cast on the structure of tree-structured data but on dimension graphs. Queries can handle structural differences and inconsistencies in value trees. Query evaluation exploits dimension graphs to generate appropriate path expressions to be be evaluated on the value trees. Dimension graphs can be also used to query multiple value trees.


Download ppt "Querying Tree-Structured Data Using Dimension Graphs Dimitri Theodoratos (New Jersey Institute of Technology, USA) Theodore Dalamagas (National Techn."

Similar presentations


Ads by Google