Download presentation

Presentation is loading. Please wait.

Published byBret Litton Modified over 2 years ago

1
Computing Structural Similarity of Source XML Schemas against Domain XML Schema Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Jixue Liu 3 Guoren Wang 4 Chi Yang 1 1 Swinburne University of Technology 2 Chinese University of Hong Kong 3 University of South Australia 4 Northeastern University of China

2
2 Outline Motivation Related Work Problem Statement Structural Similarity Model Algorithms Experiments Conclusions and Future Work

3
3 XML has become the standard for representing, exchanging and integrating data on the web. Different source providers may define different schemas for their data based on different applications. When exact results do not exist, approximate results are also expected to be returned. Motivation Fig.1 Schema of 1 st Source S1 Fig. 2 Schema of 2 nd Source S2

4
4 Users may issue queries based on their common understanding, i.e., domain schema. For example: Motivation Fig. 3 Domain Schema T The domain schema doesn’t match the both source schemas. To efficiently return approximate results, it is desirable for system to determine which source schema much more similar to the domain schema. Brief XPath queries: Q1: uni[swin]/dept[ICT]/prof; Q2: uni[swin]/lib[./cname[Hawthorn]]/book; … How to compute the similarity between domain schema and source schemas?

5
5 Related Work Measuring the similarity between XML documents – To cluster XML documents. Edit Distance - detecting the required changes from one XML document to another, such as re-labeling, deleting, and inserting. Similar to Edit Distance, Binary tree – XML documents can be represented as the tree-structured data. And then the similarity can be obtained by comparing the binary trees. Time series - each occurrence of a tag corresponds to a given impulse. By analyzing the frequencies, they can state the degree of similarity between documents. Measuring the similarity between XML schemas – To derive schema matching, schema mapping or schema integration. Cupid, XClust and Similarity Flooding proposed a structural match algorithm where they only emphasized the name and data type similarities presented at the leaf level. COMA the similarity between the elements was recursively computed from the similarity between their respective children with a leaf-level matcher. In summary, the above methods will compute the similarity in a symmetric way.

6
6 Related Work Example of Binary tree model BiBranch where the smaller the BiB value is, the more similar its corresponding pair of trees are. According to the above computation, T 2 is more similar to T 0 than others. We have a sorted list: T 2 > T 1 = T 3 = T 4. However, it is not correct in query applications. Fig.4 Example of BiBranch model The symmetric similarity model cannot satisfy query needs!!

7
7 Problem Statement Given a domain schema tree T 0 =(V 0,E 0, v r0,Card) and a source schema tree T = (V,E, v r,Card), we need to compute their structural similarity distance SSD(T 0, T). An XML schema tree is defined as T = (V, E, v r, Card) where V is a finite set of nodes, representing elements and attributes of the schema. E is a set of directed edges. v r V is the root node of tree T. Card: V → {“1”, ”*”}.

8
8 Problem Statement In this work, we will focus on more different aspects: The purpose of similarity computation is to choose a similar data source for queries. The similarity computation is asymmetric where the schema conformed by users’ queries is taken as domain schema. We concern the parent-child (PC) and ancestor-descendant (AD) relationships, rather than the sibling order because they are important in formulating a query. We take into account the cardinality of schema elements. An index based on encoding schema is provided to improve the efficiency of computation.

9
9 Structural Similarity Model The model takes into account three factors: element coverage, consistency of element pair relationships and the difference of element cardinality. Ratio of Interesting Object: Cardinality similarity of node pairs: where V ’ = V V 0 is the set of interesting nodes in V.

10
10 Structural Similarity Model Similarity of node pairs: SNP(v 1,v 2,v 01,v 02 ) Similarity of source schema w.r.t. domain schema SSD(T 0,T)

11
11 Structural Similarity Model Comparison of SSD and BiBranch models: BiBranch model: T 2 > T 1 = T 3 = T 4 T1 = T4 > T3 > T2 The results satisfy our expectation!!! Fig.5 Example of SSD model

12
12 Algorithms Techniques: Trimming rules: Root node, Leaf node, Internal node Numbering scheme as index: pre – preorder, post – postorder, C – Cardinality, P – parent, RD - Rightmost descendant’s preorder. Algorithms: Basic Algorithm (BA): Conducting pair wise comparisons. Improved Algorithm (IA): Reducing the number of similarity comparisons.

13
13 Experiments Response Time vs. Similarity Degree Fig. 6 The schema size varies from 20, 40, 60 and 80 nodes respectively. At the same time, we adjust the similarity degree from 25%, 50%, 75% and 100% respectively. (b) schema size = 40 nodes(a) schema size = 20 nodes (c) schema size = 60 nodes(d) schema size = 80 nodes

14
14 Fig.7 Schema size is 128 nodes and the level varies from 4 to 16. Experiments Response Time vs. Nested Level Speedup vs. Fanout Fig.8 the schema size is set 128 nodes and the fanout varies from 2 to 5.

15
15 Fig.9 the schema size varies from 20, 40, 60, and 80 nodes. Experiments Response Time vs. Schema Size Fig.10 The three public datasets: TPC-H-nested.xsd (17), genexml.xsd (85) and mondial-3.0.xsd (120).

16
16 Conclusions and Future Work Contributions: Proposed structural similarity problem for the purpose of query application; Designed a brief structural similarity model and discussed its effectiveness; Implemented relevant algorithms and demonstrated its efficiency with synthetic and real data sets. Future work: Improve the similarity model and make it more accurate; Apply this similarity model to improve query evaluation.

17
17 Thanks & Question

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google