Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering XML Documents for Query Performance Enhancement Wang Lian.

Similar presentations


Presentation on theme: "Clustering XML Documents for Query Performance Enhancement Wang Lian."— Presentation transcript:

1 Clustering XML Documents for Query Performance Enhancement Wang Lian

2 Outline Related Work Motivation Our Approach S-Graph Distance Function Clustering Algorithm Experimental Results

3 Related Work Besides storing XML documents in their native format, using RDBMS is an established trend. There are mainly two approaches for storing XML documents in RDBMS Schema-mapping Structure-mapping

4 Related Work(cont) In Schema-mapping A database schema is derived from the DTD of XML documents, therefore different DTD will generate different database schema In Structure mapping The database schema is fixed by defining a set of generic tables.

5 Motivation Using both schema or structure mapping, documents must be cut into pieces and inserted into tables. To answer a query tables should be joined to provide the answers. As the size of tables grows larger, the join cost may be very high. Our observation is: if a collection contains documents of different structures, then clustering on documents’ structures may reduce the join cost.

6 Motivation(cont) An example DTD is :

7 Motivation(cont) Documents in 3 clusters

8 Motivation(cont) Unpartitioned schema

9 Motivation(cont) Partitioned schema

10 Motivation(cont) Suppose we want to answer an Xpath query /conference/author.text() Using unpartitioned schema, table conference (2 tuples )and author (9 tuples) should be joined, Using partitioned schema, table conference 1 (2 tuples ) and author 1 (3 tuples) should be joined.

11 Our Approach XML document is a mixture of structure information and data value. In our context, only structure information is used to do clustering. We need a proper distance function before using any clustering algorithm.

12 S-Graph Given a set of XML documents C, the structure graph (s-graph) of C, sg(C) =(N, E), is a directed graph such that N is the set of all the elements and attributes in the documents in C and (a, b)  E if and only if a is a parent element of b in document(s) in C (b can be element or attribute). Certainly, s-graph does not catch all structure information of documents, however it captures the parent-child relationship which is valuable for evaluating path expression.

13 Distance Function For two sets, C 1 and C 2, of XML documents, the distance between them, where |sg(C i )| is the number of edges in sg(C i ), i=1,2, and sg(C 1 )  sg(C 2 ) is the set of common edges of sg(C 1 ) and sg(C 2 ).

14 Distance Function(cont) Dist({doc1},{doc2})=1 and Dist({doc2},{doc3})=0.25 Tree-dist({doc1},{doc2})= Tree-dist({doc2},{doc3})=1

15 Clustering Algorithm Input: X the set of XML documents Input: k the number of clusters specified by user 1. SG=pre-clustering(X) 2. While(remaining cluster number>k) Merge cluster C i and C j which maximize a predefined goodness function

16 Clustering Algorithm(cont) Complexity n =|X|, m=|SG|, Time complexity The upper bound of pre-clustering is O(nm), in general, it can be reduced to O(n). Iterative merging : O(m 2 logm) Space complexity O(m 2 )

17 Experimental Results The clustering algorithm is tested on real data, the DBLP XML records, which contains about 200,000 documents composed by 36 elements. Pre-clustering is effective, after scan 200,000 documents, only 233 distinguished s-graphes remain, which makes following clustering using only less than 2 second.

18 Experimental Results(cont) After setting the number of cluster to be 3, we get three clusters containing about 193,000 documents, one for article, and the other two for inproceedings. The interesting thing is in those two clusters of inproceedings, one s-graph is a subgraph of another.

19 Experimental Results(cont) We use Oracle 8.1.5 to store all the documents in 4 versions: Version 1: unpartitioned schema-mapping Version 2: partitioned schema-mapping Version 3: unpartitioned structure mapping Version 4: partitioned structure mapping

20 Experimental Results Query type Q1 : /A 1 /A 2 /…/A k ; all possible absolute XPathes in the documents. Q2 : /A 1 /A 2 /…/A k [text()=``value'']/text(); absolute XPaths in which A i, i=1,…,k, are randomly picked and "value" is the value of A k in some documents. Q3 : /A 1 /A 2 /…/A k [contains(.,``substring'')]/text() same as Q2 except that the condition tested is ``A k contains a “substring”

21 Experimental Results

22 Question?


Download ppt "Clustering XML Documents for Query Performance Enhancement Wang Lian."

Similar presentations


Ads by Google