Clustering XML Documents for Query Performance Enhancement Wang Lian.

Clustering XML Documents for Query Performance Enhancement Wang Lian

Outline Related Work Motivation Our Approach S-Graph Distance Function Clustering Algorithm Experimental Results

Related Work Besides storing XML documents in their native format, using RDBMS is an established trend. There are mainly two approaches for storing XML documents in RDBMS Schema-mapping Structure-mapping

Related Work(cont) In Schema-mapping A database schema is derived from the DTD of XML documents, therefore different DTD will generate different database schema In Structure mapping The database schema is fixed by defining a set of generic tables.

Motivation Using both schema or structure mapping, documents must be cut into pieces and inserted into tables. To answer a query tables should be joined to provide the answers. As the size of tables grows larger, the join cost may be very high. Our observation is: if a collection contains documents of different structures, then clustering on documents’ structures may reduce the join cost.

Motivation(cont) An example DTD is :

Motivation(cont) Documents in 3 clusters

Motivation(cont) Unpartitioned schema

Motivation(cont) Partitioned schema

Motivation(cont) Suppose we want to answer an Xpath query /conference/author.text() Using unpartitioned schema, table conference (2 tuples )and author (9 tuples) should be joined, Using partitioned schema, table conference 1 (2 tuples ) and author 1 (3 tuples) should be joined.

Our Approach XML document is a mixture of structure information and data value. In our context, only structure information is used to do clustering. We need a proper distance function before using any clustering algorithm.

S-Graph Given a set of XML documents C, the structure graph (s-graph) of C, sg(C) =(N, E), is a directed graph such that N is the set of all the elements and attributes in the documents in C and (a, b)  E if and only if a is a parent element of b in document(s) in C (b can be element or attribute). Certainly, s-graph does not catch all structure information of documents, however it captures the parent-child relationship which is valuable for evaluating path expression.

Distance Function For two sets, C 1 and C 2, of XML documents, the distance between them, where |sg(C i )| is the number of edges in sg(C i ), i=1,2, and sg(C 1 )  sg(C 2 ) is the set of common edges of sg(C 1 ) and sg(C 2 ).

Distance Function(cont) Dist({doc1},{doc2})=1 and Dist({doc2},{doc3})=0.25 Tree-dist({doc1},{doc2})= Tree-dist({doc2},{doc3})=1

Clustering Algorithm Input: X the set of XML documents Input: k the number of clusters specified by user 1. SG=pre-clustering(X) 2. While(remaining cluster number>k) Merge cluster C i and C j which maximize a predefined goodness function

Clustering Algorithm(cont) Complexity n =|X|, m=|SG|, Time complexity The upper bound of pre-clustering is O(nm), in general, it can be reduced to O(n). Iterative merging : O(m 2 logm) Space complexity O(m 2 )

Experimental Results The clustering algorithm is tested on real data, the DBLP XML records, which contains about 200,000 documents composed by 36 elements. Pre-clustering is effective, after scan 200,000 documents, only 233 distinguished s-graphes remain, which makes following clustering using only less than 2 second.

Experimental Results(cont) After setting the number of cluster to be 3, we get three clusters containing about 193,000 documents, one for article, and the other two for inproceedings. The interesting thing is in those two clusters of inproceedings, one s-graph is a subgraph of another.

Experimental Results(cont) We use Oracle 8.1.5 to store all the documents in 4 versions: Version 1: unpartitioned schema-mapping Version 2: partitioned schema-mapping Version 3: unpartitioned structure mapping Version 4: partitioned structure mapping

Experimental Results Query type Q1 : /A 1 /A 2 /…/A k ; all possible absolute XPathes in the documents. Q2 : /A 1 /A 2 /…/A k [text()=``value'']/text(); absolute XPaths in which A i, i=1,…,k, are randomly picked and "value" is the value of A k in some documents. Q3 : /A 1 /A 2 /…/A k [contains(.,``substring'')]/text() same as Q2 except that the condition tested is ``A k contains a “substring”

Experimental Results

Question?

Clustering XML Documents for Query Performance Enhancement Wang Lian.

Similar presentations

Presentation on theme: "Clustering XML Documents for Query Performance Enhancement Wang Lian."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering XML Documents for Query Performance Enhancement Wang Lian.

Similar presentations

Presentation on theme: "Clustering XML Documents for Query Performance Enhancement Wang Lian."— Presentation transcript:

Similar presentations

About project

Feedback