Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology.

Similar presentations

Presentation on theme: "Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology."— Presentation transcript:

1 Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology 2 Chinese University of Hong Kong

2 2 Outline Motivation Related Work Preliminary and Problem Statement BT-based Scheduling Strategy Case Study Experiments Conclusions

3 3 Motivation Top-k queries  Approximate answers are required when exact results cannot be found.  Returning a large number of results is not desirable. Multiple XML data sources  With the application of XML data, sometimes users are interested in the results retrieved from several data sources at the same time.  Answering top-k queries over multiple xml data sources is still open problem.

4 4 Related Work Top-k queries in XML  Amelie Marian etc. Adaptive processing of top-k queries in xml. ICDE2005.  Martin Theobald etc. An efficient and versatile query engine for topX search. VLDB2005.  Raghav Kaushik etc. On the integration of structure indexes and inverted lists. SIGMOD2004. Top-k queries in Relational DB  Upper, MPro and TPUT etc. We focused on top-k queries over multiple XML data sources!

5 5 Preliminary – XML Query Relaxation XML data and relevant schemas Fig.1 bookshop S1 Fig.2 schema d1 of S1 Fig.3 bookshop S2Fig.4 schema d2 of S2

6 6 Preliminary – XML Query Relaxation Relaxed results Fig.5 an original query q Fig.7 a relaxed query to d2 Fig.6 a relaxed query to d1 We keep the changed weight for each edge in relaxed queries. RankScore = 2.28 RankScore = 4.88

7 7 Problem Statement Given a weighted query q and a number of data sources {S1, S2, …, Sn} conforming to DTDs {d1, d2, …, dn}, let {q1, q2, …, qn} be the set of weighted relaxed query templates of q w.r.t. the set of DTDs, our aim is to efficiently search top k results by scheduling the evaluation of {q1, q2, …, qn} over {S1, S2, …, Sn}.

8 8 BT-based Scheduling Strategy Data source determination and switching Result determination Edge selection

9 9 Data source determination and switching Computing the ranking scores {U(1) … U(n)} of relaxed queries {q1, q2, …, qn} w.r.t. data sources {S1, S2, …, Sn}. Sorting the ranking scores as U={U(k 1 ), … U(k n )}. Taking the data source S k1 to be evaluated and U(k 2 ) as the current threshold σ. The relaxed query q2 w.r.t. d2 The relaxed query q1 w.r.t. d1 U(1) = 2.28 U(2) = 4.88 Threshold σ= 2.28

10 10 Result determination We adjust the lower bound L and upper bound U during query evaluation. When L becomes equal to or larger than the current threshold, we can process the current candidates as follows:  The number of candidates is equal to k – Stop  The number of candidates is less than k – Continue to search  The number of candidates is larger than k – Refine candidates

11 11 Edge selection Random Min_weight Max_weight

12 12 Case Study U(2) = 4.88 book title info L(2) = 1.70 <σ B 1, B 2, B 4 σ= 2.28 book title info price B1B1 B 2, B 4 L(2)(G1) = 3.5 >σ L(2)(G2) = 1.70 < σ U(2)(G2) = 3.08 > σ Top-1 result found! book title info price year B2B2 L(2)(G3) = 4.4 >σ L(2)(G4) = 1.70 < σ U(2)(G4) = 2.18 < σ B4B4 Top-2 result found! Switching Data Source to search top-3 result!

13 13 Experiments Experimental setup  We run all algorithms in Java on an Intel P4 3GHz PC with 512M memory. Wutka DTD parser was used to analyze the structures of DTDs. Dataset and selected queries  We used Xmark XML data generator to produce a set of data that were taken as dataset.  Three queries were designed:  q1: //item[./description/parlist]  q2: //item[./description/parlist/mailbox/mail[./text]]  q3: //item[./mailbox/mail/text[./keyword and./xxx] and./name and./xxx]

14 14 Experiments Static sort vs. Dynamic sort No schedule vs. BT schedule Varing top-k size

15 15 Conclusions Contributions:  Proposed a BT-based scheduling strategy for evaluating top- k queries over multiple XML data sources;  Output results immediately without waiting for the end of query evaluation;  Implemented relevant algorithms and demonstrated its effectiveness and efficiency with XMark data sets.

16 16 Thanks & Question

Download ppt "Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology."

Similar presentations

Ads by Google