SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.

SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007 2009. 02. 05. Summarized by Jaehui Park, IDS Lab., Seoul National University Presented by Jaehui Park, IDS Lab., Seoul National University

Copyright  2009 by CEBT Introduction  Demand for RDB to support effective and efficient IR-style keyword queries Features – Assembling data collectively – Supporting casual users – Revealing unexpected relationships among entities – More flexible search for back-end databases than pre-built template querying  Issues Search results contradictory to human perception (in previous work) Technical challenges – Aggregating final score of an answer Relying on monotonicity of the rank aggregation function  Contributions New ranking function – Non-monotonic nature of ranking methods Techniques for avoiding unnecessary DB accesses – Skyline sweeping algorithm – Block pipeline algorithm 2

Copyright  2009 by CEBT Preliminaries  Keyword queries on a set of relations  Joined Tuple Tree (JTT) Tree of tuples – Top-k results Foreign key to primary key relationships Candidate Network (CN) Relevance score – How relevant the JTT is to the query  Example query : “maxtor netvista” 3 Top-3 JTTs c3 c3->p2 c1->p1 c2->p2 c2->p2<-c3

Copyright  2009 by CEBT Preliminaries: existing solutions (DISCOVER 2002,2003)  Enumerating (Union) all possible CNs C Q ->P Q : valid C Q ->U : not valid C Q ->U<-C Q : may be valid  Example (cont.) 4 rules Prune duplicate CNs Prune non-minimal CNs Prune CNs of type: R Q R Q DISCOVER (2003)

Copyright  2009 by CEBT Preliminaries: existing solutions (DISCOVER 2002,2003)  Upper bounding functions Bound the scores of potential answers from each CN – Stop query execution earlier – Ex) Sparse algorithm Global pipeline algorithm  Focus of this paper How to score a JTT : Ranking Function How to generate and order the SQL queries for the CNs : Top-k Join query – Minimal DB accesses are required before top-k results are returned. 5 idscore t150 t240 t330 t420 idscore I170 I260 I340 i420 aggregate

Copyright  2009 by CEBT Ranking Function  Problems with existing ranking functions Monotonic aggregation function have been considered. – SUM  Discordance with human perception Side Effect : Overly rewarding contributions of the same keyword in different tuples in the same JTT 6 C Q ->P Q

Copyright  2009 by CEBT Ranking Function  Modeling a JTT as a virtual document  attenuating : same keyword in different relations  Technical issues Expensive cost to compute  Completeness score and Size normalization score 7 C(t1) K2K2 K1K1 P(t1) C(t1) K2K2 K1K1 P(t1)

Copyright  2009 by CEBT Top-k Join algorithm  None of the existing top-k query processing methods deals with non-monotonic scoring function c[i]->p[i] max(score(p[1],c[i+1]), score(p[j+1], c[1]))  Monotonic, upper bounding function to the actual function Lemma 1. score(T,Q) can be bounded by a function uscore(T,Q)=1/(1-s) * min(A,B) max(uscore(c[i+1], p[1]), uscore(c[1],p[j+1])) 8 C(t1) K2K2 K1K1 P(t2) X

Copyright  2009 by CEBT Top-k Join algorithm  Skyline Sweeping Algorithm Avoid unnecessary join checking -> minimal number of accesses to the database dominate relationship among candidates – Checking candidate of higher upper bound first – Priority queue Descending order of the upper bound scores Technical point – Duplicate checking 9 uscore

Copyright  2009 by CEBT Top-k Join algorithm  Large gaps between the upper bound scores and the corresponding real scores Harder to stop early – upper bound of un-processed >> real score  Block Pipeline Algorithm Employing local non-monotonic upper bounding function that bounds the real score of JTTs more accurately Tighter upper bounding: bscore < uscore signature – An ordered sequence of term frequencies for all the query keywords – Signature of the block 10

Copyright  2009 by CEBT Experiments  Dataset: IMDB, DBLP and Mondial  Oracle 10g, MySQL 5.00.18, JDK 1.5  Implementation: Sparse, Global pipeline (GP). Skyline sweep (SS), Block pipeline (BP)  Metrics Number of top-1 answers (#Rel) Reciprocal rank (R-Rank)  Relevance answer It must match all the search keyword Its size must be the smallest 11

Copyright  2009 by CEBT Experiments  Effectiveness  Efficiency Observations – Fastest : BP – SS outperforms Sparse and GP – Sparse == GP (GP > Sparse for small k or easy query) – All algorithms are more responsive for smaller k values 12

Copyright  2009 by CEBT Conclusion  New ranking method Adapts that the state-of-the-art IR ranking function and principles  Query processing method Tailored for our non-monotonic ranking functions  Extensive experiments on large scale real databases High precision with high efficiency 14

Copyright  2009 by CEBT Reviews  Good Detailed explanation of background and existing approach Good paper organization and good examples  Short of rationale for new algorithms  Non-monotonicity of Block pipeline algorithm 15

SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.

Similar presentations

Presentation on theme: "SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.

Similar presentations

Presentation on theme: "SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007."— Presentation transcript:

Similar presentations

About project

Feedback