Presentation is loading. Please wait.

Presentation is loading. Please wait.

DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,

Similar presentations


Presentation on theme: "DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,"— Presentation transcript:

1 DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center, Hannover, Germany Fraunhofer IPSE, Darmstadt Germany CSIRO ICT Centre, Australia SIGIR 2010 2010. 12. 17. Jaehui Park

2 Copyright  2010 by CEBT INTRODUCTION  Keyword search over structured data No single interpretation of a keyword query can satisfy all users Multiple interpretation may yield overlapping results.  Diversification Minimizing the risk of user's dissatisfaction by balancing relevance and novelty of search results  An example Query: "London" – location: the capital of UK – name: a book written by Jack London The occurrences can be viewed as a keyword interpretation with different semantics offering complementary results. 2

3 Copyright  2010 by CEBT INTRODUCTION  Motivation Taking advantage of the structure of the databases – Query interpretation in terms of the underlying database – To deliver more diverse and orthogonal representations of query results ex) attribute  Contributions DivQ – A probabilistic query disambiguation model – A diversification scheme for generating top-k query interpretations Evaluation metrics for structured data – α-nDCG-W – WS-recall 3

4 Copyright  2010 by CEBT The Diversification Scheme  Query interpretations a keyword query -> a set of structured queries  Ranking the query interpretations Providing a quick overview over the available classes of results Faceted search: navigate and choose 4 Q: CONSIDERATION CHRISTOPHER GUEST RelevanceTop-3 interpretations rankingRelevanceTop-3 interpretations diversification 0.9A director CHRISTOPHER GUEST of a movie CONSIDERATION 0.9A director CHRISTOPHER GUEST of a movie CONSIDERATION 0.5A director CHRISTOPHER GUEST 0.4An actor CHRISTOPHER GUEST 0.8An actor CHRISTOPHER GUEST in a movie CONSIDERATION 0.2A plot containing CHRISTOPHER GUEST of a movie increasing novelty

5 Copyright  2010 by CEBT The Diversification Scheme  Bringing Keywords into Structure Keyword Interpretations A i :k i – Mapping each keyword k i to an element A i of an algebraic expression – (Predefined) query template T joining the keyword interpretations a structural patterns that is frequently used to query the databases – An example Keyword query (K): CONSIDERATION CHRISTOPHER GUEST  director:CHRISTOPHER  director:GUEST  movie:CONSIDERATION T: A director X of a movie Y 5

6 Copyright  2010 by CEBT The Diversification Scheme  Estimating Query Relevance Relevance of a query interpretation Q to informational needs K – P(Q|K) = P(I,T|K) T: query template, I: a set of keyword interpretations – Assumptions Each keyword has one particular interpretation. The probability of a keyword interpretation is independent from the part of the query interpretation the keyword is not interpreted to. – Attribute specific term frequency (ex. the avg number of co-occurrences) ex) rank higher: a first name and a last name of a person to attribute "name" 6 the probability that, given that A j is a part of a query interpretation, keyword interpretation A j are also a part of the query interpretation. smoothing factor

7 Copyright  2010 by CEBT The Diversification Scheme  Estimating Query Similarity The Jaccard coefficient between the sets of keyword interpretations I contained by Q 1 and Q 2  Combining Relevance and Similarity 1. Select the most relevance interpretation as the first interpretation presented to the user 2. Each of the following interpretations is selected based on both its relevance and novelty 7 selected query interpretation set

8 Copyright  2010 by CEBT The Diversification Scheme  The Diversification algorithm materializing top-k relevance query interpretations  the worst case O(l*r) – l: the number of query interpretations in L – r: the number of query interpretations in the result list R 8

9 Copyright  2010 by CEBT EVALUATION METRICS  α-nDCG-W CG n (Cumulative Gain) – ex) 3+2+3+0+1+2 = 11 DCG i (Discounted Cumulative Gain) – ex) DCG 1 = 3, DCG 2 = 3 + 2/log 2 2 = 5, DCG 3 = 3 + (2/log 2 2 + 3/log 2 3) = 6.887 nDCG i = DCG i / ideal DCG i α-nDCG – Views a document as the set of information nuggets n Counting how many documents containing n were seen before and discount the gain of this document accordingly – if α = 0, it is a standard nDCG – with increasing α, novelty is rewarded with more credit 9 D1D2D3D4D5D6 323012

10 Copyright  2010 by CEBT EVALUATION METRICS  α-nDCG-W In databases – an information nugget n corresponds to a primary key pk i The gain The overlap – For each primary key pk i in the result of Q k Count how many query interpretations with pk i were seen before, and aggregate the counts 10 overlap factor

11 Copyright  2010 by CEBT EVALUATION METRICS  Weighted S-Recall S-recall – Instance recall at rank k when search results are related to several subtopics The number of unique subtopics covered by the first k results, divided by the total number of subtopics – a primary key corresponds to a subtopic in S-recall 11

12 Copyright  2010 by CEBT EXPERIMENTS  IMDB 10,000,000 records  Lyrics 400,000 records  Query logs MSN, AOL 200 most frequent queries (single query) 100 queries (complex queries) 12

13 Copyright  2010 by CEBT EXPERIMENTS  User Study 16 participants were asked to indicate on a two-point Likert scale to assess the relevance – top-25 interpretations 13

14 Copyright  2010 by CEBT EXPERIMENTS  α-nDCG-W α = 0, 0.5, and 0.99 14

15 Copyright  2010 by CEBT EXPERIMENTS  WS-recall  Balancing Relevance and Novelty 15

16 Copyright  2010 by CEBT CONCLUSION  We present an approach to search results diversification over structured data. a probabilistic query disambiguation model query similarity measure a greedy algorithm  An adaptation of the established evaluation metrics are proposed. – α-nDCG-W and WS-recall  Evaluation results demonstrate the quality of the proposed model and show that using our algorithms the novelty of keyword search results over structured data can be substantially improved. 16


Download ppt "DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,"

Similar presentations


Ads by Google