Presentation on theme: "Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,"— Presentation transcript:
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology, Australia
Outline Motivation of Keyword Search in XML Brief Review of Related Work Existing Problems Construct Structured Query Templates Ranking Function Processing Algorithms Conclusions
Motivation of XML Keyword Search Keyword search is easy-to-use Users don’t need to know the structure of XML data and specific query languages. The XML data with different structures can be searched equivalently by a keyword query because it doesn’t specify the structures of the retrieved results.
Brief Review of Related Work We focus on 4 references using label and term as keyword query format: [YunyaoLi2004VLDB] Schema-Free XQuery. [DanielaFlorescu2002ComputerNetworks] Integrating keyword search into XML query processing. [SaraCohen2003VLDB] XSEarch: A semantic search engine for XML. [WeidongYang2007CIT] Schema-aware keyword search over xml streams. Other relevant work can be found in our paper.
Brief Review of Related Work All the four work utilized label and term as keyword query format. The difference: the first three work shared the similar basic strategy that first retrieves the relevant keyword lists and then merges them into the results; while the last one first generate a big template that covers all the kinds of results w.r.t. XML schema and then cache the possible results over xml streams. Template-based strategy can obtain better performance [WeidongYang2007CIT] !
Existing Problems [WeidongYang2007CIT] was used to query over XML streams, which is not enough because of the challenges: Different templates may exist in one XML data repository. Users prefer to see part of the results, e.g., top k results. Domain knowledge can be helped to process the labels with the same meaning. Therefore, it is required to study the problem of applying template-based keyword search strategy to XML data repository.
Construct Structured Query Templates Example: There are two data sources that conform to t1 and t2 respectively. Schema t1Schema t2 Keyword query – (year:2006, title:xml, author:philip)
Construct Structured Query Templates Identifying context of keywords Determine master entities using labels in keyword query and XML schema. Generate FOR clause for each entity. Judge the occurrences of every label under each master entity. Once a time – Generate WHERE clauses More than once – First cluster and then generate WHERE clauses.
Step 1: determine master entity and its corresponding label set Q1 = “ For $b in bibliography/books/book ” Q2 = “ For $a in bibliography/articles/article ” Schema t1 Step 2: only one occurrence of each label in each master entity. Q1 += “ Where $b/year=‘2006’ and $b/title.contains(xml) and $b/author.contains(philip)” Q2 += “ Where $a/year=‘2006’ and $a/title.contains(xml) and $a/author.contains(philip)” Keyword query – (year:2006, title:xml, author:philip)
Schema t2 Step 1: determine master entity and its corresponding label set Q = “For $bi in bibliography/bib” Step 2: only two occurrences of each label in the master entity. Cluster title and author using book and article respectively Q1 += Q + “For $bo in $bi/book” Q2 += Q + “For $a in $bi/article” Keyword query – (year:2006, title:xml, author:philip) Step 3: only one occurrence of each label in each cluster. Q1 += “ Where $bi/year=‘2006’ and $bo/title.contains(xml) and $bo/author.contains(philip)” Q2 …
Construct Structured Query Templates Identifying returned nodes Step1: If the cardinality of a master entity satisfies “*” and no cluster operation is activated, we take the master entity as a return node in constructed queries; Step 2: If the cardinality of a master entity satisfies “*” and clusters are generated, we first check the root node of each cluster in a recursive procedure (back to step 1); Step 3: If the cardinality of a master entity does not satisfy “*”, we will probe its ancestor nodes one by one until this kind of node exists or the root of the xml schema.
Schema t1 Master entities are the returned nodes. Q1 += “$b ” Q2 += “$a ” Keyword query – (year:2006, title:xml, author:philip) Schema t2 Roots of clusters are the returned nodes. Q1 += “$bo ” Q2 += “$a ” The constructed queries can be read in our paper!
Ranking Function v m is the master entity nodes; ω(v i, t i ) is calculated by using tf*idf weight model. Feature of the function: The Score() consists of two parts ContextScore() and tf*idf weight, and the former is the upper bound of the score of the results.
Processing Strategy Algorithm 1 is used to generate structured queries with their corresponding context score. Algorithm 2 is used to schedule the query plan according to the conditions: Users’ requirements, e.g., number of results; Context scores of all generated queries; And the intermediate results.
Experiments Dataset: Sigmod record three variant of DBLP Keyword Queries: q1 (author:David, title:XML) q2 (year:2002, title:XML)
Conclusions XBridge is proposed to process keyword query over XML data repository, which can efficiently find the top k results by evaluating generated structured queries. A precise ranking function is provided to evaluate the relevance of the results. Limitation of this work: We take XML schema as tree patterns; We didn’t consider reference relationships of XML data.