Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.

Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego Rodica Bozianu XML and Database Systems

Efficient Keyword Search for Smallest LCAs in XML Database Abstract Keyword search is a proven, user-friendly way to query HTML documents in the World Wide Web. Keyword search in XML documents, modeled as labeled trees (efficient algorithms) The set of smallest trees containing all keywords

Efficient Keyword Search for Smallest LCAs in XML Database Abstract Core contribution: Lookup Eager algorithm Exploits key properties of smallest trees. Used when the query contains keywords with significantly different frequencies. Scan Eager algorithm is tuned for keywords with similar frequencies. Analytically and experimentally evaluates Present XKSearch system Utilizes the Indexed Lookup Eager, Scan Eager and Stack algorithms

Efficient Keyword Search for Smallest LCAs in XML Database Outline 1.Introduction 2.Notation 3.Algorithms for finding the SLCA of keyword lists 1.The Indexed Lookup Eager Algorithm (IL) 2.Scan Eager Algorithm 3.The Stack Algorithm 4.XKSearch System Implementation 5.Experiments 6.Conclusions

Efficient Keyword Search for Smallest LCAs in XML Database Introduction According to the Smallest Lowest Common Ancestor (SLCA) semantics : The result of keyword query is the set of nodes that:  Contain the keywords either in their labels or in the labels of their descendant nodes and  They have no descendant node that also contains all keywords

Efficient Keyword Search for Smallest LCAs in XML Database Introduction Example: if you ask for the relation between John and Ben the node list [0.1.1, 0.1.2, 0.2.0.0]  XQuery Complex and difficult to be executed efficiently

Efficient Keyword Search for Smallest LCAs in XML Database Notation -each node v of the tree corresponds to an XML element and is labeled with a tag λ(v). -for each node numerical id pre(v) -The XKSearch implementation uses Dewey numbers as the id’s Provide a straightforward solution to locating the LCA of two nodes 0.1.0.0.0 < 0.1.1.1 Compatible with preorder numbering

Efficient Keyword Search for Smallest LCAs in XML Database Notation For a list of k keywords and an input XML tree T:  an answer subtree a subtree of T such that it contains at least one instance of keywords.  a smallest answer subtree an answer subtree non of its subtrees is an answer subtree  = the set of the roots of all smallest answer subtrees of

Efficient Keyword Search for Smallest LCAs in XML Database Notation  the keyword list of (i.e. the list of nodes whose label directly contains sorted by id)  the node is an ancestor of node  or  is an ancestor node if there exists a node such that  If then  the lowest common ancestor lca( 0.1.1.1.0, 0.1.1.2.0 )=0.1.1

Efficient Keyword Search for Smallest LCAs in XML Database Notation  Given sets of nodes, a node if there exist such that  v belongs to the smallest lowest common ancestor (SLCA) of if and  The result is removes ancestor nodes from its input

Efficient Keyword Search for Smallest LCAs in XML Database Notation  ( ) = right (left) match of v in the set S The node of S that has the smallest (biggest) id that is greater (smaller) than or equal to pre(v)  returns null when there is no right (left) match node.  Cost: steps to find the right (left) match to compare two Dewey numbers  the other argument when one argument is null and the descendant node when v1 and v2 have ancestor- descendant relationship. Cost:

Efficient Keyword Search for Smallest LCAs in XML Database Algorithms for finding the SLCA  A Brute-force solution to the SLCA problem Computes the LCAs of all node combinations and then removes ancestor nodes Complexity: It is blocking After it computes an LCA for some, it cannot report v as an answer since there might be another set of k nodes such that

Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm Preferred when the keyword search includes at least one low frequency keyword  Based on four properties of SLCAs Property(1) Observations: for any two nodes to the right of a node if for any two nodes to the left of a node if

Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm Property(2) for k>2 Property(3) Leads to an algorithm to compute - computes for each (1≤i≤n) - the answer is r q

Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm Benefit over Brute-force: for each node v1 in S1, the algorithm does not compute for all Computes a single where each is computed by the matched functions (lm and rm) (2≤i≤k) Complexity: or

Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm v x v x u v x Result = {} U {0.1.1} u Result = {0.1.1}U{1.2.0} = {0.1.1, 1.2.0}

Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm subroutineDerivation of algorithm to compute Property(4): blocking algorithm it only processes the last keyword list after it completely processes the first k-1 keyword lists

All nodes in xᵢ except the last one are guaranteed to be SLCAs The last node is carried on the next operations Repeat the operation for all groups of P nodes of Sᵢ The smaller P is, the faster the algorithm produces the first SLCA. No operations to remove ancestor nodes from a set -> Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm

“Class”, “John” and “Ben” P=1 Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm B=[0.1.0] Output B=ø B={} B=[0.1.1] B v Output v=0.1.1 (line #13) B

When the occurrences of keywords do not differ significantly Its lm and rm implementations scan keyword lists to find matches a cursor for each keyword list Observation: nodes from different lists may not be accessed in order Efficient Keyword Search for Smallest LCAs in XML Database Scan Eager Algorithm finding matches by lookupsfinding matches by scanning the keyword lists

Complexity: or Efficient Keyword Search for Smallest LCAs in XML Database Scan Eager Algorithm

 Each stack entry has a pair of components  Id components from the bottom entry to a stack entry en are  Keywords an array of length k of boolean values keywords[i]=T the subtree rooted at the node denoted by the stack entry directory or indirectly contains the keyword w Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm

Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm

Example: ”Class”, “John” and ”Ben” Keyword lists: [0.1.0,0.1.1], [0.0.0, 0.1.1.1.0,0.2.0.0], [0.1.0.0.0, 0.1.1.2.0, 0.2.0.1] Initially: the stack is empty V=0.0.0 P=NULL Add non-matching components to the stack: Second iteration: v=0.1.0 (the next smallest node) p=lca(stack, v) = 0 pop out top 2 entries of the stack : (the important information is carried) add non-matching components: Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm JBC 0TFF 0FFF 0FFF 0TFF 0FFT 1FFF 0TFF

….. Seventh iteration : the initial stack: v=0.2.0.0 p=lca(0.1.1.2.0, 0.2.0.0)=0 pop out top 4 entries of the stack: when popping out the third component: we find a SLCA : Outputs 0.1.1 as SLCA Complexity : the number of lca operations and the number of Dewey number comparisons are Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm 1TTT 1FTT 0TFF 0FTF 2FFF 1TFT 1FTT 0TFF 0TTT 0FTF 2FFF 1TFT 1FTT 0TFF Not a SLCA => pass keyword witness information to the top entry 2FTF 1TFT 1FTT 0TFF

The Scan Eager algorithm has several advantages over the Stack algorithm.  the Scan Eager algorithm starts from the smallest keyword list, does not have to scan to the end of every keyword list may terminate much earlier than the Stack  the number of lca operations of the Scan Eager algorithm is usually much less than that of the Stack algorithm  the Stack algorithm operates on a stack whose depth is bounded by the depth of the input tree while the Scan Eager algorithm with P=1 only needs to keep three nodes in the whole process and no push/pop operations are involved. Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm

Indexed Lookup Eager, Scan Eager and Stack algorithms implemented in Java using the Apache Xerces XML parser and Berkeley DB Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation

The architecture: B-tree structure allows efficient implementation of the match operations Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation

The table LT has d entries ( d = depth of the input tree) LT(i) = the maximum number of bits needed to store the i-th component in a Dewey number; where c is the number of children of the node at the level of i-1 that has the maximum number of children among all odes at the same level In general: bytes are needed for a Dewey number of a node at level i Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation

Indexed Lookup Eager algorithm keyword lists are in a single B+ tree where keywords are the primary key and Dewey numbers are the secondary key For w and a Dewey number p, it takes a single scan operation to find the right and left match of p in the keyword list of w The number of disk accesses: -cannot be more than (Bᵢ = the number of blocks of keyword list Sᵢ ) Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation

Scan Eager algorithm The keys in the B+ tree are simply keywords  The data associated with each key w is the list of Dewey numbers of the nodes directly containing the keyword w  The number of disk accesses: Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation

similarities among the Scan Eager, Indexed Lookup Eager and Stack algorithms. However, the differences between the performance of algorithms for cold cache is not as significant as those in the hot cache experiments. The reason is that most keyword lists do not take many pages. The size of the keyword lists and the time to construct them are proportional to the size of the input XML documents XKSearchB stores Dewey numbers without using a level table On average, the size of indexes constructed by XKSearch is 65% of XKSearchB the construction time of XKSearch is 55% of XKSearchB the query response time of XKSearch for hot cache is 70% of XKSearchB Efficient Keyword Search for Smallest LCAs in XML Database Experiments

 The XKSearch system inputs a list of keywords and returns the set of Smallest Lowest Common Ancestor nodes  The complexity of Indexed Lookup Eager algorithm:  The Indexed Lookup Eager algorithm outperforms, often by orders of magnitude, other algorithms when the keywords have different frequencies.  Scan Eager algorithm as the best variant for the case where the keywords have similar frequencies. Efficient Keyword Search for Smallest LCAs in XML Database Conclusions

Efficient Keyword Search for Smallest LCAs in XML Database Thank you !

Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.

Similar presentations

Presentation on theme: "Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.

Similar presentations

Presentation on theme: "Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis."— Presentation transcript:

Similar presentations

About project

Feedback