Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.

Similar presentations


Presentation on theme: "Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis."— Presentation transcript:

1 Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego Rodica Bozianu XML and Database Systems

2 Efficient Keyword Search for Smallest LCAs in XML Database Abstract Keyword search is a proven, user-friendly way to query HTML documents in the World Wide Web. Keyword search in XML documents, modeled as labeled trees (efficient algorithms) The set of smallest trees containing all keywords

3 Efficient Keyword Search for Smallest LCAs in XML Database Abstract Core contribution: Lookup Eager algorithm Exploits key properties of smallest trees. Used when the query contains keywords with significantly different frequencies. Scan Eager algorithm is tuned for keywords with similar frequencies. Analytically and experimentally evaluates Present XKSearch system Utilizes the Indexed Lookup Eager, Scan Eager and Stack algorithms

4 Efficient Keyword Search for Smallest LCAs in XML Database Outline 1.Introduction 2.Notation 3.Algorithms for finding the SLCA of keyword lists 1.The Indexed Lookup Eager Algorithm (IL) 2.Scan Eager Algorithm 3.The Stack Algorithm 4.XKSearch System Implementation 5.Experiments 6.Conclusions

5 Efficient Keyword Search for Smallest LCAs in XML Database Introduction According to the Smallest Lowest Common Ancestor (SLCA) semantics : The result of keyword query is the set of nodes that:  Contain the keywords either in their labels or in the labels of their descendant nodes and  They have no descendant node that also contains all keywords

6 Efficient Keyword Search for Smallest LCAs in XML Database Introduction Example: if you ask for the relation between John and Ben the node list [0.1.1, 0.1.2, ]  XQuery Complex and difficult to be executed efficiently

7 Efficient Keyword Search for Smallest LCAs in XML Database Outline 1.Introduction 2.Notation 3.Algorithms for finding the SLCA of keyword lists 1.The Indexed Lookup Eager Algorithm (IL) 2.Scan Eager Algorithm 3.The Stack Algorithm 4.XKSearch System Implementation 5.Experiments 6.Conclusions

8 Efficient Keyword Search for Smallest LCAs in XML Database Notation -each node v of the tree corresponds to an XML element and is labeled with a tag λ(v). -for each node numerical id pre(v) -The XKSearch implementation uses Dewey numbers as the id’s Provide a straightforward solution to locating the LCA of two nodes < Compatible with preorder numbering

9 Efficient Keyword Search for Smallest LCAs in XML Database Notation For a list of k keywords and an input XML tree T:  an answer subtree a subtree of T such that it contains at least one instance of keywords.  a smallest answer subtree an answer subtree non of its subtrees is an answer subtree  = the set of the roots of all smallest answer subtrees of

10 Efficient Keyword Search for Smallest LCAs in XML Database Notation  the keyword list of (i.e. the list of nodes whose label directly contains sorted by id)  the node is an ancestor of node  or  is an ancestor node if there exists a node such that  If then  the lowest common ancestor lca( , )=0.1.1

11 Efficient Keyword Search for Smallest LCAs in XML Database Notation  Given sets of nodes, a node if there exist such that  v belongs to the smallest lowest common ancestor (SLCA) of if and  The result is removes ancestor nodes from its input

12 Efficient Keyword Search for Smallest LCAs in XML Database Notation  ( ) = right (left) match of v in the set S The node of S that has the smallest (biggest) id that is greater (smaller) than or equal to pre(v)  returns null when there is no right (left) match node.  Cost: steps to find the right (left) match to compare two Dewey numbers  the other argument when one argument is null and the descendant node when v1 and v2 have ancestor- descendant relationship. Cost:

13 Efficient Keyword Search for Smallest LCAs in XML Database Outline 1.Introduction 2.Notation 3.Algorithms for finding the SLCA of keyword lists 1.The Indexed Lookup Eager Algorithm (IL) 2.Scan Eager Algorithm 3.The Stack Algorithm 4.XKSearch System Implementation 5.Experiments 6.Conclusions

14 Efficient Keyword Search for Smallest LCAs in XML Database Algorithms for finding the SLCA  A Brute-force solution to the SLCA problem Computes the LCAs of all node combinations and then removes ancestor nodes Complexity: It is blocking After it computes an LCA for some, it cannot report v as an answer since there might be another set of k nodes such that

15 Efficient Keyword Search for Smallest LCAs in XML Database Outline 1.Introduction 2.Notation 3.Algorithms for finding the SLCA of keyword lists 1.The Indexed Lookup Eager Algorithm (IL) 2.Scan Eager Algorithm 3.The Stack Algorithm 4.XKSearch System Implementation 5.Experiments 6.Conclusions

16 Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm Preferred when the keyword search includes at least one low frequency keyword  Based on four properties of SLCAs Property(1) Observations: for any two nodes to the right of a node if for any two nodes to the left of a node if

17 Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm Property(2) for k>2 Property(3) Leads to an algorithm to compute - computes for each (1≤i≤n) - the answer is r q

18 Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm Benefit over Brute-force: for each node v1 in S1, the algorithm does not compute for all Computes a single where each is computed by the matched functions (lm and rm) (2≤i≤k) Complexity: or

19 Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm v x v x u v x Result = {} U {0.1.1} u Result = {0.1.1}U{1.2.0} = {0.1.1, 1.2.0}

20 Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm subroutineDerivation of algorithm to compute Property(4): blocking algorithm it only processes the last keyword list after it completely processes the first k-1 keyword lists

21 All nodes in xᵢ except the last one are guaranteed to be SLCAs The last node is carried on the next operations Repeat the operation for all groups of P nodes of Sᵢ The smaller P is, the faster the algorithm produces the first SLCA. No operations to remove ancestor nodes from a set -> Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm

22 “Class”, “John” and “Ben” P=1 Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm B=[0.1.0] Output B=ø B={} B=[0.1.1] B v Output v=0.1.1 (line #13) B

23 Efficient Keyword Search for Smallest LCAs in XML Database Outline 1.Introduction 2.Notation 3.Algorithms for finding the SLCA of keyword lists 1.The Indexed Lookup Eager Algorithm (IL) 2.Scan Eager Algorithm 3.The Stack Algorithm 4.XKSearch System Implementation 5.Experiments 6.Conclusions

24 When the occurrences of keywords do not differ significantly Its lm and rm implementations scan keyword lists to find matches a cursor for each keyword list Observation: nodes from different lists may not be accessed in order Efficient Keyword Search for Smallest LCAs in XML Database Scan Eager Algorithm finding matches by lookupsfinding matches by scanning the keyword lists

25 Complexity: or Efficient Keyword Search for Smallest LCAs in XML Database Scan Eager Algorithm

26 Efficient Keyword Search for Smallest LCAs in XML Database Outline 1.Introduction 2.Notation 3.Algorithms for finding the SLCA of keyword lists 1.The Indexed Lookup Eager Algorithm (IL) 2.Scan Eager Algorithm 3.The Stack Algorithm 4.XKSearch System Implementation 5.Experiments 6.Conclusions

27  Each stack entry has a pair of components  Id components from the bottom entry to a stack entry en are  Keywords an array of length k of boolean values keywords[i]=T the subtree rooted at the node denoted by the stack entry directory or indirectly contains the keyword w Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm

28 Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm

29 Example: ”Class”, “John” and ”Ben” Keyword lists: [0.1.0,0.1.1], [0.0.0, , ], [ , , ] Initially: the stack is empty V=0.0.0 P=NULL Add non-matching components to the stack: Second iteration: v=0.1.0 (the next smallest node) p=lca(stack, v) = 0 pop out top 2 entries of the stack : (the important information is carried) add non-matching components: Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm JBC 0TFF 0FFF 0FFF 0TFF 0FFT 1FFF 0TFF

30 ….. Seventh iteration : the initial stack: v= p=lca( , )=0 pop out top 4 entries of the stack: when popping out the third component: we find a SLCA : Outputs as SLCA Complexity : the number of lca operations and the number of Dewey number comparisons are Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm 1TTT 1FTT 0TFF 0FTF 2FFF 1TFT 1FTT 0TFF 0TTT 0FTF 2FFF 1TFT 1FTT 0TFF Not a SLCA => pass keyword witness information to the top entry 2FTF 1TFT 1FTT 0TFF

31 The Scan Eager algorithm has several advantages over the Stack algorithm.  the Scan Eager algorithm starts from the smallest keyword list, does not have to scan to the end of every keyword list may terminate much earlier than the Stack  the number of lca operations of the Scan Eager algorithm is usually much less than that of the Stack algorithm  the Stack algorithm operates on a stack whose depth is bounded by the depth of the input tree while the Scan Eager algorithm with P=1 only needs to keep three nodes in the whole process and no push/pop operations are involved. Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm

32 Efficient Keyword Search for Smallest LCAs in XML Database Outline 1.Introduction 2.Notation 3.Algorithms for finding the SLCA of keyword lists 1.The Indexed Lookup Eager Algorithm (IL) 2.Scan Eager Algorithm 3.The Stack Algorithm 4.XKSearch System Implementation 5.Experiments 6.Conclusions

33 Indexed Lookup Eager, Scan Eager and Stack algorithms implemented in Java using the Apache Xerces XML parser and Berkeley DB Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation

34 The architecture: B-tree structure allows efficient implementation of the match operations Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation

35 The table LT has d entries ( d = depth of the input tree) LT(i) = the maximum number of bits needed to store the i-th component in a Dewey number; where c is the number of children of the node at the level of i-1 that has the maximum number of children among all odes at the same level In general: bytes are needed for a Dewey number of a node at level i Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation

36 Indexed Lookup Eager algorithm keyword lists are in a single B+ tree where keywords are the primary key and Dewey numbers are the secondary key For w and a Dewey number p, it takes a single scan operation to find the right and left match of p in the keyword list of w The number of disk accesses: -cannot be more than (Bᵢ = the number of blocks of keyword list Sᵢ ) Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation

37 Scan Eager algorithm The keys in the B+ tree are simply keywords  The data associated with each key w is the list of Dewey numbers of the nodes directly containing the keyword w  The number of disk accesses: Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation

38 Efficient Keyword Search for Smallest LCAs in XML Database Outline 1.Introduction 2.Notation 3.Algorithms for finding the SLCA of keyword lists 1.The Indexed Lookup Eager Algorithm (IL) 2.Scan Eager Algorithm 3.The Stack Algorithm 4.XKSearch System Implementation 5.Experiments 6.Conclusions

39 similarities among the Scan Eager, Indexed Lookup Eager and Stack algorithms. However, the differences between the performance of algorithms for cold cache is not as significant as those in the hot cache experiments. The reason is that most keyword lists do not take many pages. The size of the keyword lists and the time to construct them are proportional to the size of the input XML documents XKSearchB stores Dewey numbers without using a level table On average, the size of indexes constructed by XKSearch is 65% of XKSearchB the construction time of XKSearch is 55% of XKSearchB the query response time of XKSearch for hot cache is 70% of XKSearchB Efficient Keyword Search for Smallest LCAs in XML Database Experiments

40 Efficient Keyword Search for Smallest LCAs in XML Database Outline 1.Introduction 2.Notation 3.Algorithms for finding the SLCA of keyword lists 1.The Indexed Lookup Eager Algorithm (IL) 2.Scan Eager Algorithm 3.The Stack Algorithm 4.XKSearch System Implementation 5.Experiments 6.Conclusions

41  The XKSearch system inputs a list of keywords and returns the set of Smallest Lowest Common Ancestor nodes  The complexity of Indexed Lookup Eager algorithm:  The Indexed Lookup Eager algorithm outperforms, often by orders of magnitude, other algorithms when the keywords have different frequencies.  Scan Eager algorithm as the best variant for the case where the keywords have similar frequencies. Efficient Keyword Search for Smallest LCAs in XML Database Conclusions

42 Efficient Keyword Search for Smallest LCAs in XML Database Thank you !


Download ppt "Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis."

Similar presentations


Ads by Google