XSEarch XML Search Engine Jonathan MAMOU October 2002
Motivation
XML Getting popular Allows meta-data to be embedded into documents Data-centric view : exchange format for structured data – meta data Document-centric view : Content – text, meta data Querying data and meta-data
One Fish Two Fish by John Meyer & Peter Smith Costs Only: $7.95 Goodnight Moon by Margaret Brown Costs Only: $10.55 Brown Bear by Bill Martin Jr. Costs Only: $6.00 Buy our Classic Children’s books. amazing.com
One Fish Two Fish John Meyer Peter Smith 7.95 Goodnight Moon Margaret Brown
A query Find titles and prices of books by ‘ Meyer ’ or ‘ Smith ’
IR Approach How to deal with tags? Discard all tags Simplicity Loss of information (structure) lower retrieval performance Keep tags as keyword How to write the query? “ Title price book author Meyer Smith ”
IR Approach (cont ’ d) Can ’ t specify that Meyer and Smith are the authors Can ’ t specify that title, price and author belongs to same book Can ’ t specify desired output (i.e., titles, price)
Database approach FOR $b IN document(“bib.xml”)//book WHERE $b/author contains ‘Meyer’ OR $b/author contains ‘Smith’ RETURN $b/title $b/price Difficult for naive user Requires knowledge of document structure Dependent on document structure
Our Goal Combine IR and database techniques : tags + text Simple language Logical Structure, not physical Require knowledge of tag names, not structure Queries should work even if structure changes Rank results
Framework
bookinfo Just Lost book title author price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95 Tree Representation We need to find tuples of related title and price nodes.
author name Dr. Meyer author name book M. Brown Goodnight Moon title book title price One Fish Two Fish $12.50 book title price Cat in the Hat $14.95 bookinfo Another Tree Representation Similar document, but with different hierarchical structure from the previous. We need to find tuples of related title, author and price nodes.
Interconnection Consider a title and price node Intuition: The nodes belong to different book entities bookinfo Just Lost book title name price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95 The lowest common ancestor of the circled nodes
Interconnection (cont ’ d) Just Lost title bookinfo book name price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95 Intuition: The nodes belong to same book entity The lowest common ancestor of the circled nodes
Interconnection (cont ’ d) Just Lost title bookinfo book name price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95 Intuition: The nodes belong to same book entity
Relationship tree Nodes n 1,n 2 n their lowest common ancestor Tn the subtree rooted at n The relationship tree of n 1,n 2 is the tree obtained by pruning from Tn all nodes other than n 1,n 2 that are not ancestors of n 1,n 2
Interconnection We say that n 1,n 2 are interconnected if the relationship tree does not contain 2 distinct nodes with the same label Or the relationship tree contains exactly one pair of distinct nodes with the same label and this pair is comprised of n 1,n 2
All-Pairs Interconnection A set of nodes is all-pairs interconnected if every pair of nodes are interconnected
Star interconnection bookinfo Just Lost book title author price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95 name The 2 names are not interconnected
Star Interconnection (cont ’ d) A set of nodes is star interconnected if all the nodes in the set are interconnected to the same node
Search terms, Search query Search Term (l,k) l label (context) k keyword Search Query AND:L1 OR:L2 L1, L2 list of search terms AND:(title,)(price,) OR:(author,Meyer)(author:Smith)
Answer AND:N1 OR:N2 N1, N2 are list of nodes Matching between N1,N2 and L1,L2 N1 and N2 are interconnected All all-pair answers are star answers Maximal answer
bookinfo Just Lost book title author price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95 Example (title,) (price,) (author,Meyer) Find matchings of title, author and price to the nodes in the tree title authorprice null
Computing answers All-pairs Determining whether the set of answers is empty is NP-complete If L1 is empty, computing the set of answers is polynomial in the size of input and output Star computing the set of answers is polynomial in the size of input and output
Ranking results Unstructured Keyword weight (tfilf) Tags weight Result size Structured Nodes distance Ancestor-descendant
Keyword Weight Compute the weight of a keyword k within a given node n Variation of the tfidf, one of the metric of Vector Space Model (classical model in IR)
Keyword Weight (cont ’ d) Term Frequency (tf): number of appearances of k within n tf(k,n) = occ(k,n) / (max occ(k ’,n)) Inverse Leaf Frequency (ilf): inverse frequency of k among all the leafs in the corpus idf(k) = log(1+N/N k ) W(k,n) = tf(k,n) * idf(k) Normalized per leave
Tag Weight Give weight to tags according to their importance E.g. give more weight to than to
Result Size Number of search terms appearing in the result (OR part)
Ranking-Structured Nodes distance size of the relationship tree Ancestor-descendant relationship “ more ” interconnected
System overview
XSEarch overview XML corpus with logical hierarchy Indexer Search query Results Offline Online
Document Location array Generate a unique id, did Associate each did with the physical location of the corresponding document Logical structure of the corpus
Node Encoding Array Generate for each interior node a id, nid Node encoding Defined recursively Node encoding of its parent Index of the node among its siblings Eg: Associate each nid with its node encoding
Node Label Array Associate each nid with its label
Inverted Tag Index For each tag, keep posting list: list of nodes labeled with this tag weight Nid1 tag Nid3Nid2
Inverted Keyword Index For each kw, keep posting list: list of leafs containing this keyword weight of the kw within the leaf (tfilf) Nid1,w1 kw Nid3,w3Nid2,w2
Node Interconnection Matrix element ij contains: 1, if ni and nj are interconnected 0, else n*n symmetric sparse matrix Dynamic programming
Alternative Hash set : keep only interconnected nodes Key: pair (ni, nj)
Interconnection Let n be the number of nodes It is possible to determine whether n1 and n2 are interconnected in O(n) time It is possible to determine interconnection of all pairs in O(n 2 ) Offline/Online computation
Interconnection for (i=size-1; i>=0; i--) for (j=i+1; j<=size; j++) if i ancestor of j connected(iChild,j) AND connected(i,jFather) AND labelIChild != labelJ AND labelI != labelJFather for (j=i+1; j<size; j++) if i not ancestor of j connected(i,jFather) AND connected(iFather,j) AND labelI != labelJFather AND labelIFather != labelJ
Demo