Presentation is loading. Please wait.

Presentation is loading. Please wait.

XSEarch XML Search Engine Jonathan MAMOU October 2002.

Similar presentations


Presentation on theme: "XSEarch XML Search Engine Jonathan MAMOU October 2002."— Presentation transcript:

1 XSEarch XML Search Engine Jonathan MAMOU October 2002

2 Motivation

3 XML Getting popular Allows meta-data to be embedded into documents Data-centric view : exchange format for structured data – meta data Document-centric view : Content – text, meta data Querying data and meta-data

4 One Fish Two Fish by John Meyer & Peter Smith Costs Only: $7.95 Goodnight Moon by Margaret Brown Costs Only: $10.55 Brown Bear by Bill Martin Jr. Costs Only: $6.00 Buy our Classic Children’s books. amazing.com

5 One Fish Two Fish John Meyer Peter Smith 7.95 Goodnight Moon Margaret Brown 10.55....

6 A query Find titles and prices of books by ‘ Meyer ’ or ‘ Smith ’

7 IR Approach How to deal with tags? Discard all tags Simplicity Loss of information (structure)  lower retrieval performance Keep tags as keyword How to write the query? “ Title price book author Meyer Smith ”

8 IR Approach (cont ’ d) Can ’ t specify that Meyer and Smith are the authors Can ’ t specify that title, price and author belongs to same book Can ’ t specify desired output (i.e., titles, price)

9 Database approach FOR $b IN document(“bib.xml”)//book WHERE $b/author contains ‘Meyer’ OR $b/author contains ‘Smith’ RETURN $b/title $b/price Difficult for naive user Requires knowledge of document structure Dependent on document structure

10 Our Goal Combine IR and database techniques : tags + text Simple language Logical Structure, not physical Require knowledge of tag names, not structure Queries should work even if structure changes Rank results

11 Framework

12 bookinfo Just Lost book title author price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95 Tree Representation We need to find tuples of related title and price nodes.

13 author name Dr. Meyer author name book M. Brown Goodnight Moon title book title price One Fish Two Fish $12.50 book title price Cat in the Hat $14.95 bookinfo Another Tree Representation Similar document, but with different hierarchical structure from the previous. We need to find tuples of related title, author and price nodes.

14 Interconnection Consider a title and price node Intuition: The nodes belong to different book entities bookinfo Just Lost book title name price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95 The lowest common ancestor of the circled nodes

15 Interconnection (cont ’ d) Just Lost title bookinfo book name price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95 Intuition: The nodes belong to same book entity The lowest common ancestor of the circled nodes

16 Interconnection (cont ’ d) Just Lost title bookinfo book name price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95 Intuition: The nodes belong to same book entity

17 Relationship tree Nodes n 1,n 2 n their lowest common ancestor Tn the subtree rooted at n The relationship tree of n 1,n 2 is the tree obtained by pruning from Tn all nodes other than n 1,n 2 that are not ancestors of n 1,n 2

18 Interconnection We say that n 1,n 2 are interconnected if the relationship tree does not contain 2 distinct nodes with the same label Or the relationship tree contains exactly one pair of distinct nodes with the same label and this pair is comprised of n 1,n 2

19 All-Pairs Interconnection A set of nodes is all-pairs interconnected if every pair of nodes are interconnected

20 Star interconnection bookinfo Just Lost book title author price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95 name The 2 names are not interconnected

21 Star Interconnection (cont ’ d) A set of nodes is star interconnected if all the nodes in the set are interconnected to the same node

22 Search terms, Search query Search Term (l,k) l label (context) k keyword Search Query AND:L1 OR:L2 L1, L2 list of search terms AND:(title,)(price,) OR:(author,Meyer)(author:Smith)

23 Answer AND:N1 OR:N2 N1, N2 are list of nodes Matching between N1,N2 and L1,L2 N1 and N2 are interconnected All all-pair answers are star answers Maximal answer

24 bookinfo Just Lost book title author price Mercy Meyer Gina Meyer $5.75 book title price Brown Bear $13.95 Example (title,) (price,) (author,Meyer) Find matchings of title, author and price to the nodes in the tree title authorprice null

25 Computing answers All-pairs Determining whether the set of answers is empty is NP-complete If L1 is empty, computing the set of answers is polynomial in the size of input and output Star computing the set of answers is polynomial in the size of input and output

26 Ranking results Unstructured Keyword weight (tfilf) Tags weight Result size Structured Nodes distance Ancestor-descendant

27 Keyword Weight Compute the weight of a keyword k within a given node n Variation of the tfidf, one of the metric of Vector Space Model (classical model in IR)

28 Keyword Weight (cont ’ d) Term Frequency (tf): number of appearances of k within n tf(k,n) = occ(k,n) / (max occ(k ’,n)) Inverse Leaf Frequency (ilf): inverse frequency of k among all the leafs in the corpus idf(k) = log(1+N/N k ) W(k,n) = tf(k,n) * idf(k) Normalized per leave

29 Tag Weight Give weight to tags according to their importance E.g. give more weight to than to

30 Result Size Number of search terms appearing in the result (OR part)

31 Ranking-Structured Nodes distance size of the relationship tree Ancestor-descendant relationship “ more ” interconnected

32 System overview

33 XSEarch overview XML corpus with logical hierarchy Indexer Search query Results Offline Online

34 Document Location array Generate a unique id, did Associate each did with the physical location of the corresponding document Logical structure of the corpus

35 Node Encoding Array Generate for each interior node a id, nid Node encoding Defined recursively Node encoding of its parent Index of the node among its siblings Eg: 13.8.1.9 Associate each nid with its node encoding

36 Node Label Array Associate each nid with its label

37 Inverted Tag Index For each tag, keep posting list: list of nodes labeled with this tag weight Nid1 tag Nid3Nid2

38 Inverted Keyword Index For each kw, keep posting list: list of leafs containing this keyword weight of the kw within the leaf (tfilf) Nid1,w1 kw Nid3,w3Nid2,w2

39 Node Interconnection Matrix element ij contains: 1, if ni and nj are interconnected 0, else n*n symmetric sparse matrix Dynamic programming

40 Alternative Hash set : keep only interconnected nodes Key: pair (ni, nj)

41 Interconnection Let n be the number of nodes It is possible to determine whether n1 and n2 are interconnected in O(n) time It is possible to determine interconnection of all pairs in O(n 2 ) Offline/Online computation

42 Interconnection for (i=size-1; i>=0; i--) for (j=i+1; j<=size; j++) if i ancestor of j connected(iChild,j) AND connected(i,jFather) AND labelIChild != labelJ AND labelI != labelJFather for (j=i+1; j<size; j++) if i not ancestor of j connected(i,jFather) AND connected(iFather,j) AND labelI != labelJFather AND labelIFather != labelJ

43 Demo


Download ppt "XSEarch XML Search Engine Jonathan MAMOU October 2002."

Similar presentations


Ads by Google