Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.

Similar presentations


Presentation on theme: "1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are."— Presentation transcript:

1 1 Keyword Search over XML

2 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are not appropriate for the naive user: –if XML “replaces” HTML as the web standard, users can’t be expected to write graph queries Allow Keyword Search over XML!

3 3 Keyword Search A keyword search is a list of search terms There can be different ways to define legal search terms. Examples: –keyword:label, e.g., author:Smith –keyword, e.g., :Smith –label, e.g., author: –value (without distinguishing between keywords and labels)

4 4 Challenges (1) Determining which part of the XML document corresponds to an answer –When searching HTML, the result units are usually documents –When searching XML, a finer granularity should be returned, e.g., a subtree

5 5 What should be returned for the query :ACID, :Kempster ?

6 6 Challenges (2) Avoiding the return of non-meaningfully related elements –XML documents often contain many unrelated fragments of information. Can these information units be recognized?

7 7 What should be returned for the query :XML, author: ?

8 8 What should be returned for the query :XML, :Kempster ?

9 9 Challenges (3) Ranking mechanisms –How should document fragments/XML elements be ranked Ideas?

10 10 In what order should the answers be returned for :ACID, author: ?

11 11 Defining a Search Semantics When defining a search over XML, all previous challenges must be considered. We must decide: –what portions of a document are a search result? –should any results be filtered out since they are not meaningful? –how should ranking be performed Typically, research focuses on one of these problems and provides simple solutions for the other problems.

12 12 XRank: Ranked Keyword Search over XML Documents Guo, Shao, Botev, Shanmugasundram SIGMOD 2003

13 13 Queries and their Semantics Queries are keywords k 1,…,k n, as in a search engine Query results are portions of XML documents that contain all words. Formally: –Let v be a node in the document. To determine whether v should be returned: First, “remove” any descendents of v that contain all the keywords k 1,…,k n. If v still contains all of k 1,…,k n, then v should be a result of the search. –Intuition: Only return v if no more specific element can be returned. Note: Containment is via child edges, not IDREF edges

14 14 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … What should be returned for the query XQL language?

15 15 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … What should be returned for the query XQL language?

16 16 Ranking Results: Intuition Granularity of ranking –In HTML, there is a rank for each document –In XML, we want a rank for each element. Different elements in the same document may have different ranks Propose to extend ideas used for ranking HTML: –PageRank: Documents with more incoming links are more important (recursive definition) –Proximity: If the document contains the search terms close together, then the document is more important Overall Rank: combination of PageRank and proximity

17 17 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … Should both papers be ranked the same?

18 18 Topics We discuss: –Ranking –The Index Structure –Query Processing

19 19 Ranking Results Take into consideration –hyperlinks –proximity We only discuss here ranking by the linking structure. Ranking by proximity can easily be defined (ideas?) What kind of “links” are the in a graph of XML documents?

20 20 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … Child/Parent “links”

21 21 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … IDREF “links”

22 22 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … XLink “links” (out of the document)

23 23 v : Hyperlink edge 1-d: Probability of random jump d: Probability of following hyperlink d /3 Remember: Page Rank Number of documents Number of outgoing links

24 24 A Graph of XML documents Nodes: N –each element in a document is a node Edges: E = CE  CE -1  HE –CE are “containment links”, i.e., there is an edge (u,v) in CE if u is a parent of v in the XML document –HE are “hyperlinks”, i.e., there is an edge (u,v) in HE if there is an IDREF link or XLink link from u to v Want to define ElemRank, the parallel to PageRank, but for XML elements

25 25 Attempt 1 at ElemRank v Hyperlink edge Containment edge There are now 4 ways to get to an element. Consider all in the formula.

26 26 Attempt 1 at ElemRank: Problem v Hyperlink edge Containment edge Consider a paper with few sections and many references. The more references there are, the less important each section is. Why?

27 27 Attempt 2 at ElemRank v Hyperlink edge Containment edge Consider Hyperlinks and Structural links separately

28 28 Attempt 2 at ElemRank: Problem v Hyperlink edge Containment edge In fact, better to consider parent- child links differently from child-parent links

29 29 Actual ElemRank v Hyperlink edge Containment edge Consider Hyperlinks, Parent links and Child links separately

30 30 Interpretation in terms of Random Walks The element rank of e is the probability that e will be reached if we start at a random element and at each point we chose one of the following options: –with probability 1-d 1 -d 2 -d 3 jump to a random element in a random page –with probability d 1 follow a random hyperlink from the current element –with probability d 2 follow a random edge to a child element –with probability d 3 follow the parent edge

31 31 ElemRank Example Suppose that d 1 = d 2 = d 3 = 0.3 In what order will the nodes be ranked? What will be the formula for each node? 1 Hyperlink edge Containment edge 2 3 4

32 32 Think About it Very nice definition of ElemRank Does it make sense? Would ElemRank give good results in the following scenarios: –IDREFs connect articles with articles that they cite –IDREFs connect managers with their departments –IDREFs connect cleaning staff with their departments in which they work –IDREFs connect countries with bordering contries (as in the CIA factbook)

33 33 Topics We discuss: –Ranking –The Index Structure –Query Processing

34 34 Indexing We now discuss the index structure Recall that we will be ranking according to ElemRank Recall that we want to return “most specific elements” How should the data be stored in an index?

35 35 Naive Method Treat elements as documents: Normal inverted lists Ricardo 0 ; 4 ; 5 ; 8 XQL 0 ; 4 ; 5 ; 7 Problem: Space Overhead How much space is needed in storage?

36 36 Naive Method Treat elements as documents: Normal inverted lists Ricardo 0 ; 4 ; 5 ; 8 XQL 0 ; 4 ; 5 ; 7 Problem: Spurious Results Cant simply return intersection of the lists, since if a node satisfies a query, so do all its ancestors

37 37 Dewey Encoding of ID Use path information to identify elements – DeweyID An ancestor’s ID is a prefix of its descendant’s ID Actually (not shown) all the node ids are prefixed by the document number 0.0 0.1 0 0.2 0.3 28 July …XML and …David Carmel … 0.3.0 0.3.1 … 0.3.0.0 0.3.0.1 …… ……

38 38 Dewey Inverted List (DIL) Store, for each keyword a list containing : – the id of the node containing the keyword –the rank of the node containing the keyword –the positions of the keyword in the node Rank and positions are needed to compute ranking To simplify, in the following slides, we only store lists of node ids

39 39 Topics We discuss: –Ranking –The Index Structure –Query Processing

40 40 Query Processing Challenges: –How do we find nodes that contain all keywords? –How do find only the most specific node that contains all keywords? –Can this be done in a single scan of the inverted keyword lists?

41 41 Example: Document 47 th Document in Corpus proceedings paper titleabstractsection subsection … XQL … … language … … XQL … … XQL language … paper … XQL …

42 42 Example: Document with IDs 47 th Document in Corpus proceedings paper titleabstractsection subsection … XQL … … language … … XQL … … XQL language … 47.0 47.0.0 47.0.0.0 47.0.0.1 47.0.0.2 47.0.0.2.0 paper … XQL … 47.0.1

43 43 Example: Inverted Lists proceedings paper titleabstractsection subsection … XQL … … language … … XQL … … XQL language … 47.0 47.0.0 47.0.0.0 47.0.0.1 47.0.0.2 47.0.0.2.0 XQL47.0.0.047.0.0.2.047.0.0.2 Lists contain ids for nodes that directly contain keyword. Lists are sorted language47.0.0.147.0.0.2.0 paper … XQL … 47.0.1

44 44 Example: Inverted Lists proceedings paper titleabstractsection subsection … XQL … … language … … XQL … … XQL language … 47.0 47.0.0 47.0.0.0 47.0.0.1 47.0.0.2 47.0.0.2.0 XQL47.0.0.047.0.0.2.047.0.0.2 We want to find nodes that should be returned. Which? language47.0.0.147.0.0.2.0 paper47.0.1 … XQL … 47.0.1

45 45 Algorithm: Data Structures XQL47.0.0.047.0.0.2.047.0.0.2 language47.0.0.147.0.0.2.0 Contains[1]Contains[2] DeweyID Result heap: ContainsAll 47.0.1

46 46 Algorithm: Pseudo Code Find smallest next entry in inverted lists Find longest common prefix of entry and dewey stack Pop all non-matching values from dewey stack. When popping: –propogate down containment information, if containsAll is false –if containsAll turns from false to true, add result to output Add non-matching values from entry into dewey stack. Mark containment for entry’s keyword

47 47 Example: Algorithm XQL47.0.0.047.0.0.2.047.0.0.2 language47.0.0.147.0.0.2.0 Contains[1]Contains[2] DeweyID Result heap: ContainsAll 47.0.1

48 48 Example: Algorithm XQL47.0.0.047.0.0.2.047.0.0.2 language47.0.0.147.0.0.2.0 Contains[1]Contains[2] DeweyID Result heap: Smallest entry is for keyword 1, XQL. lcp with Dewey stack = none. Pop (nothing). Add (all). ContainsAll 47.0.1

49 49 Example: Algorithm XQL47.0.0.047.0.0.2.047.0.0.2 language47.0.0.147.0.0.2.0 0  0 0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll 47.0.1

50 50 Example: Algorithm XQL47.0.0.047.0.0.2.047.0.0.2 language47.0.0.147.0.0.2.0 0  0 0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 2, language. lcp with Dewey stack = 47.0.0 47.0.1 Pop non-matching entries

51 51 47.0.1 Example: Algorithm XQL47.0.0.047.0.0.2.047.0.0.2 language47.0.0.147.0.0.2.0 0  0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 2, language. lcp with Dewey stack = 47.0.0 Add additional entries

52 52 47.0.1 Example: Algorithm XQL47.0.0.047.0.0.2.047.0.0.2 language47.0.0.147.0.0.2.0 1  0  0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 2, language. lcp with Dewey stack = 47.0.0

53 53 47.0.1 Example: Algorithm XQL47.0.0.047.0.0.2.0 language47.0.0.147.0.0.2.0 1  0  0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 1, XQL. lcp with Dewey stack = 47.0.0 Pop non-matching entries

54 54 47.0.1 Example: Algorithm XQL47.0.0.047.0.0.2.047.0.0.2 language47.0.0.147.0.0.2.0 0  0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 1, XQL. lcp with Dewey stack = 47.0.0 Continue on Blackboard!

55 Try It! Which nodes would be returned for the keyword search John Ben? Show the state of data structures at the point that the first answer is printed out, for the keyword search John Ben. 55

56 Inexact Querying of XML

57 XML Data May be Irregular Relational data is regular and organized. XML may be very different. –Data is incomplete: Missing values of attributes in elements –Data has structural variations: Relationships between elements are represented differently in different parts of the document –Data has ontology variations: Different labels are used to describe nodes of the same type (Note: In some of the upcoming slides, we have labels on edges instead of on nodes.)

58 1 111214 Movie Database Movie Actor 222325 26 27 28 29 T.V. Series Film Actor TitleName Title 3132 34 35 Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks 36 Year 1984 24 Year 21 Actor Name 30 Mark Hamill Léon Movie 13 Title 33 Magnolia The movie has a year attribute Incomplete Data The year of the movie is missing

59 1 111214 Movie Database Movie Actor 222325 26 27 28 29 T.V. Series Film Actor TitleName Title 3132 34 35 Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks 36 Year 1984 24 Year Actor Name 30 Mark Hamill Léon Movie 13 Title 33 Magnolia Variations in Structure 11 Movie below Actor 29 14 21 Actor below Movie

60 1 111213 Movie Database Movie Actor 222325 26 27 28 29 T.V. Series Film Actor TitleName Title 3132 33 34 Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks 35 Year 1984 24 Year 21 Actor Name 30 Mark Hamill Léon Movie 13 Title 34 Magnolia A movie labelA film label Ontology Variations

61 The description of the schema is large (e.g., a DTD of XML) The description of the schema is large (e.g., a DTD of XML) It is difficult to use the schema when formulating queries It is difficult to use the schema when formulating queries Data is contributed by many users in a variety of designs Data is contributed by many users in a variety of designs The query should deal with different structures of data The query should deal with different structures of data The structure of the database is changed frequently The structure of the database is changed frequently Queries should be rewritten frequently Queries should be rewritten frequently Need to allow the user to write an “approximate query” and have the query processor deal with it

62 The Problem In many different domains, we are given the option to query some source of information Usually, the user only gets results if the query can be completely answered (satisfied) In many domains, this is not appropriate, e.g., –The user is not familiar with the database –The database does not contain complete information –There is a mismatch between the ontology of the user and that of the database

63 What Do Users Need? Users need a way to get interesting partial answers to their queries, especially if a complete answer does not exist These partial answers should contain maximal information Problem: –It is easy to define when an answer satisfies a query –Hard to say when an answer that does not satisfy a query is of interest –Hard to say which incomplete answers are better than others

64 Inexact Answers Many different definitions have been given –For each definition, query processing algorithms have been defined Examples: –Allow some of the nodes of the query to be unmatched –Allow edges in the query to be matched to paths in the database –Allow nodes to be matched to nodes with labels that have a similar meaning Be careful so that answers are meaningful!

65 65 Tree Pattern Relaxation Amer-Yahia, Cho, Srivastava EDBT 2002

66 66 Tree Patterns Queries are tree patterns, as considered in previous lessons Book CollectionEditor NameAddress Double line indicates descendent

67 67 Relaxed Queries Four types of “relaxations” are allowed on the trees Node Generalization: Assume that we know a relationship of types/super-types among labels. Allow label to be changed to super-type Book CollectionEditor NameAddress Document CollectionEditor NameAddress

68 68 Relaxed Queries Leaf Node Deletion: Delete a leaf node (and its incoming edge) from the tree Book CollectionEditor NameAddress Book Editor NameAddress

69 69 Relaxed Queries Edge Generalization: Change a parent-child edge to an ancestor-descendent edge Book CollectionEditor NameAddress Book Editor NameAddress Collection

70 70 Relaxed Queries Subtree Promotion: A query subtree can be promoted so that it is directly connected to its former grandparent by an ancestor-descendent edge Book CollectionEditor NameAddress Book Editor Name Address Collection

71 71 Composing Relaxations Relaxations can be composed. Are the following relaxations of Q? Book CollectionEditor NameAddress Q Book Collection Book CollectionAddress Name DocumentAddress

72 72 Approximate Answers and Ranking An approximate answer to Q is an exact answer to a relaxed query derived from Q In order to give different answers different rankings, tree patterns are weighted Each node and edge has 2 weights – value when exactly satisfied, value when satisfied by a relaxation Book CollectionEditor NameAddress (7, 1) (4, 3) (2, 1) (6, 0) (5, 0) (8, 5) (6, 0) (4, 0) (3, 0) A fragment of a document that exactly satisfies the query will have a score of: 45

73 73 Example Ranking Book CollectionEditor NameAddress (7, 1) (4, 3) (2, 1) (6, 0) (5, 0) (8, 5) (6, 0) (4, 0) (3, 0) Book Person Name Address Details Sam NY How much would this answer score?

74 74 Example Ranking Book Collection Editor NameAddress (7, 1) (4, 3) (2, 1) (6, 0) (5, 0) (8, 5) (6, 0) (4, 0) (3, 0) Book Person Name Address Details Sam NY How much would this answer score?

75 75 Problem Definition Given an XML document D, a weighted tree pattern Q and a threshold t, find all approximate answers of Q in D whose scores are ≥ t Naive strategy to solve the problem: –Find all relaxations of Q –For each relaxation, compute all exact answers –remove answers with score below t Is this a good strategy?

76 76 Problem Definition Given an XML document D, a weighted tree pattern Q and a threshold t, find all approximate answers of Q in D whose scores are ≥ t A better strategy to compute an answer to a relaxation of a query: –Intuition: Compute the query as a series of joins –Can use stack-merge algorithms (studied before) for computing joins –filter out intermediate results whose scores are too low

77 77 The Query Plan We now show the how to derive a plan for evaluating queries in this setting First, we show how an exact plan is derived Then, we consider how each individual relaxation can be added in Finally, we show the complete relaxed plan

78 78 Query Plan: Exact Answers Book CollectionEditor NameAddress (7, 1) (4, 3) (2, 1) (5, 0) (8, 5) (6, 0) (4, 0) (3, 0) BookCollection Editor Address Name c(Book, Collection) c(Book, Editor) c(Editor, Name) d(Editor, Address) c(x,y) = y is child of x d(x,y) = y is descendent of x (6, 0)

79 79 Query Plan: Exact Answers Book CollectionEditor NameAddress (7, 1) (4, 3) (2, 1) (5, 0) (8, 5) (6, 0) (4, 0) (3, 0) BookCollection Editor Address Name c(Book, Collection) c(Book, Editor) c(Editor, Name) d(Editor, Address) Remember, to compute a join, e.g., of Book and Collection, we actually find the list of Books and the list of Collections (from the index) and perform the stack-merge algorithms (6, 0)

80 80 Adding Relaxations into Plan Node generalization: Book relaxed to Document Book Collection Editor Address Name c(Book, Editor) c(Editor, Name) d(Editor, Address) Document c(Book, Collection) c(Document, Collection) c(Document, Editor)

81 81 Adding Relaxations into Plan Edge generalization: Relax Editor-Name Edge Book Collection Editor Address Name c(Book, Editor) c(Editor, Name) d(Editor, Address) c(Book, Collection) c(Editor, Name) or (Not exists c(Editor,Name) and d(Editor, Name(( Written in short as: c(Editor, Name) or d(Editor, Name( We only allow relaxations when a direct child does not exist

82 82 Adding Relaxations into Plan Subtree Promotion: Promote tree rooted at Name Book Collection Editor Address Name c(Book, Editor) c(Editor, Name) d(Editor, Address) c(Book, Collection) c(Editor, Name) or (Not exists c(Editor,Name) and d(Book, Name(( Written in short as: c(Editor, Name) or d(Book, Name(

83 83 Adding Relaxations into Plan Leaf Node Deletion: Make Address Optional Book Collection Editor Address Name c(Book, Editor) c(Editor, Name) d(Editor, Address) c(Book, Collection) Outer Join Operator: Means that should join if possible, but not delete values that cannot join

84 84 Combining All Possible Relaxations All approximate answers can be derived from the following query plan Document Collection Editor Address Name c(Document, Editor) OR d(Document, Editor) c(Editor, Name) OR d(Editor, Name) OR d(Document,Name) d(Editor, Address) OR d(Document, Address) c(Book, Collection) OR d(Document, Collection) Book CollectionEditor NameAddress (7, 1) (4, 3) (2, 1) (5, 0) (8, 5) (6, 0) (4, 0) (3, 0) (6, 0)

85 85 Creating “Best Answers” Want to find answers whose ranking is over the threshold t Naive solution: Create all answers. Delete answers with low ranking Algorithm Thres: Goal of the algorithm is to prune intermediate answers that cannot possibly meet the specified threshold

86 86 Associating Nodes with Maximal Weight The maximal weight of a node in the evaluation plan is the largest value by which the score of an intermediate answer computed for that node can grow Document Collection Editor Address Name c(Document, Editor) OR d(Document, Editor) c(Editor, Name) OR d(Editor, Name) OR d(Document,Name) d(Editor, Address) OR d(Document,Address) c(Book, Collection) OR d(Document, Collection)

87 87 Book CollectionEditor NameAddress (7, 1) (4, 3) (2, 1) (5, 0) (8, 5) (6, 0) (4, 0) (3, 0) Document Collection Editor Address Name c(Document, Editor) OR d(Document, Editor) c(Editor, Name) OR d(Editor, Name) OR d(Document,Name) d(Editor, Address) OR d(Document,Address) c(Book, Collection) OR d(Document, Collection) (38)(39) (6, 0) (30)(40) (39) (41) (21) (7) (0)

88 88 Algorithm Thres Relaxed query evaluation plan is computed bottom-up –Note that the joins are computed for all matching intermediate results at the same time At each step, intermediate results are computed, along with their scores If the sum of an intermediate result score with the maximal weight of the current node is less than the threshold, prune the intermediate result

89 89 Example: Threshold = 35 Book Editor Name Address Details Sam NY Document Collection Editor Name c(Document, Editor) OR d(Document, Editor) c(Editor, Name) OR d(Editor, Name) OR d(Document,Name) d(Editor, Address) OR d(Document,Address) c(Book, Collection) OR d(Document, Collection) (38)(39) (30)(40) (39) (41) (21) (7) (0) Book CollectionEditor NameAddress (7, 1) (4, 3) (2, 1) (5, 0) (8, 5) (6, 0) (4, 0) (3, 0) Address (6, 0) When will the answer be pruned? 7 7 16 27

90 90 Try It! Book Collection Editor NameAddress (7, 1) (4, 3) (2, 1) (6, 0) (5, 0) (8, 5) (6, 0) (4, 0) (3, 0) Document Name Address Sam NY 1) How much would this answer score? Collection

91 91 (8, 5) Try It (cont) Book CollectionEditor Name (7, 1) (4, 3) (2, 1) (5, 0) (6, 0) 2) What will the exact plan look like? FNameLName 3) What will the plan look like if all possible relaxations are added? 4) What is maximal weight by which the score of an intermediate answer can grow, for each node? (2, 1) (2, 0)(1, 0)


Download ppt "1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are."

Similar presentations


Ads by Google