Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.

Similar presentations


Presentation on theme: "1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are."— Presentation transcript:

1 1 Keyword Search over XML

2 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are not appropriate for the naive user: –if XML “replaces” HTML as the web standard, users can’t be expected to write graph queries Allow Keyword Search over XML!

3 3 Keyword Search A keyword search is a list of search terms There can be different ways to define legal search terms. Examples: –keyword:label, e.g., author:Smith –keyword, e.g., :Smith –label, e.g., author: –value (without distinguishing between keywords and labels)

4 4 Challenges (1) Determining which part of the XML document corresponds to an answer –When searching HTML, the result units are usually documents –When searching XML, a finer granularity should be returned, e.g., a subtree

5 5 What should be returned for the query :ACID, :Kempster ?

6 6 Challenges (2) Avoiding the return of non-meaningfully related elements –XML documents often contain many unrelated fragments of information. Can these information units be recognized?

7 7 What should be returned for the query :XML, author: ?

8 8 What should be returned for the query :XML, :Kempster ?

9 9 Challenges (3) Ranking mechanisms –How should document fragments/XML elements be ranked Ideas?

10 10 In what order should the answers be returned for :ACID, author: ?

11 11 Defining a Search Semantics When defining a search over XML, all previous challenges must be considered. We must decide: –what portions of a document are a search result? –should any results be filtered out since they are not meaningful? –how should ranking be performed Typically, research focuses on one of these problems and provides simple solutions for the other problems.

12 12 Topics Discussed XRank: Paper presents a variation of PageRank for ranking XML elements –focus on ranking Interconnection Semantics: Methods to determine whether a set of nodes is meaningfully related –focus on filtering out meaningless results

13 13 XRank: Ranked Keyword Search over XML Documents Guo, Shao, Botev, Shanmugasundram SIGMOD 2003

14 14 Queries and their Semantics Queries are keywords k 1,…,k n, as in a search engine Query results are portions of XML documents that contain all words. Formally: –Let v be a node in the document. To determine whether v should be returned: First, “remove” any descendents of v that contain all the keywords k 1,…,k n. If v still contains all of k 1,…,k n, then v should be a result of the search. –Intuition: Only return v if no more specific element can be returned. Note: Containment is via child edges, not IDREF edges

15 15 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … What should be returned for the query XQL language?

16 16 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … What should be returned for the query XQL language?

17 17 Ranking Results: Intuition Granularity of ranking –In HTML, there is a rank for each document –In XML, we want a rank for each element. Different elements in the same document may have different ranks Propose to extend ideas used for ranking HTML: –PageRank: Documents with more incoming links are more important (recursive definition) –Proximity: If the document contains the search terms close together, then the document is more important Overall Rank: combination of PageRank and proximity

18 18 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … Should both papers be ranked the same?

19 19 Topics We discuss: –Ranking –The Index Structure –Query Processing

20 20 Ranking Results Take into consideration –hyperlinks –proximity We only discuss here ranking by the linking structure. Ranking by proximity can easily be defined (ideas?) What kind of “links” are the in a graph of XML documents?

21 21 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … Child/Parent “links”

22 22 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … IDREF “links”

23 23 XML and Information Retrieval: A SIGIR 2000 Workshop David Carmel, Yoelle Maarek, Aya Soffer XQL and Proximal Nodes Ricardo Baeza-Yates Gonzalo Navarro We consider the language … At first site, the XQL language looks… … Querying XML in Xyleme … … XLink “links” (out of the document)

24 24 v : Hyperlink edge 1-d: Probability of random jump d: Probability of following hyperlink d /3 Remember: Page Rank Number of documents Number of outgoing links

25 25 A Graph of XML documents Nodes: N –each element in a document is a node Edges: E = CE  CE -1  HE –CE are “containment links”, i.e., there is an edge (u,v) in CE if u is a parent of v in the XML document –HE are “hyperlinks”, i.e., there is an edge (u,v) in HE if there is an IDREF link or XLink link from u to v Want to define ElemRank, the parallel to PageRank, but for XML elements

26 26 Attempt 1 at ElemRank v Hyperlink edge Containment edge There are now 4 ways to get to an element. Consider all in the formula.

27 27 Attempt 1 at ElemRank: Problem v Hyperlink edge Containment edge Consider a paper with few sections and many references. The more references there are, the less important each section is. Why?

28 28 Attempt 2 at ElemRank v Hyperlink edge Containment edge Consider Hyperlinks and Structural links separately

29 29 Attempt 2 at ElemRank: Problem v Hyperlink edge Containment edge In fact, better to consider parent- child links differently from child-parent links

30 30 Actual ElemRank v Hyperlink edge Containment edge Consider Hyperlinks, Parent links and Child links separately

31 31 Interpretation in terms of Random Walks The element rank of e is the probability that e will be reached if we start at a random element and at each point we chose one of the following options: –with probability 1-d 1 -d 2 -d 3 jump to a random element in a random page –with probability d 1 follow a random hyperlink from the current element –with probability d 2 follow a random edge to a child element –with probability d 3 follow the parent edge

32 32 ElemRank Example Suppose that d 1 = d 2 = d 3 = 0.3 In what order will the nodes be ranked? What will be the formula for each node? 1 Hyperlink edge Containment edge 2 3 4

33 33 Think About it Very nice definition of ElemRank Does it make sense? Would ElemRank give good results in the following scenarios: –IDREFs connect articles with articles that they cite –IDREFs connect managers with their departments –IDREFs connect cleaning staff with their departments in which they work –IDREFs connect countries with bordering contries (as in the CIA factbook)

34 34 Topics We discuss: –Ranking –The Index Structure –Query Processing

35 35 Indexing We now discuss the index structure Recall that we will be ranking according to ElemRank Recall that we want to return “most specific elements” How should the data be stored in an index?

36 36 Naive Method Treat elements as documents: Normal inverted lists Ricardo 0 ; 4 ; 5 ; 8 XQL 0 ; 4 ; 5 ; 7 Problem: Space Overhead How much space is needed in storage?

37 37 Naive Method Treat elements as documents: Normal inverted lists Ricardo 0 ; 4 ; 5 ; 8 XQL 0 ; 4 ; 5 ; 7 Problem: Spurious Results Cant simply return intersection of the lists, since if a node satisfies a query, so do all its ancestors

38 38 Dewey Encoding of ID Use path information to identify elements – DeweyID An ancestor’s ID is a prefix of its descendant’s ID Actually (not shown) all the node ids are prefixed by the document number 0.0 0.1 0 0.2 0.3 28 July …XML and …David Carmel … 0.3.0 0.3.1 … 0.3.0.0 0.3.0.1 …… ……

39 39 Dewey Inverted List (DIL) Store, for each keyword a list containing : – the id of the node containing the keyword –the rank of the node containing the keyword –the positions of the keyword in the node Rank and positions are needed to compute ranking To simplify, in the following slides, we only store lists of node ids

40 40 Topics We discuss: –Ranking –The Index Structure –Query Processing

41 41 Query Processing Challenges: –How do we find nodes that contain all keywords? –How do find only the most specific node that contains all keywords? –Can this be done in a single scan of the inverted keyword lists?

42 42 Example: Document 47 th Document in Corpus proceedings paper titleabstractsection subsection … XQL … … language … … XQL … … XQL language … paper … XQL …

43 43 Example: Document with IDs 47 th Document in Corpus proceedings paper titleabstractsection subsection … XQL … … language … … XQL … … XQL language … 47.0 47.0.0 47.0.0.0 47.0.0.1 47.0.0.2 47.0.0.2.0 paper … XQL … 47.0.1

44 44 Example: Inverted Lists proceedings paper titleabstractsection subsection … XQL … … language … … XQL … … XQL language … 47.0 47.0.0 47.0.0.0 47.0.0.1 47.0.0.2 47.0.0.2.0 XQL47.0.0.047.0.0.247.0.0.2.0 Lists contain ids for nodes that directly contain keyword. Lists are sorted language47.0.0.147.0.0.2.0 paper … XQL … 47.0.1

45 45 Example: Inverted Lists proceedings paper titleabstractsection subsection … XQL … … language … … XQL … … XQL language … 47.0 47.0.0 47.0.0.0 47.0.0.1 47.0.0.2 47.0.0.2.0 XQL47.0.0.047.0.0.247.0.0.2.0 We want to find nodes that should be returned. Which? How will they be ranked? language47.0.0.147.0.0.2.0 paper47.0.1 … XQL … 47.0.1

46 46 Algorithm: Data Structures XQL47.0.0.047.0.0.247.0.0.2.0 language47.0.0.147.0.0.2.0 Contains[1]Contains[2] DeweyID Result heap: ContainsAll 47.0.1

47 47 Algorithm: Pseudo Code Find smallest next entry in inverted lists Find longest common prefix of entry and dewey stack Pop all non-matching values from dewey stack. When popping: –propogate down containment information, if containsAll is false –if containsAll turns from false to true, add result to output Add non-matching values from entry into dewey stack. Mark containment for entry’s keyword

48 48 Example: Algorithm XQL47.0.0.047.0.0.247.0.0.2.0 language47.0.0.147.0.0.2.0 Contains[1]Contains[2] DeweyID Result heap: ContainsAll 47.0.1

49 49 Example: Algorithm XQL47.0.0.047.0.0.247.0.0.2.0 language47.0.0.147.0.0.2.0 Contains[1]Contains[2] DeweyID Result heap: Smallest entry is for keyword 1, XQL. lcp with Dewey stack = none. Pop (nothing). Add (all). ContainsAll 47.0.1

50 50 Example: Algorithm XQL47.0.0.047.0.0.247.0.0.2.0 language47.0.0.147.0.0.2.0 0  0 0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll 47.0.1

51 51 Example: Algorithm XQL47.0.0.047.0.0.247.0.0.2.0 language47.0.0.147.0.0.2.0 0  0 0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 2, language. lcp with Dewey stack = 47.0.0 47.0.1 Pop non-matching entries

52 52 47.0.1 Example: Algorithm XQL47.0.0.047.0.0.247.0.0.2.0 language47.0.0.147.0.0.2.0 0  0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 2, language. lcp with Dewey stack = 47.0.0 Add additional entries

53 53 47.0.1 Example: Algorithm XQL47.0.0.047.0.0.247.0.0.2.0 language47.0.0.147.0.0.2.0 1  0  0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 2, language. lcp with Dewey stack = 47.0.0

54 54 47.0.1 Example: Algorithm XQL47.0.0.047.0.0.247.0.0.2.0 language47.0.0.147.0.0.2.0 1  0  0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 1, XQL. lcp with Dewey stack = 47.0.0 Pop non-matching entries

55 55 47.0.1 Example: Algorithm XQL47.0.0.047.0.0.247.0.0.2.0 language47.0.0.147.0.0.2.0 0  0 47 Contains[1]Contains[2] DeweyID Result heap: ContainsAll Next smallest entry is for keyword 1, XQL. lcp with Dewey stack = 47.0.0  47.0.0 Continue on Blackboard!


Download ppt "1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are."

Similar presentations


Ads by Google