COMP630 Paper Presentation by Haomian(Eric) Wang.

COMP630 Paper Presentation by Haomian(Eric) Wang

The paper to be presented XRank: Ranked Keyword Search over XML Documents Authors: Lin Guo, Feng Shao, Chavdar Botev, Jayavel Shanmugasundaram Published: SIGMOD 2003

What will be presented Keyword search in XML Contribution of XRank The details in XRank Experiment Conclusion

Keyword search in XML Issues in keyword search  How to find the objects according to keywords? (The process of search)  How to rank the search results? (Ranking scheme)

Keyword search in XML (cont.) Challenges of keyword search over XML  The results of the keyword search query is not always the entire document, but can be a deeply nested XML element.  Since XML keyword search can return nested elements, ranking has to be done at the granularity of elements.  The notion of proximity among keywords is more complex for XML. Must consider keyword distance (i.e. width in XML tree) and ancestor distance (i.e. height in the XML tree)

Challenges of keyword search over XML (Example1) Keyword search is: “XQL language”

Keyword search in XML (cont.) Challenges of keyword search over XML  The results of the keyword search query is not always the entire document, but can be a deeply nested XML element.  Since XML keyword search can return nested elements, ranking has to be done at the granularity of elements, which is more complex than at the granularity of documents.  The notion of proximity among keywords is more complex for XML. Must consider keyword distance (i.e. width in XML tree) and ancestor distance (i.e. height in the XML tree).

Challenges of keyword search over XML (Example2) For example, different papers in the XML document in Figure 1 can have different rankings depending on the underlying hyperlinked structure.

Challenges of keyword search over XML (Example3) Keyword search is: “Soffer XQL” Although the distance between keywords is small, the XML element that contains both of The keywords( in Line 1) is not a direct parent of Either keyword, and is not very Proximal to either keyword.

Contribution of XRank the problem definition and system architecture for ranked keyword search over hierarchical and hyperlinked XML documents computing the ranking of XML elements that takes into account both hyperlink and containment edges new inverted list index structures and associated query processing algorithms for evaluating XML keyword search queries

The details in XRank Data model and Query semantics Computing ElemRanks Evaluating XML keyword search queries

The details in XRank: Data model and Query semantics XML Data Model  define a collection of hyperlinked XML documents to be a directed graph: G = (N, CE, HE).

The details in XRank: Data model and Query semantics (cont.) element value Intra-hyperlink Inter-hyperlink

The details in XRank: Data model and Query semantics (cont.) Keyword Query Results  Just consider the conjunctive keyword query semantics. ( return the elements which contain all of the keywords.)  Consider a keyword search:  Define a set

The details in XRank: Data model and Query semantics (cont.) Keyword Query Results  The result of the query Q is defined below:  Result(Q) contains the set of elements that contain at least one occurrence of all of the query keywords, after excluding the occurrence of the keywords in sub-elements that already contain all of the query keywords

The details in XRank: Data model and Query semantics (cont.) Keyword Query Results  Only the most specific results are return for a keyword search query. Consider query “XQL language”

The details in XRank: Data model and Query semantics (cont.) Keyword Query Results  An element that has multiple independent occurr- ences of the query keywords is returned, even if a sub-element of that element already contains all of the query keywords. “XQL language”

The details in XRank: Data model and Query semantics (cont.) Ranking keyword query results  Ranking function: desire properties: Result specific: The ranking function should rank more specific results higher than less specific results. e.g. (keyword occur in the same subsection) higher than ( the keywords occur in different subsection. (height in the XML tree) Keyword proximity: The ranking function should take the proximity of the query keywords into account. Note that a result can have high keyword proximity and low specificity, and vice-versa. (width in the XML tree) Hyperlink Awareness: The ranking function should use the hyperlinked structure of XML documents.

The details in XRank: Data model and Query semantics (cont.) Ranking keyword query results  Consider the query  Raking respect to one keyword:

The details in XRank: Data model and Query semantics (cont.) One of the Keywords “structured”

The details in XRank: Data model and Query semantics (cont.) Ranking keyword query results  Raking respect to one keyword: Intuitively, the rank of v1 with respect to a keyword ki is ElemRank(vt) scaled appropriately to account for the specificity of the result. This ensures that less specific results indeed get lower ranks. ElemRank(vt) is in fact related to ElemRank(v1) due to certain properties of containment edges.

The details in XRank: Data model and Query semantics (cont.) Ranking keyword query results  Raking respect to one keyword: In the above discussion, implicity assume that there is only one relevat occurrence of the query keyword ki in v1. In case there are multiple (say, m) relevant occurrence of ki, then we first compute the rank for each occurrence using the above formula. Let the compute ranks be r1,r2,…,rm. The combine rank is: Here f is some aggregation function. We set f=max by default, but other choices (such as f=sum) are also supported.

The details in XRank: Data model and Query semantics (cont.) Ranking keyword query results  Raking respect to overall keyword:

The details in XRank: Computing ElemRanks 1 ElemRank is a measure of the objective importance of an XML element, and is computed based on the hyperlinked structure of XML documents. ElemRank is similar to Google’s PageRank, but is computed at the granularity of an element and takes the nested structure of XML into account. develop our ElemRank algorithm as a series of refinements to the PageRank algorithm

The details in XRank: Computing ElemRanks (cont.) PageRanks: (1) PageRank propagates along only one direction. This unidirectional PageRank propagation for HTML documents corresponds to the intuition that if an important page p1 points to a page p2, then p2 is likely to be important. However, if p1 points to an important page p3, that does not tell us anything about the importance of p1 (consider relatively obscure HTML pages that point to Yahoo). Therefore……

The details in XRank: Computing ElemRanks (cont.) PageRank can not directly adapted to ElemRank forward ElemRank propagation; reverse ElemRank propagation; generally, containment implies a tighter relationship than hyperlinks, and hence argues for a bi-directional transfer of ElemRanks. Element Sub-Element High ElemRank Element Sub-Element High ElemRank ? Correct in ElemRank, not correct in PageRank

The details in XRank: Computing ElemRanks (cont.) Add reverse containment edges. e(v) is used to denote the ElemRank of an element v (for notational convenience, we set e(v) of a value node v to be 0). (2)

The details in XRank: Computing ElemRanks (cont.) it still has a shortcoming: it does not distinguish between containment and hyperlink edges when computing ElemRanks. As an illustration, consider a paper that has few sections and many references. As per the above formula, the ElemRank of the paper are uniformly distributed among all the sections and references. Thus, the larger the number of references in a paper, the less important each section of the paper is likely to be, which is not very intuitive. In general, the problem is hyper-links and containment edges are treated similarly, even though these two factors are usually independent. This argues for discrimination between containment and hyperlink edges

The details in XRank: Computing ElemRanks (cont.) Discrimination between containment and hyperlink edges (3)  d1 and d2 are the probabilities of navigating through hyperlinks and containment links, respectively.

The details in XRank: Computing ElemRanks (cont.) The above formula still has a problem: it weights forward and reverse containment relationships similarly generally, ElemRanks of sub-elements should be inversely proportional to the number of sibling sub-elements. However, the ElemRank of a parent element should be directly proportional to the aggregate of the ElemRanks of its sub-elements. For instance, a workshop that contains many important papers should have a higher ElemRank than a workshop that contains only one important paper.

The details in XRank: Computing ElemRanks (cont.) u vv2v3 v u1u2u3

The details in XRank: Computing ElemRanks (cont.) Final version of ElemRanks (4)

The details in XRank: XRANK Architecture

The details in XRank: Evaluating XML keyword queries Focus on two problems (space and efficiency  How to organize inverted list (file)  How to do the query (query processing) Three solutions:  Dewey Inverted List (DIL)  Ranked Dewey Inverted List (RDIL)  Hybrid Dewey Inverted List (HDIL)

The details in XRank: Evaluating XML keyword queries (cont.) Dewey IDs: jointly captures ancestor and descendant information.

The details in XRank: Evaluating XML keyword queries(cont.) Dewey Inverted List (DIL)  DIL: Data structure The inverted list for a keyword k contains the Dewey IDs of all the XML elements that directly contain the keyword k. To handle multiple documents, the first component of each Dewey ID is the document ID.

The details in XRank: Evaluating XML keyword queries(cont.) Dewey Inverted List (DIL)  DIL: Query Processing The key idea is to merge the query keyword inverted lists, and simultaneously compute the longest common prefix of the Dewey IDs in the different lists. Since each prefix of a Dewey ID is the ID of an ancestor, computing the longest common prefix will automatically compute the ID of the deepest ancestor that contains all the query keywords

The details in XRank: Evaluating XML keyword queries(cont.) Dewey Inverted List (DIL)  DIL: Query Processing

The details in XRank: Evaluating XML keyword queries(cont.) Dewey Inverted List (DIL)  Drawback : If inverted lists are long (due to common keywords or large document collections), even the cost of a single scan of the inverted lists can be expensive, especially if users want only the top few results. One solution is to order the inverted lists by the ElemRank instead of by the Dewey ID. In this way, higher ranked results are likely to appear first in the inverted lists, and query processing can usually be terminated without scanning all of the inverted lists.

The details in XRank: Evaluating XML keyword queries(cont.) Ranked Dewey Inverted List (RDIL)  RDIL: Data structure RDIL is similar to DIL, except that the inverted lists are ordered by ElemRank instead of Dewey ID. In addition, each inverted list has a B+-tree index on the Dewey ID field.

The details in XRank: Evaluating XML keyword queries(cont.) Ranked Dewey Inverted List (RDIL)  RDIL: Query Processing Consider an entry retrieved from the inverted list of keyword ki. The entry contains the Dewey ID d of a top-ranked element that directly contains the query keyword ki. However, to determine a query result, we need to determine the longest prefix of d that also contains the other query keywords. B+-trees can be used to efficiently determine the longest prefix of d that also contains the other query keywords.

The details in XRank: Evaluating XML keyword queries(cont.) Ranked Dewey Inverted List (RDIL)  RDIL: Query Processing Consider query ‘XQL Ricardo’ Assume top-ranked Dewey ID, 9.0.4.2.0, that contains the keyword ‘XQL’ Assume leaf nodes of the B+-tree for the ‘Ricardo’ inverted list have the Dewey IDs “…, 8.2.1.4.2, 9.0.4.1.2, 9.0.5.6, 10.8.3, …”(the leaf nodes of the B+-tree are ordered by the Dewey ID) Determine the smallest Dewey ID in the ‘Ricardo’ B+-tree that is larger than 9.0.4.2.0, which in our example is 9.0.4.1.2.

The details in XRank: Evaluating XML keyword queries(cont.) Ranked Dewey Inverted List (RDIL)  RDIL: Query Processing Given that the longest common prefix can potentially have a low overall rank, how can we determine when we have the top m results so that we can stop scanning the inverted lists? To derive a stopping condition that still guarantees to output the top-m results, we build upon the provably optimal Threshold Algorithm. TA computes a threshold at every point during the scan of the inverted lists. If there are at least m elements in the output heap that have an overall rank greater than or equal to the current threshold, the algorithm can stop scanning the lists.

The details in XRank: Evaluating XML keyword queries(cont.) Ranked Dewey Inverted List (RDIL)  RDIL: Drawbacks RDIL perform well in many cases, there are certain cases where it perform much worse than DIL. For example, consider a query where the keywords are not very correlated, i.e., the individual query keywords occur relatively frequently in the document collection but rarely occur together in the same document. Since the number of results is small, RDIL has to scan most (or all) of the inverted lists to produce the output, incurring the cost of random index lookups along the way. In contrast, DIL sequentially scans the inverted lists, and is likely to be faster.

The details in XRank: Evaluating XML keyword queries(cont.) Hybrid Dewey Inverted List (HDIL)  HDIL: Data structure RDIL is likely to outperform DIL only if it scans a small fraction of the full inverted list; consequently, we can store the full inverted list sorted by Dewey id (for DIL), and store only a small fraction of the inverted list sorted by rank (for RDIL).

The details in XRank: Evaluating XML keyword queries(cont.) Hybrid Dewey Inverted List (HDIL)  HDIL: Query Processing First start evaluating the query using RDIL, and periodically monitor its performance to calculate (a) the time spent so far, denote as t, (b) the number of results above the threshold so far, denote as r. Estimate the remaining time for RDIL as (m-r)*t/r, where m is the desired number of query results. If this estimated time is more than the expected time for DIL, we switch to DIL. Note that the expected time for DIL is easy to compute a priori for a given machine configuration because it mainly depends on the number of query keywords, and the size of each query keyword inverted list (since DIL scans inverted lists fully in all cases).

Experiment Naïve approach: treat each element as a document, and use regular document-oriented keyword search methods  Naïve-ID: inverted list is ordered by ID,  Naïve-Rank: inverted list is ordered by Rank used both the DBLP and XMark data sets for our experiments

Experiment used both the DBLP and XMark data sets for our experiments. The default value for number of query results is 10.

Conclusion Well define the Ranking function and ElemRank function. Not so well define the query result (e.g. compare to XSeek). Experiments are not sufficient.

The End Q&A

COMP630 Paper Presentation by Haomian(Eric) Wang.

Similar presentations

Presentation on theme: "COMP630 Paper Presentation by Haomian(Eric) Wang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMP630 Paper Presentation by Haomian(Eric) Wang.

Similar presentations

Presentation on theme: "COMP630 Paper Presentation by Haomian(Eric) Wang."— Presentation transcript:

Similar presentations

About project

Feedback