Presentation is loading. Please wait.

Presentation is loading. Please wait.

XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.

Similar presentations


Presentation on theme: "XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database."— Presentation transcript:

1 XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database Systems - Semester Project

2 OUTLINE Introduction Introduction Ranking Idea Ranking Idea Search Techniques Search Techniques Experimental Evaluations Experimental Evaluations Conclusion Conclusion

3 INTRODUCTION Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format which is both human-readable and machine- readable. XML can have user defined tags which can be nested. XML can have user defined tags which can be nested. HTML is a presentation language and hence cannot capture much semantics. HTML is a presentation language and hence cannot capture much semantics. HTML search techniques cannot be employed for XML searches. HTML search techniques cannot be employed for XML searches. XQuery is complicated for end user. XQuery is complicated for end user. XRank provides simple keyword search query interface. XRank provides simple keyword search query interface.

4 INTRODUCTION Challanges: Challanges: Element containing the search keyword is returned. Element containing the search keyword is returned. Ranking of the elements depends on a certain factors. Ranking of the elements depends on a certain factors. Keyword proximity has to be considered in two dimensions – keyword distance and ancestor distance. Keyword proximity has to be considered in two dimensions – keyword distance and ancestor distance.

5 INTRODUCTION XML Data Model : XML Data Model : A collection of hyperlinked XML documents can be defined as a directed graph: A collection of hyperlinked XML documents can be defined as a directed graph: G = (N, CE, HE) N : The set of nodes N = (NE U NV) NE : The set of elements NV : The set of values CE : The set of containment edges relating nodes HE : The set of hyperlink edges relating nodes

6 RANKING IDEA ElemRank – For ranking a single element ElemRank – For ranking a single element Overall rank – For ranking an ancestor of an element by considering the value of ElemRank the child element. Overall rank – For ranking an ancestor of an element by considering the value of ElemRank the child element.

7 RANKING IDEA – ELEMRANK ElemRank is a measure of the objective importance of an XML element and is based on the hyperlinked structure of XML docs. ElemRank is a measure of the objective importance of an XML element and is based on the hyperlinked structure of XML docs. This is obtained by refining the PageRank algorithm of Google. This is obtained by refining the PageRank algorithm of Google. PageRank: PageRank of a document v, p(v) is PageRank: PageRank of a document v, p(v) is N d is the total number of documents. N d is the total number of documents. N h (u) is the number of out-going hyperlinks from document u. N h (u) is the number of out-going hyperlinks from document u. d is a constant (typically is 0.85). d is a constant (typically is 0.85).

8 RANKING IDEA – ELEMRANK But PageRank is unidirectional. But PageRank is unidirectional. We need ElemRank (denoted by function, e()) to be bidirectional. So add reverse containment edges in the formula: We need ElemRank (denoted by function, e()) to be bidirectional. So add reverse containment edges in the formula: v- Element for which rank is being calculated. v- Element for which rank is being calculated. N e – Number of XML elements. N e – Number of XML elements. N h (u) is the number of out-going hyperlinks from document u. N h (u) is the number of out-going hyperlinks from document u. N c (u) is the number of sub elements of u N c (u) is the number of sub elements of u d is a constant (typically is 0.85). d is a constant (typically is 0.85). E = HE ∪ CE ∪ CE, where CE -1 is the set of reverse containment edges. E = HE ∪ CE ∪ CE -1, where CE -1 is the set of reverse containment edges.

9 RANKING IDEA – ELEMRANK But containment edges and hyperlink edges need to be differentiated. But containment edges and hyperlink edges need to be differentiated. After differentiating the hyperlink edges and containment edges we get After differentiating the hyperlink edges and containment edges we get v- Element for which rank is being calculated. v- Element for which rank is being calculated. N e N e – Number of XML elements. N h (u) - number of out-going hyperlinks from document u. N h (u) - number of out-going hyperlinks from document u. N c (u) - number of sub elements of u N c (u) - number of sub elements of u d1, d2 are the probabilities of navigating through hyperlinks, forward containment edges. d1, d2 are the probabilities of navigating through hyperlinks, forward containment edges.

10 RANKING IDEA – ELEMRANK But it weights forward and reverse containment relationships similarly. But it weights forward and reverse containment relationships similarly. After differentiating the hyperlink edges, containment edges and reverse containment edges we get After differentiating the hyperlink edges, containment edges and reverse containment edges we get v - Element for which rank is being calculated. v - Element for which rank is being calculated. N e N e – Number of XML elements. N h (u) - number of out-going hyperlinks from document u. N h (u) - number of out-going hyperlinks from document u. N de (v) - number of elements in the XML documents containing the element v N de (v) - number of elements in the XML documents containing the element v N c (u) - number of sub elements of u N c (u) - number of sub elements of u d1, d2, and d3 are the probabilities of navigating through hyperlinks, forward containment edges, and reverse containment edges, respectively. d1, d2, and d3 are the probabilities of navigating through hyperlinks, forward containment edges, and reverse containment edges, respectively.

11 RANKING IDEA – OVERALL RANK Rank of v 1 with respect to the element v t which contains the keyword (k i )is calculated. decay is a parameter that can be set to a value in the range 0 to 1 For multiple occurences of k i in v 1 combined rank is: Where function is the maximum of all the ranks of element v 1 with respect to m keywords

12 RANKING IDEA – OVERALL RANK The overall ranking is the sum of the ranks with respect to each query keyword, multiplied by a measure of keyword proximity p(v 1, k 1, k 2, …, k n ). The overall ranking is the sum of the ranks with respect to each query keyword, multiplied by a measure of keyword proximity p(v 1, k 1, k 2, …, k n ). Function p(v 1, k 1, k 2, …, k n ) can be any function that ranges from 0 to 1. Function p(v 1, k 1, k 2, …, k n ) can be any function that ranges from 0 to 1.

13 SEARCH TECHNIQUES – NAÏVE APPROACH Main Difference between XML and HTML keyword search: Main Difference between XML and HTML keyword search: The granularity of query results The granularity of query results XML keyword search returns elements XML keyword search returns elements HTML keyword search returns documents HTML keyword search returns documents One way to do XML keyword search One way to do XML keyword search Treat each element as a document Treat each element as a document Problems: Problems: Space Overhead Space Overhead Spurious Query Results Spurious Query Results Inaccurate ranking of results Inaccurate ranking of results

14 SEARCH TECHNIQUES – DEWEY INVERTED LIST (DIL) Dewey IDs idea: Dewey IDs idea:

15 SEARCH TECHNIQUES – DEWEY INVERTED LIST (DIL) An inverted list of all the elements which contain the keyword/keywords is created. An inverted list of all the elements which contain the keyword/keywords is created. It contains all three fields – Dewey ID for each element, its ElemRank and the position in the element where the keyword occurs. It contains all three fields – Dewey ID for each element, its ElemRank and the position in the element where the keyword occurs. The list is sorted by Dewey ID. The list is sorted by Dewey ID.

16 SEARCH TECHNIQUES – DEWEY INVERTED LIST (DIL) This algorithm works in a single pass. This algorithm works in a single pass. Key idea is to merge the keyword inverted lists by simultaneously computing the longest common prefix of the Dewey IDs in the different lists. Key idea is to merge the keyword inverted lists by simultaneously computing the longest common prefix of the Dewey IDs in the different lists.

17 SEARCH TECHNIQUES – DEWEY INVERTED LIST (DIL) 5.0.3.0 5.0.3.0.0 5.0.3.0.1

18 SEARCH TECHNIQUES – RANKED DEWEY INVERTED LIST (RDIL) “If inverted lists are long (due to common keywords or large document collections) even the cost of a single scan of the inverted list can be expensive, especially if the users want only the top few results” “If inverted lists are long (due to common keywords or large document collections) even the cost of a single scan of the inverted list can be expensive, especially if the users want only the top few results” We can directly start determining the elements which are likely to have higher ranks. We can directly start determining the elements which are likely to have higher ranks. In this way, we can only calculate the top m results requested by the user rather than all of them. In this way, we can only calculate the top m results requested by the user rather than all of them.

19 SEARCH TECHNIQUES – RANKED DEWEY INVERTED LIST (RDIL) In RDIL, In RDIL, Inverted lists are ordered by ElemRank. Inverted lists are ordered by ElemRank. Each inverted list has a B+-tree index of the Dewey ID field. Each inverted list has a B+-tree index of the Dewey ID field.

20 SEARCH TECHNIQUES – RANKED DEWEY INVERTED LIST (RDIL) Working: Pick a random keyword k i and thus has Dewey ID of a top ranked element containing k i Pick a random keyword k i and thus has Dewey ID of a top ranked element containing k i Now another keyword k j is picked and from its B+ tree (which is sorted by Dewey IDs), we pick a Dewey ID which is greater than the Dewey ID of k i. Now another keyword k j is picked and from its B+ tree (which is sorted by Dewey IDs), we pick a Dewey ID which is greater than the Dewey ID of k i. The longest ID containing both the elements will be either the Dewey ID we just picked or a predecessor of the Dewey ID we just picked. The longest ID containing both the elements will be either the Dewey ID we just picked or a predecessor of the Dewey ID we just picked.

21 SEARCH TECHNIQUES – RANKED DEWEY INVERTED LIST (RDIL) Example: Consider the query “XQL Ricardo”. Consider the query “XQL Ricardo”. Dewey ID, 9.0.4.2.0 is a top ranked Dewey ID which contains the keyword “XQL”. Dewey ID, 9.0.4.2.0 is a top ranked Dewey ID which contains the keyword “XQL”. Pick the Dewey ID greater than 9.0.4.2.0 from the leaf nodes of the B+ tree for the keyword “Ricardo”. Pick the Dewey ID greater than 9.0.4.2.0 from the leaf nodes of the B+ tree for the keyword “Ricardo”. Consider the IDs - 8.2.1.4.2, 9.0.4.1.2, 9.0.5.6, 10.8.3, … on B+ tree of Ricardo Consider the IDs - 8.2.1.4.2, 9.0.4.1.2, 9.0.5.6, 10.8.3, … on B+ tree of Ricardo We pickup the ID 9.0.5.6 as it is greater than 9.0.4.2.0. We pickup the ID 9.0.5.6 as it is greater than 9.0.4.2.0. The Dewey ID with longest prefix will be either 9.0.5.6 or its predecessor, 9.0.4.1.2. The Dewey ID with longest prefix will be either 9.0.5.6 or its predecessor, 9.0.4.1.2. The element with Dewey ID 9.0.4 will contain both XQL and Ricardo. The element with Dewey ID 9.0.4 will contain both XQL and Ricardo.

22 SEARCH TECHNIQUES – RANKED DEWEY INVERTED LIST (RDIL) Consider an individual query where keywords occur relatively frequently in the document collection but rarely occur together in the same document. Consider an individual query where keywords occur relatively frequently in the document collection but rarely occur together in the same document. RDIL has to scan most (or all) of the inverted lists to produce the output. RDIL has to scan most (or all) of the inverted lists to produce the output. The overhead of performing random index lookups in RDIL can sometimes outweigh the benefit of processing the inverted lists in rank order The overhead of performing random index lookups in RDIL can sometimes outweigh the benefit of processing the inverted lists in rank order

23 SEARCH TECHNIQUES – HYBRID DEWEY INVERTED LIST (HDIL) The key idea here is to combine the benefits of both DIL and RDIL. The key idea here is to combine the benefits of both DIL and RDIL. We dynamically switch from RDIL and DIL depending upon the query performance. We dynamically switch from RDIL and DIL depending upon the query performance. So we will need to have Inverted list sorted by ElemRank for RDIL and Dewey ID for DIL. So we will need to have Inverted list sorted by ElemRank for RDIL and Dewey ID for DIL. But RDIL is likely to outperform DIL only if it scans a small fraction of the full inverted list. But RDIL is likely to outperform DIL only if it scans a small fraction of the full inverted list. So we store only a small fraction of the inverted list sorted by rank. So we store only a small fraction of the inverted list sorted by rank.

24 SEARCH TECHNIQUES – HYBRID DEWEY INVERTED LIST (HDIL)

25 The dynamic switching between RDIL and DIL is based on the following factors: The dynamic switching between RDIL and DIL is based on the following factors: The time spent so far – t The time spent so far – t The number of results above the threshold so far – r The number of results above the threshold so far – r Based on this we estimate the remaining time for RDIL as s (m-r)*t/r Based on this we estimate the remaining time for RDIL as s (m-r)*t/r Switch to DIL if this is more than the expected time for DIL. Switch to DIL if this is more than the expected time for DIL. We initially start with RDIL and then switch to DIL based on the above computation. We initially start with RDIL and then switch to DIL based on the above computation.

26 EXPERIMENTAL EVALUATIONS Data Sets Used : DBLP and XMark. Data Sets Used : DBLP and XMark. We perform time taken by each of the search techniques based on the number of keywords, correlation among them versus time. We perform time taken by each of the search techniques based on the number of keywords, correlation among them versus time.

27 CONCLUSION We have presented the design, implementation and evaluation of the XRANK system for ranked keyword search over XML documents taking into account: We have presented the design, implementation and evaluation of the XRANK system for ranked keyword search over XML documents taking into account: (a) the hierarchical and hyperlinked structure of XML documents (a) the hierarchical and hyperlinked structure of XML documents (b) a two-dimensional notion of keyword proximity, when computing the ranking for XML keyword search queries (b) a two-dimensional notion of keyword proximity, when computing the ranking for XML keyword search queries

28 THANK YOU.


Download ppt "XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database."

Similar presentations


Ads by Google