XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.
Xyleme A Dynamic Warehouse for XML Data of the Web.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
Computing Trust in Social Networks
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Investigation of Web Query Refinement via Topic Analysis and Learning with Personalization Department of Systems Engineering & Engineering Management The.
COMP630 Paper Presentation by Haomian(Eric) Wang.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Chapter 19: Information Retrieval
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Information Retrieval
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Querying Structured Text in an XML Database By Xuemei Luo.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Database and Query Model ◦ Informal Model ◦ Formal Model ◦ Query and Answer Model 
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Chapter 6: Information Retrieval and Web Search
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Ranked Information Retrieval on XML Data Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
2 September 2005VLDB Tutorial on XML Full-Text Search XML Full-Text Search: Challenges and Opportunities Jayavel Shanmugasundaram Cornell University Sihem.
Algorithmic Detection of Semantic Similarity WWW 2005.
Web- and Multimedia-based Information Systems Lecture 2.
XRANK: Ranked Keyword Search over XML Documents Presentation by: Meghana Kshirsagar Nitin Gupta Indian Institute of Technology, Bombay Lin Guo Feng Shao.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Databases and Information Retrieval: Rethinking the Great Divide SIGMOD Panel 14 Jun 2005 Jayavel Shanmugasundaram Cornell University.
Text Search over XML Documents Jayavel Shanmugasundaram Cornell University.
Overview of XML Data Management Research at Cornell Jayavel Shanmugasundaram Cornell University.
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Information Retrieval in Practice
XRANK: Ranked Keyword Search over XML Documents
Probabilistic Data Management
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Toshiyuki Shimizu (Kyoto University)
Information Retrieval
Keyword Searching and Browsing in Databases using BANKS
Introduction to XML IR XML Group.
Presentation transcript:

XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database Systems - Semester Project

OUTLINE Introduction Introduction Ranking Idea Ranking Idea Search Techniques Search Techniques Experimental Evaluations Experimental Evaluations Conclusion Conclusion

INTRODUCTION Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format which is both human-readable and machine- readable. XML can have user defined tags which can be nested. XML can have user defined tags which can be nested. HTML is a presentation language and hence cannot capture much semantics. HTML is a presentation language and hence cannot capture much semantics. HTML search techniques cannot be employed for XML searches. HTML search techniques cannot be employed for XML searches. XQuery is complicated for end user. XQuery is complicated for end user. XRank provides simple keyword search query interface. XRank provides simple keyword search query interface.

INTRODUCTION Challanges: Challanges: Element containing the search keyword is returned. Element containing the search keyword is returned. Ranking of the elements depends on a certain factors. Ranking of the elements depends on a certain factors. Keyword proximity has to be considered in two dimensions – keyword distance and ancestor distance. Keyword proximity has to be considered in two dimensions – keyword distance and ancestor distance.

INTRODUCTION XML Data Model : XML Data Model : A collection of hyperlinked XML documents can be defined as a directed graph: A collection of hyperlinked XML documents can be defined as a directed graph: G = (N, CE, HE) N : The set of nodes N = (NE U NV) NE : The set of elements NV : The set of values CE : The set of containment edges relating nodes HE : The set of hyperlink edges relating nodes

RANKING IDEA ElemRank – For ranking a single element ElemRank – For ranking a single element Overall rank – For ranking an ancestor of an element by considering the value of ElemRank the child element. Overall rank – For ranking an ancestor of an element by considering the value of ElemRank the child element.

RANKING IDEA – ELEMRANK ElemRank is a measure of the objective importance of an XML element and is based on the hyperlinked structure of XML docs. ElemRank is a measure of the objective importance of an XML element and is based on the hyperlinked structure of XML docs. This is obtained by refining the PageRank algorithm of Google. This is obtained by refining the PageRank algorithm of Google. PageRank: PageRank of a document v, p(v) is PageRank: PageRank of a document v, p(v) is N d is the total number of documents. N d is the total number of documents. N h (u) is the number of out-going hyperlinks from document u. N h (u) is the number of out-going hyperlinks from document u. d is a constant (typically is 0.85). d is a constant (typically is 0.85).

RANKING IDEA – ELEMRANK But PageRank is unidirectional. But PageRank is unidirectional. We need ElemRank (denoted by function, e()) to be bidirectional. So add reverse containment edges in the formula: We need ElemRank (denoted by function, e()) to be bidirectional. So add reverse containment edges in the formula: v- Element for which rank is being calculated. v- Element for which rank is being calculated. N e – Number of XML elements. N e – Number of XML elements. N h (u) is the number of out-going hyperlinks from document u. N h (u) is the number of out-going hyperlinks from document u. N c (u) is the number of sub elements of u N c (u) is the number of sub elements of u d is a constant (typically is 0.85). d is a constant (typically is 0.85). E = HE ∪ CE ∪ CE, where CE -1 is the set of reverse containment edges. E = HE ∪ CE ∪ CE -1, where CE -1 is the set of reverse containment edges.

RANKING IDEA – ELEMRANK But containment edges and hyperlink edges need to be differentiated. But containment edges and hyperlink edges need to be differentiated. After differentiating the hyperlink edges and containment edges we get After differentiating the hyperlink edges and containment edges we get v- Element for which rank is being calculated. v- Element for which rank is being calculated. N e N e – Number of XML elements. N h (u) - number of out-going hyperlinks from document u. N h (u) - number of out-going hyperlinks from document u. N c (u) - number of sub elements of u N c (u) - number of sub elements of u d1, d2 are the probabilities of navigating through hyperlinks, forward containment edges. d1, d2 are the probabilities of navigating through hyperlinks, forward containment edges.

RANKING IDEA – ELEMRANK But it weights forward and reverse containment relationships similarly. But it weights forward and reverse containment relationships similarly. After differentiating the hyperlink edges, containment edges and reverse containment edges we get After differentiating the hyperlink edges, containment edges and reverse containment edges we get v - Element for which rank is being calculated. v - Element for which rank is being calculated. N e N e – Number of XML elements. N h (u) - number of out-going hyperlinks from document u. N h (u) - number of out-going hyperlinks from document u. N de (v) - number of elements in the XML documents containing the element v N de (v) - number of elements in the XML documents containing the element v N c (u) - number of sub elements of u N c (u) - number of sub elements of u d1, d2, and d3 are the probabilities of navigating through hyperlinks, forward containment edges, and reverse containment edges, respectively. d1, d2, and d3 are the probabilities of navigating through hyperlinks, forward containment edges, and reverse containment edges, respectively.

RANKING IDEA – OVERALL RANK Rank of v 1 with respect to the element v t which contains the keyword (k i )is calculated. decay is a parameter that can be set to a value in the range 0 to 1 For multiple occurences of k i in v 1 combined rank is: Where function is the maximum of all the ranks of element v 1 with respect to m keywords

RANKING IDEA – OVERALL RANK The overall ranking is the sum of the ranks with respect to each query keyword, multiplied by a measure of keyword proximity p(v 1, k 1, k 2, …, k n ). The overall ranking is the sum of the ranks with respect to each query keyword, multiplied by a measure of keyword proximity p(v 1, k 1, k 2, …, k n ). Function p(v 1, k 1, k 2, …, k n ) can be any function that ranges from 0 to 1. Function p(v 1, k 1, k 2, …, k n ) can be any function that ranges from 0 to 1.

SEARCH TECHNIQUES – NAÏVE APPROACH Main Difference between XML and HTML keyword search: Main Difference between XML and HTML keyword search: The granularity of query results The granularity of query results XML keyword search returns elements XML keyword search returns elements HTML keyword search returns documents HTML keyword search returns documents One way to do XML keyword search One way to do XML keyword search Treat each element as a document Treat each element as a document Problems: Problems: Space Overhead Space Overhead Spurious Query Results Spurious Query Results Inaccurate ranking of results Inaccurate ranking of results

SEARCH TECHNIQUES – DEWEY INVERTED LIST (DIL) Dewey IDs idea: Dewey IDs idea:

SEARCH TECHNIQUES – DEWEY INVERTED LIST (DIL) An inverted list of all the elements which contain the keyword/keywords is created. An inverted list of all the elements which contain the keyword/keywords is created. It contains all three fields – Dewey ID for each element, its ElemRank and the position in the element where the keyword occurs. It contains all three fields – Dewey ID for each element, its ElemRank and the position in the element where the keyword occurs. The list is sorted by Dewey ID. The list is sorted by Dewey ID.

SEARCH TECHNIQUES – DEWEY INVERTED LIST (DIL) This algorithm works in a single pass. This algorithm works in a single pass. Key idea is to merge the keyword inverted lists by simultaneously computing the longest common prefix of the Dewey IDs in the different lists. Key idea is to merge the keyword inverted lists by simultaneously computing the longest common prefix of the Dewey IDs in the different lists.

SEARCH TECHNIQUES – DEWEY INVERTED LIST (DIL)

SEARCH TECHNIQUES – RANKED DEWEY INVERTED LIST (RDIL) “If inverted lists are long (due to common keywords or large document collections) even the cost of a single scan of the inverted list can be expensive, especially if the users want only the top few results” “If inverted lists are long (due to common keywords or large document collections) even the cost of a single scan of the inverted list can be expensive, especially if the users want only the top few results” We can directly start determining the elements which are likely to have higher ranks. We can directly start determining the elements which are likely to have higher ranks. In this way, we can only calculate the top m results requested by the user rather than all of them. In this way, we can only calculate the top m results requested by the user rather than all of them.

SEARCH TECHNIQUES – RANKED DEWEY INVERTED LIST (RDIL) In RDIL, In RDIL, Inverted lists are ordered by ElemRank. Inverted lists are ordered by ElemRank. Each inverted list has a B+-tree index of the Dewey ID field. Each inverted list has a B+-tree index of the Dewey ID field.

SEARCH TECHNIQUES – RANKED DEWEY INVERTED LIST (RDIL) Working: Pick a random keyword k i and thus has Dewey ID of a top ranked element containing k i Pick a random keyword k i and thus has Dewey ID of a top ranked element containing k i Now another keyword k j is picked and from its B+ tree (which is sorted by Dewey IDs), we pick a Dewey ID which is greater than the Dewey ID of k i. Now another keyword k j is picked and from its B+ tree (which is sorted by Dewey IDs), we pick a Dewey ID which is greater than the Dewey ID of k i. The longest ID containing both the elements will be either the Dewey ID we just picked or a predecessor of the Dewey ID we just picked. The longest ID containing both the elements will be either the Dewey ID we just picked or a predecessor of the Dewey ID we just picked.

SEARCH TECHNIQUES – RANKED DEWEY INVERTED LIST (RDIL) Example: Consider the query “XQL Ricardo”. Consider the query “XQL Ricardo”. Dewey ID, is a top ranked Dewey ID which contains the keyword “XQL”. Dewey ID, is a top ranked Dewey ID which contains the keyword “XQL”. Pick the Dewey ID greater than from the leaf nodes of the B+ tree for the keyword “Ricardo”. Pick the Dewey ID greater than from the leaf nodes of the B+ tree for the keyword “Ricardo”. Consider the IDs , , , , … on B+ tree of Ricardo Consider the IDs , , , , … on B+ tree of Ricardo We pickup the ID as it is greater than We pickup the ID as it is greater than The Dewey ID with longest prefix will be either or its predecessor, The Dewey ID with longest prefix will be either or its predecessor, The element with Dewey ID will contain both XQL and Ricardo. The element with Dewey ID will contain both XQL and Ricardo.

SEARCH TECHNIQUES – RANKED DEWEY INVERTED LIST (RDIL) Consider an individual query where keywords occur relatively frequently in the document collection but rarely occur together in the same document. Consider an individual query where keywords occur relatively frequently in the document collection but rarely occur together in the same document. RDIL has to scan most (or all) of the inverted lists to produce the output. RDIL has to scan most (or all) of the inverted lists to produce the output. The overhead of performing random index lookups in RDIL can sometimes outweigh the benefit of processing the inverted lists in rank order The overhead of performing random index lookups in RDIL can sometimes outweigh the benefit of processing the inverted lists in rank order

SEARCH TECHNIQUES – HYBRID DEWEY INVERTED LIST (HDIL) The key idea here is to combine the benefits of both DIL and RDIL. The key idea here is to combine the benefits of both DIL and RDIL. We dynamically switch from RDIL and DIL depending upon the query performance. We dynamically switch from RDIL and DIL depending upon the query performance. So we will need to have Inverted list sorted by ElemRank for RDIL and Dewey ID for DIL. So we will need to have Inverted list sorted by ElemRank for RDIL and Dewey ID for DIL. But RDIL is likely to outperform DIL only if it scans a small fraction of the full inverted list. But RDIL is likely to outperform DIL only if it scans a small fraction of the full inverted list. So we store only a small fraction of the inverted list sorted by rank. So we store only a small fraction of the inverted list sorted by rank.

SEARCH TECHNIQUES – HYBRID DEWEY INVERTED LIST (HDIL)

The dynamic switching between RDIL and DIL is based on the following factors: The dynamic switching between RDIL and DIL is based on the following factors: The time spent so far – t The time spent so far – t The number of results above the threshold so far – r The number of results above the threshold so far – r Based on this we estimate the remaining time for RDIL as s (m-r)*t/r Based on this we estimate the remaining time for RDIL as s (m-r)*t/r Switch to DIL if this is more than the expected time for DIL. Switch to DIL if this is more than the expected time for DIL. We initially start with RDIL and then switch to DIL based on the above computation. We initially start with RDIL and then switch to DIL based on the above computation.

EXPERIMENTAL EVALUATIONS Data Sets Used : DBLP and XMark. Data Sets Used : DBLP and XMark. We perform time taken by each of the search techniques based on the number of keywords, correlation among them versus time. We perform time taken by each of the search techniques based on the number of keywords, correlation among them versus time.

CONCLUSION We have presented the design, implementation and evaluation of the XRANK system for ranked keyword search over XML documents taking into account: We have presented the design, implementation and evaluation of the XRANK system for ranked keyword search over XML documents taking into account: (a) the hierarchical and hyperlinked structure of XML documents (a) the hierarchical and hyperlinked structure of XML documents (b) a two-dimensional notion of keyword proximity, when computing the ranking for XML keyword search queries (b) a two-dimensional notion of keyword proximity, when computing the ranking for XML keyword search queries

THANK YOU.