BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.

Slides:

Advertisements

Similar presentations

XML: Extensible Markup Language

Advertisements

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,

Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.

CSE 6331 © Leonidas Fegaras XML and Relational Databases 1 XML and Relational Databases Leonidas Fegaras.

DBLABNational Taiwan Ocean University1/35 A Document-based Approach to Indexing XML Data Ya-Hui Chang and Tsan-Lung Hsieh Department of Computer Science.

1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.

TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.

1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.

ViST: a dynamic index method for querying XML data by tree structures Authors: Haixun Wang, Sanghyun Park, Wei Fan, Philip Yu Presenter: Elena Zheleva,

BLAS: An Efficient XPath Processing System Chen Y., Davidson S., Zheng Y. Νίκος Λούτας.

Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY.

Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,

Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.

NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.

Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.

IR Models: Structural Models

Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.

1 COS 425: Database and Information Management Systems XML and information exchange.

Flexible and Efficient XML Search with Complex Full-Text Predicates Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University.

Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.

Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.

1 Optimizing Cursor Movement in Holistic Twig Joins Marcus Fontoura, Vanja Josifovski, Eugene Shekita (IBM Almaden Research Center) Beverly Yang (Stanford)

Indexing XML Data Stored in a Relational Database VLDB`2004 Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili.

Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.

Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Graph Data Management Lab, School of Computer Science gdm.fudan.edu.cn XMLSnippet: A Coding Assistant for XML Configuration Snippet.

XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.

Querying Structured Text in an XML Database By Xuemei Luo.

TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks Martin Theobald Max Planck Institute Informatics Ralf Schenkel Saarland University Ablimit Aji Emory.

Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.

ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.

Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.

Declaratively Producing Data Mash-ups Sudarshan Murthy 1, David Maier 2 1 Applied Research, Wipro Technologies 2 Department of Computer Science, Portland.

Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.

XML and Database.

Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.

Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant.

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.

CSE 6331 © Leonidas Fegaras XQuery 1 XQuery Leonidas Fegaras.

Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.

1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.

SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.

XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.

AQAX: Approximate Query Answering for XML Josh Spiegel, M. Pontikakis, S. Budalakoti, N. Polyzotis Univ. of California Santa Cruz.

Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer,

Efficient Filtering of XML Documents with XPath Expressions

RE-Tree: An Efficient Index Structure for Regular Expressions

Structure and Value Synopses for XML Data Graphs

Spatio-temporal Pattern Queries

OrientX: an Integrated, Schema-Based Native XML Database System

Toshiyuki Shimizu (Kyoto University)

Semi-Structured data (XML Data MODEL)

Structure and Content Scoring for XML

Early Profile Pruning on XML-aware Publish-Subscribe Systems

Structure and Content Scoring for XML

XQuery Leonidas Fegaras.

Introduction to XML IR XML Group.

CoXML: A Cooperative XML Query Answering System

Presentation transcript:

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure Synopses Weimin He, Leonidas Fegaras, David Levine University of Texas at Arlington

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses2 Outline Motivation Key Contributions Related Work Data Synopses Indexing Query Processing Experimental Results Conclusion

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses3 Why not Google? Need to query both structure and content –an opportunity for more precise search Keyword queries are NOT adequate for XML search An example query beyond Google: Find the price of the book whose author’s lastname is “Smith” and whose title contains “XML” and “SAX” Semantic search using an XPath Query: //book[author/lastname ~ “Smith”][title ~ “XML” and “SAX”]/price Simpler query formats cannot express complex containment relationships: [ (lastname, Smith), (title, XML & SAX), price ] Fully indexing XML data is neither efficient nor scalable

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses4 Key Contributions A framework for indexing and searching schema-less XML documents based on data synopses extracted from documents Two novel data synopsis structures that can achieve higher query precision and scalability A hash-based processing algorithm to speed up searching A prototype implementation to evaluate the performance of the indexing scheme and to validate the data synopsis precision

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses5 Related Work Extend keyword queries to XML –XRank –XKSearch Integrate IR constructs and scoring into XQuery –TIX –TeXQuery XML Summarization Techniques –XSketch –XCluster

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses6 System Architecture

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses7 Specification of Search Queries XPath is extended with a simple IR syntax: Queries may contain predicates of the form: e ~ S –e is an XPath expression –S is a search predicate that takes the form: “term” | S1 and S2 | S1 or S2 | (S) A running query example: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/price Query result: A list of document locations (path names) that satisfy the query

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses8 Data Indexing Structural Summary (SS) –A tree that captures all unique paths in an XML document –It is constructed from XML data incrementally –Each SSnode# corresponds to a unique full label path: 9: /auction/sponsor/address

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses9 Data Indexing (cont.) Content Synopsis (CS) –Summarizes the text associated with an SS node in an XML document –Approximated as a bit matrix of size W×L L is fixed but W may depend on the document size –Stored as a B + -tree that implements the mapping (SSnode#, doc#)  bit-matrix –Used in evaluating search predicates in the query Positional Filter (PF) –Captures the position spans of all XML elements associated with an SS node in an XML document –Represented as a bit matrix of size M×L, where M ≥ 2 –Stored as a B + -tree that implements the mapping (SSnode#, doc#)  bit-matrix –Used in enforcing containment constraints among query predicates Do we need positional dimension?

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses10 Data Synopsis Example Query: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/price

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses11 Containment Filtering Query: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/price

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses12 Query Processing Overview Query Footprint (QF) Extraction –Query: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/price –QF: //auction//item:0[location: 1][description: 2]/price Structural Summary Matching –Retrieve all structural summaries that match the QF We use the standard preorder numbering scheme to represent an SS An SS is stored as a B + -tree that implements the mapping: tag → {(SS#, SSnode#, begin, end, level)} We use containment joins to retrieve the qualified full label paths that match the entry points in the QF [ /auction/item, /auction/item/location, /auction/item/description ] Containment Filtering Qualified document locations are collected and returned –The unit of query processing is a mapping from a doc# to a bit matrix of size M×L (positions) –An empty bit matrix means an unqualified document

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses13 Two-Phase Containment Filtering Many sources of inefficiency: –A large number of full label path may match a single generic XPath query –A long list of data synopses has to be retrieved for each label path in a QF –The retrieved lists of data synopses have to be correlated at each step during containment filtering Solution: –Aggregate data synopses lists from multiple documents into a single bit matrix, called Document Synopsis, of size W×D path → bit-matrix so that, given a term t and a full label path p, the document doc# is a candidate if the document synopsis for p is set at [hash(t),hash(doc#)] –Need a two-phase containment filtering algorithm to prune unqualified document locations before the actual containment filtering

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses14 Document Synopsis The document synopsis for /biblio/book/paragraph

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses15 Experimental Setup Data SetData Size (MB) FilesAvg. File Size (KB) Avg. SS Size (Byte) Avg. CS Size (Byte) Avg. PF Size (Byte) XBench XMark A prototype system is implemented in Java Employed Berkeley DB Java Edition as a storage manager Datasets –XMark –XBench

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses16 Query Workload DatasetQueryQuery Expression XMarkQ1/site//item[location ~ "United"][payment ~ "Creditcard" and "Check"]/description XMarkQ2//regions//item[location ~ "States"][payment ~ "Creditcard" or "Cash"]/name XMarkQ3/site//item[location ~ "United"][payment ~ "Creditcard"]/description XMarkQ4//regions//item[location ~ "States"][payment ~ "Check"]/quantity XMarkQ5/site//item[description//text ~ "gold"]/name XMarkQ6/regions//item[description//text ~ "character "]/payment XMarkQ7//closed_auction[type ~ "Regular"][annotation//text~ "heat"]/date XMarkQ8//closed_auction[annotation//text~ "heat" or "country"]/seller XMarkQ9//closed_auction[annotation//text~ "heat" and "country"]/buyer XMarkQ10//closed_auction[annotation//text~ "country"]/type XBenchQ11/article//body[abstract/p ~ "hockey"][section/p ~ "hockey" and "patterns"]/section XBenchQ12//article//body[section/p ~ "regular"][abstract/p ~ "hockey" or "patterns"]/abstract XBenchQ13/article//body[section/subsec/p ~ "hockey"][abstract/p ~ "hockey"]/abstract XBenchQ14/article//body[section/subsec/p ~ "regular"][abstract/p ~ "patterns"]/section XBenchQ15/article//body[section/p ~ "patterns"][abstract/p ~ "patterns"]/abstract XBenchQ16/article//body[section/p ~ "hockey"][abstract/p ~ "patterns"]/abstract XBenchQ17//prolog[keywords/keyword ~ "bold" or "regular"][title~ "regular"]/authors XBenchQ18//prolog[keywords/keyword ~ "bold"][title~ "bold"]/title XBenchQ19//prolog[genre ~ "Travel"] [keywords/keyword ~ "bold" or "stealth" ]//author/name XBenchQ20//prolog[genre ~ "Travel"] [keywords/keyword ~ "bold"]/title

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses17 Indexing Scheme Comparison ILI: using a standard XML indexing scheme based on full Inverted Lists DSI: using our indexing scheme based on Data Synopses

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses18 Query Precision Measurement ODBF: using one-dimensional Bloom Filters TDBF: using two-dimensional Bloom Filters

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses19 Efficiency of Optimization Algorithm OPCF: using one-phase containment filtering TPCF: using two-phase containment filtering

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses20 Future Research Directions Develop an effective ranking function Adopt top-k algorithms to improve search efficiency Apply our framework to structured P2P networks Evaluate our framework over INEX data