BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure Synopses Weimin He, Leonidas Fegaras, David Levine University of Texas at Arlington http://lambda.uta.edu

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses2 Outline Motivation Key Contributions Related Work Data Synopses Indexing Query Processing Experimental Results Conclusion

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses3 Why not Google? Need to query both structure and content –an opportunity for more precise search Keyword queries are NOT adequate for XML search An example query beyond Google: Find the price of the book whose author’s lastname is “Smith” and whose title contains “XML” and “SAX” Semantic search using an XPath Query: //book[author/lastname ~ “Smith”][title ~ “XML” and “SAX”]/price Simpler query formats cannot express complex containment relationships: [ (lastname, Smith), (title, XML & SAX), price ] Fully indexing XML data is neither efficient nor scalable

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses4 Key Contributions A framework for indexing and searching schema-less XML documents based on data synopses extracted from documents Two novel data synopsis structures that can achieve higher query precision and scalability A hash-based processing algorithm to speed up searching A prototype implementation to evaluate the performance of the indexing scheme and to validate the data synopsis precision

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses5 Related Work Extend keyword queries to XML –XRank –XKSearch Integrate IR constructs and scoring into XQuery –TIX –TeXQuery XML Summarization Techniques –XSketch –XCluster

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses6 System Architecture

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses7 Specification of Search Queries XPath is extended with a simple IR syntax: Queries may contain predicates of the form: e ~ S –e is an XPath expression –S is a search predicate that takes the form: “term” | S1 and S2 | S1 or S2 | (S) A running query example: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/price Query result: A list of document locations (path names) that satisfy the query

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses8 Data Indexing Structural Summary (SS) –A tree that captures all unique paths in an XML document –It is constructed from XML data incrementally –Each SSnode# corresponds to a unique full label path: 9: /auction/sponsor/address

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses9 Data Indexing (cont.) Content Synopsis (CS) –Summarizes the text associated with an SS node in an XML document –Approximated as a bit matrix of size W×L L is fixed but W may depend on the document size –Stored as a B + -tree that implements the mapping (SSnode#, doc#)  bit-matrix –Used in evaluating search predicates in the query Positional Filter (PF) –Captures the position spans of all XML elements associated with an SS node in an XML document –Represented as a bit matrix of size M×L, where M ≥ 2 –Stored as a B + -tree that implements the mapping (SSnode#, doc#)  bit-matrix –Used in enforcing containment constraints among query predicates Do we need positional dimension?

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses10 Data Synopsis Example Query: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/price

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses11 Containment Filtering Query: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/price

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses12 Query Processing Overview Query Footprint (QF) Extraction –Query: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/price –QF: //auction//item:0[location: 1][description: 2]/price Structural Summary Matching –Retrieve all structural summaries that match the QF We use the standard preorder numbering scheme to represent an SS An SS is stored as a B + -tree that implements the mapping: tag → {(SS#, SSnode#, begin, end, level)} We use containment joins to retrieve the qualified full label paths that match the entry points in the QF [ /auction/item, /auction/item/location, /auction/item/description ] Containment Filtering Qualified document locations are collected and returned –The unit of query processing is a mapping from a doc# to a bit matrix of size M×L (positions) –An empty bit matrix means an unqualified document

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses13 Two-Phase Containment Filtering Many sources of inefficiency: –A large number of full label path may match a single generic XPath query –A long list of data synopses has to be retrieved for each label path in a QF –The retrieved lists of data synopses have to be correlated at each step during containment filtering Solution: –Aggregate data synopses lists from multiple documents into a single bit matrix, called Document Synopsis, of size W×D path → bit-matrix so that, given a term t and a full label path p, the document doc# is a candidate if the document synopsis for p is set at [hash(t),hash(doc#)] –Need a two-phase containment filtering algorithm to prune unqualified document locations before the actual containment filtering

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses14 Document Synopsis The document synopsis for /biblio/book/paragraph

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses15 Experimental Setup Data SetData Size (MB) FilesAvg. File Size (KB) Avg. SS Size (Byte) Avg. CS Size (Byte) Avg. PF Size (Byte) XBench1050266639443220564178 XMark55.811500541730616 A prototype system is implemented in Java Employed Berkeley DB Java Edition 3.2.13 as a storage manager Datasets –XMark –XBench

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses16 Query Workload DatasetQueryQuery Expression XMarkQ1/site//item[location ~ "United"][payment ~ "Creditcard" and "Check"]/description XMarkQ2//regions//item[location ~ "States"][payment ~ "Creditcard" or "Cash"]/name XMarkQ3/site//item[location ~ "United"][payment ~ "Creditcard"]/description XMarkQ4//regions//item[location ~ "States"][payment ~ "Check"]/quantity XMarkQ5/site//item[description//text ~ "gold"]/name XMarkQ6/regions//item[description//text ~ "character "]/payment XMarkQ7//closed_auction[type ~ "Regular"][annotation//text~ "heat"]/date XMarkQ8//closed_auction[annotation//text~ "heat" or "country"]/seller XMarkQ9//closed_auction[annotation//text~ "heat" and "country"]/buyer XMarkQ10//closed_auction[annotation//text~ "country"]/type XBenchQ11/article//body[abstract/p ~ "hockey"][section/p ~ "hockey" and "patterns"]/section XBenchQ12//article//body[section/p ~ "regular"][abstract/p ~ "hockey" or "patterns"]/abstract XBenchQ13/article//body[section/subsec/p ~ "hockey"][abstract/p ~ "hockey"]/abstract XBenchQ14/article//body[section/subsec/p ~ "regular"][abstract/p ~ "patterns"]/section XBenchQ15/article//body[section/p ~ "patterns"][abstract/p ~ "patterns"]/abstract XBenchQ16/article//body[section/p ~ "hockey"][abstract/p ~ "patterns"]/abstract XBenchQ17//prolog[keywords/keyword ~ "bold" or "regular"][title~ "regular"]/authors XBenchQ18//prolog[keywords/keyword ~ "bold"][title~ "bold"]/title XBenchQ19//prolog[genre ~ "Travel"] [keywords/keyword ~ "bold" or "stealth" ]//author/name XBenchQ20//prolog[genre ~ "Travel"] [keywords/keyword ~ "bold"]/title

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses17 Indexing Scheme Comparison ILI: using a standard XML indexing scheme based on full Inverted Lists DSI: using our indexing scheme based on Data Synopses

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses18 Query Precision Measurement ODBF: using one-dimensional Bloom Filters TDBF: using two-dimensional Bloom Filters

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses19 Efficiency of Optimization Algorithm OPCF: using one-phase containment filtering TPCF: using two-phase containment filtering

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses20 Future Research Directions Develop an effective ranking function Adopt top-k algorithms to improve search efficiency Apply our framework to structured P2P networks Evaluate our framework over INEX data

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.

Similar presentations

Presentation on theme: "BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.

Similar presentations

Presentation on theme: "BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure."— Presentation transcript:

Similar presentations

About project

Feedback