Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel, Gerhard Weikum TopX Efficient & Versatile.

Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel, Gerhard Weikum TopX Efficient & Versatile Top-k Query Processing for Semistructured Data TopX Efficient & Versatile Top-k Query Processing for Semistructured Data

“Native XML data base systems can store schemaless data... ” “Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ” “XML-QL: A Query Language for XML.” “Native XML Data Bases.” “Proc. Query Languages Workshop, W3C,1998.” “XML queries with an expressive power similar to that of Datalog …” sec article sec par bib par title “Current Approaches to XML Data Manage- ment” item par title inproc title //article[.//bib[about(.//item, “W3C”)] ]//sec [ about(.//, “XML retrieval”) ] //par [ about(.//, “native XML databases”) ] “What does XML add for retrieval? It adds formal ways …” “ w3c.org/xml” sec article sec par “Sophisticated technologies developed by smart people.” par title “The X ML Files ” par title “The Ontology Game” title “The Dirty Little Secret” bib “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …” title item url “XML” RANKINGRANKING VAGUENESSVAGUENESS PRUNINGPRUNING

Extend existing threshold algorithms for inverted lists [Güntzer, Balke & Kießling, VLDB’00; Fagin, PODS ‘01] to XML data and XPath-like full-text search Non-schematic, heterogeneous data sources Efficiently support IR-style vague search Combined inverted index for content & structure Avoid full index scans, postpone expensive random accesses to large disk-resident data structures Exploit cheap disk space for redundant index structures Goal: Efficiently retrieve the best (top-k) results of a similarity query Goal: Efficiently retrieve the best (top-k) results of a similarity query

XML-IR: History and Related Work IR on structured docs (SGML): 1995 2000 2005 IR on XML: Commercial software: MarkLogic, Verity?, IBM?, Oracle?,... XML query languages: XQuery 1.0 (W3C) XPath 2.0 (W3C) NEXI (INEX Benchmark) XPath 2.0 & XQuery 1.0 Full-Text (W3C) XPath 1.0 (W3C) XML-QL (AT&T Labs) Web query languages: Lorel (Stanford U) Araneus (U Roma) W3QS (Technion Haifa) TeXQuery (AT&T Labs) WebSQL (U Toronto) XIRQL & HyRex (U Dortmund) XXL & TopX (U Saarland / MPII) ApproXQL (U Berlin / U Munich) ELIXIR (U Dublin) JuruXML (IBM Haifa ) XSearch (Hebrew U) Timber (U Michigan) XRank & Quark (Cornell U) FleXPath (AT&T Labs) XKeyword (UCSD) OED etc. (U Waterloo) HySpirit (U Dortmund) HyperStorM (GMD Darmstadt) WHIRL (CMU)

Ontology/ Large Thesaurus WordNet, OpenCyc, etc. Ontology/ Large Thesaurus WordNet, OpenCyc, etc. SA DBMS / Inverted Lists Unified Text & XML Schema DBMS / Inverted Lists Unified Text & XML Schema Random Access Probabilistic Candidate Pruning Probabilistic Candidate Pruning Probabilistic Index Access Scheduling Probabilistic Index Access Scheduling Dynamic Query Expansion Dynamic Query Expansion Top-k XPath Processing Top-k XPath Processing Top- k Queue Top- k Queue Scan Threads Auxiliary Predicates Auxiliary Predicates Candidate Cache Candidate Cache Candidate Queue Indexing Time Query Processing Time Indexer /Crawler Indexer /Crawler Frontends Web Interface Web Service API Frontends Web Interface Web Service API Selectivities Histograms Correlations Selectivities Histograms Correlations Index Metadata TopX Query Processor TopX Query Processor Sequential Access RA RA RA 1 1 2 2 3 3 4 4

Probabilistic Candidate Pruning Probabilistic Candidate Pruning Probabilistic Index Access Scheduling Probabilistic Index Access Scheduling Dynamic Query Expansion Dynamic Query Expansion Top-k XPath Processing Top-k XPath Processing 1 1 2 2 3 3 4 4 Experiments: TREC & INEX Benchmarks Experiments: TREC & INEX Benchmarks 5 5

Data Model  XML trees (no XLinks or ID/IDref attributes)  Pre-/postorder node labels  Redundant full-content text nodes (w/stemming, no stopwords) XML Data Management XML management systems vary widely in their expressive power. Native XML Data Bases. Native XML data base systems can store schemaless data. “xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ “native xml data base native xml data base system store schemaless data“ “xml data manage” article title abs sec “xml manage system vary wide expressive power“ “native xml data base” “native xml data base system store schemaless data“ title par 16 213 2 45 5364 “ xml data manage xml manage system vary wide expressive power native xml native xml data base system store schemaless data“ ftf (“xml”, article 1 ) = 4 ftf (“xml”, article 1 ) = 4 ftf (“xml”, sec 4 ) = 2 ftf (“xml”, sec 4 ) = 2 “native xml data base native xml data base system store schemaless data“

Scoring Model [INEX ’06/’07]  XML-specific extension to Okapi BM25 (originating from probabilistic IR on unstructured text)  ftf instead of tf  ef instead of df  Element-type specific length normalization  Tunable parameters k 1 and b bib[“transactions”] vs. par[“transactions”] bib[“transactions”] vs. par[“transactions”]

Naive “Merge-then-Sort” approach in between O(mn) and O(mn 2 ) runtime and O(mn) access cost Fagin’s NRA [PODS ´01] at a Glance Inverted Index s(t 1, d 10 ) = 0.8 s(t 2,d 10 ) = 0.6 s(t 3,d 10 ) = 0.7 … Corpus: d 1,…,d n Query: q = (t 1,t 2,t 3 ) RankDocWorst- score Best- score 1d780.92.4 2d640.82.4 3d100.72.4 RankDocWorst- score Best- score 1d781.42.0 2d231.41.9 3d640.82.1 4d100.72.1 RankDocWorst- score Best- score 1d102.1 2d781.42.0 3d231.41.8 4d641.22.0 … … Scan depth 1 Scan depth 1 Scan depth 2 Scan depth 2 Scan depth 3 Scan depth 3 k = 1 t1t1 d78 0.9 d1 0.7 d88 0.2 d13 0.2 d78 0.1 d99 0.2 d34 0.1 d23 0.8 d10 0.8 t2t2 d64 0.8 d23 0.6 d10 0.6 t3t3 d10 0.7 d78 0.5 d64 0.4 Find the top-k documents that maximize s(t 1,d j ) + s(t 2,d j ) +... + s(t m,d j )  non-conjunctive (“andish”) evaluations 1.NRA(q,L): 2.scan all lists L i (i = 1..m) in parallel & consider doc d at pos i 3. E(d) := E(d)  {i}; 4. high i = s(t i,d); 5. worstscore(d) := ∑ s(t,d) |  E(d); 6. bestscore(d) := worstscore(d) + ∑ high |  E(d); 7. if worstscore(d) > min-k then 8. add d to top-k 9. min-k := min{worstscore(d’) | d’  top-k}; 10. else if bestscore(d) > min-k then 11. candidates := candidates  {d}; 12. if max {bestscore(d’) | d’  candidates}  min-k then 13. return top-k; 1.NRA(q,L): 2.scan all lists L i (i = 1..m) in parallel & consider doc d at pos i 3. E(d) := E(d)  {i}; 4. high i = s(t i,d); 5. worstscore(d) := ∑ s(t,d) |  E(d); 6. bestscore(d) := worstscore(d) + ∑ high |  E(d); 7. if worstscore(d) > min-k then 8. add d to top-k 9. min-k := min{worstscore(d’) | d’  top-k}; 10. else if bestscore(d) > min-k then 11. candidates := candidates  {d}; 12. if max {bestscore(d’) | d’  candidates}  min-k then 13. return top-k; STOP!

Inverted Block-Index for Content & Structure  Mostly Sorted (=sequential) Access to large element blocks on disk  Group elements in descending order of (maxscore, docid)  Block-scan all elements per doc for a given (tag, term) key  Stored as inverted files or database tables  Two B + tree indexes over the full range of attributes (IOTs in Oracle) eiddocidscoreprepostmax- score 4620.92150.9 920.51080.9 17150.851200.85 8430.11120.1 sec[“xml”] title[“native”] par[“retrieval”] eiddocidscoreprepostmax- score 216170.92150.9 7230.814100.8 5120.54120.5 671310.412230.4 eiddocidscoreprepostmax- score 311.01211.0 2820.88140.8 18250.7537 9640.7564 SA RA RA RA RA //sec[about(.//, “XML”) and about(.//title, “native”] //par[about(.//, “retrieval”)]

Navigational Element Index eiddocidprepost 462215 92108 1715120 843112 sec  Additional index for tag paths  RAs on B + tree index using (docid, tag) as key  Few & judiciously scheduled “expensive predicate” probes  Schema-oblivious indexing & querying  Non-schematic XML data (no DTD required)  Supports full NEXI syntax & all 13 XPath axes (+level) title[“native”] par[“retrieval”] eiddocidscoreprepostmax- score 216170.92150.9 7230.814100.8 5120.54120.5 671310.412230.4 eiddocidscoreprepostmax- score 311.01211.0 2820.88140.8 18250.7537 9640.7564 RA SA //sec[about(.//title, “native”] //par[about(.//, “retrieval”)]

1.0 worst=0.9 best=2.9 46 worst=0.5 best=2.5 9 TopX Query Processing Example eiddocidscoreprepost 4620.9215 920.5108 17150.85120 8430.1112 eiddocidscoreprepost 216170.9215 7230.81410 5120.5412 671310.41223 eiddocidscoreprepost 311.0121 2820.8814 18250.7537 9640.7564 worst=1.0 best=2.8 3 worst=0.9 best=2.8 216 171 worst=0.85 best=2.75 72 worst=0.8 best=2.65 worst=0.9 best=2.8 46 2851 worst=0.5 best=2.4 9 doc 2 doc 17 doc 1 worst=0.9 best=2.75 216 doc 5 worst=1.0 best=2.75 3 doc 3 worst=0.9 best=2.7 46 2851 worst=0.5 best=2.3 9 worst=0.85 best=2.65 171 worst=1.7 best=2.5 46 28 worst=0.5 best=1.3 9 worst=0.9 best=2.55 216 worst=1.0 best=2.65 3 worst=0.85 best=2.45 171 worst=0.8 best=2.45 72 worst=0.8 best=1.6 72 worst=0.1 best=0.9 84 worst=0.9 best=1.8 216 worst=1.0 best=1.9 3 worst=2.2 best=2.2 46 2851 worst=0.5 best=0.5 9 worst=1.0 best=1.6 3 worst=0.85 best=2.15 171 worst=1.6 best=2.1 171 182 worst=0.9 best=1.0 216 worst=0.0 best=2.9 Pseudo- doc worst=0.0 best=2.8 worst=0.0 best=2.75 worst=0.0 best=2.65 worst=0.0 best=2.45 worst=0.0 best=1.7 worst=0.0 best=1.4 worst=0.0 best=1.35 sec[“xml”] title[“native”] Top-2 results worst=0.9 46 worst=0.5 9 worst=0.9 216 worst=1.7 46 28 worst=1.0 3 worst=1.6 171 182 par[“retrieval”] 1.0 0.9 0.85 0.1 0.9 0.8 0.5 0.8 0.75 min-2=0.0 min-2=0.5 min-2=0.9 min-2=1.6 Candidate queue worst=2.2 46 2851 min-2=1.0 //sec[about(.//, “XML”) and about(.//title, “native”] //par[about(.//, “retrieval”)]

……… 1.0 0.9 0.8 1.0 0.9 0.2 1.0 0.9 0.7 0.6  SA Scheduling  Look-ahead Δ i through precomputed score histograms  Knapsack-based optimization of Score Reduction  RA Scheduling  2-phase probing: Schedule RAs “late & last”, i.e., cleanup the queue if  Extended probabilistic cost model for integrating SA & RA scheduling Index Access Scheduling [VLDB ’ 06] RA RA Inverted Block Index Δ 3,3 = 0.2 Δ 1,3 = 0.8 SA

Probabilistic Candidate Pruning [VLDB ’04] sampling eid…maxscore 2160.9 720.8 510.5 eid…maxscore 31.0 280.8 1820.75 title[“native”] par[“retrieval”] 0 f1f1 1 high 1 f2f2 high 2 1 0 2 0 δ(d)  Convolutions of score distributions (assuming independence) Indexing TimeQuery Processing Time Probabilistic candidate pruning: Drop d from the candidate queue if P [ d gets in the final top-k ] < ε With probabilistic guarantees for precision & recall Probabilistic candidate pruning: Drop d from the candidate queue if P [ d gets in the final top-k ] < ε With probabilistic guarantees for precision & recall P [d gets in the final top-k] =

Dynamic Query Expansion [SIGIR ’05]  Incremental merging of inverted lists for expansion t i,1...t i,m in descending order of s(t ij, d)  Best-match score aggregation  Specialized expansion operators  Incremental Merge operator  Nested Top-k operator (efficient phrase matching)  Boolean (but ranked) retrieval mode  Supports any sorted inverted index for text, structured records & XML d42 d11 d92... d21 d78 d10 d11... d1 d37 d42 d32... d87 disaster accident fire transport d66 d93 d95... d101 tunnel d95 d17 d11... d99 Top-k (transport, tunnel, ~disaster) Top-k (transport, tunnel, ~disaster) d42d11d92d37 … ~disaster Incr. Merge TREC Robust Topic #363 SA

Incremental Merge Operator ~t Large corpus term correlations Large corpus term correlations sim(t, t 1 ) = 1.0 ~t = { t 1, t 2, t 3 } sim(t, t 2 ) = 0.9 sim(t, t 3 ) = 0.5 t1t1... d78 0.9 d1 0.4 d88 0.3 d23 0.8 d10 0.8 0.4 t3t3... d99 0.7 d34 0.6 d11 0.9 d78 0.9 d64 0.7 d78 0.9 d23 0.8 d10 0.8 d64 0.72 d23 0.72 d10 0.63 d11 0.45 d78 0.45 d1 0.4... SA... d12 0.2 d78 0.1 d64 0.8 d23 0.8 d10 0.7 t2t2 0.9 0.72 0.35 0.45 Thesaurus lookups/ Relevance feedback Thesaurus lookups/ Relevance feedback Index list metadata (e.g., histograms) Index list metadata (e.g., histograms) d88 0.3 Expansion terms Expansion similarities Initial high-scores 0.18 Meta histograms seamlessly integrate Incremental Merge operators into probabilistic scheduling and candidate pruning Meta histograms seamlessly integrate Incremental Merge operators into probabilistic scheduling and candidate pruning

TREC Terabyte Benchmark ’05/’06  Extensive crawl over the.gov domain (2004)  25 Mio documents—426 GB text data  50 ad-hoc-style keyword queries  reintroduction of gray wolves  Massachusetts textile mills  Primary cost metrics  Cost = #SA + c R /c S #RA  Wall clock runtime

TREC Terabyte Cost comparison of scheduling strategies [VLDB 06]

TREC Terabyte Wall clock runtimes [VLDB ‘06/TREC ’06]

INEX Benchmark ‘06/’07  New XMLified Wikipedia corpus  660,000 documents w/ 130,000,000 elements—6.6 GB XML data  125 NEXI queries, each as content-only (CO) and content-and-structure (CAS) formulation  CO: +“state machine” figure Mealy Moore  CAS: //article[about(., “state machine” )] //figure[about(., Mealy ) or about(., Moore )]  Primary cost metric  Cost = #SA + c R /c S #RA

TopX vs. Full-Merge  Significant cost savings for large ranges of k  CAS cheaper than CO !

Static vs. Dynamic Expansions  Query expansions with up to m=292 keywords & phrases  Balanced amount of sorted vs. random disk access  Adaptive scheduling wrt. c R /c S cost ratio  Dynamic expansions outperform static expansions & full-merge in both efficiency & effectiveness

Efficiency vs. Effectiveness  Very good precision/runtime ratio for probabilistic pruning

Official INEX ’06 Results Retrieval effectiveness (rank 3-5 out of ~60 submitted runs)

Conclusions & Outlook  Scalable XML-IR and vague search  Mature system, reference engine for INEX topic development & interactive tracks  Efficient and versatile Java prototype for text, XML, and structured data (Oracle backend)  Very efficient prototype reimplementation for text data in C++ (over own file structures)  C++ version for XML currently in production at MPI  More features  Graph top-k, proximity search, XQuery subset,…

Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel, Gerhard Weikum TopX Efficient & Versatile.

Similar presentations

Presentation on theme: "Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel, Gerhard Weikum TopX Efficient & Versatile."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel, Gerhard Weikum TopX Efficient & Versatile.

Similar presentations

Presentation on theme: "Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel, Gerhard Weikum TopX Efficient & Versatile."— Presentation transcript:

Similar presentations

About project

Feedback