Martin Theobald Max-Planck-Institut Informatik Stanford University

Martin Theobald Max-Planck-Institut Informatik Stanford University
TopX Efficient & Versatile Top-k Query Processing for Text, Semistructured & Structured Data Martin Theobald Max-Planck-Institut Informatik Stanford University

RANKING PRUNING VAGUENESS //article[.//bib[about(.//item, “W3C”)]
]//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)] RANKING “Native XML data base systems can store schemaless data ... ” “Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ” “XML-QL: A Query Language for XML.” “Native XML Data Bases.” “Proc. Query Languages Workshop, W3C,1998.” “XML queries with an expressive power similar to that of Datalog …” sec article par bib title “Current Approaches to XML Data Manage- ment” item inproc “What does XML add for retrieval? It adds formal ways …” “w3c.org/xml” sec article par “Sophisticated technologies developed by smart people.” title “The XML Files” Ontology Game” Dirty Little Secret” bib “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …” item url “XML” VAGUENESS PRUNING

Unified Text & XML Schema
Frontends Web Interface Web Service API TopX Query Processor Probabilistic Index Access Scheduling Candidate Queue Candidate Cache SA Scan Threads Top-k Queue Probabilistic Candidate Pruning Query Processing Time Random Access Dynamic Query Expansion Sequential Access Incremental XPath Engine Auxiliary Predicates RA Thesaurus WordNet, OpenCyc, etc. Index Metadata Selectivities Histograms Correlations DBMS / Inverted Lists Unified Text & XML Schema Indexing Time RA Indexer /Crawler

“xml manage system vary wide expressive
Data Model “xml data manage xml manage system vary wide expressive power native xml native xml data base system store schemaless data“ “xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ “xml data manage” article title abs sec “xml manage system vary wide expressive power“ “native xml data base” “native xml data base system store schemaless data“ par 1 6 2 3 4 5 <article> <title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. <par>Native XML data base systems can store schemaless data. </par> </sec> </article> ftf (“xml”, article1 ) = 4 “native xml data base native xml data base system store schemaless data“ XML trees (no XLinks or ID/IDref attributes) Pre-/postorder node labels Redundant full-content text nodes

Scoring Model [INEX ’06/’07]
XML-specific extension to Okapi BM25 (originating from probabilistic text IR) ftf instead of tf ef instead of df Element type-specific length normalization Tunable parameters k1 and b bib[“transactions”] vs. par[“transactions”]

TopX Query Processing [VLDB ’05]
//sec[about(.//, “XML”) and about(.//title, “native”] //par[about(.//, “retrieval”)] sec[“xml”] title[“native”] par[“retrieval”] 1.0 1.0 1.0 0.9 eid docid score pre post 46 2 0.9 15 9 0.5 10 8 171 5 0.85 1 20 84 3 0.1 12 0.9 eid docid score pre post 216 17 0.9 2 15 72 3 0.8 14 10 51 0.5 4 12 1.0 eid docid score pre post 3 1 1.0 21 28 2 0.8 8 14 182 5 0.75 7 0.8 0.8 0.85 0.5 0.75 0.1 … 16 11 0.4 5 89 … 14 8 0.8 19 … 20 3 0.04 8 21 5 0.05 4 35 1 0.09 32 Top-2 Candidate Queue worst=1.6 171 182 worst=1.0 3 worst=0.9 46 worst=2.2 46 28 51 worst=1.7 46 28 worst=0.5 9 worst=0.9 216 max-q=2.75 max-q=2.15 max-q=2.55 max-q=2.45 max-q=2.8 max-q=1.6 max-q=3.0 max-q=2.9 max-q=2.7 min-2=1.0 min-2=0.5 min-2=0.0 min-2=0.9 min-2=1.6

Index Access Scheduling [VLDB ’06]
Inverted Block Index SA Scheduling Look-ahead Δi through precomputed score histograms Knapsack-based optimization of Score Reduction RA Scheduling 2-phase probing: Schedule RAs “late & last” Extended probabilistic cost model for integrating SA & RA scheduling SA SA SA 1.0 0.9 0.8 0.2 0.7 0.6 … Δ1,3 = 0.8 Δ3,3 = 0.2 RA

Probabilistic Pruning [VLDB ’04]
Convolutions of score distributions (assuming independence) P [d gets in the final top-k] = title[“native”] f1 1 high1 f2 high2 eid … max- score 216 0.9 72 0.8 51 0.5 2 δ(d) par[“retrieval”] sampling eid … max- score 3 1.0 28 0.8 182 0.75 Probabilistic candidate pruning: Drop d from the candidate queue if P [d gets in the final top-k] < ε With probabilistic guarantees for precision & recall Indexing Time Query Processing Time

Dynamic Query Expansion [SIGIR ’05]
TREC Robust Topic #363 Top-k (transport, tunnel, ~disaster) Incrementally merge inverted lists for expansion ti,1...ti,m in descending order of s(tij, d) Best-match score aggregation Specialized expansion operators Incremental Merge operator Nested Top-k operator (efficient phrase matching) Boolean (but ranked) retrieval mode Supports any sorted inverted index for text, structured records & XML SA SA transport d66 d93 d95 ... d101 tunnel d17 d11 d99 d42 d11 d92 d37 … ~disaster SA d42 d11 d92 ... d21 d78 d10 d1 d37 d32 d87 disaster accident fire Incr. Merge

Incremental Merge Operator
Thesaurus lookups/ Relevance feedback Index list metadata (e.g., histograms) Initial high-scores Expansion terms ~t = { t1, t2, t3 } Large corpus term correlations sim(t, t1 ) = 1.0 t1 ... d78 0.9 d1 0.4 d88 0.3 d23 0.8 d10 0.9 0.4 ... d12 0.2 d78 0.1 d64 0.8 d23 d10 0.7 t2 sim(t, t2 ) = 0.9 Expansion similarities 0.72 0.18 sim(t, t3 ) = 0.5 t3 ... d99 0.7 d34 0.6 d11 0.9 d78 d64 SA 0.45 0.35 ~t d78 0.9 d23 0.8 d10 0.8 d64 0.72 d23 0.72 d10 0.63 d11 0.45 d78 0.45 d1 0.4 d88 0.3 ... Meta histograms seamlessly integrate Incremental Merge into probabilistic scheduling and candidate pruning

Some Experiments New XML-ified Wikipedia corpus (INEX 2006)
660,000 documents w/ 130,000,000 elements 125 INEX queries, each as content-only (CO) and content-and-structure (CAS) formulation CO: +“state machine” figure Mealy Moore CAS: //article[about(., “state machine” )] //figure[about(., Mealy ) or about(., Moore )] Primary cost metric: Cost = #SA + cR/cS #RA

TopX vs. Full-Merge Significant cost savings for large ranges of k
CAS cheaper than CO !

Efficiency vs. Effectiveness
Very good precision/runtime ratio for probabilistic pruning

Static vs. Dynamic Expansions
Query expansions with up to m=292 keywords & phrases Balanced amount of sorted vs. random disk access Adaptive scheduling wrt. cR/cS cost ratio Dynamic expansions superior to static expansions & full-merge in both efficiency & effectiveness

Thanks… Gerhard Weikum Ralf Schenkel
Norbert Fuhr, Michalis Vazirgiannis Holger Bast, Debapriyo Majumdar All the MPI & INEX folks

topx.sourceforge.net See our Sigmod’07 demo!

Martin Theobald Max-Planck-Institut Informatik Stanford University

Similar presentations

Presentation on theme: "Martin Theobald Max-Planck-Institut Informatik Stanford University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Martin Theobald Max-Planck-Institut Informatik Stanford University

Similar presentations

Presentation on theme: "Martin Theobald Max-Planck-Institut Informatik Stanford University"— Presentation transcript:

Similar presentations

About project

Feedback