Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

Motivation XML: de-facto standard for data exchange Development of the “XML Warehouse” Conflict between “on-line” and query execution cost  Increased query response times  Users might wait for un-interesting results XML Data Warehouse XML R Q

Approximate Query Answers Evaluate query over a concise data synopsis and obtain an approximation R’ of the true result Use approximate result as timely feedback  User can assess the “value” of the query Goal: reduce number of evaluated queries XML Data Warehouse Synopsis XML R R’ Q

Contributions TreeSketch Synopses  Structural summaries for XML data  Approximate answers for complex twig queries  Summarization model  Structural clustering of elements  Efficient processing and construction Element Simulation Distance  Novel distance metric for XML data  Captures “approximate” similarity between two XML trees Experimental Results  Accurate approximate answers for low space budgets  Low-error selectivity estimates  Efficient construction algorithm

Outline Preliminaries TreeSketches  Synopsis model  Computing approximate answers  Summary construction Element Simulation Distance Experimental Study Conclusions

Data and Query Model XML Document q0q0 q1q1 q2q2 q3q3 //section.//equation./figure Twig Query s2s2 e 11 e 13 f5f5 f7f7 r Nesting Tree p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r e 10 f5f5 s2s2 r e8e8 f5f5 s2s2 r f4f4 s2s2 r e8e8 f4f4 s2s2 r q3q3 q2q2 q1q1 q0q0 Binding Tuples

Problem Definition Process twig query over a synopsis Compute approximation of nesting tree q0q0 q1q1 q2q2 q3q3 //section.//equation./figure s2s2 e 11 e 13 f5f5 f7f7 r s ee f r Approximate Nesting Tree True Nesting Tree XML Data Synopsis

Graph Synopsis XML DocumentGraph Synopsis Synopsis node  Set of elements of the same tag Synopsis edge  Document edge(s) P(1) S(2) F(2) C(4) F(2) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r

XML DocumentTreeSketch TreeSketch Synopsis Augment graph-synopsis with edge counts count[u,v]: mean #children in v per element in u 2 1 1  1 2 1 1 1 1 1 P(1) S(2) F(2) C(4) F(2) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r 2 #F#F #F#F

XML DocumentTreeSketch TreeSketch Synopsis Augment graph-synopsis with edge counts count[u,v]: mean #children in v per element in u 2 1 2 2 1 0.5 P(1) S(2) C(4) F(4) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r #F#F

TreeSketches and Clustering TreeSketch  Clustering based on structure  All elements in a node are mapped to a “centroid”  Tight clusters  Accurate synopsis  The perfect synopsis corresponds to a perfect clustering Synopsis quality quantified by clustering error  Options: Manhattan Distance, Squared Error, …  Quality can be measured independent of a workload  Key for effective construction

Computing Approximate Answers TreeSketch q0q0 q1q1 q2q2 q3q3 //section.//equation.//caption QueryApproximate Nesting Tree R E 1 1+1=2 C S 2 Compute TreeSketch of approximate answer Accuracy depends on quality of clustering 1 2 1 1 1 1 1 P(1) S(2) F(2) C(4) F(2) E(2) R(1)

TreeSketch Construction Given an XML tree T, build a TreeSketch of size B Difficult clustering problem  Space dimensionality depends on the clustering itself Construction based on bottom-up clustering  Compress perfect synopsis by merging clusters  Best merge determined by marginal gains Perfect Space Budget …

Depth-Guided Merging Key observation: Two elements have similar structure, if their children have similar structure  Children clusters should be merged first Bottom-up merging, based on depth  Depth: distance from the leaves of the tree  Build a pool of candidate merges by increasing depth  Replenish the pool when it falls below a given threshold Improved construction time - good performance

Error of Approximation Error  Distance between R’ and R Popular metric: Tree-edit distance  Min-cost sequence of operations that transform R’ to R  Measures syntactic differences between R and R’ Not intuitive for approximate answers! T1T1 T r s e s f 14 ef 41 r s e s f 44 ef 11 r s e s f 26 ef 62 T2T2 Different counts Similar Trait Same counts Opposite Trait

Element Simulation Distance Capture approximate similarity between R and R’ u simulates v: u and v have identical structure ESD(u,v): “degree” of simulation between u,v  How well the structure of u matches the structure of v Modeled as the distance between multi-sets Efficient computation using perfect summaries T1T1 T r s e s f 14 ef 41 r s e s f 44 ef 11 r s e s f 26 ef 62 T2T2

Experimental Methodology Data Sets: XMark, DBLP, IMDB, SwissProt Workload: 1000 random twig queries Evaluation metrics:  Average ESD for approximate answers  Mean absolute relative error for selectivity estimation

Approximate Answers IMDB (~102K Elements) Avg. Result Size: 3,477 tuples

Selectivity Estimation - SwissProt SwissProt (~182K Elements) Avg. Result Size: 104,592 tuples

Selectivity Estimation Data Set #Elements (x 10 3 ) # Tuples (x 10 3 ) DBLP1,50078 IMDB23613 S-Prot473365 XMark2,000145 Data Set Construction Time (min) DBLP11 IMDB2.5 S-Prot38 XMark240

Conclusions Approximate query answering for XML databases TreeSketch Synopses  Structural summaries for tree-structured XML  Approximate answers for twig-queries  Model: Graph Synopsis + Edge-counts  Efficient processing and construction Element Simulation Distance  Capture approximate similarity b/w XML trees Experimental Results  High accuracy for low space budgets  Efficient construction

Questions?

XML Document p1p1 s2s2 f7f7 c 14 s3s3 f9f9 c 17 f5f5 e 11 c 12 e 13 f9f9 c 17 r P(1) S(2) F(2) C(4) F(2) E(2) R TreeSketch 1 2 11 1 1 1 TreeSketch Model (2/2) Average number of children Edge count #E#E #C#C 1 1 

XML XML Document p1p1 s2s2 f7f7 c 14 s3s3 f9f9 c 17 f5f5 e 11 c 12 e 13 p: paper s: section c: caption t: title f: figure e: equation f9f9 c 17 r

Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

Similar presentations

Presentation on theme: "Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

Similar presentations

Presentation on theme: "Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)"— Presentation transcript:

Similar presentations

About project

Feedback