Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Md. Mahbub Hasan University of California, Riverside.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Fast Algorithms For Hierarchical Range Histogram Constructions
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
Bloom Based Filters for Hierarchical Data Georgia Koloniari and Evaggelia Pitoura University of Ioannina, Greece.
Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Selectivity-Based Partitioning Alkis Polyzotis UC Santa Cruz.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.
Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
Novel Self-Configurable Positioning Technique for Multihop Wireless Networks Authors : Hongyi Wu Chong Wang Nian-Feng Tzeng IEEE/ACM TRANSACTIONS ON NETWORKING,
Dependency-Based Histogram Synopses for High-dimensional Data Amol Deshpande, UC Berkeley Minos Garofalakis, Bell Labs Rajeev Rastogi, Bell Labs.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Depth Estimation for Ranking Query Optimization Karl Schnaitter, UC Santa Cruz Joshua Spiegel, BEA Systems, Inc. Neoklis Polyzotis, UC Santa Cruz.
Approximate XML Query Answers Neoklis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas) Represented by: Gal.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
XPathLearner: An On-Line Self- Tuning Markov Histogram for XML Path Selectivity Estimation Authors: Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
PMLAB Finding Similar Image Quickly Using Object Shapes Heng Tao Shen Dept. of Computer Science National University of Singapore Presented by Chin-Yi Tsai.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
TwigStackList¬: A Holistic Twig Join Algorithm for Twig Query with Not-predicates on XML Data by Tian Yu, Tok Wang Ling, Jiaheng Lu, Presented by: Tian.
RRXS Redundancy reducing XML storage in relations O. MERT ERKUŞ A. ONUR DOĞUÇ
Clustering XML Documents for Query Performance Enhancement Wang Lian.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.
Bo Zong, Yinghui Wu, Ambuj K. Singh, Xifeng Yan 1 Inferring the Underlying Structure of Information Cascades
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
AQAX: Approximate Query Answering for XML Josh Spiegel, M. Pontikakis, S. Budalakoti, N. Polyzotis Univ. of California Santa Cruz.
A paper on Join Synopses for Approximate Query Answering
RE-Tree: An Efficient Index Structure for Regular Expressions
Structure and Value Synopses for XML Data Graphs
Probabilistic Data Management
Efficient Subgraph Similarity All-Matching
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
A Framework for Testing Query Transformation Rules
Clustering.
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

Motivation XML: de-facto standard for data exchange Development of the “XML Warehouse” Conflict between “on-line” and query execution cost  Increased query response times  Users might wait for un-interesting results XML Data Warehouse XML R Q

Approximate Query Answers Evaluate query over a concise data synopsis and obtain an approximation R’ of the true result Use approximate result as timely feedback  User can assess the “value” of the query Goal: reduce number of evaluated queries XML Data Warehouse Synopsis XML R R’ Q

Contributions TreeSketch Synopses  Structural summaries for XML data  Approximate answers for complex twig queries  Summarization model  Structural clustering of elements  Efficient processing and construction Element Simulation Distance  Novel distance metric for XML data  Captures “approximate” similarity between two XML trees Experimental Results  Accurate approximate answers for low space budgets  Low-error selectivity estimates  Efficient construction algorithm

Outline Preliminaries TreeSketches  Synopsis model  Computing approximate answers  Summary construction Element Simulation Distance Experimental Study Conclusions

Data and Query Model XML Document q0q0 q1q1 q2q2 q3q3 //section.//equation./figure Twig Query s2s2 e 11 e 13 f5f5 f7f7 r Nesting Tree p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r e 10 f5f5 s2s2 r e8e8 f5f5 s2s2 r f4f4 s2s2 r e8e8 f4f4 s2s2 r q3q3 q2q2 q1q1 q0q0 Binding Tuples

Problem Definition Process twig query over a synopsis Compute approximation of nesting tree q0q0 q1q1 q2q2 q3q3 //section.//equation./figure s2s2 e 11 e 13 f5f5 f7f7 r s ee f r Approximate Nesting Tree True Nesting Tree XML Data Synopsis

TreeSketch Model

Graph Synopsis XML DocumentGraph Synopsis Synopsis node  Set of elements of the same tag Synopsis edge  Document edge(s) P(1) S(2) F(2) C(4) F(2) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r

XML DocumentTreeSketch TreeSketch Synopsis Augment graph-synopsis with edge counts count[u,v]: mean #children in v per element in u P(1) S(2) F(2) C(4) F(2) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r

XML DocumentTreeSketch TreeSketch Synopsis Is there a lossless synopsis? What is the quality of a lossy synopsis? P(1) S(2) F(2) C(4) F(2) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r

XML DocumentTreeSketch Count Stability (u,v) count-stable: all elements in u have the same child-count in v P(1) S(2) F(2) C(4) F(2) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r

XML DocumentTreeSketch Count-Stable TreeSketch A count-stable synopsis can recover the input tree Efficient one-pass construction Stable summary can be too large for practical use! P(1) S(1) F(2) C(4) F(2) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r S(1) 1

XML DocumentTreeSketch Lossy TreeSketch  P(1) S(2) F(2) C(4) F(2) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r 2 #F#F #F#F

TreeSketches and Clustering TreeSketch  Element clustering  All elements in a node are mapped to a “centroid”  Tight clusters  Accurate synopsis Synopsis quality  Clustering error  Options: Manhattan Distance, Squared Error, …  Quality can be measured independent of a workload  Key for effective construction

Computing Approximate Answers TreeSketch q0q0 q1q1 q2q2 q3q3 //section.//equation.//caption QueryApproximate Nesting Tree R E 1 1+1=2 C S 2 Compute TreeSketch of approximate answer Accuracy depends on quality of clustering P(1) S(2) F(2) C(4) F(2) E(2) R(1)

TreeSketch Construction Given an XML tree T, build a TreeSketch of size B Difficult clustering problem  Space dimensionality depends on the clustering itself Construction based on bottom-up clustering  Compress perfect synopsis by merging clusters  Best merge determined by marginal gains  Heuristic to reduce number of candidate merges Perfect Space Budget …

Element Simulation Distance

Error of Approximation Error  Distance between R’ and R Popular metric: Tree-edit distance  Min-cost sequence of operations that transform R’ to R  Measures syntactic differences between R and R’ Not intuitive for approximate answers! T1T1 T r s e s f 14 ef 41 r s e s f 44 ef 11 r s e s f 26 ef 62 T2T2 Different counts Similar Trait Same counts Opposite Trait

Element Simulation Distance Capture approximate similarity between R and R’ u simulates v: u and v have identical structure ESD(u,v): “degree” of simulation between u,v  How well the structure of u matches the structure of v Modeled as the distance between multi-sets Efficient computation using perfect summaries T r s e s f 14 ef 41 r s e s f 26 ef 62 T2T2 eeeeeeee f eeeeeeeeeeee ffff Recursive application of ESD

Experimental Results

Methodology Data Sets: XMark, DBLP, IMDB, SwissProt Workload: 1000 random twig queries Evaluation metrics:  Average ESD for approximate answers  Mean absolute relative error for selectivity estimation

Approximate Answers - IMDB IMDB (~102K Elements) Avg. Result Size: 3,477 tuples

Selectivity Estimation - SwissProt SwissProt (~182K Elements) Avg. Result Size: 104,592 tuples

Selectivity Estimation - ALL Data Set #Elements (x 10 3 ) # Tuples (x 10 3 ) DBLP1,50078 IMDB23613 S-Prot XMark2, Data Set Construction Time (min) DBLP11 IMDB2.5 S-Prot38 XMark240

Conclusions Approximate query answering for XML databases TreeSketch Synopses  Structural summaries for tree-structured XML  Approximate answers for twig-queries  Model: Graph Synopsis + Edge-counts  Efficient processing and construction Element Simulation Distance  Capture approximate similarity between XML trees Experimental Results  High accuracy for low space budgets  Efficient construction

Questions?

XML Document p1p1 s2s2 f7f7 c 14 s3s3 f9f9 c 17 f5f5 e 11 c 12 e 13 f9f9 c 17 r P(1) S(2) F(2) C(4) F(2) E(2) R TreeSketch TreeSketch Model (2/2) Average number of children Edge count #E#E #C#C 1 1 

XML XML Document p1p1 s2s2 f7f7 c 14 s3s3 f9f9 c 17 f5f5 e 11 c 12 e 13 p: paper s: section c: caption t: title f: figure e: equation f9f9 c 17 r

XML DocumentTreeSketch TreeSketch Synopsis Augment graph-synopsis with edge counts count[u,v]: mean #children in v per element in u P(1) S(2) C(4) F(4) E(2) R(1) p1p1 s2s2 f5f5 c 11 s3s3 f6f6 c 12 f4f4 e8e8 c9c9 e 10 f7f7 c 13 r #F#F

Depth-Guided Merging Key observation: Two elements have similar structure, if their children have similar structure Bottom-up merging, based on depth  Depth: distance from the leaves of the tree  Build a pool of candidate merges by increasing depth  Replenish the pool when it falls below a given threshold Reduced construction time - Accurate synopses

Depth-Guided Merging Observation: Two elements have similar structure, if their children have similar structure Heuristic: If a merge of two clusters is good, then merges of the child clusters are likely to have been good as well Bottom-up merging strategy Savings in construction time - Accurate synopses