SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing.

Slides:



Advertisements
Similar presentations
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Fast Algorithms For Hierarchical Range Histogram Constructions
Chapter 4: Trees Part II - AVL Tree
Bloom Based Filters for Hierarchical Data Georgia Koloniari and Evaggelia Pitoura University of Ioannina, Greece.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
COSC 3100 Transform and Conquer
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Modern Information Retrieval Chapter 8 Indexing and Searching.
Lecture04 Data Compression.
Jan. 2013Dr. Yangjun Chen ACS Outline Signature Files - Signature for attribute values - Signature for records - Searching a signature file Signature.
Modern Information Retrieval
CPSC 231 B-Trees (D.H.)1 LEARNING OBJECTIVES Problems with simple indexing. Multilevel indexing: B-Tree. –B-Tree creation: insertion and deletion of nodes.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
Skip Lists1 Skip Lists William Pugh: ” Skip Lists: A Probabilistic Alternative to Balanced Trees ”, 1990  S0S0 S1S1 S2S2 S3S3 
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
XPathLearner: An On-Line Self- Tuning Markov Histogram for XML Path Selectivity Estimation Authors: Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Sorting Fun1 Chapter 4: Sorting     29  9.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
CSC 211 Data Structures Lecture 13
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Conjunctive Filter: Breaking the Entropy Barrier Daisuke Okanohara *1, *2 Yuichi Yoshida *1*3 *1 Preferred Infrastructure Inc. *2 Dept. of Computer Science,
Histograms for Selectivity Estimation
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
XML and Database.
Skip Lists 二○一七年四月二十五日
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Week 10 - Friday.  What did we talk about last time?  Graph representations  Adjacency matrix  Adjacency lists  Depth first search.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
2/19/2016 3:18 PMSkip Lists1  S0S0 S1S1 S2S2 S3S3    2315.
SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
1 UNIVERSITY of PENNSYLVANIAGrigoris Karvounarakis October 04 Lazy Query Evaluation for Active XML Abiteboul, Benjelloun, Cautis, Manolescu, Milo, Preda.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
By A. Aboulnaga, A. R. Alameldeen and J. F. Naughton Vldb’01
Data Structures: Disjoint Sets, Segment Trees, Fenwick Trees
Indexing Structures for Files and Physical Database Design
Updating SF-Tree Speaker: Ho Wai Shing.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
B-Trees B-Trees.
A paper on Join Synopses for Approximate Query Answering
RE-Tree: An Efficient Index Structure for Regular Expressions
Structure and Value Synopses for XML Data Graphs
CS 430: Information Discovery
Spatial Online Sampling and Aggregation
Heaps and the Heapsort Heaps and priority queues
Indexing and Hashing Basic Concepts Ordered Indices
Minwise Hashing and Efficient Search
Switching Lemmas and Proof Complexity
Presentation transcript:

SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing

Contents Motivation Problem Definition Perfect Binary SF-Tree Variations of SF-Tree Experiments Conclusion

Motivation

XML is becoming more and more popular e.g., product catalogs, invoices, documentations, etc We need to extract useful information out of them Queries, usually in form of XQuery or XPath, are important

Motivation

e.g., FOR $i IN document(“*”)//buyer/name $j IN document(“*”)//seller/name WHERE $i/text()=$j/text() RETURN $i

Motivation To evaluate this query, different evaluation plans are available FOR $i IN document(“*”)//buyer/name $j IN document(“*”)//seller/name WHERE $i/text()=$j/text() RETURN $i

Motivation To select the best query plan, we need the selectivity of simple path expressions (SPEs) To retrieve selectivity counts, we need a structure to store suitable statistics of the original XML document collection

Motivation We may use path indices to store this SPE-to-selectivity mapping e.g., DataGuide, path tree, 1-index etc. disadvantage: designed for absolute path expressions (APEs) and inefficient for simple path expressions (SPEs)

Motivation

To increase efficiency, we may use suffix trie to store the mapping advantage: really quick, no need to scan the whole index tree disadvantage: very large, especially when the data tree is complex

Motivation

These structures are not flexible won’t change with frequently asked queries can’t make trade-offs between space and speed We can tolerate some errors in selectivity retrieval and it is good to know a bound on the error

Motivation Our goal: to build a new structure for estimating the selectivity of SPEs which, is faster than the path indices requires less space than suffix tries provides flexible trade-off between space and time provides accuracy guarantee on the estimated selectivity

Problem Definition

XML document is a text document that logically represents a directed edge-labeled graph called data graph consists of Elements, Attributes and other information items

Problem Definition

Data Paths a set of information items (e 1, e 2, …, e n ) each e i is directly nested within e i-1 for 2  i  n i.e., a path in a data graph (not necessarily starts from root)

Problem Definition Simple Path Expression (SPE) asks about the structure of XML documents has a form: //t 1 /t 2 /…/t n where t i are tags returns all information items (e n ) which is the end of a data path (e 1, e 2, …, e n ) that satisfies the SPE

Problem Definition Satisfy: a data path (e 1, e 2, …, e n ) satisfies an SPE (//t 1 /t 2 /…/t n ) if: every e i has a tag t i Selectivity (or counts, selectivity counts) selectivity of an SPE (//t 1 /t 2 /…/t n ) is the number of information items e n ’s that are returned by the SPE

Problem Definition e.g., //buyer/name, //name, //invoice/buyer/name are all SPEs selectivity counts of //name is 2, //products is 1, //invoice is 1

Problem Definition

The Problem: given a simple path expression p and a collection D of XML documents, find the estimated selectivity s 1 of p in D while s 1 should be as close to the actual selectivity s of p in D as possible

SF-Tree

Basic Idea: divide all possible SPEs in an XML document collection into distinct groups each SPE in a group has the same selectivity (or similar selectivity) the task of finding the selectivity of an SPE now becomes the task of finding which selectivity group this SPE belongs to

SF-Tree e.g.,

SF-Tree e.g.,

SF-Tree In each group, we use a signature file to store all the paths that belong to this group This reduces the storage requirement, improves the accessing speed but induces errors to the structure

Signature Files A traditional technique to improve the efficiency of keyword-based information retrieval Each keyword is hashed into m bit positions of a signature file F For each keyword in the document, the corresponding m bits in F are set

Signature Files e.g., F has 10 bits, m = 3 //name is hashed to bits 2, 3, 8 //buyer is hashed to bits 2, 4, 9

Signature Files To check the existence of a query keyword k in F, k is hashed and we check whether all the m bits are set or not If any of the m bits are unset, k is not in F If all m bits are set, k is very likely to be in F

Signature Files e.g., (cont) if a query keyword //seller is hashed to bits 2, 8, 9, F will say //seller is in it, but it is not true bits 2, 8, 9 are in fact set by other keywords a false drop occurred

Signature Files Previous analysis showed that in optimal case, false drop rate is (1/2) m |F | = m |D | / ln(2) where D is the set of all keywords in the document In our case, we treat an SPE in a group as a keyword and a group as a document

SF-Tree There are many groups for 16-bit selectivity counts, there are 2 16 groups if each group has a distinct selectivity count To get the selectivity, we need to find which group this SPE belongs to Linear scan of all groups is not feasible slow error prone

SF-Tree organize the groups into a tree form called an SF-Tree (stands for Signature File Tree) Rationale: create representative groups over several groups, and if an SPE is not in a representative group, it will never be in any of descendant groups of this representative group.

SF-Tree

Basic selectivity retrieval algorithm: 1. start from the root, 2. check which child contains the SPE p 3. descend to that child, repeat step 2, 3 until a leaf is reached 4. return the selectivity associated with that leaf

Perfect Binary SF-Tree A very straightforward way to build representative nodes is to build the nodes according to the most significant bits of the selectivity count i.e., all the descendants has the same m.s.b. in their selectivity counts A perfect binary SF-Tree is thus formed

Perfect Binary SF-Tree advantage: easy to build easy to analyze disadvantage: less flexible cannot tackle FAQ not that space efficient

Perfect Binary SF-Tree properties: (proofs are skipped) low error rate: let d be depth of the SF-Tree, a be the depth difference between the false drop leaf, correct leaf and their least common ancestor for negative queries, false drop rate is: R neg (d) < 1/2 d(m-1) for positive queries, max error size = 2 a, R pos (d, a) < 1/2 ma-a+1

Perfect Binary SF-Tree high accessing speed: assuming no false drop, T neg (d) = 2 signatures T pos (d) = 2d signatures it doesn’t depend on the complexity of the XML documents

Perfect Binary SF-Tree each false drop induces at most 2 more signature checks expected number of signatures to be checked for positive query: C < 2d + (2d+2)(d+2)/2 m

Perfect Binary SF-Tree storage: each SPE must appear at one leaf and all the ancestors of that leaf each SPE contributes m/ln2 bits to each signature that contains it space requirement = (m/ln2)  v (|v|d v ) where v is a leaf node

Perfect Binary SF-Tree properties -- summary: negligible error rate (if m = 10, d = 16, error rate for a negative query is 1/2 144 !) error rate decreases as error size increases for positive queries error rate and accessing time are independent of the complexity of the XML documents

Variations of SF-Tree

drawbacks of a perfect binary SF-Tree: every SPE has the same depth in SF-Tree, cannot optimize for FAQ too many layers, we can tolerate a higher error rate for smaller space requirement

Variations of SF-Tree Huffman SF-Tree Shannon-Fano SF-Tree Cropped SF-Tree

Huffman SF-Tree build a Huffman tree on the groups of selectivity the weight of each group can be: A. the number of SPEs in that group B. the probability of a random query will fall into that group

Huffman SF-Tree

advantage: optimal in space required over all binary SF-Trees if the weight is (A). optimal in expected number of signatures to be checked if the weight is (B). disadvantage: error rate is independent of error size

Shannon-Fano SF-Tree build a Shannon-Fano tree over the selectivity groups the order of the leaves are not changed similar depth as in Huffman tree error rate now decreases as error size increases

Shannon-Fano SF-Tree We observed that the distribution of SPE over all selectivity is very skewed: most of the SPE has a very small selectivity the distribution is similar to a GP with ratio smaller than 1. The use of Huffman or Shannon-Fano tree much reduces the depth of SF-Tree

Shannon-Fano SF-Tree

Cropped SF-Tree Cropped SF-Tree is an easy trade-off between access time, space and accuracy two types of cropped SF-Tree: head-cropped SF-Tree (the top h levels of an SF-Tree is removed) tail-cropped SF-Tree (the bottom t levels of an SF-Tree is removed)

Cropped SF-Tree original SF-Tree

Head-cropped SF-Tree height of the SF-Tree decreases space required is smaller accessing time increases because we have to check all the children of the root in the first step accuracy decreases a little bit because the maximum possible value of a decreases.

Tail-cropped SF-Tree height of the SF-Tree decreases space required is smaller accessing time is shorter (we can traverse fewer levels) accuracy is lowered (a leaf now represents a range of selectivity, i.e., the quantization step size of selectivity is larger)

Experiments

Data Sets

Experiments Query Workloads positive SPEs: randomly pick 2000 SPEs from all the SPEs in the XML document collection. Each SPE has equal probability of being picked negative SPEs: randomly concatenate tags in the XML document collection, and then remove those with non-zero selectivity until 2000 SPEs are generated

Experiments

Comparison to Competitors

Performance of SF-Tree

Conclusions

A new structure, SF-Tree, is introduced SF-Tree is much quicker and path indices like path trees and uses much less space than suffix trie SF-Tree provides means to make a trade-off between space, time and accuracy

Conclusions Although SF-Tree returns only estimated selectivity, but accuracy is guaranteed Neg queries: false drop rate = 1/2 d(m-1), Pos queries: false drop rate = 1/2 ma-a+1 max error size = 1/2 a

Discussions SF-Tree is a new approach to approximate some "string-to-number" structures It may be applicable to other aspects such as approximating a multi- dimensional histogram

Questions?

References A. Aboulnaga, A. R. Alameldeen, and J. F. Naughton. Estimating the selectivity of XML path expressions for internet scale applications. In VLDB, , C. Faloutsos and S. Christodoulakis. Signature files: An access method for documents and its analytical performance evaluation. ACM TOIS, 2(4): , R. Goldman and J. Widom. Data Guides: Enabling query formulation and optimization in semistructured databases. In VLDB, pp , L. Lim, M. wang, S. Padmanabhan, J. S. Vitter, and R.Parr. XPathLearner: an on-line self-tuning markov histogram for XML path selectivity estimation. In VLDB T. Milo, D. Suciu. Index structures for path expressions. In ICDT, N. Polyzotis and M. Garofalakis. Statistical synopses for graph- structured XML databases. In SIGMOD, 2002.

References N. Polyzotis and M. Garofalakis. Structure and value synopses for XML data graphs. In VLDB, W3C. Extensible markup language (XML)