Presentation on theme: "Bloom Based Filters for Hierarchical Data Georgia Koloniari and Evaggelia Pitoura University of Ioannina, Greece."— Presentation transcript:
Bloom Based Filters for Hierarchical Data Georgia Koloniari and Evaggelia Pitoura University of Ioannina, Greece
2 Outline Motivation Problem Description Related Work Our approach: Multi-Level Bloom Filters Performance Evaluation Hierarchical Distribution of Filters Experimental Results Conclusions Future Work
3 Motivation Evolution of peer-to-peer systems as an effective way of sharing data Wide use of XML for data representation and exchange in the Internet Service Descriptions in XML-based languages Growing interest in content-based routing of data Challenge: How to efficiently discover the appropriate data based on their content?
4 The Problem A peer-to-peer system where each node stores a set of XML documents A query issued at a node may need results from multiple nodes in the system Use data summaries at each node to assist query routing A B C SumB SumC
5 Summaries Requirements Scalability: summaries should be able to scale to a large number of users and shared documents. Distribution: should be distributed across the nodes of the peer-to-peer system without requiring any central point of control. Dynamic: should support updates, since in a peer-to-peer system, users join and leave the system at will.
6 Related Work XML Indices –The Index Fabric [Cooper & Shadmon, RightOrder Inc 2001] –XSKETCH Synopsis [Polyzotis & Garofalakis, VLDB 2002] –APEX [Chung, Min & Chim, ACM SIGMOD 2002 ] –Path Tree [Aboulnaga, Alameldeen & Naughton, VLDB 2001] –Signature-based Indices [Park & Kim, DASFAA 2001] Routing in P2P –Secure Service Discovery [Hodes et al, Mobicom ’99] –Routing indices [Crespo & Garcia-Molina, ICDCS 2002]
7 Data Model camera printer device color postscript digital
8 Querying XML-based data or service descriptions Find the documents that satisfy a given query Queries that exploit content and structure of the data Membership Queries: “Is element X in set Y?” Path Queries: consisting of regular path expressions, i.e. device/*/camera
9 Bloom Filters Compact data structures for a probabilistic representation of a set Appropriate to answer membership queries
10 Bloom Filters (cont’d) Query for b: check the bits at positions H 1 (b), H 2 (b),..., H 4 (b).
11 Bloom Filters (cont’d) Appearance of false positives. False positive: the probabilty that the filter recognizes an elemnt as belonging to the set although it does not. P = (1 - e -kn/m ) k Ease of updates with the use of an array of counters Unable to represent relationships between elements
12 Our approach: Bloom filters suitable for distributed environments Main drawback: Unable to represent hierarchies Extend to multi-level Bloom Filters in order to support path queries Two approaches: –Breadth Bloom Filters –Depth Bloom Filters
13 Breadth Bloom Filters One Bloom Filter BBF i for each level of the tree i In each filter BBF i we insert the elements of all the nodes of level i. An additional BBF 0 with all the elements to improve performance Different sizes of the filter for each filter Look-up: –check BBF 0 for all elements of the path –check each element a i of the path to the corresponding level
14 Breadth Bloom Filters BBF 0 BBF 1 BBF 3 BBF 2 (device printer camera color postscript digital) device printer camera (color postscript digital) Queries: $device/printer/color /printer/postscript camera printer device color postscript digital
15 Depth Bloom Filters One Bloom Filter DBF i for each path of the tree with length i, i.e. each path with i+1 nodes In each DBF i we insert all paths of the tree with length i. Look-up for path of length p: –Check all elements of the query in DBF –Check for every sub-path of length 2 to p –For * split the path at the positition of * and check each sub-path seperately
16 Depth Bloom Filters (device printer camera color postscript digital) (device/printer device/camera camera/digital printer/color printer/postscript) (device/camera/digital device/printer/color device/printer/postscript) Queries: /device/printer/color /device/*/postscript DBF 0 Paths of length DBF 1 2 Paths of length 1 Paths of length 2 camera printer device color postscript digital
17 Experimental Evaluation 200 XML documents produced by the Niagara Generator (www.cs.wisc.edu/niagara) 4 hash functions using the MD5 message digest algorithm (RFC1321) Size of the filter: bits, about 2% of the size of the documents Levels of the documents: 4 Elements per document: 50 No repetition between element names Length of queries: 3 (e.g. /device/camera/digital) 90% of the elements forming the queries were contained in the documents Metric: Percentage of false positives
18 Influence of filter size
19 Influence of the number of elements per document
20 Influence of the levels of the document
21 Influence of the length of the queries
22 Varying the query workload Workload type: /printer/digital
23 Summary of Results Multi-level Bloom filters outperform Simple Bloom filters in evaluating path queries. For 2% of the total size of the data, multi-level Bloom filters evaluate path queries for a false positives ratio below 3%, while Simple Blooms fail to recognize the correct paths, no matter how much the filter size increases. Breadth Blooms work better than Depth Blooms. Depth Blooms require more space but are suitable for handling queries for which Breadth Blooms present a high ratio of false positives (exp. 5)
24 Distribution Each node stores: –local summary –merged summary of neighbours –merged summary constructed by applying the bit-wise OR per level Nodes organized according to topological proximity Two organizations of nodes: –hierarchical –horizons
25 Distribution: Hierarchical Organization Node C: Local filter Merged filter :E F G H Root filters: A, B, D
26 Bloom Filter Similarity Nodes organized according to Bloom Filter Similarity Measure: similarity measure based on the Manhattan distance metric. Let two filters B and C of size m d(B, C) = |B – C| + |B – C| + … |B[m] – C[m]|. similarity(B, C) = m – d(B, C).
27 Bloom Filter Similarity (cont’d) C B similarity(B, C) =8 - ( ) = 4 For multi-level Bloom filters similarity is defined as the sum of each pair of corresponding levels
28 Content-Based Organization When a node joins the system: –it broadcasts its local summary and attaches to the most «similar» node available
29 Performance in Distributed Setting Hierarchical organization of nodes Metric: Number of hops Parameters: –Variable number of nodes –Number of hierarchies: 5 –Maximum out-degree: 5 –Every 10% of all docs 70% similar –Length of queries: 2 –10% of the documents have results –70% of the documents contain the elements of the path query –One document per node
30 Finding the first result with respect to the nodes
31 Finding all the results with respect to the nodes
32 Finding the first result with varying number of results
33 Finding the first result with respect to the nodes
34 Finding all the results with respect to the nodes
35 Summary of Results The content-based organization is much more efficient in finding all the results for a query, than the proximity organization. They both perform similarly in discovering the first result. The content-based organization outperforms the proximity one when the nodes that satisfy a given query are limited. Both Simple and multi-level Blooms can be efficiently used as distributed filters. For path queries, multi-level Blooms outperform Simple ones.
36 Conclusions We introduced two novel data structures: Breadth and Depth Bloom Filters that exploit both the content and structure of the XML documents given a small space overhead. The new data structures outperform simple Bloom Filters with respect to false positives when addresing regular path expression queries Distributed in large-scale systems to support efficient service discovery Extended the use of Bloom filters to organize the nodes according to their content.
37 Future Work Explore different policies for the filters distribution. Explore different types of data summaries (e.g. Signatures) Extend the data model to XML graphs and incorporate values into the indexes