Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fateme Shirazi Spring 2010 1 Statistical structures for Internet-scale data management Authors: Nikos Ntarmos, Peter Triantafillou, G. Weikum.

Similar presentations


Presentation on theme: "Fateme Shirazi Spring 2010 1 Statistical structures for Internet-scale data management Authors: Nikos Ntarmos, Peter Triantafillou, G. Weikum."— Presentation transcript:

1 Fateme Shirazi Spring 2010 1 Statistical structures for Internet-scale data management Authors: Nikos Ntarmos, Peter Triantafillou, G. Weikum

2 Outline 2 Introduction Background : Hash sketches Compute aggregates and building histograms Implementation Results Conclusion

3 Peer-to-Peer (P2P) 3 File sharing in overlay networks Millions of users (peers) provide storage and bandwidth for searching and fetching files

4 Motivation 4 In P2P file-sharing often the total number of (unique) documents shared by their users is needed Distributed P2P search engines need to evaluate the significance of keywords the ratio of indexed documents containing each keyword to the total number of indexed documents

5 Motivation 5 Internet-scale information retrieval systems need a method to deduce the rank/score of data items. Sensor networks need methods to compute aggregates Traditionally query optimizers rely on histograms over stored data, to estimate the size of intermediate results

6 Overview Sketch 6 A large number of nodes, form the system’s infrastructure Contribute and/or store data items,involved in operations such as computing synopses and building histograms In general, queries do not affect all nodes Compute aggregation functions over data sets dynamically by a filter predicate of the query

7 Problem Formulation 7 Relevant data items stored in unpredictable ways in a subset of all nodes A large number of different data sets expected to exist, stored at (perhaps overlapping) subsets of the network And, relevant queries and synopses may be built and used over any of these data sets

8 Computational Model 8 Data stored in P2P network is structured in relations Each R consists of (k+l) attr. or columns R(a1,…,ak,b1,…,bl) Either one of the attributes of the tuple, or calculated otherwise (e.g. a combination of its attributes) attr1attr2attr3

9 Outline 9 Introduction Background : Hash sketches Compute aggregates and building histograms Experimental setup Results Conclusion

10 Distributed Hash Tables 10 A family of structured P2P network overlays exposing a hash- table-like interface(l ookup service ) Examples of DHTs include Chord, Kademlia, Pastry, CAN… Any node can efficiently retrieve a value with given key

11 Chord 11 Nodes are assigned identifiers from a circular ID space, computed as the hash of IP address Node-ID space among nodes partitioned, so that each node is responsible for a well-defined set (arc) of identifiers Each item is also assigned a unique identifier from the same ID space Stored at the node whose ID is closest to the item’s ID

12 Hash Sketches 12 Estimating the number of distinct items in D of data in a database For application domains which need counting distinct elements: Approximate query answering in very large databases, Data mining on the Internet graph Stream processing

13 Hash Sketches 13 A hash sketch consists of a bit vector B[·] of length L In order to estimate the number n of distinct elements in D, ρ (h(d)) is applied to all d ∈ D and record the results in the bitmap vector B[0... L−1] 000011 d1 d2 d3 d4 0 0 0 0 1 1 LSB MSB Partially copied from slides of the author

14 0 0 0 0 0 0 LSB MSB Hash sketches: Insertions h()  () PRN n PRN n-1...... PRN 4 PRN 3 PRN 2 PRN 1 L-bit Pseudo-Random Numbers dndn d n-1...... d4d4 d3d3 d2d2 d1d1 Data Items n Hash sketch (Bit vector B) b L-1 bLbL...... b1b1 b0b0 L+1 h() 10111  () “my item 1 key” “my item 2 key” “my item 3 key” “my item 4 key” 10010 01101 10011 1 1 14 Copied from slides of the author

15 Hash Sketches 15 Since h() distributes values uniformly over [0, 2 L ) P( ρ (h(d)) = k) = 2 −k−1 R =position of the least-significant 0-bit in B, then 2 R ~ n d1 d2 d3 d4 000011 |D| ~ 2 2 = 4 Partially copied from slides of the author

16 Distributing Data Synopses 16 (1) the “conservative” but popular rendezvous based approach (2) the decentralized way of DHS, in which no node has some sort of special functionality Partially copied from slides of the author

17 Mapping DHS bits to DHT Nodes N1 N8 N14 N21 N32 N56 N51 N48 N42 N38 Bit 0 Bit 1 Bit 2 Bit 3 Bit … Copied from slides of the author 17

18 DHS : Counting N1 N8 N14 N21 N32 N56 N51 N48 N42 N38 Counting node Bits >3 not set Bit 2 not set. Retrying… Bit 2 not set Bit 1 not set. Retrying… Bit 1 set! Copied from slides of the author 18

19 Outline 19 Introduction Background : Hash sketches Compute aggregates and building histograms Experimental setup Results Conclusion

20 Computing Aggregates 20 COUNT-DISTINCT: Estimation of the number of (distinct) items in a multi-set COUNT: adding the tuple IDs to the corresponding synopsis, instead of the values of the column in question SUM : each node locally computes the sum of values of the column tuples it stores, populates a local hash sketch AVG: Consists of estimating the SUM and COUNT of the column and then taking their ratio

21 COUNT-DISTINCT 21 Both rendezvous-based hash sketches and DHS applicable to estimation of the number of (distinct) items in a multiset Assume the estimation of the number of distinct values in a column C of a relation R stored in our Internet-scale data management system is wanted

22 Counting with the Rendezvous Approach 22 Nodes first compute a rendezvous ID. (attr1 h() 47 ) Then compute locally the synopsis and send it to the node whose ID is closest to the above ID (“rendezvous node”) The rendezvous node responsible for combining the individual synopses (by bitwise OR) into the global synopsis Interested nodes can then acquire the global synopsis by querying the rendezvous node

23 Step 1 23

24 Step 2 24

25 Step 3 25

26 Counting with DHS 26 In the DHS-based case, nodes storing tuples of R insert them into the DHS, by: (1)Nodes hash their tuples and compute ρ (hash) for each tuple (2) For each tuple,nodes send a “set-to-1” to a random ID in the corresponding arc (3) Counting consists of probing random nodes in arcs corresponding to increasing bit positions until 0-bit is found

27 Step 1 27

28 Step 2 28

29 Step 3 29

30 Histograms 30 The most common technique used by commercial databases as a statistical summary An approximation of the distribution of values in base relations. For a given attribute/column, a histogram is a grouping of attribute values into “buckets” Salary Age

31 Constructing histogram types 31 Equi-Width histograms The most basic histogram variant Partitions the attribute value domain into cells (buckets) of equal spread Assigns to each the number of tuples with an attribute value.

32 Other histogram types 32 Average shifted Equi-Width histograms,ASH Consist of several EWH with different starting positions in value space Frequency of each value in a bucket computed as the average of estimations given by histogram Equi-Depth histograms In an Equi-Depth histogram all buckets have equal frequencies but not (necessarily) equal spreads

33 Outline 33 Introduction Background : Hash sketches Compute aggregates and building histograms Implementation Results Conclusion

34 Implementation 34 1.Generating the workload 2. Populating the network with peers 3. Randomly assigning data tuples from the base data to nodes in the overlay 4. Then inserting all nodes into the P2P 5. Selecting random nodes,reconstructing histograms and computing aggregates

35 Measures of Interest 35 To consider (1) The fairness of the load distribution across nodes in the network (2)The accuracy of the estimation itself (3)The number of hops are considered to do the estimation To show the trade-off of scalability vs. performance/load distribution between the DHS and rendezvous-based approaches

36 Fairness 36 To compute the fairness, the load on any given node as the insertion /query/probe “hits” on the node is measured Number of times this node is target of insertion/query/probe opera A multitude of metrics are used. More specifically : The Gini Coefficient The Fairness Index Maximum and total loads for DHS- and rendezvous based approaches

37 The Gini Coefficient 37 Mean of the absolute difference of every possible pair. Takes values in the interval [0, 1), where a GC value of 0.0 is the best possible state, with 1.0 being the worst The Gini Coefficient roughly represents the amount of imbalance in the system Gini = A/(A+B) A B

38 Estimation error 38 Mean error of the estimation is reported Computed as percentage By the distributed estimation differed to the estimated aggregate computed in a centralized manner (i.e. as if all data was stored on a single host)

39 Hop-count Costs 39 The per-node average hop count for inserting all tuples to the distributed synopsis is measured and shown The per-node hop count costs are higher for the DHS-based approach

40 Outline 40 Introduction Background Compute aggregates and building histograms Implementation Results Conclusion

41 Results 41 The hop-count efficiency and the accuracy of rendezvous- based hash sketches and of the DHS is measured Initially single-attribute relations is created, with integer values in the intervals [0, 1000) following either a uniform distribution (depicted as a Zipf with θ equal to 0.0) or a shuffled Zipf distribution with θ equal to 0.7, 1.0, and 1.2

42 Total query load (node hits) over time 42

43 Load distribution 43 The extra hop-count cost of the DHS-based approach pays back when it comes to load distribution fairness The load on a node, the number of times it is visited (a.k.a. node hits) during data insertion and/or query processing.

44 Gini Coefficient 44 Rendezvous approach DHS approach

45 Evolution of the Gini coefficient 45 In the rendezvous based approach a single node has all the query load The DHS-based approaches,≈0.5, which equal the GC values of the distribution of the distances between consecutive nodes in the ID space Thus the best respective values by any algorithm using randomized assignment of items to nodes

46 Evolution of the Gini coefficient 46

47 Error for Computing COUNT Aggregate 47 Rendezvous approach DHS approach In both cases, error due to use of hash sketches Both approaches exhibit the same average error As expected, the higher the number of bitmaps in the synopsis, the better the accuracy

48 Insertion hop count 48 Rendezvous approach DHS approach The insertion hop-count cost for all aggregates Hop count costs are higher for the DHS-based approach by appr.8× for both the insertion and query cases

49 Outline 49 Introduction Background : Hash sketches Compute aggregates and building histograms Experimental setup Results Conclusion

50 50 A framework for distributed statistical synopses for Internet- scale networks such as P2P systems Extending centralized settings techniques towards distributed settings Developing DHT based higher-level synopses like Equi-Width, ASH, and Equi-Depth histograms

51 Conclusion 51 Fully distributed cardinality estimator, providing scalability, efficiency, accuracy Constructed efficiently and scaling well with growing network size, while having high accuracy Providing trade-off between accuracy and construction /maintenance costs Totally balanced (access and maintenance) load on nodes

52 Future research 52 Examining auto-tuning capabilities for the histogram inference engine Integrating it with Internet- scale query processing systems To look into implementing for other types of synopses, aggregates, and histogram variants Finally, using this tools for approximate query answ ering

53 Thank you 53


Download ppt "Fateme Shirazi Spring 2010 1 Statistical structures for Internet-scale data management Authors: Nikos Ntarmos, Peter Triantafillou, G. Weikum."

Similar presentations


Ads by Google