Fateme Shirazi Spring 2010 1 Statistical structures for Internet-scale data management Authors: Nikos Ntarmos, Peter Triantafillou, G. Weikum.

Slides:

Advertisements

Similar presentations

P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.

Advertisements

Peer to Peer and Distributed Hash Tables

Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.

Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

Fast Algorithms For Hierarchical Range Histogram Constructions

Introduction to Histograms Presented By: Laukik Chitnis

CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.

Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.

Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.

Small-world Overlay P2P Network

Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.

Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems Antony Rowstron and Peter Druschel Proc. of the 18th IFIP/ACM.

Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,

Mercury: Scalable Routing for Range Queries Ashwin R. Bharambe Carnegie Mellon University With Mukesh Agrawal, Srinivasan Seshan.

Spring 2003CS 4611 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.

Rendezvous Points-Based Scalable Content Discovery with Load Balancing Jun Gao Peter Steenkiste Computer Science Department Carnegie Mellon University.

Adaptive Content Management in Structured P2P Communities Jussi Kangasharju Keith W. Ross David A. Turner.

Integrating Semantics-Based Access Mechanisms with P2P File Systems Yingwu Zhu, Honghao Wang and Yiming Hu.

Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.

Aggregating Information in Peer-to-Peer Systems for Improved Join and Leave Distributed Computing Group Keno Albrecht Ruedi Arnold Michael Gähwiler Roger.

SCALLOP A Scalable and Load-Balanced Peer- to-Peer Lookup Protocol for High- Performance Distributed System Jerry Chou, Tai-Yi Huang & Kuang-Li Huang Embedded.

Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.

Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.

Peer To Peer Distributed Systems Pete Keleher. Why Distributed Systems? l Aggregate resources! –memory –disk –CPU cycles l Proximity to physical stuff.

1 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.

ICDE A Peer-to-peer Framework for Caching Range Queries Ozgur D. Sahin Abhishek Gupta Divyakant Agrawal Amr El Abbadi Department of Computer Science.

Geographic Routing Without Location Information A. Rao, C. Papadimitriou, S. Shenker, and I. Stoica In Proceedings of the 9th Annual international Conference.

Introduction to Peer-to-Peer Networks. What is a P2P network Uses the vast resource of the machines at the edge of the Internet to build a network that.

Mobile Ad-hoc Pastry (MADPastry) Niloy Ganguly. Problem of normal DHT in MANET No co-relation between overlay logical hop and physical hop – Low bandwidth,

INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.

Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.

Towards Efficient Load Balancing in Structured P2P Systems Yingwu Zhu, Yiming Hu University of Cincinnati.

PIC: Practical Internet Coordinates for Distance Estimation Manuel Costa joint work with Miguel Castro, Ant Rowstron, Peter Key Microsoft Research Cambridge.

Introduction to Peer-to-Peer Networks. What is a P2P network A P2P network is a large distributed system. It uses the vast resource of PCs distributed.

Peer to Peer Research survey TingYang Chang. Intro. Of P2P Computers of the system was known as peers which sharing data files with each other. Build.

Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.

Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis

Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.

Implicit group messaging in peer-to-peer networks Daniel Cutting, 28th April 2006 Advanced Networks Research Group.

AlvisP2P : Scalable Peer-to-Peer Text Retrieval in a Structured P2P Network Toan Luu, Gleb Skobeltsyn, Fabius Klemm, Maroje Puh, Ivana Podnar Zarko, Martin.

An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.

Histograms for Selectivity Estimation

National Institute of Advanced Industrial Science and Technology Query Processing for Distributed RDF Databases Using a Three-dimensional Hash Index Akiyoshi.

Scalable Content- Addressable Networks Prepared by Kuhan Paramsothy March 5, 2007.

Lecture 12 Distributed Hash Tables CPE 401/601 Computer Network Systems slides are modified from Jennifer Rexford.

Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

Peer to Peer Network Design Discovery and Routing algorithms

Presented By Amarjit Datta

Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks

Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.

LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.

Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu.

Two Peer-to-Peer Networking Approaches Ken Calvert Net Seminar, 23 October 2001 Note: Many slides “borrowed” from S. Ratnasamy’s Qualifying Exam talk.

INTERNET TECHNOLOGIES Week 10 Peer to Peer Paradigm 1.

P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.

Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.

Plethora: A Locality Enhancing Peer-to-Peer Network Ronaldo Alves Ferreira Advisor: Ananth Grama Co-advisor: Suresh Jagannathan Department of Computer.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

CS694 - DHT1 Distributed Hash Table Systems Hui Zhang University of Southern California.

Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.

1 VLDB, Background What is important for the user.

Pastry Scalable, decentralized object locations and routing for large p2p systems.

CHAPTER 3 Architectures for Distributed Systems

Plethora: Infrastructure and System Design

EE 122: Peer-to-Peer (P2P) Networks

DATABASE HISTOGRAMS E0 261 Jayant Haritsa

Presentation transcript:

Fateme Shirazi Spring Statistical structures for Internet-scale data management Authors: Nikos Ntarmos, Peter Triantafillou, G. Weikum

Outline 2 Introduction Background : Hash sketches Compute aggregates and building histograms Implementation Results Conclusion

Peer-to-Peer (P2P) 3 File sharing in overlay networks Millions of users (peers) provide storage and bandwidth for searching and fetching files

Motivation 4 In P2P file-sharing often the total number of (unique) documents shared by their users is needed Distributed P2P search engines need to evaluate the significance of keywords the ratio of indexed documents containing each keyword to the total number of indexed documents

Motivation 5 Internet-scale information retrieval systems need a method to deduce the rank/score of data items. Sensor networks need methods to compute aggregates Traditionally query optimizers rely on histograms over stored data, to estimate the size of intermediate results

Overview Sketch 6 A large number of nodes, form the system’s infrastructure Contribute and/or store data items,involved in operations such as computing synopses and building histograms In general, queries do not affect all nodes Compute aggregation functions over data sets dynamically by a filter predicate of the query

Problem Formulation 7 Relevant data items stored in unpredictable ways in a subset of all nodes A large number of different data sets expected to exist, stored at (perhaps overlapping) subsets of the network And, relevant queries and synopses may be built and used over any of these data sets

Computational Model 8 Data stored in P2P network is structured in relations Each R consists of (k+l) attr. or columns R(a1,…,ak,b1,…,bl) Either one of the attributes of the tuple, or calculated otherwise (e.g. a combination of its attributes) attr1attr2attr3

Outline 9 Introduction Background : Hash sketches Compute aggregates and building histograms Experimental setup Results Conclusion

Distributed Hash Tables 10 A family of structured P2P network overlays exposing a hash- table-like interface(l ookup service ) Examples of DHTs include Chord, Kademlia, Pastry, CAN… Any node can efficiently retrieve a value with given key

Chord 11 Nodes are assigned identifiers from a circular ID space, computed as the hash of IP address Node-ID space among nodes partitioned, so that each node is responsible for a well-defined set (arc) of identifiers Each item is also assigned a unique identifier from the same ID space Stored at the node whose ID is closest to the item’s ID

Hash Sketches 12 Estimating the number of distinct items in D of data in a database For application domains which need counting distinct elements: Approximate query answering in very large databases, Data mining on the Internet graph Stream processing

Hash Sketches 13 A hash sketch consists of a bit vector B[·] of length L In order to estimate the number n of distinct elements in D, ρ (h(d)) is applied to all d ∈ D and record the results in the bitmap vector B[0... L−1] d1 d2 d3 d LSB MSB Partially copied from slides of the author

LSB MSB Hash sketches: Insertions h()  () PRN n PRN n PRN 4 PRN 3 PRN 2 PRN 1 L-bit Pseudo-Random Numbers dndn d n d4d4 d3d3 d2d2 d1d1 Data Items n Hash sketch (Bit vector B) b L-1 bLbL b1b1 b0b0 L+1 h()  () “my item 1 key” “my item 2 key” “my item 3 key” “my item 4 key” Copied from slides of the author

Hash Sketches 15 Since h() distributes values uniformly over [0, 2 L ) P( ρ (h(d)) = k) = 2 −k−1 R =position of the least-significant 0-bit in B, then 2 R ~ n d1 d2 d3 d |D| ~ 2 2 = 4 Partially copied from slides of the author

Distributing Data Synopses 16 (1) the “conservative” but popular rendezvous based approach (2) the decentralized way of DHS, in which no node has some sort of special functionality Partially copied from slides of the author

Mapping DHS bits to DHT Nodes N1 N8 N14 N21 N32 N56 N51 N48 N42 N38 Bit 0 Bit 1 Bit 2 Bit 3 Bit … Copied from slides of the author 17

DHS : Counting N1 N8 N14 N21 N32 N56 N51 N48 N42 N38 Counting node Bits >3 not set Bit 2 not set. Retrying… Bit 2 not set Bit 1 not set. Retrying… Bit 1 set! Copied from slides of the author 18

Outline 19 Introduction Background : Hash sketches Compute aggregates and building histograms Experimental setup Results Conclusion

Computing Aggregates 20 COUNT-DISTINCT: Estimation of the number of (distinct) items in a multi-set COUNT: adding the tuple IDs to the corresponding synopsis, instead of the values of the column in question SUM : each node locally computes the sum of values of the column tuples it stores, populates a local hash sketch AVG: Consists of estimating the SUM and COUNT of the column and then taking their ratio

COUNT-DISTINCT 21 Both rendezvous-based hash sketches and DHS applicable to estimation of the number of (distinct) items in a multiset Assume the estimation of the number of distinct values in a column C of a relation R stored in our Internet-scale data management system is wanted

Counting with the Rendezvous Approach 22 Nodes first compute a rendezvous ID. (attr1 h() 47 ) Then compute locally the synopsis and send it to the node whose ID is closest to the above ID (“rendezvous node”) The rendezvous node responsible for combining the individual synopses (by bitwise OR) into the global synopsis Interested nodes can then acquire the global synopsis by querying the rendezvous node

Step 1 23

Step 2 24

Step 3 25

Counting with DHS 26 In the DHS-based case, nodes storing tuples of R insert them into the DHS, by: (1)Nodes hash their tuples and compute ρ (hash) for each tuple (2) For each tuple,nodes send a “set-to-1” to a random ID in the corresponding arc (3) Counting consists of probing random nodes in arcs corresponding to increasing bit positions until 0-bit is found

Step 1 27

Step 2 28

Step 3 29

Histograms 30 The most common technique used by commercial databases as a statistical summary An approximation of the distribution of values in base relations. For a given attribute/column, a histogram is a grouping of attribute values into “buckets” Salary Age

Constructing histogram types 31 Equi-Width histograms The most basic histogram variant Partitions the attribute value domain into cells (buckets) of equal spread Assigns to each the number of tuples with an attribute value.

Other histogram types 32 Average shifted Equi-Width histograms,ASH Consist of several EWH with different starting positions in value space Frequency of each value in a bucket computed as the average of estimations given by histogram Equi-Depth histograms In an Equi-Depth histogram all buckets have equal frequencies but not (necessarily) equal spreads

Outline 33 Introduction Background : Hash sketches Compute aggregates and building histograms Implementation Results Conclusion

Implementation 34 1.Generating the workload 2. Populating the network with peers 3. Randomly assigning data tuples from the base data to nodes in the overlay 4. Then inserting all nodes into the P2P 5. Selecting random nodes,reconstructing histograms and computing aggregates

Measures of Interest 35 To consider (1) The fairness of the load distribution across nodes in the network (2)The accuracy of the estimation itself (3)The number of hops are considered to do the estimation To show the trade-off of scalability vs. performance/load distribution between the DHS and rendezvous-based approaches

Fairness 36 To compute the fairness, the load on any given node as the insertion /query/probe “hits” on the node is measured Number of times this node is target of insertion/query/probe opera A multitude of metrics are used. More specifically : The Gini Coefficient The Fairness Index Maximum and total loads for DHS- and rendezvous based approaches

The Gini Coefficient 37 Mean of the absolute difference of every possible pair. Takes values in the interval [0, 1), where a GC value of 0.0 is the best possible state, with 1.0 being the worst The Gini Coefficient roughly represents the amount of imbalance in the system Gini = A/(A+B) A B

Estimation error 38 Mean error of the estimation is reported Computed as percentage By the distributed estimation differed to the estimated aggregate computed in a centralized manner (i.e. as if all data was stored on a single host)

Hop-count Costs 39 The per-node average hop count for inserting all tuples to the distributed synopsis is measured and shown The per-node hop count costs are higher for the DHS-based approach

Outline 40 Introduction Background Compute aggregates and building histograms Implementation Results Conclusion

Results 41 The hop-count efficiency and the accuracy of rendezvous- based hash sketches and of the DHS is measured Initially single-attribute relations is created, with integer values in the intervals [0, 1000) following either a uniform distribution (depicted as a Zipf with θ equal to 0.0) or a shuffled Zipf distribution with θ equal to 0.7, 1.0, and 1.2

Total query load (node hits) over time 42

Load distribution 43 The extra hop-count cost of the DHS-based approach pays back when it comes to load distribution fairness The load on a node, the number of times it is visited (a.k.a. node hits) during data insertion and/or query processing.

Gini Coefficient 44 Rendezvous approach DHS approach

Evolution of the Gini coefficient 45 In the rendezvous based approach a single node has all the query load The DHS-based approaches,≈0.5, which equal the GC values of the distribution of the distances between consecutive nodes in the ID space Thus the best respective values by any algorithm using randomized assignment of items to nodes

Evolution of the Gini coefficient 46

Error for Computing COUNT Aggregate 47 Rendezvous approach DHS approach In both cases, error due to use of hash sketches Both approaches exhibit the same average error As expected, the higher the number of bitmaps in the synopsis, the better the accuracy

Insertion hop count 48 Rendezvous approach DHS approach The insertion hop-count cost for all aggregates Hop count costs are higher for the DHS-based approach by appr.8× for both the insertion and query cases

Outline 49 Introduction Background : Hash sketches Compute aggregates and building histograms Experimental setup Results Conclusion

50 A framework for distributed statistical synopses for Internet- scale networks such as P2P systems Extending centralized settings techniques towards distributed settings Developing DHT based higher-level synopses like Equi-Width, ASH, and Equi-Depth histograms

Conclusion 51 Fully distributed cardinality estimator, providing scalability, efficiency, accuracy Constructed efficiently and scaling well with growing network size, while having high accuracy Providing trade-off between accuracy and construction /maintenance costs Totally balanced (access and maintenance) load on nodes

Future research 52 Examining auto-tuning capabilities for the histogram inference engine Integrating it with Internet- scale query processing systems To look into implementing for other types of synopses, aggregates, and histogram variants Finally, using this tools for approximate query answ ering

Thank you 53