Roger Zimmermann, Wei-Shinn Ku, and Wei-Cheng Chu Computer Science Department University of Southern California Presenter: Xunfei Jiang.

Slides:



Advertisements
Similar presentations
P2PR-tree: An R-tree-based Spatial Index for P2P Environments ANIRBAN MONDAL YI LIFU MASARU KITSUREGAWA University of Tokyo.
Advertisements

Scalable Content-Addressable Network Lintao Liu
Searching on Multi-Dimensional Data
Rumor Routing in Sensor Networks David Braginsky and Deborah Estrin Presented By Tu Tran 1.
University of Cincinnati1 Towards A Content-Based Aggregation Network By Shagun Kakkar May 29, 2002.
Technion –Israel Institute of Technology Computer Networks Laboratory A Comparison of Peer-to-Peer systems by Gomon Dmitri and Kritsmer Ilya under Roi.
Search and Replication in Unstructured Peer-to-Peer Networks Pei Cao, Christine Lv., Edith Cohen, Kai Li and Scott Shenker ICS 2002.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
Spatial Mining.
Naming Computer Engineering Department Distributed Systems Course Asst. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2014.
Small-world Overlay P2P Network
2-dimensional indexing structure
P2p, Spring 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems March 29, 2005.
Network Coding for Large Scale Content Distribution Christos Gkantsidis Georgia Institute of Technology Pablo Rodriguez Microsoft Research IEEE INFOCOM.
Scalable Application Layer Multicast Suman Banerjee Bobby Bhattacharjee Christopher Kommareddy ACM SIGCOMM Computer Communication Review, Proceedings of.
A Trust Based Assess Control Framework for P2P File-Sharing System Speaker : Jia-Hui Huang Adviser : Kai-Wei Ke Date : 2004 / 3 / 15.
Spatial Information Systems (SIS) COMP Spatial access methods: Indexing.
Chapter 3: Data Storage and Access Methods
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek and Hari alakrishnan.
Spatial Queries Nearest Neighbor Queries.
Spatio-Temporal Databases. Introduction Spatiotemporal Databases: manage spatial data whose geometry changes over time Geometry: position and/or extent.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
MULTICASTING Network Security.
P2P Course, Structured systems 1 Introduction (26/10/05)
Improving Data Access in P2P Systems Karl Aberer and Magdalena Punceva Swiss Federal Institute of Technology Manfred Hauswirth and Roman Schmidt Technical.
ICDE A Peer-to-peer Framework for Caching Range Queries Ozgur D. Sahin Abhishek Gupta Divyakant Agrawal Amr El Abbadi Department of Computer Science.
1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Michael Vassilakopoulos.
1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.
5.1 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED.
Peer to Peer Research survey TingYang Chang. Intro. Of P2P Computers of the system was known as peers which sharing data files with each other. Build.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.
A Scalable Content-Addressable Network (CAN) Seminar “Peer-to-peer Information Systems” Speaker Vladimir Eske Advisor Dr. Ralf Schenkel November 2003.
Enabling Peer-to-Peer SDP in an Agent Environment University of Maryland Baltimore County USA.
Adaptive Web Caching CS411 Dynamic Web-Based Systems Flying Pig Fei Teng/Long Zhao/Pallavi Shinde Computer Science Department.
Load-Balancing Routing in Multichannel Hybrid Wireless Networks With Single Network Interface So, J.; Vaidya, N. H.; Vehicular Technology, IEEE Transactions.
An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.
SIGCOMM 2001 Lecture slides by Dr. Yingwu Zhu Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.
Scalable Content- Addressable Networks Prepared by Kuhan Paramsothy March 5, 2007.
DHT-based unicast for mobile ad hoc networks Thomas Zahn, Jochen Schiller Institute of Computer Science Freie Universitat Berlin 報告 : 羅世豪.
1. Outline  Introduction  Different Mechanisms Broadcasting Multicasting Forward Pointers Home-based approach Distributed Hash Tables Hierarchical approaches.
1. Efficient Peer-to-Peer Lookup Based on a Distributed Trie 2. Complex Queries in DHT-based Peer-to-Peer Networks Lintao Liu 5/21/2002.
1 30 November 2006 An Efficient Nearest Neighbor (NN) Algorithm for Peer-to-Peer (P2P) Settings Ahmed Sabbir Arif Graduate Student, York University.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Peer to Peer Network Design Discovery and Routing algorithms
Click to edit Master title style Multi-Destination Routing and the Design of Peer-to-Peer Overlays Authors John Buford Panasonic Princeton Lab, USA. Alan.
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
A configuration method for structured P2P overlay network considering delay variations Tomoya KITANI (Shizuoka Univ. 、 Japan) Yoshitaka NAKAMURA (NAIST,
INTERNET TECHNOLOGIES Week 10 Peer to Peer Paradigm 1.
AQWA Adaptive Query-Workload-Aware Partitioning of Big Spatial Data Dimosthenis Stefanidis Stelios Nikolaou.
P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.
P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.
Plethora: A Locality Enhancing Peer-to-Peer Network Ronaldo Alves Ferreira Advisor: Ananth Grama Co-advisor: Suresh Jagannathan Department of Computer.
Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.
1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
CHAPTER 3 Architectures for Distributed Systems
Early Measurements of a Cluster-based Architecture for P2P Systems
EE 122: Peer-to-Peer (P2P) Networks
5.2 FLAT NAMING.
Paraskevi Raftopoulou, Euripides G.M. Petrakis
Efficient Aggregation over Objects with Extent
Presentation transcript:

Roger Zimmermann, Wei-Shinn Ku, and Wei-Cheng Chu Computer Science Department University of Southern California Presenter: Xunfei Jiang

Introduction Problem current architectures rely on a centralized data repository applications utilize and integrate data sets remotely accessible under different administrative control Solution Combine spatial database with Web service data is maintained by specific entities or organizations the correct data set for a specific calculation can be downloaded automatically without manual user intervention updates and changes to the data are instantly available to remote applications Propose middleware design based on distributed R-tree and Quad-tree index structures access to the data is public and available through a Web services interface requests are sent only o the specific repositories that most likely have relevant data

Motivation Autonomy Cooperative and efficient query processing Decentralization Figure 1: The proposed distributed spatial database infrastructure with middleware utilizing replicated spatial index structures (either R-trees or Quad-trees). query Cooperate to decide which other nodes contain potentially relevant data and which do not query data archives are disbursed, query access method and routing mechanism are expect to be fully decentralized the overall system must cooperatively execute the request and return all relevant data

Baseline Algorithm EQR (exhaustive query routing) When a query q arrives at a specific node, forwarded it to all other nodes. messages generated by queries M = 2× Q × (N − 1) Q: the total number of queries N: the number of nodes. 2 × : the same number of result messages are generated Shorten of EQR generates a lot of message traffic poor scalability

Query Routing with Spatial Indexing How to reduce the query space recursively partition the key space into a set of equivalence classes R-tree and Quad-tree algorithms generate tree-structure indices that partition the overall space into successively smaller areas at lower levels of the index hierarchy successfully used in the core engines of spatial database systems Quad-treeR-tree Structureeach internal node has exactly four children Extension of B tree, and the upper bound and lower bound of the number of entries for each internal node are pre-specified (usually 2 and 4). CharacteristicsDecompose space into adaptable cells. Each cell (or bucket) has a maximum capacity. When maximum capacity is reached, the bucket splits. The tree directory follows the spatial decomposition of the Quadtree. Split spatial space into rectangles. When a rectangle contains more than maximum children, adjust should be made. Object nodes are at the lowest level.

A novel method: use R-trees and Quad-trees as index structures across multiple spatial databases insert the MBR of the data set of each archive into a global R-tree or Quad-tree distribute copies of the global index structure to each archive(avoid a centralized index server) an archive can intersect each query rectangle with the archive MBRs stored in the global index. The query is then only forwarded to archives whose MBR overlaps with query rectangle Cons: additional cost on synchronizing the global index structures Pros: reduces the overhead the global index structures manage bounding rectangles, changes to the data set of any individual archive only result in index updates if the MBR changes – and this is very infrequent Example An archive manages 1,000 two-dimensional spatial data objects. The MBR is defined by at most four of them2. insertion or deletion confined within a MBR do not affect the global index only changes that stretch or shrink the MBR need to be propagated. the estimation function of the number of messages when a global index is used: M T = M Q + M U = [2 × Q × (N − 1) × S Q ] + [U × (N − 1) × S U ] an example of S U values ranging from to with one of our experiments.

Assumptions Every archive in the distributed environment hosts a database engine storing retrieving querying holds a directory, termed the server list, with entries denoting the network location (e.g., IP address) the minimum bounding rectangle (MBR) of every spatial database From the directory information, each server computes the corresponding R-tree (or Quad- tree) global index data structure. Updates local data structure is updated after receiving MBR update messages local MBR changes due to data insertions or deletions initiate update messages to all the other database servers In the R-tree based design, the upper bound and lower bound of the number of entries that will fit in one internal tree node are pre-specified. e.g., M is the maximum number of entries that will fit in one node and m ≤ M/2 is the parameter specifying the minimum number of entries in a node.

The R-Tree Based Design Index Initialization and Topology Maintenance a new server (A) joins a spatial database 1) A sends its information (IP address and MBR) as update message to an existing server B 2) The existing server B updates its local R-tree index and replies with the current system information 3) A constructs its own R-tree index 4) A broadcasts an update message with its hostname and MBR to all the other servers except B archive A departures it broadcasts an update message to announce that it is removing itself from the topology other servers delete the leaving server’s MBR from their R-tree server fails the node that first detects the unresponsive system broadcasts a removal message to everyone

The R-Tree Based Design Query Routing Clients do not need to contact all the servers to obtain comprehensive query results Queries sent to a server will automatically be forwarded and yield accurate spatial results from the complete data set The queried server determines through its local R-tree whether any of the other archives in the collective potentially have relevant data the query rectangle and the archives’ MBR intersect Forwarded queries are flagged show that they originated from a server rather than a client avoid query loops The results of forwarded queries are returned to the initially contacted server aggregates them and returns the set to the client.

The R-Tree Based Design R-Tree Index Update Each spatial database server must process data object update requests from local users data insertions Data deletions Variation of MBR update of local R-tree index the new MBR is broadcast to all the other servers in the system for tree index synchronization

The Quad-tree Based Design Quad-tree Index Update. Slight differences from R-tree based design for tree index updates arise as follows. If an object insertion or deletion results in changes to the MBR boundary, Quad-tree model checks whether the MBR variation affects the Quad-tree structure. changed: the update is propagated to all the other servers as usual for tree index synchronization. Unchanged: the MBR update is not broadcast. Server number increasing, updates are constant Table 2 illustrates that approximately 4.2% to 5.8% of all insert or delete operations result in an MBR change. increasing both the server and update numbers linearly the activity per server is relatively constant. Table 3 shows the experimental results Conclusion: the update message traffic to synchronize distributed Quad-trees is much lower than for R-trees. Typo?

Nearest Neighbor Queries Most recently research applied to our system branch-and-bound R-tree traversal algorithm efficiently answers both NN and k-NN queries Two metrics minimum distance (MINDIST) the minimum Euclidean distance between the query point and the nearest edge of an MBR optimal choice minimum of the maximum possible distance (MINMAXDIST) minimum of all the maximum distances between the query point and points on each of the axes of an MBR pessimistic choice MINDIST MINMAXDIST Query Point MBR MAXDIST on edge x1 x1 y2y1 x2 MAXDIST on edge y2 MAXDIST on edge x2MAXDIST on edge y1

Nearest Neighbor Queries NN search algorithm implements an ordered depth first traversal based on the values of MINDIST and MINMAXDIST. It begins from the R-tree root node and proceeds down the tree hierarchy. At a leaf node, a distance computation function is invoked to decide the actual distance between the query point and the candidate DB objects. The algorithm iterates with three search-pruning strategies until it finds the NN object. Distributed design every server maintains a local R-tree the NN search algorithm is executed on local R-tree to compute both the MINDIST and MINMAXDIST values. a Web service interface is created at each node access these distance values across multiple archives remotely obtain the distance between the search point and a candidate nearest data point To answer a NN query, a server needs to send several distance query messages to other servers in the system during the branch-and-bound process. With the three search pruning strategies proposed in “Nearest Neighbor Queries” and a slightly modified search algorithm, NN queries can be efficiently executed.

EXPERIMENTAL VALIDATION Implemented tree-based design in a simulator to evaluate the performance of our approach Index Structure R-tree algorithm MX-CIF Quad-tree algorithm Index tree search complexity the same as these algorithms In a distributed environment, the search complexity is dominated by the communication overhead between servers. Focus of the simulation quantifying the query routing traffic generated by queries updates Data sets synthetic spatial data set real-world spatial data set

Simulator Implementation The leaf nodes represent specific server MBRs and contain forwarding pointers (i.e., the host names and IP addresses) to the remote servers. The leaf node of the MBR of the local data set directly points to the local database. If a query window intersects with several server MBRs, then the query is forwarded to each. The simulator counts all the messages generated through the query forwarding mechanism all the return messages containing result data sets. Additionally, tree update information must be broadcast to all servers.

Simulator Implementation Event Generation Two types of events: Queries Updates Data updates could either be insertion or deletion requests. Both types were generated according to a Poisson distribution, with the inter-arrival rate λ Q and λ U being specified independently. Simulated Time: ten hours

Simulator Implementation Query Parameter Generation dynamically created based on the two parameters(see Table 4). mean query window size (QWS-μ): mean percentage of the global geographical area that was used for the query window based on a normal distribution. deviation (QWS-σ): provides a variation range bound by one QWS-σ deviation such that the query window area was different for each query event. Eg: With QWS-μ and (QWS-σ), the simulator first chose the query window size. Randomly selected a point (x 1, y 1 ) as one corner coordinate and a value x 2 inside the global boundary as the x-value of the other coordinate across the diagonal of the query window. Based on the window size, calculate y2.

Simulator Implementation Synthetic Data Generation borehole data item a spatial location attribute Longitude Latitude Randomly generating N data center points, C 0, C 1,..., C N −1, which located inside a global boundary. C i = (x i, y i ) is the geographical center of all the borehole data managed by an individual spatial database server. For each C i, B associated boreholes p j are generated according to a normal distribution. the borehole points are more dense near the center point sparse when the distance to the center point increases. The generator limited the maximal distance of a borehole from its center to the value of two standard deviations. After all the borehole points were created, the MBR of each database server was computed. Figure 3 illustrates the boreholes managed by ten servers and their respective MBRs.

Experiments Synthetic Data The accumulated traffic of queries and updates of the tree- based designs and the exhaustive query routing mechanism.

Experiments Kobe Data Experiment Data set geotechnical data provided by Kobe University, Japan boreholes of Kobe county K-means algorithm [7] to cluster the Kobe data points in Euclidean space and assign them to database servers data set was divided into ten clusters (see Figure 5) Experiment parameters: Use both the R-tree and quad-tree index structures with different query window sizes (ranging from 1% to 50%). we also generated a synthetic data set with the same parameters (10 servers, 400 boreholes per server) [7] J.B. McQueen. Some methods of classification and analysis of multivariate observations. In 5th Berkeley Symposium in Mathematics, Statistics and Probability, pages 281–297, 1967.

Performance improved with both synthetic and real-world data sets EQR: the worst case OQR(optimal Query Routing): defined as the best case The tree-based designs : reduction of 60% ~ 70% of inter-server message traffic compared with exhaustive query routing (with query window sizes of 10% to 20%). relationship between EQR (upper bound), the two tree- based designs, and OQR (lower bound) with different query window sizes (Figure 7) Normalized y-scale divided the accumulated message count of EQR and the tree-based designs by the message count of OQR

The best condition No overlap between any server MBRs tree-based designs can reduce inter-server traffic by up to 90% The worst condition significant MBR overlap the performance decreases to the same level or slightly worse (because of the update costs) than ERQ System designer needs to consider the characteristics of the data set before opting for the tree-based query routing algorithms

Relative Work large scale distributed data management systems P2P (peer-to-peer) systems key characteristics dynamic topology heterogeneity self-organization The query processing and routing approaches of some of the initial P2P systems focused on a centralized index server (e.g., Napster) a flooding mechanism (e.g., Gnutella). not very scalable or inefficient distributed hash tables (DHT) achieve massive scalability and efficient query forwarding Pasty [11], Chord [16], and CAN [9] provide a mechanism to perform object location within a potentially very large overlay network of nodes connected to the Internet. unsuitable for range queries techniques adapted DHT mechanisms for range queries Harwood and Tanin [5] introduce a method to hash spatial content over P2P networks Space is divided in a Quadtree-like manner and the central points of each square, denoted control points, are hashed to a Chord ring. Spatial objects and queries are resolved to spatial regions whose control points are then hashed onto the DHT ring. A distributed catalog service that can locate XML path data Range queries are supported via wildcards in XML strings (i.e., “*”) may require a scan of some of the data.

Conclusion Presented an architecture efficiently route and execute spatial queries based on globally distributed and replicated index structures R-tree Quad-tree performed extensive simulations with both synthetic and real data sets and observed that update message traffic to keep the replicated indices synchronized is negligible overall query message traffic is significantly reduced only slightly higher than what an optimal distribution algorithm with global knowledge could achieve

Future Work Current metric: number of messages does not capture the parallelism that is achieved within the system Future work measure the response time and the query throughput

Thank you!