Querying the Internet with PIER Article by: Ryan Huebsch, Joseph M. Hellerstein, Nick Lanham, Boon Thau Loo, Scott Shenker, Ion Stoica, 2003 EECS Computer.

Slides:



Advertisements
Similar presentations
Internet Indirection Infrastructure (i3 ) Ion Stoica, Daniel Adkins, Shelley Zhuang, Scott Shenker, Sonesh Surana UC Berkeley SIGCOMM 2002 Presented by:
Advertisements

Implementing Declarative Overlays From two talks by: Boon Thau Loo 1 Tyson Condie 1, Joseph M. Hellerstein 1,2, Petros Maniatis 2, Timothy Roscoe 2, Ion.
P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Peer to Peer and Distributed Hash Tables
Scalable Content-Addressable Network Lintao Liu
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
Somdas Bandyopadhyay Anirban Basumallik
Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.
Massively Distributed Database Systems Distributed Hash Spring 2014 Ki-Joune Li Pusan National University.
Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Schenker Presented by Greg Nims.
Presented by Elisavet Kozyri. A distributed application architecture that partitions tasks or work loads between peers Main actions: Find the owner of.
A Scalable Content Addressable Network (CAN)
Peer to Peer File Sharing Huseyin Ozgur TAN. What is Peer-to-Peer?  Every node is designed to(but may not by user choice) provide some service that helps.
Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Shenker A Scalable, Content- Addressable Network (CAN) ACIRI U.C.Berkeley Tahoe Networks.
P2p, Fall 05 1 Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) VLDB 2003 Ryan Huebsch, Joe Hellerstein, Nick Lanham,
Internet Indirection Infrastructure Ion Stoica UC Berkeley.
Introducing: Cooperative Library Presented August 19, 2002.
A Scalable Content-Addressable Network Authors: S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker University of California, Berkeley Presenter:
Distributed Lookup Systems
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek and Hari alakrishnan.
1 A Scalable Content- Addressable Network S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker Proceedings of ACM SIGCOMM ’01 Sections: 3.5 & 3.7.
Content Addressable Networks. CAN Associate with each node and item a unique id in a d-dimensional space Goals –Scales to hundreds of thousands of nodes.
1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
P2p, Fall 06 1 Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) VLDB 2003 Ryan Huebsch, Joe Hellerstein, Nick Lanham,
Peer-to-Peer Networks Slides largely adopted from Ion Stoica’s lecture at UCB.
ICDE A Peer-to-peer Framework for Caching Range Queries Ozgur D. Sahin Abhishek Gupta Divyakant Agrawal Amr El Abbadi Department of Computer Science.
Geographic Routing Without Location Information A. Rao, C. Papadimitriou, S. Shenker, and I. Stoica In Proceedings of the 9th Annual international Conference.
Internet Indirection Infrastructure (i3) Ion Stoica, Daniel Adkins, Shelley Zhuang, Scott Shenker, Sonesh Surana UC Berkeley SIGCOMM 2002.
Structured P2P Network Group14: Qiwei Zhang; Shi Yan; Dawei Ouyang; Boyu Sun.
1 A scalable Content- Addressable Network Sylvia Rathnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Shenker Pirammanayagam Manickavasagam.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
CONTENT ADDRESSABLE NETWORK Sylvia Ratsanamy, Mark Handley Paul Francis, Richard Karp Scott Shenker.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
PIER & PHI Overview of Challenges & Opportunities Ryan Huebsch † Joe Hellerstein † °, Boon Thau Loo †, Sam Mardanbeigi †, Scott Shenker †‡, Ion Stoica.
Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) Ryan Huebsch † Joe Hellerstein †, Nick Lanham †, Boon Thau Loo.
Information-Centric Networks07a-1 Week 7 / Paper 1 Internet Indirection Infrastructure –Ion Stoica, Daniel Adkins, Shelley Zhuang, Scott Shenker, Sonesh.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Titanium/Java Performance Analysis Ryan Huebsch Group: Boon Thau Loo, Matt Harren Joe Hellerstein, Ion Stoica, Scott Shenker P I E R Peer-to-Peer.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 2 ARCHITECTURES.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.
A Scalable Content-Addressable Network (CAN) Seminar “Peer-to-peer Information Systems” Speaker Vladimir Eske Advisor Dr. Ralf Schenkel November 2003.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan Presented.
SIGCOMM 2001 Lecture slides by Dr. Yingwu Zhu Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
Content Addressable Networks CAN is a distributed infrastructure, that provides hash table-like functionality on Internet-like scales. Keys hashed into.
Scalable Content- Addressable Networks Prepared by Kuhan Paramsothy March 5, 2007.
MobiQuitous 2007 Towards Scalable and Robust Service Discovery in Ubiquitous Computing Environments via Multi-hop Clustering Wei Gao.
P2P Group Meeting (ICS/FORTH) Monday, 28 March, 2005 A Scalable Content-Addressable Network Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp,
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
1 Distributed Hash Table CS780-3 Lecture Notes In courtesy of Heng Yin.
PIER: Peer-to-Peer Information Exchange and Retrieval Ryan Huebsch Joe Hellerstein, Nick Lanham, Boon Thau Loo, Scott Shenker, Ion Stoica
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 2: Distributed Hash.
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
Querying The Internet With PIER Nitin Khandelwal.
1. Efficient Peer-to-Peer Lookup Based on a Distributed Trie 2. Complex Queries in DHT-based Peer-to-Peer Networks Lintao Liu 5/21/2002.
PIER ( Peer-to-Peer Information Exchange and Retrieval ) 30 March 07 Neha Singh.
Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks
Tapestry : An Infrastructure for Fault-tolerant Wide-area Location and Routing Presenter : Lee Youn Do Oct 5, 2005 Ben Y.Zhao, John Kubiatowicz, and Anthony.
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
Two Peer-to-Peer Networking Approaches Ken Calvert Net Seminar, 23 October 2001 Note: Many slides “borrowed” from S. Ratnasamy’s Qualifying Exam talk.
Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.
Ryan Huebsch, Joseph M. Hellerstein, Ion Stoica, Nick Lanham, Boon Thau Loo, Scott Shenker Querying the Internet with PIER Speaker: Natalia KozlovaTutor:
CS Spring 2010 CS 414 – Multimedia Systems Design Lecture 24 – Introduction to Peer-to-Peer (P2P) Systems Klara Nahrstedt (presented by Long Vu)
Internet Indirection Infrastructure (i3)
CHAPTER 3 Architectures for Distributed Systems
Providing Secure Storage on the Internet
DHT Routing Geometries and Chord
A Scalable, Content-Addressable Network
Presentation transcript:

Querying the Internet with PIER Article by: Ryan Huebsch, Joseph M. Hellerstein, Nick Lanham, Boon Thau Loo, Scott Shenker, Ion Stoica, 2003 EECS Computer Science Division, UC Berkeley International Computer Science Institute Presentation by: Laura Tolosi Supervisor: Prof. Gerhard Weikum

Overview Motivation Querying the Internet CAN - A particular DHT PIER Validation and performance evaluation 2

Motivation Scalability of Database Systems Querying the Internet CAN - A particular DHT PIER Validation and performance evaluation 3

Scalability of Database Systems Internet - hundreds million nodes The largest database systems - only few hundred nodes Scalability - ‘the ability to grow your system smoothly and economically as your requirements increase’ Parameters: data size, speed, workload, transaction cost Goal: huge number of concurrent users, continuous availability, large-stored data volume Measuring for scalability success: Workload Hardware config Query response time DB size size-up speed-up scale-up cost - workload increase should not increase transaction cost - X(data size) implies  X (cost)  X X X  1/X X  X 4

Motivation Querying the Internet Applications Design principles CAN - A particular DHT PIER Validation and performance evaluation 5

Querying the Internet Application example - Network Intrusion Detection attack ‘fingerprints’: sequences of port accesses (port scanners), port numbers numbers and packet contents (buffer-overrun attacks, web robots), application level information on content ( spam) … one can detect similar fingerprints, frequently reported SELECT I.fingerprint, count(*)*sum(R.weight) AS wcnt FROM intrusions I, reputation R WHERE R.address = I.address GROUP BY I.fingerprint HAVING wcnt > 1.5 IP address fingerprint attribute1 1 F F F F F1... Table 1: intrusions IP address weight attribute Table 2: reputation fingerprint wcnt F1 4.5 F2 2.0 Table 3: join result 6

Querying the Internet Relaxed principles for scaling Relaxed consistency Organic Scaling Natural Habitats for Data Standard Schemas via Grassroots Software 7

Motivation Querying the Internet CAN - A particular DHT Design, construction, joining, routing… (quick overview)PIER Validation and performance evaluation 8

Content-addressable Network (CAN) CAN is a particular DHT virtual d-dimensional Cartesian coordinate space partitioned among the nodes in the system CDE AB (0.0)(1.0) (0.1)(1.1) ( , ) ( , ) ( , ) Node B’s coordinate zone IP_addr(A) IP_addr(D) IP_addr(E) B’s routing table store (key, value) pairs using a hash function: key  (f_1(key), …, f_d(key)) key is mapped to a point in the coordinate space (key, value) will be stored by the node responsible for the zone in which the point lies. each node maintains as its set of neighbors the IP addresses of node with adjoining zones 9

Content-addressable Network (CAN) - continued Routing in CAN hash(key) = (x, y) ‘simulate’ a straight line from source to destination nodes Joining a new node randomly chooses a point in the space (say (x,y)) sends a join request to an existing node in CAN (say 6) the message is routed until the zone in which the point lies is reached the zone is split For a d-dimensional space partitioned in n equal zones: More on CAN: Node departure, Recovery... (x,y) 7 10

Motivation Querying the Internet CAN - A particular DHT PIER Architecture Routing layer Storage manager Provider DHT distributed joins Symmetric hash join Validation and performance evaluation 11

Physical Network Network IP Network Overlay Network Storage Manager Provider Overlay Routing DHT PIER Architecture Query Plan ►◄ ►◄ get scan(I) scan(R) PIER Core Relational Execution Engine Query Optimizer Catalog Manager Applications Network Monitoring Other User Apps Declarative Queries SELECT I.fingerprint, count(*)*sum(R.weight) AS wcnt FROM intrusions I, reputation R WHERE I.address = R.address GROUP BY I.fingerprint HAVING wcnt >

DHT Design Routing Layer Mapping for keys Dynamic as nodes join and leave Storage Manager DHT based data Provider Storage access interface for higher levels Any particular DHT can be used (CAN, Chord...) 13

Routing Layer Maps a key into the IP address of the node currently responsible for that key. Provides exact lookups, callbacks higher levels when the set of keys has changed. Routing Layer API lookup (key)  ipaddr join (landmark) leave () locationMapChange() 14

Storage Manager Stores and retrieves records, which consist of (key, value) pairs. Keys are used to locate items and can be any data type or structure supported Storage Manager API store (key, item) retrieve (key)  item remove (key) 15

Provider Ties routing and storage manager layers and provides an interface to applications Each object in the DHT has a namespace, resourceID and instanceID namespace: application or group of objects, table or relation resourceID: primary key or any attribute instanceID: integer to separate items with the same namespace and the same resourceID DHT key : hash (namespace, resourceID) Provider API get (namespace, resourceID)  item put (namespace, resourceID, instanceID, item, lifetime) renew (namespace, resourceID, instanceID, item, lifetime)  bool multicast (namespace, resourceID, item) lscan (namespace)  iterator newData (namespace)  item 16

Query Processor Performs selection, projection, join, grouping, aggregation Items are inserted, updated, deleted via DHT interface Operators produce results as quick as possible (push) Enqueue the data for the next operator (pull) reachable snapshot: the set of data published by reachable nodes at the time the query is sent from the client node dilated reachable snapshot: union of local snapshots of data published by reachable nodes local snapshot: data published at the time the query message arrived at the node 17

DHT Distributed Joins relational database join performed to some degree in parallel by a umber of processors (machines) containing different parts of the relations on which join is performed DHT distributed join example: SELECT I.fingerprint, count(*)*sum(R.weight) AS wcnt FROM intrusions I, reputation R WHERE R.address = S.address GROUP BY I.fingerprint HAVING wcnt > 1.5 IR I I R R I match records on different machines 18

DHT Distributed Joins - continued SELECT I.fingerprint, count(*)*sum(R.weight) AS wcnt FROM intrusions I, reputation R WHERE R.address = S.address GROUP BY I.fingerprint HAVING wcnt > 1.5 Entire DHT Network Namespace N I = multicast group(I - group) Table: intrusions Query initiator 19

DHT Distributed Joins - continued Entire DHT Network Namespace N R = multicast group(B - group) SELECT I.fingerprint, count(*)*sum(R.weight) AS wcnt FROM intrusions I, reputation R WHERE R.address = S.address GROUP BY I.fingerprint HAVING wcnt > 1.5 Table: reputation 20

DHT Distributed Joins - continued Entire DHT Network [1] multicast(I-group,’oper-j’,’SELECT I.fingerprint,...’) [2] multicast(R-group,’oper-j’,’SELECT I.fingerprint,...’) [3] lscan: For each tuple in I: extract I.address, I.fingerprint item = (I.address, I.fingerprint, ‘I’) instanceID = random() put(‘N Q ’, I.address, instanceID, item, ‘60 min’) [3] lscan: For each tuple in R: extract R.address, R.weight item = (R.address, R.weight, ‘S’) instanceID = random() put(‘N Q ’, R.address, instanceID, item, ‘60 min’) N Q namespace PUSH phase 21

DHT Distributed Joins - continued Entire DHT Network N Q namespace PULL phase [5] receive(‘oper-x’, (I.fingerprint, R.weight)) store received item in result table [4]newData(‘N Q ’, item) for each call of newData() get(‘N Q ’, item.resourceID) if get returns both an I-derived tuple and an R-derived tuple, then send(‘oper-j’, hash(initiator), (I.fingerprint, R.weight)) 22

Motivation Querying the Internet CAN - A particular DHT PIER Validation and performance evaluation 23

Simulation setup Workload: SELECT R.pkey, S.pkey, R.pad FROM R, S WHERE R.num1 = S.pkey AND R.num2 > constat1 AND S.num2 > constant2 AND f(R.num3, S.num3) > constant3 R has 10 times more tuples than S attributes in S, R are uniformly distributed constants chosen as to produce 50% selectivity R.pad attribute assures that all tuples are 1KB in size nodes network cluster of 64 PCs topology : fully connected network, 100ms latency, 10Mbps assumption: data changes at a rate higher than the rate of incoming queries all measures: performed after CAN routing stabilizes, tables R, S are in the DHT 24

Scalability evaluation evaluate it by proportionally increasing the load with the number of nodes each node - 1MB data measure the response time for the 30-th tuple avoid to measure the first response time Average time to receive the 30th result tuple when both the size of the network and the load are scaled up. 25

Effects of soft state evaluate the robustness of the system in the face of node failures assume it takes 15s to detect a node failure all data is lost when a node fails periodically renew all tuples Average recall for different refresh rates Failure rate (failures/min) 26

Summary PIER - a distributed query engine based on widely distributed environments It is internally using a relational database format PIER aims to overcome the traditional database lack of scalability by using relaxed principles of consistency and a peer-to-peer like technology PIER is a read only system PIER is build on top of a DHT CAN - particular DHT, based on hashing data onto a d-dimensional space Implements DHT- distributed joins algorithms - e.g. symmetric hash join Operators are not enqueued: results are produced as quickly as possible (push), and data is transferred for the next operator (pull) Queries are optimized in order to minimize the response time 27

THANK YOU! 28