PIER: Peer-to-Peer Information Exchange and Retrieval Ryan Huebsch Joe Hellerstein, Nick Lanham, Boon Thau Loo, Scott Shenker, Ion Stoica

Slides:

Advertisements

Similar presentations

Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.

Advertisements

P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.

Peer to Peer and Distributed Hash Tables

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in.

CS4432: Database Systems II

Peer-to-Peer Systems Chapter 25. What is Peer-to-Peer (P2P)? Napster? Gnutella? Most people think of P2P as music sharing.

The State of the Art in Distributed Query Processing by Donald Kossmann Presented by Chris Gianfrancesco.

Querying the Internet with PIER Article by: Ryan Huebsch, Joseph M. Hellerstein, Nick Lanham, Boon Thau Loo, Scott Shenker, Ion Stoica, 2003 EECS Computer.

Search and Replication in Unstructured Peer-to-Peer Networks Pei Cao, Christine Lv., Edith Cohen, Kai Li and Scott Shenker ICS 2002.

The Architecture of PIER: an Internet-Scale Query Processor (PIER = Peer-to-peer Information Exchange and Retrieval) Ryan Huebsch Brent Chun, Joseph M.

Applications over P2P Structured Overlays Antonino Virgillito.

Sharing Aggregate Computation for Distributed Queries Ryan Huebsch, UC Berkeley Minos Garofalakis, Yahoo! Research † Joe Hellerstein, UC Berkeley Ion Stoica,

P2p, Fall 05 1 Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) VLDB 2003 Ryan Huebsch, Joe Hellerstein, Nick Lanham,

FRIENDS: File Retrieval In a dEcentralized Network Distribution System Steven Huang, Kevin Li Computer Science and Engineering University of California,

Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.

Looking Up Data in P2P Systems Hari Balakrishnan M.Frans Kaashoek David Karger Robert Morris Ion Stoica.

P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.

Freddies: DHT-Based Adaptive Query Processing via Federated Eddies Ryan Huebsch Shawn Jeffery CS Peer-to-Peer Systems 12/9/03.

Object Naming & Content based Object Search 2/3/2003.

Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.

Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.

1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.

Peer To Peer Distributed Systems Pete Keleher. Why Distributed Systems? l Aggregate resources! –memory –disk –CPU cycles l Proximity to physical stuff.

Wide-area cooperative storage with CFS

Or, Providing Scalable, Decentralized Location and Routing Network Services Tapestry: Fault-tolerant Wide-area Application Infrastructure Motivation and.

P2p, Fall 06 1 Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) VLDB 2003 Ryan Huebsch, Joe Hellerstein, Nick Lanham,

ICDE A Peer-to-peer Framework for Caching Range Queries Ozgur D. Sahin Abhishek Gupta Divyakant Agrawal Amr El Abbadi Department of Computer Science.

1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.

Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.

Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) Ryan Huebsch Joe Hellerstein, Nick Lanham, Boon Thau Loo, Timothy.

Complex Queries in DHT-based Peer-to-Peer Networks Matthew Harren, Joe Hellerstein, Ryan Huebsch, Boon Thau Loo, Scott Shenker, Ion Stoica

Thesis Proposal Data Consistency in DHTs. Background Peer-to-peer systems have become increasingly popular Lots of P2P applications around us –File sharing,

Database Design – Lecture 16

Content Overlays (Nick Feamster). 2 Content Overlays Distributed content storage and retrieval Two primary approaches: –Structured overlay –Unstructured.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

PIER & PHI Overview of Challenges & Opportunities Ryan Huebsch † Joe Hellerstein † °, Boon Thau Loo †, Sam Mardanbeigi †, Scott Shenker †‡, Ion Stoica.

Introduction to Hadoop and HDFS

Query optimization in relational DBs Leveraging the mathematical formal underpinnings of the relational model.

Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) Ryan Huebsch † Joe Hellerstein †, Nick Lanham †, Boon Thau Loo.

High Throughput Computing on P2P Networks Carlos Pérez Miguel

Titanium/Java Performance Analysis Ryan Huebsch Group: Boon Thau Loo, Matt Harren Joe Hellerstein, Ion Stoica, Scott Shenker P I E R Peer-to-Peer.

Distributed Session Announcement Agents for Real-time Streaming Applications Keio University, Graduate School of Media and Governance Kazuhiro Mishima.

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan Presented.

1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.

Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.

1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.

MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.

Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.

Querying The Internet With PIER Nitin Khandelwal.

1. Efficient Peer-to-Peer Lookup Based on a Distributed Trie 2. Complex Queries in DHT-based Peer-to-Peer Networks Lintao Liu 5/21/2002.

Peer to Peer Network Design Discovery and Routing algorithms

PIER ( Peer-to-Peer Information Exchange and Retrieval ) 30 March 07 Neha Singh.

Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks

LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.

Two Peer-to-Peer Networking Approaches Ken Calvert Net Seminar, 23 October 2001 Note: Many slides “borrowed” from S. Ratnasamy’s Qualifying Exam talk.

NCLAB 1 Supporting complex queries in a distributed manner without using DHT NodeWiz: Peer-to-Peer Resource Discovery for Grids Sujoy Basu, Sujata Banerjee,

REED ： Robust, Efficient Filtering and Event Detection in Sensor Network Daniel J. Abadi, Samuel Madden, Wolfgang Lindner Proceedings of the 31st VLDB.

1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.

Nick McKeown CS244 Lecture 17 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications [Stoica et al 2001]

Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

DATABASE OPERATORS AND SOLID STATE DRIVES Geetali Tyagi ( ) Mahima Malik ( ) Shrey Gupta ( ) Vedanshi Kataria ( )

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

Ryan Huebsch, Joseph M. Hellerstein, Ion Stoica, Nick Lanham, Boon Thau Loo, Scott Shenker Querying the Internet with PIER Speaker: Natalia KozlovaTutor:

CS Spring 2010 CS 414 – Multimedia Systems Design Lecture 24 – Introduction to Peer-to-Peer (P2P) Systems Klara Nahrstedt (presented by Long Vu)

CHAPTER 3 Architectures for Distributed Systems

EE 122: Peer-to-Peer (P2P) Networks

DHT Routing Geometries and Chord

CS 162: P2P Networks Computer Science Division

Presentation transcript:

PIER: Peer-to-Peer Information Exchange and Retrieval Ryan Huebsch Joe Hellerstein, Nick Lanham, Boon Thau Loo, Scott Shenker, Ion Stoica UC Berkeley, CS Division Berkeley P2P 2/24/03

2 Outline Motivation General Architecture Brief look at the Algorithms Potential Applications Current Status Future Research

3 P2P DHTs are Cool, but… Lots of effort is put into making DHTs Scalable (thousands  millions of nodes) Reliable (every imaginable failure) Security (anonymity, encryption, etc.) Efficient (fast access with minimal state) Load balancing, and others Still only a hash table interface, put and get Hard (but not impossible) to build real applications using only the basic primitives

4 Databases are Cool, but… Relational databases bring a declarative interface to the user/application. Ask for what you want, not how to get it Database community is not new to parallel and distributed systems Parallel: Centralized, one administrator, one point of failure Distributed: Did not catch on, complicated, never really scaled above 100’s of machines

5 Databases + P2P DHTs Marriage Made in Heaven? Well, databases carry a lot of other baggage ACID transactions Consistency above all else So we just want to unite the query processor with DHTs DHTs + Relational Query Processing = PIER Bring complex queries to DHTs  foundation for real applications

6 Outline Motivation General Architecture Brief look at the Algorithms Potential Applications Current Status Future Research

7 Architecture DHT is divided into 3 modules We’ve chosen one way to do this, but may change with time and experience Goal is to make each simple and replaceable PIER has one primary module Add-ons can make it look more database like.

8 Architecture: DHT: Routing Very simple interface Plug in any routing algorithm here: CAN, Chord, Pastry, Tapestry, etc. lookup(key)  ipaddr join(landmarkNode) leave() CALLBACK: locationMapChange()

9 Architecture: DHT: Storage Currently we use a simple in-memory storage system, no reason a more complex one couldn’t be used store(key, item) retrieve(key)  item remove(key)

10 Architecture: DHT: Provider Connects the pieces, and provides the ‘DHT’ interface get(ns, rid)  item put(ns, rid, iid, item, lifetime) renew(ns, rid, iid, lifetime)  success? multicast(ns, item) lscan(ns)  items CALLBACK: newData(ns, item)

11 Architecture: PIER Currently, consists only of the relational execution engine Executes a pre-optimized query plan Query plan is a box-and-arrow description of how to connect basic operators together selection, projection, join, group-by/aggregation, and some DHT specific operators such as rehash Traditional DBs use an optimizer + catalog to take SQL and generate the query plan, those are just add-ons to PIER

12 Outline Motivation General Architecture Brief look at the Algorithms Potential Applications Current Status Future Research

13 Joins: The Core of Query Processing A relational join can be used to calculate: The intersection of two sets Correlate information Find matching data Goal: Get tuples that have the same value for a particular attribute(s) (the join attribute(s)) to the same site, then append tuples together. Algorithms come from existing database literature, minor adaptations to use DHT.

14 Joins: Symmetric Hash Join (SHJ) Algorithm for each site (Scan) Use two lscan calls to retrieve all data stored at that site from the source tables (Rehash) put a copy of each eligible tuple with the hash key based on the value of the join attribute (Listen) use newData to see the rehashed tuples (Compute) Run standard one-site join algorithm on the tuples as they arrive Scan/Rehash steps must be run on all sites that store source data Listen/Compute steps can be run on fewer nodes by choosing the hash key differently

15 Joins: Fetch Matches (FM) Algorithm for each site (Scan) Use lscan to retrieve all data from ONE table (Get) Based on the value for the join attribute, issue a get for the possible matching tuples from the other table Note, one table (the one we issue the get s for) must already be hashed on the join attribute Big picture: SHJ is put based FM is get based

16 Joins: Additional Strategies Bloom Filters Use of bloom filters can be used to reduce the amount of data rehashed in the SHJ Symmetric Semi-Join Run a SHJ on the source data projected to only have the hash key and join attributes. Use the results of this mini-join as source for two FM joins to retrieve the other attributes for tuples that are likely to be in the answer set Big Picture: Tradeoff bandwidth (extra rehashing) for latency (time to exchange filters)

17 Group-By/Aggregation A group-by/aggregation can be used to calculate: Split data into groups based on value Max, Min, Sum, Count, etc. Goal: Get tuples that have the same value for a particular attribute(s) (group-by attribute(s)) to the same site, then summarize data (aggregation).

18 Group-By/Aggregation At each site (Scan) lscan the source table Determine group tuple belongs in Add tuple’s data to that group’s partial summary (Rehash) for each group represented at the site, rehash the summary tuple with hash key based on group-by attribute (Combine) use newData to get partial summaries, combine and produce final result after specified time, number of partial results, or rate of input Can add multiple layers of rehash/combine to reduce fan-in. Subdivide groups in subgroups by randomly appending a number to the group’s key

19 Outline Motivation General Architecture Brief look at the Algorithms Potential Applications Current Status Future Research

20 Why Would a DHT Query Processor be Helpful? Data is distributed  centralized processing not efficient or not acceptable Correlation, Intersection  Joins Summarize, Aggregation, Compress  Group-By/Aggregation Probably not as efficient as custom designed solution for a single particular problem Common infrastructure for fast application development/deployment

21 Network Monitoring Lot’s of data, naturally distributed, almost always summarized  aggregation Intrusion Detection usually involves correlating information from multiple sites  join Data comes from many sources nmap, snort, ganglia, firewalls, web logs, etc. PlanetLab is our natural test bed (Timothy, Brent, and Nick)

22 Enhanced File Searching First step: Take over Gnutella (Boon) Well, actually just make PlanetLab look an UltraPeer on the outside, but run PIER on the inside Long term: Value added services Better searching, utilize all of the MP3 ID tags Reputations Combine with network monitoring data to better estimate download times

23 i 3 Style Services Mobility and Multicast Sender is a publisher Receiver(s) issue a continuous query looking for new data Service Composition Services issue a continuous query for data looking to be processed After processing data, they publish it back into the network

24 Outline Motivation General Architecture Brief look at the Algorithms Potential Applications Current Status Future Research

25 Codebase Approximately 17,600 lines of NCSS Java Code Same code (overlay components/pier) run on the simulator or over a real network without changes Runs simple simulations with up to 10k nodes Limiting factor: 2GB addressable memory for the JVM (in Linux) Runs on Millennium and Planet Lab up to 64 nodes Limiting factor: Available/working nodes & setup time Code: Basic implementations of Chord and CAN Selection, projection, joins (4 methods), and aggregation. Non-continuous queries

26 Simulations of 1 SHJ Join

27 1 SHJ Join on Millennium

28 Outline Motivation General Architecture Brief look at the Algorithms Potential Applications Current Status Future Research

29 Future Research Routing, Storage and Layering Catalogs and Query Optimization Hierarchical Aggregations Range Predicates Continuous Queries over Streams Semi-structured Data Applications, Applications, Applications…