Complex Queries in DHT-based Peer-to-Peer Networks Matthew Harren, Joe Hellerstein, Ryan Huebsch, Boon Thau Loo, Scott Shenker, Ion Stoica

Slides:

Advertisements

Similar presentations

P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.

Advertisements

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.

Peer to Peer and Distributed Hash Tables

Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.

Querying the Internet with PIER Article by: Ryan Huebsch, Joseph M. Hellerstein, Nick Lanham, Boon Thau Loo, Scott Shenker, Ion Stoica, 2003 EECS Computer.

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Robert Morris Ion Stoica, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT.

1 Distributed Hash Tables My group or university Peer-to-Peer Systems and Applications Distributed Hash Tables Peer-to-Peer Systems and Applications Chapter.

P2p, Fall 05 1 Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) VLDB 2003 Ryan Huebsch, Joe Hellerstein, Nick Lanham,

Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.

Looking Up Data in P2P Systems Hari Balakrishnan M.Frans Kaashoek David Karger Robert Morris Ion Stoica.

P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.

Object Naming & Content based Object Search 2/3/2003.

Freenet A Distributed Anonymous Information Storage and Retrieval System I Clarke O Sandberg I Clarke O Sandberg B WileyT W Hong.

Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.

1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.

Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen

Peer To Peer Distributed Systems Pete Keleher. Why Distributed Systems? l Aggregate resources! –memory –disk –CPU cycles l Proximity to physical stuff.

P2p, Fall 06 1 Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) VLDB 2003 Ryan Huebsch, Joe Hellerstein, Nick Lanham,

Peer-to-Peer Networks Slides largely adopted from Ion Stoica’s lecture at UCB.

ICDE A Peer-to-peer Framework for Caching Range Queries Ozgur D. Sahin Abhishek Gupta Divyakant Agrawal Amr El Abbadi Department of Computer Science.

1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.

Storage management and caching in PAST PRESENTED BY BASKAR RETHINASABAPATHI 1.

Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.

Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) Ryan Huebsch Joe Hellerstein, Nick Lanham, Boon Thau Loo, Timothy.

Introduction to Peer-to-Peer Networks. What is a P2P network A P2P network is a large distributed system. It uses the vast resource of PCs distributed.

Content Overlays (Nick Feamster). 2 Content Overlays Distributed content storage and retrieval Two primary approaches: –Structured overlay –Unstructured.

PIER & PHI Overview of Challenges & Opportunities Ryan Huebsch † Joe Hellerstein † °, Boon Thau Loo †, Sam Mardanbeigi †, Scott Shenker †‡, Ion Stoica.

Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) Ryan Huebsch † Joe Hellerstein †, Nick Lanham †, Boon Thau Loo.

Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.

Titanium/Java Performance Analysis Ryan Huebsch Group: Boon Thau Loo, Matt Harren Joe Hellerstein, Ion Stoica, Scott Shenker P I E R Peer-to-Peer.

NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Vincent Matossian September 21st 2001 ECE 579 An Overview of Decentralized Discovery mechanisms.

Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.

1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.

1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.

Scalable Content- Addressable Networks Prepared by Kuhan Paramsothy March 5, 2007.

Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.

Freenet “…an adaptive peer-to-peer network application that permits the publication, replication, and retrieval of data while protecting the anonymity.

1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.

1 Distributed Hash Table CS780-3 Lecture Notes In courtesy of Heng Yin.

PIER: Peer-to-Peer Information Exchange and Retrieval Ryan Huebsch Joe Hellerstein, Nick Lanham, Boon Thau Loo, Scott Shenker, Ion Stoica

Computer Networking P2P. Why P2P? Scaling: system scales with number of clients, by definition Eliminate centralization: Eliminate single point.

Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.

Querying The Internet With PIER Nitin Khandelwal.

1. Efficient Peer-to-Peer Lookup Based on a Distributed Trie 2. Complex Queries in DHT-based Peer-to-Peer Networks Lintao Liu 5/21/2002.

Peer to Peer Network Design Discovery and Routing algorithms

PIER ( Peer-to-Peer Information Exchange and Retrieval ) 30 March 07 Neha Singh.

LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.

Bruce Hammer, Steve Wallis, Raymond Ho

Two Peer-to-Peer Networking Approaches Ken Calvert Net Seminar, 23 October 2001 Note: Many slides “borrowed” from S. Ratnasamy’s Qualifying Exam talk.

CS 540 Database Management Systems

CS 347Notes081 CS 347: Parallel and Distributed Data Management Notes 08: P2P Systems.

P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.

Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.

1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.

Nick McKeown CS244 Lecture 17 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications [Stoica et al 2001]

Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.

Ryan Huebsch, Joseph M. Hellerstein, Ion Stoica, Nick Lanham, Boon Thau Loo, Scott Shenker Querying the Internet with PIER Speaker: Natalia KozlovaTutor:

CS Spring 2010 CS 414 – Multimedia Systems Design Lecture 24 – Introduction to Peer-to-Peer (P2P) Systems Klara Nahrstedt (presented by Long Vu)

Peer-to-Peer Information Systems Week 12: Naming

A Case Study in Building Layered DHT Applications

CS 268: Lecture 22 (Peer-to-Peer Networks)

CHAPTER 3 Architectures for Distributed Systems

EE 122: Peer-to-Peer (P2P) Networks

DHT Routing Geometries and Chord

Building Peer-to-Peer Systems with Chord, a Distributed Lookup Service

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Peer-to-Peer Information Systems Week 12: Naming

Presentation transcript:

Complex Queries in DHT-based Peer-to-Peer Networks Matthew Harren, Joe Hellerstein, Ryan Huebsch, Boon Thau Loo, Scott Shenker, Ion Stoica UC Berkeley, CS Division IPTPS 3/8/02

2 Outline Contrast P2P & DB systems Motivation Architecture DHT Requirements Query Processor Current Status Future Research

3 Query Processor SQL Joins Predicates Relational Data Group By Aggregation DHT CAN Chord Tapestry Pastry Uniting DHTs and Query Processing…

4 P2P & DB Systems Flexibility  Decentralized  Strong Semantics  Powerful query facilities  Fault Tolerance  Lightweight  Transactions & Concurrency Control  P2P DB

5 P2P + DB = ? P2P Database? No! ACID transactional guarantees do not scale, nor does the everyday user want ACID semantics Much too heavyweight of a solution for the everyday user Query Processing on P2P! Both P2P and DBs do data location and movement Can be naturally unified (lessons in both directions) P2P brings scalability & flexibility DB brings relational model & query facilities

6 P2P Query Processing (Simple) Example Filesharing+ SELECT song, size, server… FROM album, song WHERE album.ID = song.albumID AND album.name = “Rubber Soul” Keyword searching is ONE canned SQL query Imagine what else you could do!

7 P2P Query Processing (Simple) Example Filesharing+ SELECT song, size, server… FROM album-ngrams AN, song WHERE AN.ID = song.albumID AND AN.ngram IN GROUP BY AN.ID HAVING COUNT(AN.ngram) >= Keyword searching is ONE canned SQL query Imagine what else you could do! Fuzzy Searching, Resource Discovery, Enhanced DNS

8 What this project IS and IS NOT about… IS NOT ABOUT: Absolute Performance In most situations a centralized solution could be faster… IS ABOUT: Decentralized Features No administrator, anonymity, shared resources, tolerates failures, resistant to censorship… IS NOT ABOUT: Replacing RDBMS Centralized solutions still have their place for many applications (commercial records, etc.) IS ABOUT: Research synergies Unifying/morphing design principles and techniques from DB and NW communities

9 General Architecture Based on Distributed Hash Tables (DHT) to get many good networking properties A query processor is built on top Note: the data is stored separately from the query engine, not a standard DB practice!

10 DHT – API Basic API publish(RID, object) lookup(RID) multicast(object) NOTE: Applications can only fetch-by- name… a very limited query language!

11 DHT – API Enhancements I Basic API publish(namespace, RID, object) lookup(namespace, RID) multicast(namespace, object) Namespaces: subsets of the ID space for logical and physical data partitioning

12 DHT – API Enhancements II Additions l scan(namespace) – retrieve the data stored locally from a particular namespace newData(namespace) – receive a callback when new data is inserted into the local store for the namespace This violates the abstraction of location independence Why necessary? Parallel scanning of base relation Why acceptable? Access is limited to reading, applications can not control the location of data

13 Query Processor (QP) Architecture QP is just another application as far as the DHT is concerned… DHT objects = QP tuples User applications can use QP to query data using a subset of SQL Select Project Joins Group By / Aggregate Data can be metadata (for a file sharing type application) or entire records, mechanisms are the same

14 Indexes. The lifeblood of a database engine. DHT’s mapping of RID/Object is equivalent to an index Additional indexes are created by adding another key/value pair with the key being the value of the indexed field(s) and value being a ‘pointer’ to the object (the RID or primary key) Data PKey Primary Index Index NS Secondary Index DHT Ptr Key Secondary DHT Data PKey Primary

15 Relational Algorithms Selection/Projection Join Algorithms Symmetric Hash Use l scan on tables R & S. Republish tuples in a temporary namespace using the join attributes as the RID. Nodes in the temporary namespace perform mini-joins locally as tuples arrive and forwards results to requestor. Fetch Matches If there is an index on the join attribute(s) for one table (say R), use l scan for other table (say S) and then issue a lookup probing for matches in R. Semi-Join like algorithms Bloom-Join like algorithms Group-By (Aggregation)

16 Interesting note… The state of the join is stored in the DHT store Rehashed data is automatically re-routed to the proper node if the coordinate space adjusted When a node splits (to accept a new node into the network) the data is also split, this includes previously delivered rehashed tuples Allows for graceful re-organization of the network not to interfere with ongoing operations

17 Where we are… A working real implementation of our Query Processing (currently named PIER) on top of a CAN simulator Initial work studying and analyzing algorithms… nothing really ground-breaking… YET! Analyzing the design space and which problems seem most interesting to pursue

18 Where to go from here? Common Issues: Caching – Both at DHT and QP levels Using Replication – for speed and fault tolerance (both in data and computation) Security Database Issues: Pre-computation of (intermediate) results Continuous queries/alerters Query optimization (Is this like network routing?) More algorithms, Dist-DBMS have more tricks Performance Metrics for P2P QP Systems What are the new apps the system enables?

Additional Slides

20 Symmetric Hash Join I want Hawaiian images that appeared in movies produced since 1970… DHT Network Create a query request SELECT name, URL FROM images, movies WHERE image.ID = movie.ID AND… Use multicast to distribute request When each node receives the multicast it uses l scan to read all data stored at the node. Each object or tuple is analyzed… 1)The tuple is checked against predicates that apply to it (i.e. produced > 1970) 2)Unnecessary fields can be projected out 3)Re-insert the resulting tuple into the network using the join key value as the new RID, and use a new temporary namespace (both tables use same namespace) Data being Rehashed As data arrives from BOTH tables, use pipelined hash join to generate results and send to requestor

21 N-grams… Technique from information retrieval to do in- exact matching I want “tyranny”, but I can’t spell… “tyrrany” First, n-grams is created (bi-grams in this case) Doc1 “tyranny”  create 8 bi-grams: “t”, “ty”, “yr”, “ra”, “an”, “nn”, “ny”, & “y” Each bi-gram contains a pointer to the doc ID So a db might look like: “t”  Doc1 “t”  Doc5 “t”  Doc6 “an”  Doc1 “an”  Doc3 “an”  Doc5 “ba”  Doc2 “nn”  Doc1 “ny”  Doc1 “ra”  Doc1 “ra”  Doc2 “rn”  Doc3 “ty”  Doc1 “vu”  Doc2 “xr”  Doc4

22 N-grams… continued Convert search string to bi-grams Intersect (which is a relational join) the list of bi-grams from the search word with the index of bi-grams/docIDs. Aggregate the results by docID, count the number of n-gram matches for each docID More n-gram matches = closer to request SELECT i.docID, count(docID) as matches FROM indexlist i WHERE i.ngram IN GROUP BY i.docID ORDER BY matches ASC

23 DHT API? Is it good? API isn’t good for: Range queries Limited multicast – Currently, all queries must be asked at all nodes, this is the same scaling problem with Gnutella & Freenet. Batch Publish/Lookup operations How to properly extend the API if needed End-to-End argument always in play… which layer (DHT, QP, or user application) should do what? Don’t want to create special hooks in a particular DHT, will lose compatibility with other DHTs

24 Performance Metrics Currently we are considering User’s Perspective T - Time till the user is happy – first screenful of data? Accuracy (Recall & Precision) System’s Perspective Throughput – queries per second Storage Overhead Network’s Perspective Link Usage Link Stress