Data Management in Large-scale P2P Systems

Name: Data Management in Large-scale P2P Systems
Uploaded: 2017-10-15T18:11:08+00:00
Duration: PTM12S44
Channel: Loren Waters
Description: Data Management in Large-scale P2P Systems

Data Management in Large-scale P2P Systems
Patrick Valduriez, Esther Pacitti Atlas group, INRIA and LINA University of Nantes, France

Motivations P2P systems Distributed database systems
Decentralized control, large scale Low-level, simple services File sharing, computation sharing, com. sharing Distributed database systems High-level data management services queries, transactions, consistency, security, etc. Centralized control, limited scale P2P + distributed database Why? How?

Why high-level P2P data sharing?
Professional community example Medical doctors in a hospital may want to share (some of) their patient data for an epidemiological study They have their own, independent patient descriptions They want to ask queries such as “age and weight of male patients diagnosed with disease X …” over their own descriptions They don’t want to create a database and buy a server

Problem definition P2P system
No centralized control, very large scale Very dynamic: peers can join and leave the network at any time Peers can be autonomous and unreliable Techniques designed for distributed data management no longer apply Too static, need to be decentralized, dynamic and self-adaptive

Outline Data management in distributed systems P2P systems
Data management in P2P systems Data management in APPA

Data management basic principle
Data independence Hide implementation details Provision for high-level services Schema Queries (SQL, XQuery) Automatic optimization Transactions Consistency Access control … Application Application Logical view (schema) Storage Storage

Distributed database system (DDBS)
Distribution transparency Global schema Common data descriptions Distributed data placement Centralized control through global catalog Distributed functions Schema mapping Query processing Transaction management Access control Etc. Queries, Transactions Site 1 Distributed Database System Site 2 Site 3 DBMS1 DBMS2

Scaling up DDBS Distributed database systems Data integration systems
Enterprise information systems Scale up to tens of databases Data integration systems strong heterogeneity and autonomy of data sources (files, databases, XML documents, ..) Limited functionality (queries) Scale up to hundreds of data sources Parallel database systems Focus on high-performance and high-availability Strong homogeneity Scale up to hundreds of data nodes

A generic P2P system A user at a peer may access sharable data at remote peers P2P software private sharable P2P software private sharable P2P software private sharable

Potential benefits of P2P systems
Scale up to very large numbers of peers Dynamic self-organization Load balancing Parallel processing High availability through massive replication

P2P vs DDBS P2P DDBS Joining the network Upon peer’s initiative
Controled by DBA Queries No schema, key-word based Global schema, static optimization Query answers Partial Complete Content location Using neighbors or DHT Using directory

Requirements for P2P data management (1)
Autonomy of peers Peers should be able to join/leave at any time, control their data wrt other (trusted) peers Query expressiveness Key-lookup, key-word search, SQL-like Efficiency Efficient use of bandwidth, computing power, storage

Requirements for P2P data management (2)
Quality of service (QoS) User-perceived efficiency: completeness of results, response time, data consistency, … Fault-tolerance Efficiency and QoS despite failures Security Data access control in the context of very open systems

P2P network topologies Unstructured systems Structured (DHT) systems
e.g. Structured (DHT) systems e.g. CAN, CHORD Super-peer (hybrid) systems e.g. Napster

P2P unstructured network
data p2p data p2p data peer 4 p2p data peer 1 peer 2 peer 3 High autonomy (peer needs to know neighbor to login) Searching by flooding the network general, inefficient High-fault tolerance with replication

P2P structured network Efficient exact-match search
Distributed Hash Table (DHT) h(k1)= p1 h(k2)= p2 h(k3)= p3 h(k4)= p4 p2p d(k1) p2p d(k2) p2p d(k3) p2p d(k4) peer 1 peer 2 peer 3 peer 4 Efficient exact-match search O(log n) for put(key,value), get(key) Limited autonomy since a peer is responsible for a range of keys

Super-peer network sp2sp sp2p sp2sp sp2p p2sp data p2sp data p2sp data p2sp data peer 1 peer 2 peer 3 peer 4 Super-peers can perform complex functions (meta-data management, indexing, acces control, etc.) Efficiency and QoS Restricted autonomy SP = single point of failure => use several

P2P systems comparison Requirements Unstructured DHT Super-peer
Autonomy high low avg Query exp. Efficiency QoS Fault-tolerance Security

Data management in P2P systems
Current research focuses on Decentralized schema mappings PeerDB: unstruct. network, keyword search only Extending DHT for complex querying PIER : exact-match and join queries Query reformulation Edutella: super-peer, RDF-based schemas Piazza: graph of pair-wise schema mappings Replication generally limited to static read-only files P-Grid addresses updates in structured networks

Data management in APPA (Atlas P2P Architecture)
Objectives Scalability, availability and performance Main features Network-independent architecture Layered, service-based architecture Replication with semantics-based reconciliation Decentralized schema management Schema-based query support and optimization Peer data caching Prototype on JXTA Network-independent P2P services

Network independent APPA
Advanced Services Query Processing Replication Cache Management Security Basic Services Group Membership Management Consensus Management P2P Data Management Peer Management Peer Communication P2P Network Key-based Storage and Retrieval Peer ID Assignment Peer Linking Internet ...

Different APPA architectures
Peer Advanced services Basic P2P network DHT data local Basic services P2P network Super-peer Peer P2P data local Advanced services

Schema management in APPA
Takes advantage of the collaborative nature of the applications Peers that wish to cooperate agree on a Common Schema Description (CSD) Given 2 CSD relation definitions, an example of peer mapping at peer p is: p:r(A,B,D) csd:r1(A,B,C), csd:r2(C,D,E) Peer mappings stored as P2P data

Replication in APPA Small-world assumption: peers work in smaller groups with time locality Lazy multi-master replication n peers can update the same replica Improves read performance and availability Replica divergence solved by distributed log-based reconciliation Exploit P2P data management service

Query processing in APPA
Given a SQL-like query on peer schema, performs query reformulation Maps the query on CSD schemas query matching Finds relevant peers query optimization Selects best peers, taking replication into account query decomposition and execution Exploits parallelism

Conclusion Advanced P2P applications will need high-level data management services Various P2P networks will improve Network-independence crucial to exploit and combine them Many technical issues Important to characterize applications that can most benefit from P2P wrt other distributed architectures

Data Management in Large-scale P2P Systems

Similar presentations

Presentation on theme: "Data Management in Large-scale P2P Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Management in Large-scale P2P Systems

Similar presentations

Presentation on theme: "Data Management in Large-scale P2P Systems"— Presentation transcript:

Similar presentations

About project

Feedback