Dr. Tony White Carleton University

Dr. Tony White Carleton University
P2P: An Overview Dr. Tony White Carleton University

Outline Introduction Evolution of Network Computing Applications
Definitions The Rise of Edge Computing Why Peer-to-Peer? What is it? Applications Cycle Sharing Content Delivery … Open Problems Summary

Evolution of Network Computing
Client/server: - Introduced inequalities - Required homogeneity Web introduced: - A common protocol: HTTP - A common document format: HTML - A universal client: the browser

P2P Definition Peer-to-peer computing is the location and sharing of computer resources and services by direct exchange between servents. A servent is a peer that can adopt the roles of both server and client when operating.

P2P Definition “P2P is a class of applications that takes advantage of resources -- storage, cycles, content, human presence -- available at the edges of the Internet. Because accessing these decentralized resources means operating in an environment of unstable connectivity and unpredictable IP addresses, P2P nodes must operate outside the DNS system and have significant or total autonomy from central servers.” Clay Shirkey February, 2000

Definitions I Pure peer-to-peer is completely decentralized and characterized by lack of a central server or central entity; clients make direct contact with one another. Computational peer-to-peer uses P2P technology to disseminate computational tasks over multiple clients; peers do not have a direct connection to one another.

Definitions II Datacentric peer-to-peer is information and data residing on systems or devices that is accessible to others when users connect. It is sometimes called peer-assisted or grid-assisted delivery. Applications include distributed file and content sharing. Usercentric/hybrid peer-to-peer involves clients contacting others via a central server or entity to communicate, share data, or process data. Often used in collaboration applications.

What is a P2P network? It is an overlay network
Peer applications know IP addresses of other peer applications. Link between two nodes is actually an application-level connection.

What matters? Topology of overlay matters
Where content is stored matters Search protocol matters Gnutella results in: Poor performance Poor reliability

The Rise of Edge Computing …
In P2P, clients also are servers, hence are peers. Driving P2P is the abundance of: Computing power Non-volatile storage Network bandwidth (This seems to turn thin clients on their heads.) Sharing from the edge: Physical Resources: cycles, disk Information Resources: files, database access Services: code mobility implied

P2P Enables Complete Access
P2P file swapping is the obvious application Text, audio, video, executables, … Searching and sharing Resources Information Information processing capacity Searches More current than Google™ Indexing web logs (blogs, klogs …) More focused: search within a “peer group”

P2P Enables Complete Access …
Searching and sharing: Instant messaging locate user quickly independent of service provider. Buyers and sellers P2P auctions – compete with Ebay. Blogging Sharing of “self”. Edge-based multi-media streaming: Web radio Web TV Peer shells: Script complex P2P applications from simpler ones. Service creation using service composition.

P2P Enables Complete Access …
A New Style of Distributed Computing P2P applications tolerate peers coming/going. Result depends on which peers are available. High availability comes from probability that some peers are available. Not on load-balancing and fail over schemes. Must avoid “tragedy of the commons”.

Examples of Early P2P Some new Internet applications are different:
Instant messaging services (AIM, MS Messenger, …) P2P applications – no central authority/server. Napster – quasi-P2P Gnutella Freenet These applications are vertically integrated: Non-standard protocols Closed namespaces Stand alone

Problems I Topology Identity Security Bandwidth usage Fault tolerance
Search efficiency Identity Trust Anonymity Security Authorization Privacy

Problems II Namespaces Community Management Firewall traversal
Overlaps traditional enterprise groups Highly dynamic, user controlled Firewall traversal Political IT loses control of content distribution No control of information flow! Legal DRM

What is needed? Interoperability (common protocols & standards):
Communication protocols (e.g. JXTA, Jabber, …) Representation of identity (or not!) Semantic content (meta-data) Secure information exchange: Must be able to guarantee trust within a network Prevent unauthorized access to network Policy-based control of information exchange Ubiquity Buy-in from large groups of users

Securing Distributed Computations in a Commercial Environment
Philippe Golle, Stanford University Stuart Stubblebine, CertCo Improved results compared to what’s in pre-proceedings

Example of a Distributed Computation
580,000 active participants 565,800 years of CPU time since 1996 26.1 TeraFLOPs / sec What SETI does: analyze radio signals from space, volunteer model Hugely successful. Has demonstrated power of DC over Internet Intense interest in commercial model

Commercialization: supply
A dozen of companies have recruited thousands of participants $100 million in venture funding in 2000 (with

Commercialization: demand
Super-computing market: $2 billion / year Computationally intensive parallelizable projects: Drug design research Mathematical research Economic simulations Digital entertainment

Cheaters! David Anderson, Seti@home's director.
"Fifty percent of the project's resources have been spent dealing with security problems" “The really hard part has to do with verifying computational results" David Anderson, director.

Cycle Sharing Participants
Trusted supervisor Maintains a pool of registered participants Bids for large computations Divides the computation into tasks that are assigned to participants Collects the results and distributes payment to the participants Example: Distributed.net, Entropia.com, etc… Untrusted participants May range from large companies to individual users Participants are anonymous (No “real world” leverage) Participants may collude. We distinguish between real-world entities (agents) and anonymous participants. Participants may leave the computation at any time, either temporarily or for good.

Organization Distribution of tasks The unit of computation is a task
Assumption: all tasks have the same size and can be run by any participant within the same time bounds. The supervisor runs a probabilistic algorithm to assign tasks to participants. The supervisor keeps track of who did what

Security Definition: a computation is secure if no rational, non-risk-seeking participant ever cheats. Collusion may occur only before tasks are assigned. A participant has 3 choices: Request a computation and do it Request a computation and NOT do it Take a leave Assumption: all errors are malicious

Utility function of an agent
α Run the computation Cheat and “guess” the result Cheating detected Cheating undetected – L α + E α: Payment received per task E: Benefit of defecting (E = e α) L: Cost of getting caught cheating Security condition: (α+E)P – L(1-P) < 0 where P is the probability that cheating is undetected

Basic scheme Registration: Assignment of a task:
Participant performs d+1 unpaid tasks The supervisor verifies them (at limited cost) The participant is accepted iff all the results are correct Assignment of a task: A task is given to N participants chosen uniformly independently at random The number N is chosen according to the probability distribution Payment: a constant amount α per task if all the results agree If not, the task is re-assigned to a new set of participants Severance: a participant is paid an amount d.α

Properties Computational overhead = (α+E)P – L(1-P) < 0
Security condition: Computational overhead Setup time Maximum coalition size Maximum e 10% 10 1% 1 17% 46% 243% 100 Overhead = for “small” p

Participants with varying computational resources
Until now, implicit assumption that all participants have the same computational resources. Unrealistic assumption Security threat: an adversary may briefly control a number of participants out of proportions with her real computational power Activity: a probability distribution over the pool of participants, which evolves dynamically over time Participants are drawn at random according to the Activity We define rules for updating the activity Security implications

Content Delivery Networks
Swarmcast/OnionNetworks File is stored in multiple locations Idea is to retrieve portions of file from separate hosts: File is split into small (32k) pieces Requests are random Space of packets bigger than file Only subset of packets required Technique is Forward Error Correction Kazaa/Morpheus MojoNation (HiveCache) Distributed backup and restore system

Privacy Networks: Publius
Publishers: want to publish anonymously Servers: host random-looking content Storage The publisher takes the key, K that is used to encrypt the file and splits it into n shares, such that any k of them can reproduce the original K, but k-1 give no hints as to the key. Each server receives the encrypted Publius content and one of the shares. Retrieval A retriever must get the encrypted Publius content from some server and k of the shares. Content is tied to URL that is used to recover the data and the shares.

Privacy Networks: Freehaven
Anonymity: Publishers that insert documents, Readers that retrieve documents, Servers that store documents. Uses a free, low-latency, two-way mixnet for forward-anonymous communication. Accountability: Reputation and micropayment schemes, which allow us to limit the damage done by servers that misbehave. Persistence: Publisher of a document determines its lifetime. Flexibility: System functions smoothly as peers dynamically join or leave

John Kubiatowicz University of California at Berkeley
OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore Context: Ubiquitous Computing
Computing everywhere: Desktop, Laptop, Palmtop Cars, Cellphones Shoes? Clothing? Walls? Connectivity everywhere: Rapid growth of bandwidth in the interior of the net Broadband to the home and office Wireless technologies such as CMDA, Satelite, laser Where is persistent data????

Utility-based Infrastructure?
Pac Bell Sprint IBM AT&T Canadian OceanStore Utility-based Infrastructure? Data service provided by storage federation Cross-administrative domain Pay for Service

OceanStore: Everyone’s Data, One Big Utility “The data is just out there”
How many files in the OceanStore? Assume 1010 people in world Say 10,000 files/person (very conservative?) So 1014 files in OceanStore! If 1 gig files (ok, a stretch), get 1 mole of bytes! Truly impressive number of elements… … but small relative to physical constants Aside: new results: 1.5 Exabytes/year (1.51018)

OceanStore Assumptions
Untrusted Infrastructure: The OceanStore is comprised of untrusted components Individual hardware has finite lifetimes All data encrypted within the infrastructure Responsible Party: Some organization (i.e. service provider) guarantees that your data is consistent and durable Not trusted with content of data, merely its integrity Mostly Well-Connected: Data producers and consumers are connected to a high-bandwidth network most of the time Exploit multicast for quicker consistency when possible Promiscuous Caching: Data may be cached anywhere, anytime

The Peer-To-Peer View: Irregular Mesh of “Pools”

Key Observation: Want Automatic Maintenance
Can’t possibly manage billions of servers by hand! System should automatically: Adapt to failure Exclude malicious elements Repair itself Incorporate new elements System should be secure and private Encryption, authentication System should preserve data over the long term (accessible for 1000 years): Geographic distribution of information New servers added from time to time Old servers removed from time to time Everything just works

Outline: Three Technologies and a Principle
Principle: ThermoSpective Systems Design Redundancy and Repair everywhere Structured, Self-Verifying Data Let the Infrastructure Know What is important Decentralized Object Location and Routing A new abstraction for routing Deep Archival Storage Long Term Durability

Attack Resistant P2P Content can be compromised by: Remember:
Attack by malicious agents Censorship Faulty nodes Remember: Nodes have finite resources

Gnutella query

Morpheus/Kazaa ... ... ... ... super peer ... ...

Examples Napster shut down by attacks on central server
Gnutella spammed by Flatplanet Removal of a few peers shatters Gnutella 63 from 1800 in figures

After deletion of 2/3 of peers, 99% of remainder can still access
Performance After deletion of 2/3 of peers, 99% of remainder can still access 99% of the data items

DRN design [Jared Saia]
Topology based upon butterfly network (constant degree version of hypercube) Each vertex of butterfly called a supernode Each supernode represents a set of peers Each peer is in multiple supernodes

DRN Topology N peers, n supernodes
Each peer participates in Clogn randomly chosen supernodes Supernode X connected to supernode Y means all nodes in X connected to all nodes in Y

Conclusion P2P systems popular today
Limewire, Kazaa … Existing P2P systems vulnerable and inefficient Many challenges ahead: Search Resource Management Security and Privacy Lots of good research to be done …

Open Problems in P2P Data Sharing
Appendix I Open Problems in P2P Data Sharing

Open Problems in Data Sharing Peer-To-Peer Systems
Hector Garcia-Molina ICDT Conference, January 10, 2003 Contributors: Mayank Bawa, Brian Cooper, Arturo Crespo, Neil Daswani, Prasanna Ganesan, Sergio Marti, Qi Sun, Beverly Yang and others

 not independent challenges!
P2P Challenges Search Resource Management Security & Privacy  not independent challenges!

Search Search Options Query Expressiveness Comprehensiveness Topology
Data Placement Message Routing

Comparison     

Content Addressable Network (CAN)
Nodes 1 Data 2 A distributed hash table on Internet scales …

Comparison          

Challenge: Exploring the Space
a lot of research SIL model autonomy + gnutella can + efficiency robustness +

Search Index Link (SIL) Model
Forwarding search link (FSL) Non-forwarding search link (NSL) Forwarding index link (FIL) Non-forwarding index link (NIL) A E D C B Q Q

SIL Model Forwarding search link (FSL)
Non-forwarding search link (NSL) Forwarding index link (FIL) Non-forwarding index link (NIL) A D C B Q E H G F Q

Super-Peer Network D E A H C core B G F

SIL Challenges Desirable graph properties Desirable features
Dynamic configuration

Example Property: Redundancy
B C A Redundancy exists in a SIL graph if a link can be removed without reducing coverage

Example: Undesirable Feature
C One-index cycle: Node A has an index link to B, and there is a search path from B to A

Avoiding Undesirable Features
Node D is joining the system: what neighbors should it connect to? what type of links should it use? B E A D ? C

Open Problems: Security
Availability (e.g., coping with DOS attacks) Authenticity Anonymity Access Control (e.g., IP protection, payments,...)

Authenticity ? title: origin of species author: charles darwin
date: 1859 body: In an island far, far away ... ...

More than Just File Integrity
title: origin of species author: charles darwin ? date: 1859 00 body: In an island far, far away ... checksum

More than Fetching One File
T=origin Y=? A=darwin B=? T=origin Y=1859 A=darwin B=abcd T=origin Y=1800 A=darwin Y=1859

Solutions Authenticity Function A(doc): T or F Voting Based Time Based
at expert sites, at all sites? can use signature expert sig(doc) user Voting Based authentic is what majority says Time Based e.g., oldest version (available) is authentic

Added Challenge: Efficiency
Example: Current music sharing everyone has authenticity function but downloading files is expensive Solution: Track peer behavior bad peer good peer

How to Track Peer Behavior?
Trust Vector [ v1, v2, v3, v4 ] a b c d Single value between 0 and 1? Pair of values: [ total downloads, good downloads ] ?

Trust Operations update? a [1, .9, .5, 0, 0] .9 .5 b c
[1, 1, 0, .3, 1] [1, 0, 1, 1, .2] .3 1 .3 .2 d e

Issues Trust computations in dynamic system Overloading good nodes
Bad nodes provide good content sometimes Bad nodes can build up reputation Bad nodes can form collectives ...

Sample Results Fraction of inauthentic downloads
Fraction of malicious peers

P2P Challenges Search Resource Management Security and Privacy

Resource Management Local work: Ci Remote work: (1 - ) Ci 1
capacity = C1 2 capacity = C2 3 capacity = C3 Local work: Ci Remote work: (1 - ) Ci

Incentives for Remote Work
What is best value for ? How do I get remote nodes to work for me? 2 1 C2 C1 Local work: Ci Remote work: (1 - ) Ci 3 C3

Conclusion P2P systems popular today
Limewire, Kazaa … Existing P2P systems vulnerable and inefficient Many challenges ahead: Search Resource Management Security and Privacy Lots of good research to be done …

For Additional Information
Google: “Stanford Peers”, OceanStore, Tapestry, Chord

Appendix II P2P Architectures

Peer-to-Peer is Not Always Decentralized …when Centralization is Good
Nelson Minar

Warning: Broad generalizations ahead
Talk Overview Topologies of distributed systems Strengths and weaknesses Conclusions Warning: Broad generalizations ahead

What is P2P Anyway? Decentralized Systems: no Edge Resources: yes
Popular Power fails test Napster fails test Most Instant Messaging fails test Confuses topology with function Edge Resources: yes Small computers on edges contribute back All peers are active participants

Distributed Systems Topologies
Get away from fundamentalism “Pure P2P”, “True P2P”, etc Focus instead on system architecture How do the pieces fit together? Concentrate on connection topology Which topology for which problem?

Centralized Client/server Web servers Databases Napster search
Instant Messaging Popular Power

Ring Fail-over clusters Simple load balancing Assumption Single owner

Hierarchical DNS NTP Usenet (sort of)

Decentralized Gnutella Freenet Hive Internet routing

Centralized + Centralized
N-tier apps Database heavy systems Web services gateways Grand Central

Centralized + Ring Serious web applications High availability servers

Centralized + Decentralized
Clip2 Gnutella Reflector FastTrack KaZaA Morpheus

What about other topologies?
Centralized + Hierarchical? Back end tree of information Caching architectures Decentralized + Ring? P2P network of fail-over clusters Decentralized + Hierarchical? Decentralized + Centralized?

Strengths and Weaknesses
Plenty of topologies to choose from What is each kind good for? Need a set of properties to measure Caution: What follows is very high level

Things to Measure Manageability Information coherence Extensibility
How hard is it to keep working? Information coherence How authoritative is info? (Auditing, non-repudiation) Extensibility How easy is it to grow? Fault tolerance How well can it handle failures? Security How hard is it to subvert? Resistance to legal or political intervention How hard is it to shut down? (Can be good or bad) Scalability How big can it grow?

Centralized Manageable Coherent Extensible Fault Tolerant Secure
Lawsuit-proof Scalable System is all in one place All information is in one place No one can add on to system Single point of failure Simply secure one host Easy to shut down One machine. But in practice?

Ring Manageable Coherent Extensible Fault Tolerant Secure
Lawsuit-proof Scalable Simple rules for relationships Easy logic for state Only ring owner can add Fail-over to next host As long as ring has one owner Shut down owner Just add more hosts

Hierarchical Manageable Coherent Extensible Fault Tolerant Secure
Lawsuit-proof Scalable Chain of authority Cache consistency Add more leaves, rebalance Root is vulnerable Too easy to spoof links Just shut down the root Hugely scalable – DNS

Decentralized Manageable Coherent Extensible Fault Tolerant Secure
Lawsuit-proof Scalable Very difficult, many owners Difficult, unreliable peers Anyone can join in! Redundancy Difficult, open research No one to sue! (…but follow $) Theory – yes : Practice – no

Common architecture for web applications
Centralized + Ring Manageable Coherent Extensible Fault Tolerant Secure Lawsuit-proof Scalable Just manage the ring As coherent as ring No more than ring Ring is a huge win As secure as ring Still single place to shut down Common architecture for web applications

Centralized + Decentralized
Manageable Coherent Extensible Fault Tolerant Secure Lawsuit-proof Scalable Same as decentralized Better than decentralized Anyone can still join! Plenty of redundancy Still no one to sue Looking very hopeful Best architecture for P2P networks?

Centralized vs. Decentralized
Centralized is pretty good! Manageable Coherent Security Decentralized is exciting Extensible Massive fault tolerance Lawsuit-proof Scalability is the big question

Conclusions Centralized is easy to deal with
Major architecture for distributed systems Combines well with rings Decentralized is good, needs research Coherence, Manageability, Security Scalability Hierarchical is overlooked Combining architectures is powerful

Peer-to-Peer is Not Always Decentralized …when Centralization is Good
Nelson Minar Thanks to Marc Hedlund, Raffi Krikorian, Tony White

Appendix III P2P Industry

P2P Industry Outline “There’s no peer-to-peer market any more than there’s a client/server market” – Anne Manes, Sun Microsystems Peer-to-peer encompasses a wide range of technologies centered around decentralizing computing Business and revenue models are still fuzzy There are clear opportunities and research excitement

Distribution of P2P Companies
Category Examples Industry Share Distributed Computing Entropia United Devices 35% Collaboration / Knowledge Management Groove Networks Engenia 20% Content Distribution Akamai Proksim 10% Infrastructure / Platform Akavi Xdegrees File Sharing Kazaa Napster Distributed Search OpenCola Thinkstream 5% (From “P2P 101: An Overview of the P2P Landscape” by Larry Cheng)

Major Features of P2P Industry
(From “P2P 101: An Overview of the P2P Landscape” by Larry Cheng) Lack of experienced, quality management teams Lack of detailed business models Skeptical investors 150+ active companies Estimated 95% failure rate “The elephant in the room is the fact that most companies here are not commercially viable.” - Heard from a speaker at O’Reilly

Current P2P Business Models
Sell P2P products to end-users No current revenue-generating business model Sometimes coupled with content-sale models Sell content through P2P Subscription-based – I buy content from you Sponsor-based – Someone pays you to give me content Ad-based – You give me content and sell ads

Current P2P Business Models
Sell something which lets others profit from P2P Solve a critical problem for decentralized applications Offer support and enhanced services for free tools Specialized packages for particular industries Tools and libraries for P2P infrastructure “The people most likely to make money during a Gold Rush are the ones selling pickaxes and shovels.” Andy Oram, The O’Reilly Network

Dr. Tony White Carleton University

Similar presentations

Presentation on theme: "Dr. Tony White Carleton University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dr. Tony White Carleton University

Similar presentations

Presentation on theme: "Dr. Tony White Carleton University"— Presentation transcript:

Similar presentations

About project

Feedback