Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 School of Computing Science Simon Fraser University CMPT 765/408: P2P Systems Instructor: Dr. Mohamed Hefeeda.

Similar presentations


Presentation on theme: "1 School of Computing Science Simon Fraser University CMPT 765/408: P2P Systems Instructor: Dr. Mohamed Hefeeda."— Presentation transcript:

1 1 School of Computing Science Simon Fraser University CMPT 765/408: P2P Systems Instructor: Dr. Mohamed Hefeeda

2 2 P2P Computing: Definitions  Peers cooperate to achieve desired functions -Peers: End-systems (typically, user machines) Interconnected through an overlay network Peer ≡ Like the others (similar or behave in similar manner) -Cooperate: Share resources, e.g., data, CPU cycles, storage, bandwidth Participate in protocols, e.g., routing, replication, … -Functions: File-sharing, distributed computing, communications, content distribution, …  Note: the P2P concept is much wider than file sharing

3 3 When Did P2P Start?  Napster (Late 1990’s) -Court shut Napster down in 2001  Gnutella (2000)  Then the killer FastTrack (Kazaa,...)  BitTorrent, and many others  Accompanied by significant research interest  Claim -P2P is much older than Napster!  Proof -The original Internet! -Remember UUCP (unix-to-unix copy)?

4 4 What IS and IS NOT New in P2P?  What is not new -Concepts!  What is new -The term P2P (may be!) -New characteristics of Nodes which constitute the System that we build

5 5 What IS NOT New in P2P?  Distributed architectures  Distributed resource sharing  Node management (join/leave/fail)  Group communications  Distributed state management  ….

6 6 What IS New in P2P?  Nodes (Peers) -Quite heterogeneous Several order of magnitudes difference in resources Compare the bandwidth of a dial-up peer versus a high-speed LAN peer -Unreliable Failure is the norm! -Offer limited capacity Load sharing and balancing are critical -Autonomous Rational, i.e., maximize their own benefits! Motivations should be provided to peers to cooperate in a way that optimizes the system performance

7 7 What IS New in P2P? (cont’d)  System -Scale Numerous number of peers (millions) -Structure and topology Ad-hoc: No control over peer joining/leaving Highly dynamic -Membership/participation Typically open  -More security concerns Trust, privacy, data integrity, … -Cost of building and running Small fraction of same-scale centralized systems How much would it cost to build/run a super computer with processing power of that 3 Million SETI@Home PCs?

8 8 What IS New in P2P? (cont’d)  So what?  We need to design new lighter-weight algorithms and protocols to scale to millions (or billions!) of nodes given the new characteristics  Question: why now, not two decades ago? -We did not have such abundant (and underutilized) computing resources back then! -And, network connectivity was very limited

9 9 Why is it Important to Study P2P?  P2P traffic is a major portion of Internet traffic (50+%), current killer app  P2P traffic has exceeded web traffic (former killer app)!  Direct implications on the design, administration, and use of computer networks and network resources -Think of ISP designers or campus network administrators  Many potential distributed applications

10 10 Sample P2P Applications  File sharing -Gnutella, Kazaa, BitTorrent, …  Distributed cycle sharing -SETI@home, Gnome@home, …  File and storage systems -OceanStore, CFS, Freenet, Farsite, …  Media streaming and content distribution -PROMISE -SplitStream, CoopNet, PeerCast, Bullet, Zigzag, NICE, …

11 11 P2P vs. its Cousin (Grid Computing)  Common Goal: -Aggregate resources (e.g., storage, CPU cycles, and data) into a common pool and provide efficient access to them  Differences along five axes [Foster & Imanitchi 03] -Target communities and applications -Type of shared resources -Scalability of the system -Services provided -Software required

12 12 P2P vs Grid Computing (cont’d) IssueGridP2P Communities and Applications  Established communities, e.g., scientific institutions  Computationally- intensive problems  Grass-root communities (anonymous)  Mostly, file- swapping Resources Shared  Powerful and Reliable machines, clusters  High-speed connectivity  Specialized instruments  PCs with limited capacity and connectivity  Unreliable  Very diverse

13 13 P2P vs Grid Computing (cont’d) IssueGridP2P System Scalability  Hundreds to thousands of nodes  Hundreds of thousands to Millions of nodes Services Provided  Sophisticated services: authentication, resource discovery, scheduling, access control, and membership control  Members usually trust others  Limited services: resource discovery  limited trust among peers Software required  Sophisticated suit: e.g., Globus, Condor Simple: (screen saver), e.g., Kazza, SETI@Home

14 14 P2P vs Grid Computing: Discussion  The differences mentioned are based on the traditional view of each paradigm -It is conceived that both paradigms will converge and will complement each other [e.g., Butt et al. 03]  Target communities and applications -Grid: is going open  Type of shared resources -P2P: is to include various and more powerful resources  Scalability of the system -Grid: is to increase number of nodes  Services provided -P2P: is to provide authentication, data integrity, trust management, …

15 15 P2P Systems: Simple Model P2P Substrate Operating System Hardware Middleware P2P Application Software architecture model on a peer System architecture: Peers form an overlay according to the P2P Substrate

16 16 Overlay Network  An abstract layer built on top of physical network  Neighbors in overlay can be several hops away in physical network  Why do we need overlays? -Flexibility in Choosing neighbors Forming and customizing topology to fit application’s needs (e.g., short delay, reliability, high BW, …) Designing communication protocols among nodes -Get around limitations in legacy networks -Enable new (and old!) network services

17 17 Overlay Network (cont’d)

18 18 Overlay Network (cont’d)  Some applications that use overlays -Application level multicast, e.g., ESM, Zigzag, NICE, … -Reliable inter-domain routing, e.g., RON -Content Distribution Networks (CDN) -Peer-to-peer file sharing  Overlay design issues -Select neighbors -Handle node arrivals, departures -Detect and handle failures (nodes, links) -Monitor and adapt to network dynamics -Match with the underlying physical network

19 19 Overlay Network (cont’d) Recall: IP Multicast source

20 20 Overlay Network (cont’d) Application Level Multicast (ALM) source

21 21 Peer Software Model  A software client installed on each peer  Three components: -P2P Substrate -Middleware -P2P Application P2P Substrate Operating System Hardware Middleware P2P Application Software model on peer

22 22 Peer Software Model (cont’d)  P2P Substrate (key component) -Overlay management Construction Maintenance (peer join/leave/fail and network dynamics) -Resource management Allocation (storage) Discovery (routing and lookup)  Ex: Pastry, CAN, Chord, …  More on this later

23 23 Peer Software Model (cont’d)  Middleware -Provides auxiliary services to P2P applications: Peer selection Trust management Data integrity validation Authentication and authorization Membership management Accounting (Economics and rationality) … -Ex: CollectCast, EigenTrust, Micro payment

24 24 Peer Software Model (cont’d)  P2P Application -Potentially, there could be multiple applications running on top of a single P2P substrate -Applications include File sharing File and storage systems Distributed cycle sharing Content distribution -This layer provides some functions and bookkeeping relevant to target application File assembly (file sharing) Buffering and rate smoothing (streaming)  Ex: Promise, Bullet, CFS

25 25 P2P Substrate  Key component, which -Manages the Overlay -Allocates and discovers objects  P2P Substrates can be -Structured -Unstructured -Based on the flexibility of placing objects at peers

26 26 P2P Substrates: Classification  Structured (or tightly controlled, DHT) −Objects are rigidly assigned to specific peers −Looks like as a Distributed Hash Table (DHT) −Efficient search & guarantee of finding −Lack of partial name and keyword queries −Maintenance overhead −Ex: Chord, CAN, Pastry, Tapestry, Kademila (Overnet)  Unstructured (or loosely controlled) −Objects can be anywhere −Support partial name and keyword queries −Inefficient search & no guarantee of finding −Some heuristics exist to enhance performance −Ex: Gnutella, Kazaa (super node), GIA [Chawathe et al. 03]

27 27 Structured P2P Substrates  Objects are rigidly assigned to peers −Objects and peers have IDs (usually by hashing some attributes) −Objects are assigned to peers based on IDs  Peers in overlay form specific geometrical shape, e.g., -tree, ring, hypercube, butterfly network  Shape (to some extent) determines −How neighbors are chosen, and −How messages are routed

28 28 Structured P2P Substrates (cont’d)  Substrate provides a Distributed Hash Table (DHT)-like interface −InsertObject (key, value), findObject (key), … −In the literature, many authors refer to structured P2P substrates as DHTs  It also provides peer management (join, leave, fail) operations  Most of these operations are done in O(log n) steps, n is number of peers

29 29 Structured P2P Substrates (cont’d)  DHTs: Efficient search & guarantee of finding  However, −Lack of partial name and keyword queries −Maintenance overhead, even O(log n) may be too much in very dynamic environments  Ex: Chord, CAN, Pastry, Tapestry, Kademila (Overnet)

30 30 Example: Content Addressable Network (CAN) [Ratnasamy 01] − Nodes form an overlay in d-dimensional space −Node IDs are chosen randomly from the d-space −Object IDs (keys) are chosen from the same d-space − Space is dynamically partitioned into zones − Each node owns a zone − Zones are split and merged as nodes join and leave − Each node stores −The portion of the hash table that belongs to its zone −Information about its immediate neighbors in the d- space

31 31 2-d CAN: Dynamic Space Division n1 n2 n3 n4 0 0 7 7 n5

32 32 2-d CAN: Key Assignment n1 n2 n3 n4 0 0 7 7 K1 K2 K3 K4 n5

33 33 2-d CAN: Routing (Lookup) n1 n2 n3 n4 0 0 7 7 K1 K2 K3 K4 n5 K4?

34 34 CAN: Routing − Nodes keep 2d = O(d) state information (neighbor coordinates, IPs) −Constant, does not depend on number of nodes n − Greedy routing -Route to the node that is closest to the destination -On average, is done in O(n 1/d ) = O(log n) when d = log n /2

35 35 CAN: Node Join − New node finds a node already in the CAN −(bootstrap: one (or a few) dedicated nodes outside the CAN maintain a partial list of active nodes) − It finds a node whose zone will be split −Choose a random point P (will be its ID) −Forward a JOIN request to P through the existing node − The node that owns P splits its zone and sends half of its routing table to the new node − Neighbors of the split zone are notified

36 36 CAN: Node Leave, Fail − Graceful departure −The leaving node hands over its zone to one of its neighbors − Failure −Detected by the absence of heart beat messages sent periodically in regular operation −Neighbors initiate takeover timers, proportional to the volume of their zones −Neighbor with smallest timer takes over zone of dead node −notifies other neighbors so they cancel their timers (some negotiation between neighbors may occur) −Note: the (key, value) entries stored at the failed node are lost −Nodes that insert (key, value) pairs periodically refresh (or re-insert) them

37 37 CAN: Discussion − Scalable −O(log n) steps for operations −State information is O(d) at each node − Locality −Nodes are neighbors in the overlay, not in the physical network −Suggestion (for better routing) −Each node measure RTT between itself and its neighbors −Forward the request to the neighbor with maximum ratio of progress to RTT − Maintenance cost −Logarithmic −But, may still be too much for very dynamic P2P systems

38 38 Unstructured P2P Substrates − Objects can be anywhere  Loosely-controlled overlays − The loose control −Makes overlay tolerate transient behavior of nodes −For example, when a peer leaves, nothing needs to be done because there is no structure to restore −Enables system to support flexible search queries −Queries are sent in plain text and every node runs a mini- database engine − But, we loose on searching −Usually using flooding, inefficient −Some heuristics exist to enhance performance −No guarantee on locating a requested object (e.g., rarely requested objects) − Ex: Gnutella, Kazaa (super node), GIA [Chawathe et al. 03]

39 39 Example: Gnutella − Peers are called servents − All peers form an unstructured overlay − Peer join −Find an active peer already in Gnutella (e.g., contact known Gnutella hosts) −Send a Ping message through the active peer −Peers willing to accept new neighbors reply with Pong − Peer leave, fail −Just drop out of the network! − To search for a file −Send a Query message to all neighbors with a TTL (=7) −Upon receiving a Query message −Check local database and reply with a QueryHit to requester −Decrement TTL and forward to all neighbors if nonzero

40 40 Flooding in Gnutella Scalability Problem

41 41 Heuristics for Searching [Yang and Garcia-Molina 02] − Iterative deepening −Multiple BFS with increasing TTLs −Reduce traffic but increase response time − Directed BFS −Send to “good” neighbors (subset of your neighbors that returned many results in the past)  need to keep history − Local Indices −Keep a small index over files stored on neighbors (within number of hops) −May answer queries on behalf of them −Save cost of sending queries over the network −Index currency?

42 42 Heuristics for Searching: Super Node − Used in Kazaa (signaling protocols are encrypted) − Studied in [Chawathe 03] − Relatively powerful nodes play special role −maintain indexes over other peers

43 43 Unstructured Substrates with Super Nodes Super Node (SN) Ordinary Node (ON)

44 44 Example: FastTrack Networks (Kazaa) − Most of info/plots in following slides are from Understanding Kazaa by Liang et al. − The most popular (~ 3 million active users in a typical day) sharing 5,000 Terabytes − Kazaa traffic exceeds Web traffic − Two-tier architecture (with Super Nodes and Ordinary Nodes) − SN maintains index on files stored at ONs attached to it −ON reports to SN the following metadata on each file: −File name, file size, ContentHash, file descriptors (artist name, album name, …)

45 45 FastTrack Networks (cont’d) − Mainly two types of traffic −Signaling −Handshaking, connection establishment, uploading metadata, … −Encrypted! (some reverse engineering efforts) −Over TCP connections between SN—SN and SN—ON −Analyzed in [Liang et al. 04] −Content traffic −Files exchanged, not encrypted −All through HTTP between ON—ON −Detailed Analysis in [Gummadi et al. 03]

46 46 Kazaa (cont’d) − File search −ON sends a query to its SN −SN replies with a list of IPs of ONs that have the file −SN may forward the query to other SNs − Parallel downloads take place between supplying ONs and receiving ON

47 47 FastTrack Networks (cont’d) − Measurement study of Liang et al. −Hook three machines to Kazaa and wait till one of them is promoted to be SN −Connect the other two (ONs) to that SN −Study several properties −Topology structure and dynamics −Neighbor selection − Super node lifetime −….

48 48 Kazaa: Topology Structure [Liang et al. 04] ON to SN: 100 - 160 connections  Since there are ~3M nodes, we have ~30,000 SNs SN to SN: 30 – 50 connections  Each SN connects to ~0.1 % of total number of SNs

49 49 Kazaa: Topology Dynamics [Liang et al. 04] − Average ON – SN connection duration −Is ~ 1 hour, after removing very short-lived connections (30 sec) used for shopping for SNs − Average SN – SN connection duration −23 min, which is short because of −Connection shuffling between SNs to allow ONs to reach a larger set of objects −SNs search for other SNs with smaller loads −SNs connect to each other from time to time to exchange SN lists (each SN stores 200 other SNs in its cache)

50 50 Kazaa: Neighbor Selection [Liang et al. 04] − When ON first joins, it gets a list of 200 SNs −ON considers locality and SN workload in selecting its future SN − Locality −40% of ON-SN connections have RTT < 5 msec −60% of ON-SN connections have RTT < 50 msec −RTT: E. US  Europe ~100 msec

51 51 Kazaa: Lifetime and Signaling Overhead [Liang et al. 04] − Super node average lifetime is ~2.5 hours − Overhead: −161 Kb/s upstream −191 Kb/s downstream −  Most of SNs are high-speed (campus network, or cable)

52 52 Kazaa vs. Firewalls, NAT [Liang et al. 04] − Default port WAS 1214 −Easy for firewalls to filter out Kazaa traffic − Now, Kazaa uses dynamic ports −Each peer chooses its random port −ON reports its port to its SN −Ports of SNs are part of the SN refresh list exchanged among peers −Too bad for firewalls! − Network Address Translator (NAT) −A requesting peer can not establish a direct connection with a serving peer behind NAT −Solution: connection reversal −Send to SN of NATed peer, which already has a connection with it −SN tells NATed peer to establish a connection with requesting peer! −Transfer occurs happily through the NAT −Both peers behind NATs?

53 53 Kazaa: Lessons [Liang et al. 04] − Distributed design − Exploit heterogeneity − Load balancing − Locality in neighbor selection − Connection Shuffling −If a peer searches for a file and does not find it, it may try later and gets it! − Efficient gossiping algorithms −To learn about other SNs and perform shuffling −Kazaa uses a “freshness” field in SN refresh list  a peer ignores stale data − Consider peers behind NATs and Firewalls −They are everywhere!

54 54 Summary  P2P is an active research area with many potential applications in industry and academia  In P2P computing paradigm: -Peers cooperate to achieve desired functions  New characteristics -heterogeneity, unreliability, rationality, scale, ad hoc -  new and lighter-weight algorithms are needed  Simple model for P2P systems: -Peers form an abstract layer called overlay -A peer software client may have three components P2P substrate, middleware, and P2P application Borders between components may be blurred

55 55 Summary (cont’d)  P2P substrate: A key component, which -Manages the Overlay -Allocates and discovers objects  P2P Substrates can be -Structured (DHT) Example: CAN -Unstructured Example 1: Gnutella, Example 2: Kazza


Download ppt "1 School of Computing Science Simon Fraser University CMPT 765/408: P2P Systems Instructor: Dr. Mohamed Hefeeda."

Similar presentations


Ads by Google