Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Crawling Gnutella Network By: Samer Al-Kiswany.

Similar presentations


Presentation on theme: "1 Crawling Gnutella Network By: Samer Al-Kiswany."— Presentation transcript:

1 1 Crawling Gnutella Network By: Samer Al-Kiswany

2 2 Roadmap EECE 411 Introduction Gnutella network structure Gnutella protocol overview Gnutella crawling protocol Crawling topology information Crawling node content

3 3 Introduction EECE 411 Gnutella network is a decentralized peer to peer system for file sharing.  Original created by Justin Frankel of Nullsoft  Large scale today up to 4M nodes, 1000TB data, 100M files today  Fast growth in its early stages more than 50 times during first half of 2001 (50 times again 2001 to 2006)  Self-organizing network  Open, simple and flexible protocol

4 4 Roadmap EECE 411 Introduction Gnutella network structure Gnutella protocol overview Gnutella crawling protocol Crawling topology information Crawling node content

5 5 Gnutella Network Structure EECE 411 Gnutella Protocol 0.6 Two tier architectures of ultrapeers and leaves Ultrapeers Leaves

6 6 Roadmap EECE 411 Introduction Gnutella network structure Gnutella protocol overview Gnutella crawling protocol Crawling topology information Crawling node content

7 7 Basic Primitives for File Sharing EECE 411  Join: How do I begin participating?  Publish: How do I advertise my file(s)?  Search: How do I find a file?  Fetch: How do I retrieve a file?

8 8 Gnutella Protocol Overview EECE 411  Join: on startup, client contacts an ultrapeer node(s)  Publish: no need  Search:  Ask the ultrapeer node  The ultrapeer will propagate the questions to other ultrapeers and will return the answer back  Fetch: get the file directly from peer (HTTP)

9 9 Roadmap EECE 411 Introduction Gnutella network structure Gnutella protocol overview Gnutella crawling protocol Crawling topology information Crawling node content

10 10 Crawling a Gnutella node EECE 411 By Crawling we are interested in two main pieces of information:  With whom the node is connected ? - Topology information Gnutella protocols terms “Crawling/Communicating Network Topology Information ”  What files the node is sharing with others? Gnutella protocol terms “Browsing Host ”

11 11 Crawling Topology Information EECE 411 Gnutella protocol 0.6 supports network topology information crawling !!! Gnutella Network Topo crawl Topo information Topology Information: -Ultrapeers -Leaves

12 12 GNUTELLA CONNECT/0.6 User-Agent: LimeWire (crawl) X-Ultrapeer: False Query-Routing: 0.1 Crawler: 0.1 GNUTELLA/0.6 200 OK User-Agent: BearShare Leaves: 127.0.0.1:6346,127.0.0.2:6346 Peers: 127.0.0.4:6346,127.0.0.5:6346 EECE 411 Topo Crawl Topo information GNUTELLA/0.6 200 OK Crawling Topology Information

13 13 Browsing Node Content EECE 411 Gnutella Network Browse Host List of files

14 14 GET / HTTP/1.1 Host: Crawler_IP:PORT User-Agent: UBCECE Accept: application/x-gnutella- packets Connection: close HTTP/1.1 200 OK Server: LimeWire/x.y Content-Type: application/x-gnutella- packets Connection:close EECE 411 Browse Host List of files Query Hit Message Browsing Node Content

15 15 Query Hit Parsing EECE 411 12ABCDEF3 1 – Gnutella message header important field : message length. 2 – Query Hit Header important field : Number of files A-F– list of shared files includes file name and size 3 – Other Gnutella protocol fields The HTTP response message may contain more than one query Hit response Query Hit Message 1 2ABCDEF3 2ABCDEF3 - - - 1

16 16 Limitations - Does this always work ? EECE 411 Topology Crawling: The topology information crawling is not supported by some Gnutella protocol v0.4 implementations Host Browsing : Some Gnutella node implementations will return the list of files in HTML (BearShare for instance). (will not respond with Query Hit message)

17 17 Roadmap EECE 411 Introduction Gnutella network structure Gnutella protocol overview Gnutella crawling protocol Crawling topology information Crawling node content

18 18 Single Gnutella-Node Crawler EECE 411 A proof of concept implementation of single Gnutella-node crawler. Available through the following link http://www.ece.ubc.ca/~samera/TA/project/sgnc.html The main class that implements the crawling protocol is the Crawler class: crawlpeers(ip_address, port) parsePeers(byte[] ) listFiles(ip_address, port) processQueryHit(byte[] )

19 19 Project Phase II EECE 411 Implement a single-node Gnutella network crawler Report:  The active leaf nodes  Information regarding the “agent” (i.e., the implementation: LimeWire, BearShare …etc)  The domain name corresponding to the node IP address. Avoid cycles !!

20 20 Project Phase III EECE 411 Implement a master/worker crawler with Java NIO sockets. Gnutella Network Master Primary Crawl the following list : … Results: peers IPs, statistics CrawledTo be Crawled Problems ? (Hint: Failures)

21 21 Project Phase III EECE 411 Implement a master/worker crawler with Java NIO sockets. Adopt primary/backup replication for the manager Gnutella Network Master Primary CrawledTo be Crawled Master Backup X

22 22 Previous Years Ideas – Part I EECE 411 Programming languages / frameworks / protocols Java (the vast majority) Scala Apache MINA framework. Java RMI Jython XML-RPC SQL Python/Perl/Shell/cron jobs Architecture Master/worker (the majority) Hierarchical

23 23 Previous Years Ideas – Part II EECE 411 Design choices NIO at both master and workers Careful load balancing Keep the workers always busy Bootstrapping new workers if old works fail Additional bells and whistles GUI manager Statistics in real-time through GUI and web page Graphviz

24 24 References EECE 411 Other references: http://gnutella-specs.rakjar.de/index.php/Main_Page www.limewire.com Single Gnutella-Node Crawler: http://www.ece.ubc.ca/~samera/TA/project/sgnc.html http://www.ece.ubc.ca/~samera/TA/project/sgnc.html Gnutella Crawling protocol : http://www.ece.ubc.ca/~samera/TA/project/Gnuttela-Protocol.html http://www.ece.ubc.ca/~samera/TA/project/Gnuttela-Protocol.html

25 25 Thank you www.ece.ubc.ca/~samera


Download ppt "1 Crawling Gnutella Network By: Samer Al-Kiswany."

Similar presentations


Ads by Google