Presentation on theme: "Dynamics of Peer-to-Peer Networks or Who is Going to be The Next Pop Star? Yuval Shavitt School of Electrical Engineering"— Presentation transcript:
Dynamics of Peer-to-Peer Networks or Who is Going to be The Next Pop Star? Yuval Shavitt School of Electrical Engineering firstname.lastname@example.org http://www.eng.tau.ac.il/~shavitt
Credits Talk is based on the papers: Static and dynamic characterization of the Gnutella network [Shaked-Gish, S, Tankel, IPTPS 2007] How to predict the next pop star? [Koenigstein, S, Tankel, KDD 2008]
What are Peer-to-Peer Networks? The common computing paradigm is client-server – Server waits for requests (on a known port) – Client sends a request – Server serves the client – Examples: WWW, FTP, SMTP (e- mail), ….. Peer-to-peer networks: – Each end-point is both client and server client server
Talk Outline Static and dynamic characterization of the Gnutella network [Shaked-Gish, S, Tankel, IPTPS 2007] How to predict the next pop star? [Koenigstein, S, Tankel, KDD 2008] Highlights from – Song Ranking Based on Piracy in Peer-to-Peer Networks. [Koenigstein & S, ISMIR 2009] – Predicting Billboard Success Using Data-Mining in P2P Networks. [Koenigstein, S, Zilberman, AdMIRe 2009] – Songs Clustering Using Peer-to-Peer Co-occurrences. [S, Weinsberg, AdMIRe 2009] – Estimating Peer Similarity using Distance of Shared Files [S, Weinsberg, Weinsberg, IPTPS 2010]
The Gnutella Network Gnutella: The most popular sharing network on the Internet According to the Digital Music News Research Group 40% market share in Q4 2007 Limewire: The most popular file sharing client in the world. Dominates the Gnutella network.
The Gnutella Protocol Originally – A distributed algorithm were all nodes are equally responsible for the routing. However, since regular users constantly connect and disconnect from the network, this caused unstable topology. Originally – No IP in queries => Track back query results Today - a tiered system of ultrapeers and leaves. Regular home users (leaves) are kept on the edges of the graph. Stable nodes are promoted to ultrapeers accepting leaf connections and responsible for routing queries. Today - Queries carry OOB address: The originator’s address or in most cases when the client is firewalled, this is the ultrapeer’s address
The Gnutella Protocol Originally: a flat peer-to-peer distributed protocol. – Churn caused instability Today: a 2-level tiered system – Stable nodes are promoted to become ultrapeers – Queries carry OOB address: The originator’s address or in most cases when the client is firewalled, this is the ultrapeer’s address
Locating the Origin IP address IP resolution Process: Detect the U.P. IP Discard queries with more than 2 hops Discard queries with 2 hops and same IP Intercept queries with 2 hops and different IPs peer UP listener peer Cancels the bias for rare queries Introduces bias against firewalled clients Cancels the bias for rare queries Introduces bias against firewalled clients
Data Sets First study: – Jul 2006 - Nov 2006 – 665,000,000 world-wide geo-identified queries Second study – Oct 2006 – Jul 2007, Sundays only – 310,000,000 USA geo-identified queries A network crawl of 24 hours – 1.2M users – 533,000 different songs Largest studies ever performed in length and depth
How to Predict Artist’s Success? Noam Koenigstein, Y. Shavitt, and Tomer Tankel. Spotting Out Emerging Artists Using Geo-Aware Analysis of P2P Query Strings. The 2008 ACM SIGKDD Conference, August 2008, Las Vegas, NV, USA.
The Word of Mouth Effect unsuccessful product a uniform spatial distribution A successful innovation formation of adopter-clusters around early adopters The Divergence can be used to predict a new product success probability [Garber et al., Marketing Science 2004]
The divergence When measured against the uniform distribution, maximum is achieved when P is a function. – True for both Kullback-Leiblar and Jensen- Shannon – This is the case when emerging artists are considered Non uniform distribution of potential adopters:
Party Like a Rockstar in 2007 Week 6: The string “party like a rockstar” is detected by the algorithm Week 8: Atlanta’s popularity chart in (Feb 18 th ) Week 15: Atlanta based Shop Boyz sign contract with Universal Recordings Week 18: The song first enters the Billboard Hot 100 on (80 th position) Week 23: Reached 2 nd position on Billboard Hot 100 Ranked only 10,156 on the global chart
Party Like a Rockstar Shop Boyz related queries in February 2007 Shop Boyz Popularity and Divergence in 2007
Soulja Boy Detected by our alg: already in 2006. The string “soulja boy” entered the “Atlanta queries top 100” already in October 2006 Entered the Bubbling Under R&B/Hip-Hop Singles in the 23rd of June 2007 Later ranked first in the following Billboard charts: Hot 100, Hot Rap Tracks, Hot Videoclip, Hot RingMasters and Hot Ringtones
Yung Berg Active in LA Week 2: Entered LA top 100 Week 15: First appeared on the Billboard charts Week 32: Reached 18 on the Billboard Top 100
The Detection Algorithm Input: A list of Geo-identified P2P Query strings Output: A list of locally popular query string with high probability to become globally popular Build local and global popularity charts local popularity is detected using local and global popularity thresholds Looking for local popularity growth trends from week to week Filtering: Non-music related content, and already familiar artists are characterized by uniform distribution
Local Popularity Not all queries are “products”, thus divergence is not effective (e.g., rare typos) Detection is based on local popularity:
ATPL - All Times Popular List Initialization: All the strings that reached global popularity in 2006 Weekly aggregation Filters non-volatile string: adult related, e.g., “porn” well established artists, e.g., “madonna”, “avril lavigne” Movies, software, etc.
Top Ranks – Hot 100 75% of the songs reached their highest rank in the P2P popularity chart before reaching their highest rank on the Billboard. On average a song reaches its top rank on the P2P network 2.39 weeks before reaching its top position on the Billboard.
Prediction Results Example: When a song enters the Billboard will it reach “top 20”? Precision: 89%, Recall: 80% On average songs pass the threshold 2.83 weeks before reaching top Billboard rank More details: Koenigstein, Shavitt, and Zilberman, AdMIRe 2009
User Disk Content Disk content of over 1M users – Average of 130 songs per user disk User-User correlation – Recommendation systems Song-Song correlation – Identifying genres – Recommendation systems 2D FastMap
Current Work Clustering and Recommender Systems Time delay between countries
Summary Following activity in the Internet can help up detect trends before they are visible – P2P networks – Social networks – Blogs – Talk-backs – Searches More at http://www.eng.tau.ac.il/~shavitt