Social Networks Seminar on Advanced Internet Applications and Services Ilana Dreizis Eyal Bellisha.

Social Networks Seminar on Advanced Internet Applications and Services Ilana Dreizis Eyal Bellisha

2 Outline Introduction to UGC (User Generated Content) systems Analyzing statistics in UGC systems —Compare UGC systems with standard Vod systems —Analyze the popularity distributions from various categories of UGC services —Measure the level of content aliasing and of illegal content Efficiency proposals for UGC systems —Utilizing P2P —Using caches

3 Introduction to UGC systems In the past: —Video-on-Demand (Vod) systems were supplied by a limited number of producers —Content popularity was controllable through professional marketing campaigns —Niche products were hard to reach and often required a great deal of interest and motivation, on the consumer’s part, to be accessed Nowadays: —Hundreds of millions of Internet users are self-publishing consumers —UGC popularity is not well predicted using the traditional prediction models —There is an enormous variability in video contents Having a better understanding of the popularity characteristic, would help: —Overcome a few of the bottlenecks residing in today’s networks (poor search and recommendation engines) —Affect the strategies of marketing, advertising and search engines

5 Data Collection Our dataset: — – The World’s largest UGC site (2 categories: ‘Entertainment’ & ‘Science & Technology’) — – The most popular UGC service in Korea (has streaming videos at rates as high as 800kb/s) — – a popular online video rental store — – Europe’s largest online DVD rental store — – Online movie guide

6 UGC vs. non-UGC Content production rate: —IMDB carries 1,597,407 titles of movies and TV episodes during 1888-2015. —Youtube has 65,000 daily new uploads It only takes 24 days in Youtube to produce the same number of videos! The average number of posts per publisher is similar for UGC and non-UGC: —90% of film directors publish less than10 movies —90% of UGC publishers upload less than 30 videos in YouTube Length of videos: —Average length in Youtube is 2-4 minutes —Average length of a film is 94 minutes User participation: —Strong linear correlation:

7 Power Law The power law model has been increasingly used to explain various statistics appearing in the computer science and networking pplications In systems where many people are free to choose between many options, a small subset of the whole will get a disproportionate amount of traffic (or attention, or income), even if no members of the system actively work towards such an outcome The very act of choosing, spread widely enough and freely enough, creates a power law distribution

8 Power Law Examples Distribution of inlinks to 100,000 random web pages:

9 Power Law Examples Several hundred blogs ranked by number of inbound links Wealth of investors in the Forbes 400 list of 2003 vs. their ranks (rich-get- richer)

10 Power Law Many distributions whose underlying mechanism is power law fail to show power law patterns at the two ends of the distribution: —Most popular items —Least popular items This could be the results of bottlenecks The Netflix data shows a pattern which fits the power law distributions only for ranks 1 to 100 In this case there is an information bottleneck due to the fact that the users cannot easily discover niche contents because it is not properly categorized

11 How niche-centric is Youtube? 10% of most popular videos account for almost 80% of views Requests on Youtube are highly skewed towards popular videos Suggestion: caching the 10% of the long-term popular videos can serve 80% of the requests

12 Popular Content Analysis All 4 popularity distributions analyzed exhibit power law behavior across more than 2 orders of magnitude (straight line) Best fit: Power law with an exponential cutoff

13 Popular Content Analysis Most categories (such as Daum Food) showed power law distributions with an exponential cutoff Yule process in YouTube can explain the power law distribution: If k users have already watched a video then the rate of the other users watching the video will be proportional to k What could account for the exponential cutoff in the most popular videos? Aging (network of actors) —Unlikeable, since traces show that 80% of video requests on a given day are older than a month Information filtering: a user can only receive information from a limited number of sources —On the contrary, highly popular videos are prominently featured in oD services to attract more viewers “Fetch at most once” behavior —Viewers are not likely to watch the same video multiple times

14 “Fetch At Most Once” in YouTube? R – Average number of requests per user U – Fixed number of users V – Number of videos Tail truncation is affected by R and the number of videos per category

15 The Long Tail - Intro Problems with traditional retail: —Average movie theater will not show a film unless it can attract at least 1,500 people over two weeks time —An average record store needs to sell at least 4 copies of a CD per year to make it worth the rent for 1.3” of shelf space —Same goes for DVD rental shops, video-game stores, booksellers, radio stations etc. The reason for this problem is: a limited local population In the previous century there was a clear solution to this problem: hits Theaters focus on blockbusters, CD stores focus on the 100 top singes charts, news stands focus on the top 30 newspapers and magazines etc.

16 The Long Tail - Intro Why is this a problem? Today, for example, the average Barnes& Noble supertore carries around 100,000 titels. Yet, more than 25% of Amazon’s book sales come from outside the top 100,000 20 of the Top Selling Albums of all times were produced somewhere between 1996-2000. The next 5 years produced only 2 ranked at: 92 & 95 Album sales between 2001-2005 dropped in 25%. While hit album sales dropped in nearly 50% during the same years

17 The Long Tail - Intro Yet, another Example If we take a look at Rhapsody’s (streaming service by RealNetworks) monthly statistics, we get a demand curve that looks like the one of any record store: (page 19) All the action seems to appear in a tiny number of tracks on the left – The hits

18 The Long Tail - Intro After a century of staring at the left hand-side of the curve, let’s have a look at the right hand-side: its not exactly zero! Not only that, but songs are being downloaded at an average of 250 a month Since there is so many of them. Their total sales quickly add up: 22 million DLs – almost 25% of Rhapsodys total business If we look even closer: 16 million DLs – a little more than 15% of DLs a month

19 The Long Tail - Intro

20 The Long Tail Can UGC services benefit from the Long Tail? The truncated tail best fits power law with exp. cutoff A number of reasons for the truncated tail: —Natural shape: most videos are of low interest to most users —Sampling Biases or pre-filters: users tend to publish their most interesting videos, leaving the private ones unreachable —Information Filtering: search engines tend to favor a small number of popular items Removal of these bottlenecks would allow users to discover rare niche videos and offer new potential business opportunities

21 The Long Tail Potential gains: The benefit is reduced when the number of videos is smaller since videos can be found easier These benefits may not hold if truncation is due to natural user behavior

22 Popularity Distribution vs. Age After a day, 90% of videos are watched at least one 40% are watched over 10 times If a video did not get enough requests during its first days, it is unlikely it will get many requests in the future (very slow decay in the horiz. log-scale axis) Is it possible to predict near-future popularity? YouTube Sci trace

23 Predicting Near-Future Probability Main reasons —Service providers may populate videos within multiple proxies or caches —Content owners will have a fast feedback on their contents (e.g. movie trailers) Even 2 days old videos provide high correlation results after 3 months How easy or hard is it for a video to become popular as a function of its age? Correlation coefficient of video views in two snapshots and the number of videos analyzed

24 Popularity Shifts Observations: —Young videos can change many rank positions very fast (unlike older videos): ranking classification for old videos is more stable —Old videos are able to become popular after a long time (maybe good recommendation engines) —The gap between the max. and the top 99% reflects that only a few young videos make large rank changes —There is a consistent min. line at about -4000 across all age groups (these videos did not receive any requests but were pushed back in ranking by popular videos)

25 Popularity Shifts Observation: —Videos that get many requests can get a minor rank change —Videos that get very few requests can have a large rank change Conclusion: considering the change in ranks is not enough There still are drastic popularity shifts for young videos (log-scale) Most old videos did not receive any significantly large number of requests

27 Efficient UGC System Design YouTube is estimated to carry 60% of all videos online YouTube serves 100 million distinct videos daily Goal: investigate the benefits for alternate distribution schemes: caching and P2P Data used: daily traces of 6 consecutive days for 263,847 YouTube Sci. videos

28 Better Use Of Caching Assumptions: —Caches always redirect users to the right video —No assumptions are made about the location or the size of the caches —Each time a video is viewed the cache holds the full length video (even if the user chose to watch it partially) Types of caches: —Static finite cache – at day 0 the cache is filled with long-term pop. Videos that never changes —Dynamic infinite cache – at day 0 the cache is filled with all videos requested before day 0, and afterwards stores any additional video requests —Hybrid finite cache – like the static cache, but can also hold the daily most pop. Videos The static and the hybrid caches hold at day 0 16% of the Sci. videos The hybrid cache also holds extra space for the daily top 10,000 videos The 6 day trace is replayed under each one of these caches, calculation hits and misses Cache size is determined by the number of videos cached (video length does not vary much across files)

29 Better Use Of Caching Results: —The static cache uses 84% less space then the dynamic cache, but saves about 75% of the server’s load —The hybrid cache improves the static cache by about 10%

30 P2P VoD Consider a peer assisted VoD distribution where users stream videos form Vod servers as well as from each other P2P is effective only when there are enough online peers sharing content How many files benefit from this approach? How much server workload can be lowered?

31 P2P VoD P2P session cases: —A user shares only while watching a video —A user shares a video for the entire duration time he spends on YouTube —A user shares for 1 extra hour after he is done watching —A user shares for 1 extra day Average session time is currently 28 minutes Assumption: within a single day requests are exponentially distributed Variables: —Intensity of requests: —System time of a user: t —Number of concurrent users:

32 Content Aliasing In UGC there often exist multiple identical or very similar copies for a single pop. event Problem: —Multiple copies of a certain video dilute the popularity of a single event —This could affect the recommendation engine Data: used a sample of 216 vids. From the top 10,000 of YouTube’s Ent. Category Most vids. Have 1-4 aliases, while the maximum is 89

33 Content Aliasing Time intervals of aliases: There is little or no decrease in the number of views over time

34 Illegal Uploads Videos derived from copyrighted content raise a serious legal dilemm for UGC service providers Nearly 10% of videos in YouTube are uploaded without the permission of the content owner according to Vidmeter’s report In order to measure the extent of illegal video content, the same list of videos is sampled at two different times: the discrepancy represents the deleted videos From the first set of 1,687,506 videos (YouTube Ent.) only about 5% were deleted due to a violation of the copyright law A far smaller result than that of Vidmeter’s

35 Appendix Example for the network of actors:

36 Papers I Tube,You Tube, Everybody Tubes: Analyzing the World’s Largest User Generated Content Video System —Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong- Yeol Ahn and Sue Moon Classes Of Behavior Of Small-World Networks —L. Amaral, A. Scala, M. Barthelemy and H.E. Stanley The Long Tail: Why the Future of Business Is Selling Less of More —C. Anderson

37 Outline Introduction and basic concepts Introduce the general ideas of BUBBLE algorithm Introduce centralised community detection algorithms Evaluate the BUBBLE algorithm Show the possibility of a distributed implementation for BUBBLE

38 Introduction - Goals To improve our understanding of human mobility in terms of social structures —Community —Centrality To use these structures in the design of forwarding algorithms for Pocket Switched Networks(PSNs) —In PSN, we do not try to find or build end-to-end paths. Data is forwarded hop-by-hop, taking advantage of any opportunities in the course of device mobility (local/global network connectivity) Whenever two PSN nodes come into contact, they must detect each other and determine what to transfer in each direction —PSN falls under the more general space of Delay Tolerant Networks(DTN)

39 DTN The existing TCP/IP model operates on a number of key assumptions: —an end-to-end path exists between a data source and its peer(s) —the maximum round-trip time between any node pairs in the network is not excessive —the end-to-end packet drop probability is small Challenged networks characteristics: —very significant link delay —non-existence of end-to-end routing paths —lack of large memory at end nodes In a DTN, routing is performed over time to achieve eventual delivery by employing long-term storage at the intermediate nodes

40 Introduction Some DTN routing algorithms provide forwarding by building and updating routing tables whenever mobility occurs Not cost effective for a PSN: —mobility is often unpredictable —topology changes can be rapid Rather than exchange much control traffic to create unreliable routing structures, it is search for some characteristics of the network which are less volatile than mobility A PSN is formed by people. Those people’s social relationships may vary much more slowly than the topology

42 Why BUBBLE Algorithm? Previous work presented “labeling strategy”: —Each node have a label that informs other nodes of its affiliation —Next-hop nodes are selected if they belong to the same label as the destination —Very little state information, merely an affiliation label, can already bring significant improvement in forwarding performance : delivery ratio delay cost —This is a beginning of social based forwarding in PSN: without a concise concept of community lack of mechanisms to move messages away from the source when the destinations are socially far away

43 Why BUBBLE Algorithm? BUBBLE combines the knowledge of community structure with the knowledge of node centrality to make forwarding decisions Two intuitions behind this algorithm: —People have varying roles and popularities in society, and these should be true also in the network —People form communities in their social lives, and this should also be observed in the network layer

44 BUBBLE RAP Algorithm Forwarding is carried out as follows: The source node first bubbles the message up the hierarchical ranking tree using the global ranking, until it reaches a node which is in the same community as the destination node The local ranking system is used instead of the global ranking, and the message continues to bubble up through the local ranking tree until the destination is reached or the message expires Require every node to be able to compare ranking of all other nodes in the system

45 Illustration of the BUBBLE algorithm

46 Data Collection Our dataset: —Infocom05 - the iMotes were distributed to students attending the Infocom student workshop —Hong-Kong - the people carrying the iMotes were chosen independently in a Hong-Kong bar —Cambridge - the iMotes were distributed to students from University of Cambridge Computer Laboratory —Infocom06 - the same as in Infocom05 except that the scale is larger —Reality – smart phones were deployed to students and staff at MIT Experimental data setInfocom0 5 Hong- Kong Cambridg e Infocom0 6 Reality DeviceiMote Phone Network typeBluetooth Duration (days)35113246 Number of Experimental Devices 4137549897 Number of internal contacts 22,45956010,873191,33654,667 Average # Contacts pair day 4.60.0840.3456.70.024

47 Frequency of nodes as relays Shows the number of times a node fails on the shortest paths between all other node pairs (centrality of a node in the system) In order to design more efficient forwarding strategy we prefer to choose popular nodes as relays rather than unpopular ones

49 Are communities of nodes detectable in PSN traces? This requires community detection algorithm Criteria for choosing the algorithm: —Ability to uncover overlapping communities —A high degree of automation

50 Are communities of nodes detectable in PSN traces? K-CLIQUE method suits —but was designed for binary graphs, thus we must threshold the edges of the contact graphs in order to use it and it is difficult to choose an optimum threshold manually Weighted network analysis (WNA) can work on weighted graphs directly —but it cannot detect overlapping communities Chose to use both K-CLIQUE and WNA

51 K-CLIQUE Community Definition Union of all k-cliques (complete sub graphs of size k) reachable through a series of adjacent k-cliques [Palla et al] Two k-cliques are adjacent if they share k − 1 nodes A community corresponds to a maximal union of k-cliques in which we can reach any k-clique from any other k-clique through series of k-clique adjacencies

52 K-CLIQUE Community Definition Overlapping feature (a node can belong to several different k- clique clusters at the same time) As we increase k, the k-clique-communities shrink, but on the other hand become more cohesive since their member nodes have to be part of at least one k-clique Was designed for binary graphs (undirected, unweighted)

53 K-CLIQUE Community Detection Communities based on contact durations with weight threshold = 388800s (4.5days), 648000s (7.5days) and k=3,4 (Reality)

54 Weighted Network Analysis Communities detected by applying WNA on four datasets. Infocom06 – Qmax is low, agrees with the fact that in a conference the community boundary becomes blurred Cambridge – the two communities exactly matched the two groups (1st year and 2nd year) of students selected for the experiment Reality - Qmax is high, reflects the more diverse campus environment

55 Centralised community detection algorithms Give us rich information about the human social clustering Useful for offline data analysis on mobility traces collected Useful for exploring structures in the data and hence design useful forwarding strategies, security measures

57 Evaluations of Different Forwarding Algorithms Comparison metrics: —Delivery ratio - the proportion of messages that have been delivered out of the total unique messages created —Delivery cost - the total number of messages (include duplicates) transmitted across the air. To normalize this, we divide it by the total number of unique messages created

58 Evaluations of Different Forwarding Algorithms WAIT - Hold on to a message until the sender encounters the recipient directly (paths with single hop) —lower bound for delivery and cost FLOOD - Messages are flooded throughout the entire system (length of the path is unlimited) —upper bound for delivery and cost MCP - Multiple-Copy-Multiple-Hop. We use 4-copy-4-hop MCP scheme in most of the cases —paths four hops are used (corresponding to a flooding algorithm with a Time-To-Live of 4 hops) LABEL - Messages are only forwarded to the nodes in the same community as the destination

59 Two-Community Case Cambridge data can be divided into two communities : —undergraduate year 1 (Group A) —year 2 (Group B) Centrality of nodes within each group: —traffic is created only between members of the same community —only members in the same community are chosen as relays for messages

60 Two-Community Case Figure (a) shows the individual node centrality when traffic is created from one group to another Figure (b) shows the correlation of node centrality within an individual group and inter-group centrality Points lie more or less around the diagonal line: —the inter- and intra- group centralities are quite well correlated —active nodes in a group are also active nodes for inter-group communication

61 Two-Community Case Comparisons of several algorithms on Cambridge dataset, delivery and cost: —BUBBLE achieves almost the same delivery success rate as the 4-copy- 4-hop MCP but with only 45% of its cost

62 Multiple-Community Case Use the Reality dataset —There is a total 8 groups within the whole dataset —Within each individual group, the node centralities demonstrate diversity similar to the Cambridge case —First isolate just one group, consisting of 16 nodes, single group case: BUBBLE performs very similarly to MCP most of the time and even outperform MCP when the time TTL is set to be longer than 1 week (delivery success ratio) BUBBLE only has 55% of the cost of MCP

63 Multiple-Community Case Comparisons of several algorithms on Reality dataset, all groups: —flooding achieves the best for delivery ratio, but the cost is : 2.5 times that of MCP 5 times that of BUBBLE —BUBBLE is very close in performance to MCP and even outperforms it when the time TTL of the messages is allowed to be larger than 2 weeks —BUBBLE cost is only 50% that of MCP

64 Multiple-Community Case PROPHET - Uses the history of encounters and transitivity to calculate the probability that a node can deliver a message to a particular destination Comparisons of BUBBLE and PROPHET on Reality dataset: —BUBBLE achieves a similar delivery ratio to PROPHET, but with only half of the cost —Similar significant improvements by using BUBBLE are also observed in other datasets, these demonstrate the generality of the BUBBLE algorithm

66 DiBuBB Algorithm For practical applications we need BUBBLE be implemented in a distributed way —Each device should be able to: detect its own community calculate its centrality values Use the distributed K-CLIQUE algorithm to detect local community (detecting accuracy up to 85% of the centralised one) Use C-Window to approximate its own global and local centrality values —C-Window – cumulative window. calculate the average value on all previous windows, such as from yesterday to now Besides that, it operate exactly like BUBBLE

67 Distributed BUBBLE RAP Trace analysis conclusions: —Total degree (unique nodes seen by a node throughout the experiment period) is not a good approximation of the node centrality —The degree per unit time (for example the number of unique nodes seen per 6 hours) and the node centrality have a high correlation value

68 Approximating Centrality For evaluation of distributed centrality compare RANK, S-Window and C-Window: —RANK - a component of BUBBLE, using only centrality information. Messages are pushed to nodes which have a higher ranking than the current node, until either they reach the destinations or they expire —S-Window - when two nodes meet each other, they compare how many unique nodes they have met in the previous unit-time slot (e.g. 6 hours)

69 Approximating Centrality S-Window achieves maximum of 4% improvement in delivery ratio than RANK, but at double the cost C-Window does not achieve as good delivery as RANK (not more than 10% less in term of delivery), but it also has lower cost

70 Approximating Centrality C-Window is easy to implement in reality and has similar delivery and cost to RANK (pre-calculated centrality), which is why it was chosen for DiBuBB S-Window, and C-Window can approximate the pre-calculated centrality quite well Running a set of RANK emulations on more datasets, but using the centrality values of the Multiple-Community Case showed that the delivery ratio and cost of RANK on the new datasets is as good as in the original dataset These results imply some level of human mobility predictability, and show empirically that past contact information can be used in the future

71 CONCLUSIONS It is possible to detect characteristic properties of social grouping in a decentralised fashion from a diverse set of real world traces Demonstrated that such characteristics can be effectively used in forwarding decisions BUBBLE algorithm has similar delivery ratio, but much lower resource utilisation than flooding, control flooding, and PROPHET

72 Papers BUBBLE Rap: Social-based Forwarding in Delay Tolerant Networks —Pan Hui, Jon Crowcroft, Eiko Yoneki Analysis of weighted networks —M. E. J. Newman Uncovering the overlapping community structure of complex networks in nature and society —Gergely Palla, Imre Derényi, Illés Farkas & Tamás Vicsek Pocket Switched Networks and Human Mobility in Conference Environments —Pan Hui, Augustin Chaintreau, James Scott, Richard Gass, Jon Crowcroft and Christophe Diot How Small Labels create Big Improvements —Pan Hui, Jon Crowcroft A Delay-Tolerant Network Architecture for Challenged Internets —Kevin Fall Routing in a Delay Tolerant Network —Sushant Jain, Kevin Fall, Rabin Patra

Social Networks Seminar on Advanced Internet Applications and Services Ilana Dreizis Eyal Bellisha.

Similar presentations

Presentation on theme: "Social Networks Seminar on Advanced Internet Applications and Services Ilana Dreizis Eyal Bellisha."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Social Networks Seminar on Advanced Internet Applications and Services Ilana Dreizis Eyal Bellisha.

Similar presentations

Presentation on theme: "Social Networks Seminar on Advanced Internet Applications and Services Ilana Dreizis Eyal Bellisha."— Presentation transcript:

Similar presentations

About project

Feedback