Presentation on theme: "1 Diffusion of Information & Innovations in Online Social Networks Krishna Gummadi Networked Systems Research Group Max Planck Institute for Software Systems."— Presentation transcript:
1 Diffusion of Information & Innovations in Online Social Networks Krishna Gummadi Networked Systems Research Group Max Planck Institute for Software Systems
2 My goals and methodology Goals: Understand & build complex systems –example: online social networks Methodology: Evolve the systems with feedback –observe deployed systems –extract insights –test new designs and architectural principles
3 My research: Enabling the Social Web Three fundamental trends & challenges in social Web 1. User-generated content sharing –can we protect privacy of users sharing personal data? 2. Word-of-mouth based content exchange –can we understand & leverage word-of-mouth better?? 3. Crowd-sourcing content rating and ranking –can we find trustworthy & relevant content sources?
4 Information discovery in Online Social Networks Discovering information on the Web –old method: Browsing from authoritative sources –new method: Word-of-mouth from friends Lots of theories & beliefs about viral propagation –but few are empirically derived or validated at scale! Large-scale empirical studies only possible recently
5 Research problems Understand dynamics of propagation –Temporal and spatial patterns of propagation –Role of social network, social systems, and user influence For different types of information and innovations –News, web URLs, conventions, and technology services With the ultimate goal of enabling better viral campaigns –Consumers: Help them get content they would not otherwise receive –Publishers: Help them spread their content more effectively
6 One of the most popular social media Social links are the primary way how information flows Users can follow any public messages, called tweets, they like Traditional media sources and word-of-mouth coexist Mainstream media sources (BBC, CNN, DowningSteet) Celebrities (Oprah Winfrey), politicians (Barack Obama) Ordinary users (like you and me!) Why ?
7 Dataset Crawled near-complete data from Twitter till August 2009 a sked Twitter to white-list 58 machines c rawled information about user profiles and all tweets ever posted starting from user ID of 0 to 80 million Gathered 54M users, 2B follow links, and 1.7B tweets u ser profile includes join date, name, location, time zone e xact time stamp of tweets available
8 Studies of information diffusion How web URLs are discovered in Twitter [IMC ‘11] How news spreads in Twitter [ICWSM ‘11] The role of offline geography in Twitter [ICWSM 2012] How social conventions emerge in Twitter [ICWSM 2012] –social norms are fundamental to social psychology and social life –social conventions are like social norms, before they become tied to group identity and before deviant behavior is sanctioned
Macroscopic analysis: Who passes information to whom With Fabrício Benevenuto (UFOP) Hamed Haddadi (QMUL) Meeyoung Cha (KAIST)
10 High-level network characteristics 95% of users belong to the largest connected component (LCC) 5% were singletons and 0.2% formed 32K smaller components Low reciprocity (10%) Power-law node degree distribution with extremely large hubs Grassroots users, on average, have 37 followers (98% had <200 followers) 0.01% users had >100,000 followers
11 Two-step flow of influence by Katz and Lazarsfeld (1940s) Not all people are equally influential A minority of opinion leaders influence everyone else Mass media influence the opinion leaders, hence the two-step flow Theory of information flow
12 Can we identify the different groups in Twitter? What fraction of audience can each group reach? Interesting questions
13 How do we identify different groups? Grassroots 51M (98.6%) Evangelists 700,000 (1.4%) Mass media 8,000 (<0.01%)
14 Major news events studied Picked six major news topics in 2009 Used keywords to identify relevant tweets Limited study to a 2 month period 50-80% grassroots 18-48% evangelists <0.1% mass media All events reached millions of audience
15 Audience reach: Sufficiency Sufficiency—Audience that can be reached by the top K spreaders rank 1 rank 2 rank 3 Spreader Audience
16 Sufficiency test in Iran election Mass media Evangelists Grassroots
17 Audience reach: Necessity Necessary—Audience that are still reachable after removing the top K spreaders, i.e., audience would otherwise not be reachable rank 1 rank 2 rank 3 Spreader Audience
18 Necessity test in Iran election Mass media Evangelists Grassroots
19 Audience reach of popular topics Mass media alone reach the majority of all audience Evangelists increase the reach considerably Grassroots play marginal role
20 Audience reach of non-popular topics Evangelists group need more attention in viral marketing Existing influence measures fail to appreciate their role Evangelists group need more attention in viral marketing Existing influence measures fail to appreciate their role Evangelists group consistently reach large audience Mass media may not be present Grassroots play marginal role
21 Teased out the roles of mass media, evangelist, and grassroots users in the spread of major and minor events Mass media are important for spreading popular topics Evangelists play a crucial role for both popular and non-popular topics Grassroots play a marginal role in all cases Studied information spreading patterns across groups Information flows in all directions unlike in the two-step flow theory Summary of macroscopic analysis
A more closer look: Patterns of URL propagation With Tiago Rodrigues (UFMG) Fabrício Benevenuto (UFOP) Meeyoung Cha (KAIST)
23 Interesting questions What types of content are discovered by Word-of-Mouth? What are the structures of Word-of-Mouth propagation trees? How geographically distributed are the propagation trees?
24 Why URLs on Twitter? Ideal for studying Word-of-Mouth – Centered around the idea of spreading information – Easy to trace their propagation 208M URLs shared on Twitter from 2006 -- 2009
25 Modeling Information Cascades Hierarchical tree model TUserTweet content A C B D
26 Modeling Information Cascades Hierarchical tree model TUserTweet content 1ACheck this: http://www.example.com/ A B Initiator Receiver C D
27 Modeling Information Cascades Hierarchical tree model TUserTweet content 1ACheck this: http://www.example.com/ 2Bhttp://www.example.com/ is interesting A B D Initiator Spreader Receiver C
28 Modeling Information Cascades Hierarchical tree model TUserTweet content 1ACheck this: http://www.example.com/ 2Bhttp://www.example.com/ is interesting 3CInteresting link: http://www.example.com/ A C B D Initiator Spreader Receiver
29 Modeling Information Cascades Hierarchical tree model A C B D Initiator Spreader Receiver Audience
30 Modeling Information Cascades Hierarchical tree model – URL propagation pattern is a forest A C B D Initiator Spreader Receiver E F Initiator Spreader G Initiator I Receiver H
31 Word-of-mouth can help popularize niche content What URLs are popularly shared on Twitter? Do they come from the popular domains in the Web?
32 Does all content, including those published by unpopular domains, benefit from Word-of-Mouth? Word-of-mouth gives all URLs and content (both popular and non-popular) a chance to become popular
33 How large is the largest Word-of-Mouth? URL popularity – Most popular: 426,820 spreaders and audience of 28M users – Average: 3 spreaders and audience of 843 users Word-of-mouth can incur extremely large cascades
34 What are the typical structures of propagation trees? Cascade trees are much wider than they are deep – 0.1% of the trees have width > 20 – 0.005% of the trees have height > 20 A C B D 3 2 14738,418
35 What are the typical structures of propagation trees?
36 Twitter Cascades vs. E-mail Cascades D. Liben-Nowell and J. Kleinberg – Tracing Information Flow on a Global Scale using Internet Chain-Letter Data, PNAS, 2008 e-mailTwitter
37 Users within a short geographical distance have a higher probability of posting the same URL How geographically distributed are the propagation trees? A C B D
38 Summary: Patterns of URL propagation Large-scale analysis of URL propagation in Twitter – All contents have a chance to reach a large audience – Propagation trees on Twitter are wide and shallow Advertising – Content is consumed locally Caching design and recommendation
Microscopic analysis: Understanding news media landscape in Twitter With Jisun An (Cambridge Univ.) Meeyoung Cha (KAIST)
40 Interesting questions Does social interaction help media sources reach more audience? Do users follow diverse media sources? Does social interaction expose users to diverse media sources?
41 Methodology Focus on 80 media sources English-based media A total of 14M followers and their connections (1.2B links, 350,000 tweets GenreExample account News (40 sources) cnnbrk, nytimes, TerryMoran Technology (13) BBCClick, mashable Sports (7)NBA, nfl Music (3)MTV Politics (5)nprpolitics, Business (2)davos Fashion & Gossip (4) peoplemag
42 Media exposure
43 Is social interaction helping media publishers reach more audience? Yes: Social interaction increases publisher’s audience On average, audience size increases by a factor of 28 2. Nytimes (1.7M) 2. Nytimes (1.7M) 55. NASA (120K) 55. NASA (120K) 2. nytimes 1.7M -> 6.7M 8. BBCClick 1.2M -> 12M 65. washingtonpost 30K->3.5M
44 Does a user follow multiple media sources? Direct Subs: 80% users su bscribe only to 2-3 media sources No: Users only follow limited number of media sources.
45 Is social interaction exposing users to multiple media sources? Social Interaction: 80% o f users hear from up to 2 7 media sources Yes: 8 fold increase in number of media sources Direct Subs: 80% users su bscribe only to 2-3 media sources
Following multiple media sources does not necessarily imply exposure to diverse opinions Focus on political news Does a user follow diverse media sources?
47 Does user follow diverse media sources? Manually tagging political leanings of media source Left-right.org ADA (Americans for Democratic Action) score Scale from 0 to 100, where 0 means ‘very conservative’ No: Out of 10M users, 7M users only follow one side of media sources Left-leaning(62.1%), center (37%), right-leaning (0.9%) I like to see diverse media sources
48 Is social interaction exposing users to diverse media sources? Yes: Users are exposed to diverse opinions through social interact ion
49 Estimating closeness How “close” or “similar” two media sources are
50 Closeness measure Closeness: probability that a random follower of B i also follows A Closeness( NYTimes, Foxnews) = 143K/578K = 0.25 Closeness( NYTimes, washingtonpost) = 250K/404K = 0.62 Which one is closer to nytimes, Foxnews or washingtonpost? Washingtonpost is closer to nytimes than Foxnews NYTimes (A) washingtonpost(B 2 ) 154,224249,6262,840,960 Foxnews (B 1 ) NYTimes (A) 435,222142,9512,947,635
51 Closeness of political media sources Picked political media sources Ranked other political media sources based on closeness value We can automatically infer political leaning of media sources nprpolitics (Left) close distant nytimes (Left) jdickerson (Left) Nightling (Left) nrpscottsismon (Left) GMA (Center) bbcbreaking (Center) foxnews (Right) washtimes (Right) close distant washingtonpost (Left) f oxnews (Right) usnews (Right) bbcbreaking (Center) earlyshow (Left) nytimes (Left) arianhuff (Left) ObamaNews (Left) nprpolitics (Left)
52 Summary: Media landscape in Twitter Users only follow limited number of media sources. But they are exposed to 8x more media sources via social interaction Most users only follow political media with a certain bias Can automatically infer bias in media sources – Could be used for recommending content from diverse media sources
Emergence of social conventions With Farshad Kooti (MPI-SWS) Meeyoung Cha (KAIST) Winter Mason (Stevens Inst. of Tech.)
54 Interesting questions How do social conventions arise naturally? What is the context of their invention? How do they become widely accepted? Can we predict their adoption?
The retweeting variations o Searched for syntax token @username o “Adopter” refers to a user using the variation at least once Variation# of adopters# of retweets RT1,836 K53,221 K via751 K5367 K Retweeting50 K296 K Retweet36 K110 K HT8 K22 K R/T5 K28 K 3 K18 K Total2,059 K59,065 K 55
56 Why retweeting convention? o Information-sharing channels are explicit in Twitter o Specific to Twitter: exposures within the community o Contained in Twitter, hence capturing all usages 56
What are the very first use cases? Via Mar’07 Sep’08 RT Jan’08 R/T Jun’08 Retweeting Jan’08 Retweet Nov’07 HT Oct’07 57
Via started from natural language @JasonCalacanis (via @kosso) - new Nokia N-Series p hones will do Flash, Video and YouTube Via Mar’07 Sep’08 RT Jan’08 R/T Jun’08 Retweeting Jan’08 Retweet Nov’07 HT Oct’07 58
HT started from blog communities The Age Project: how old do I look? http://tweetl.co m/21b ( HT @technosailor ) Via Mar’07 Sep’08 RT Jan’08 R/T Jun’08 Retweeting Jan’08 Retweet Nov’07 HT Oct’07 59
The first Twitter-specific variation Retweet @HealthyLaugh she is in the Boston Glob e today, for a Stand up show she’s doing tonight. A dd the funny lady on Tweeter! Via Mar’07 Sep’08 RT Jan’08 R/T Jun’08 Retweeting Jan’08 Retweet Nov’07 HT Oct’07 60
RT was an adaption to constraints RT @BreakingNewsOn: "LV Fire Department: No major injuries and the fire on the Monte Carlo west wing contained east wing nearly contained." Via Mar’07 Sep’08 RT Jan’08 R/T Jun’08 Retweeting Jan’08 Retweet Nov’07 HT Oct’07 61
Some start from explicit discussions @ev of @biz re: twitterkeys ★ http://twurl.nl/fc6tr d Via Mar’07 Sep’08 RT Jan’08 R/T Jun’08 Retweeting Jan’08 Retweet Nov’07 HT Oct’07 62
Early adopters are more tech-savvy Random users Early adopters 63
Early adopters are more innovative Early adoptersRandom users Has Bio94%25% Profile Pic99%50% Changed profile theme 91%40% Has Location95%36% Has Lists57%4% Has URL85%14% 64
Early adopters are more popular Much higher number of followers 80% of early adopters in top 1% based on PageRank 65
66 Defining the diffusion network o Each adopter is a node in the graph. o There is a link from A to B if A was exposed to the variation by B. 66
67 Diffusion network of first 500 adopters of Retweet
68 Diffusion network of first 500 adopters of RT
69 Early adopter network o Average number of exposures: 2.9 – 6.4 o Average clustering coefficient: 0.233 - 0.320 o Criticality: fraction of users who were only exposed because of the most critical user: 0.5% - 4.9% Early adopters’ diffusion networks are dense and clustered. There is no single critical user.
70 Convention had different spread patterns from the URLs o URLs’ early adopters are not necessarily core users o The diffusion network is not dense and clustered o There are critical users in the process
72 Variations have different growth rates Some variations are growing and some dying at the end Only two variations became dominant RT via
73 Wide-spread vs. normal adoptions Successful variations reached peripheral users In tune with two-step flow theory Successful variations reached peripheral users In tune with two-step flow theory
74 Summary o Conventions emerged in an organic, bottom-up manner o Early adopters are core members of the community: Active, tech-savvy, popular, and innovative o Social conventions start spreading through dense and clustered networks and there is no critical user o When variations got popular, they reached out side of core community
75 Ongoing work: Convention prediction problem “Given a social network with records of users and their interactions, how reliably can we infer which variant of the convention a user U adopts at time T?”
76 Ongoing work: What features matter for prediction? Personal features – join date, in-/out-degrees, geo-location, # of tweets etc. Social features – number of exposures, number of adopter friends Global features – date of adoption, which is related to global popularity
77 Preliminary results: Prediction accuracy Baseline predicts adoption of dominant convention all the time Minimal improvement in prediction accuracy over baseline
78 Preliminary results: Prediction accuracy without a dominant convention Baseline predicts adoption with 0.5 accuracy Improvement in prediction accuracy over baseline especially, for less popular conventions
79 Top-5 predictive features 1.Date of adoption: Global feature 2.# of exposures: Social feature 3.# of posted URLs: Personal feature 4.Join date of adopter: Personal feature 5.# of adopter friends: Social feature