Presentation is loading. Please wait.

Presentation is loading. Please wait.

Topical Semantics of Twitter Links MARCH 23, 2011 In-seok An SNU Internet Database Lab. Michael J. Welch, Uri Schonfeld Yahoo! Inc., UCLA Computer Science.

Similar presentations


Presentation on theme: "Topical Semantics of Twitter Links MARCH 23, 2011 In-seok An SNU Internet Database Lab. Michael J. Welch, Uri Schonfeld Yahoo! Inc., UCLA Computer Science."— Presentation transcript:

1 Topical Semantics of Twitter Links MARCH 23, 2011 In-seok An SNU Internet Database Lab. Michael J. Welch, Uri Schonfeld Yahoo! Inc., UCLA Computer Science Dept WSDM`11

2 Outline  Introduction –twitter  Modeling Twitter  Analysis of The Graph  Exploring Link Semantics  Experiments on Link Semantics  Conclusion 2

3 Introduction  Twitter –Microblogging site –10 th world wide in total traffic –28 million unique monthly visitors –Provider of information for breaking news events 3

4 Introduction  Simple graphical modeling for Web –Text-based pages connected by hyperlinks ( directed edges ) –Will fail to capture all that this information has to offer –Produce less than ideal results  A rich graphical model for Twitter –Multiple semantic edges  Follow, RT, Mention, List –Not all edges are created equal  In this paper –Web graph vs. Twitter graph –Follow link vs. Retweet link 4

5 Introduction Twitter  Twitter –Blogging platform  Maximum of 140 characters  Micro-blogging platform –Multiple interfaces  Web, SMS, mobile application, instant messaging, etc. 5

6  Dual role –Reader  A user may choose to follow another user’s posts –Accessible via a private stream ( timeline ) –Sorted by their publication timestamp  Friends / follower –Writer  Posting messages  Retweet messages  Reply or Mention other twitterian 6 Introduction Twitter

7  Mention –User is referred to by their username prefixed with the character “@”  Retweet –A user chooses to repeat another user’s post –New style retweet –Old style retweet Introduction Twitter 7

8  List –Added in late 2009 –Allows users to construct and organize a group of users referred to as a list –Help a user to focus on the posts of certain subsets of their friends  Two broad categories –Topical lists  Centered around the discussion of common interests or subjects  “politics” –Classification lists  Formed to group users who share a common trait  “Celebrities”, “professional athletes” –Lists generate meaningful manually-created categorizations of users Introduction Twitter 8

9 Outline  Introduction  Modeling Twitter –The Full Twitter Graph Model –Additional Twitter Information –The Simplified Twitter Graph  Analysis of The Graph  Exploring Link Semantics  Experiments on Link Semantics  Conclusion 9

10 Modeling Twitter  Web graph model –Nodes  Web pages –Edges  Hyperlinks connecting them –Enables the application of many graph analysis techniques  Inlink & outlink distributions  PageRank  N by N matrix M –The Web graph is commonly represented as matrix –N is the number of pages on the web – 10

11 Modeling Twitter The Full Twitter Graph Model  The Twitter graph is inherently more complex –At least two different types of entities ( nodes )  Users and Tweets –At least four types of relationships ( edges )  Follows, Publish, Retweets and Mentions  Twitter Graph Edges –Follow edge  User a follows the posts of user b –Publish edge  Authorship of the post –Retweet edge  Post a is a retweet of post b –Mention edge  Post a mentions user b 11

12 Modeling Twitter The Full Twitter Graph Model  Matrix representation of the Twitter graph –Identical to the Web graph –|U| + |P| by |U| + |P| matrix  |U| : the number of users  |P| : the number of posts –A non-zero value in  Represents an edge between node i and node j 12

13 Modeling Twitter Additional Twitter Information  Time –Twitter includes timestamp information  When each post was written  When accounts were created –When a follow link was created  No explicit way to determine  Can be approximated with repeated crawling –Valuable for studying factors  Evolution of the graph  Charting popularity over time 13

14 Modeling Twitter Additional Twitter Information  Hyperlinks –Standard hyperlinks embedded in the posts –Third node type  Web page  Uniquely identified by a URL –Difficulty modeling hyperlinks in Twitter  Common use of URL shortening services –TinyURL and bit.ly  Prevents making use of keywords or other interesting artifacts the URL may contain directly  Makes additional processing of the data necessary 14

15 Modeling Twitter Additional Twitter Information  Post Content –Use the content of a post  To extract metadata –User name mention –Identification of retweets –Remaining textual content of a post  Determining the topics of interest to a user as well –Difficulties  Small size of the posts –Sparsity of data –Sparsity of tokens  Frequent use of nonstandard shorthand notation 15

16 Modeling Twitter The Simplified Twitter Graph  Simplified Twitter Graph –Only includes user nodes –Still capturing the most important information  From the original representation as it pertains to the users –The user-user follow links remain  As they are from the Full Twitter graph –Add retweet edges to the simplified Twitter Graph  If user a retweets user b at least one time –There is retwet edge from user a to user b 16

17 Outline  Introduction  Modeling Twitter  Analysis of The Graph –Link Distributions –Graph Formation  Exploring Link Semantics  Experiments on Link Semantics  Conclusion 17

18 Analysis of The Graph  Data specification –Collected between October 2009 and January 2010 –1.1 million Twitter users –More than 273 million follow edges –2.9 million retweet edges  Crawling method –Beginning with an initial seed set of the top 1000 users in twitterholic.com –Crawling in a BFS manner –Traversing the follow links in a forward direction 18

19 Analysis of The Graph Link Distributions  Follow Edges –Power-law distribution –Two abnormal spikes in Outlink distribution  20-friend –Twitter provides an initial a set of 20 “recommended” users to follow  2000-friend –The restrictions Twitter places on following more than 2000 users 19

20 Analysis of The Graph Link Distributions  Retweet Edges –Retweet Inlink  Power-law distribution –Retweet Outlink  Does not follow power-law distribution –While the number of friends one has is generally power-law, the number of users one finds truly interesting does not appear to scale in a similar fashion 20

21 Analysis of The Graph Link Distributions  Posting Frequency – 417,613 users who publish at least one tweet –Most recent 200 posts per user –58,000 users published only a single post during the month –A large number of users wrote more than 100 posts 21

22 Analysis of The Graph Graph Formation  Readers and Writers –Three potential scenarios  A user acts primarily as reader –No or little posts  A user frequently retweets posts –Writes little to no original content  A user contributes significant new content –User’s reading and writing behavior  Each dot : unique user  X-axis : # of posts published by friends  Y-axis : # of posts published by user  Shade : originality –The lighter shades indicate less originality  Size : PageRank of each user ( based on follow-edge ) 22

23 Analysis of The Graph Graph Formation  General trend –For users who post very frequently  A larger fraction of their posts are actually retweets –Many users retweeted at least one post which they did not read from one of their friends  Despite the explicit friendship links available in the site structure, it is still not possible to know exactly what a user reads –Many websites are adding modules which display Twitter results 23

24 Outline  Introduction  Modeling Twitter  Analysis of The Graph  Exploring Link Semantics –Retweet vs. Follow based Ranking –Link Virality  Experiments on Link Semantics  Conclusion 24

25 Exploring Link Semantics  Web graph –A link from page a to page b  Endorsement of the quality of page b  Extent its relevance to page a  Twitter graph –Follow link  Endorsement of quality or interest  The actual semantics of the link –User a, acting as a reader, is interested in user b acting as writer –Retweet link  Endorsement of quality –User is interested in the topic –User expects his readers to be interested in this post  Retweet edge signifies a connection from user a as a writer to user b as a writer 25

26 Exploring Link Semantics Retweet vs. Follow based Ranking  PageRank based on two edges –Retweet-based  Simple power-law distribution –Follow-based  Two different segments with different power-law coefficients 26

27 Exploring Link Semantics Retweet vs. Follow based Ranking  PageRank over Retweet links vs. Follow links –Follow links  Twitter recommended celebrities ( barackobama ) –Rich get richer phenomenon  Top ranker has lower rank in RT-based PageRank –Retweet links  Tweetmeme –Social bookmarking site  Top ranker has lower rank in Follow-based PageRank 27

28 Exploring Link Semantics Retweet vs. Follow based Ranking  Follow-based –Public figure or celebrities  Retweet-based –News generating entities  Aplusk is the only user who appears in the top 10 for both rankings  These rank can be affected by spam or marketing techniques –ddlovatoRT simply retweet all posts mentioning Demi Lovato –Twitter’s research team estimates that less than 1% of Tweets are now spam 28

29 Exploring Link Semantics Link Virality  Retweet Virality –  Follow Virality – –RoF(u) : the users who u has seen at least on post from via a retweet –FoF(u) : the set of all users who are reachable by traversing exactly two directed follow edges –Fr(u) : the set of users whom user u follows  Retweet Viriality is consistently higher than Follow Virality –Retweets demonstrate a stronger notion of importance or influence to users –Users are more likely to follow people they see retweeted than those who are merely “Friends of Friends” 29

30 Outline  Introduction  Modeling Twitter  Analysis of The Graph  Exploring Link Semantics  Experiments on Link Semantics –Empirical Results –Topic Sensitive PageRank  Conclusion 30

31 Experiments on Link Semantics  Topical relevance –Follow links quickly diffuse into a broad range of topics –Retweet links remain more concentrated on the original topic  Data –1.1 million users –273 million follow edges –2.9 million retweet edges 31

32 Experiments on Link Semantics Empirical Results  Empirical evaluation –Starting from a seed set of users  Members of the same topical list –photography and design –Generate two sets of users  At least one seed member follows them  At least one seed member has retweeted one of their posts –Random sample of 25 users from each of these sets –Manually assessed them for topical relevance  Result –# of relevant users in the follow-generated samples were 4 and 5 –# of relevant users in the retweet-generated samples were 19 and 20 32

33 Experiments on Link Semantics Topic Sensitive PageRank  PageRank –Recursive ranking formula –Page is as important as the pages pointing to it  Topic Sensitive PageRank( TSPR ) –Quantify the difference in topical relevance carried by follow and retweet links –Biased PageRank  Generate query-specific importance scores for pages at query time –We use topic sensitive PageRank to quantify the difference in topical relevance carried by follow and retweet link 1 [1] T.H. Haveliwala. Topic-sensitive PageRank, www 2002. 33

34 Experiments on Link Semantics Topic Sensitive PageRank  Experiments –Beginning with a topical Twitter list –Compute topic sensitive PageRank for  Follow edges  Retweet edges –If the links carry the topicality well  The high-ranking users are likely to be topically relevant to the original seed topic –Evaluate the resulting highest ranked users for relevance to the original topic with a user survey 34

35 Experiments on Link Semantics Topic Sensitive PageRank  Experimental Setup –Collected 9 topical lists from listorious.com  19 ~ 437 users –Average 155, median 49  Seed users have average 14,284 followers –Compute personalized PageRank –Selected the 30 highest ranking non-seed users –Conduct a survey  Participants were shown a topic description and the 30 highest raned users for either a follow-based or a retweet-based PageRank  Ordered randomly  Mixed with a random set of 10 of the seed users for that topic  Make a binary judgment of each user’s relevance  A total of 12 people participated in the survey  Each list was evaluated by at least 2 people 35

36 Experiments on Link Semantics Topic Sensitive PageRank  Accuracy of the highly ranked users –Precision  The average relevancy of a set of users –Relevance  The fraction of users who were judged relevant by at least on survey taker – the set of users from U judged relevant in evaluation k of a paricular list 36

37 Experiments on Link Semantics Topic Sensitive PageRank  Result –Precision can be improved by simply using retweet links instead of following links  Precision of top ranked user improved by over 30% 37

38 Experiments on Link Semantics Topic Sensitive PageRank  Cohesiveness of Seed –To verify the seed users  Include 10 randomly selected seed users for each evaluation  Result –Average Precision : 0.931  Minimum of 0.838  Maximum of 1.9 –The seed users represented their topics well –Our survey takers understood and agreed upon the topic definitions 38

39 Conclusion  We have described a detailed model of Twitter as a graph –Key statistics about the graph –Provided some initial insights as to how the graph forms  important distinctions between edge types in the graph –Follow and retweet –The varying semantics and properties of these edges will have significant implication on graph algorithms such as PageRank –Retweet edges preserve topical relevance  Better than follow edges 39


Download ppt "Topical Semantics of Twitter Links MARCH 23, 2011 In-seok An SNU Internet Database Lab. Michael J. Welch, Uri Schonfeld Yahoo! Inc., UCLA Computer Science."

Similar presentations


Ads by Google