Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extracting Relevant & Trustworthy Information from Microblogs Joint work with Bimal Viswanath, Farshad Kooti, Saptarshi Ghosh, Naveen Sharma, Niloy Ganguly,

Similar presentations


Presentation on theme: "Extracting Relevant & Trustworthy Information from Microblogs Joint work with Bimal Viswanath, Farshad Kooti, Saptarshi Ghosh, Naveen Sharma, Niloy Ganguly,"— Presentation transcript:

1 Extracting Relevant & Trustworthy Information from Microblogs Joint work with Bimal Viswanath, Farshad Kooti, Saptarshi Ghosh, Naveen Sharma, Niloy Ganguly, Fabricio Benevenuto MPI-SWS, Germany; IIT Kharagpur, India; UFOP, Brazil

2 My research: The big picture Three fundamental trends & challenges in social Web 1. User-generated content sharing  can we protect privacy of users sharing personal data? 2. Word-of-mouth based content exchange  can we understand & leverage word-of-mouth better?? 3. Crowd-sourcing content rating and ranking  can we find trustworthy & relevant content sources?

3 Twitter microblogging site  An important source for real-time Web content  500 million active users posting 400 million tweets daily  Quality of tweets / content vary widely  Any one can post tweets  Celebrities, politicians, news media, academics, spammers  Challenge: Finding relevant & trustworthy content  Trustworthy: Thwart spammers and their spam  Relevance: Identify authoritative experts on specific topics

4 Thwarting Spammers in Twitter [WWW 2012] Part 1

5 Background: How spammers operate  Twitter spammers try to gain lots of followers  To promote spam directly  To gain influence in the network  Search engines rank tweets based on how influential the user is  Most metrics depend on user ’ s network connectivity  More followers help a user to gain influence Incentivizes spammers to acquire links to gain influence

6 Acquiring followers via link farming  Unrelated users exchange links with each other  To gain more influence based on network connectivity Alice Bob Charlie David Influence based on connectivity is improved

7 To thwart spammers  We need to  1. Understand link farming activity in Twitter  2. Combat link farming activity in Twitter  Prior works: Focused on detecting spammers  Via their characteristics, e.g., follower to following ratios  Rat-race between spammers and spam fighters  We focus on the spammer support network

8 Identifying spammers  Used Twitter network gathered from previous study [ICWSM ’ 10]  Data collected in August 2009  54M nodes, 1.9B links, 1.7B Tweets  Identified accounts suspended by Twitter  Account could be suspended for various reasons  Found suspended users that posted blacklisted URLs  Includes 41,352 such spammers

9 Spammers farm links at large-scale  Spam-targets: 27% of all users followed by at least one of ~40,000 spammers!  Spam-followers: 82% of all followers have been targeted  Spammers have more followers than random users  Avg follower count for Spammers: 234, Random users: 36

10 Who responds to links from spammers?  Small number of followers respond most of the time Top 100k followers exhibit high reciprocation of 0.8 on avg. Top 100k users account for 60% of all links to spammers We call these users link farmers

11 Are link farmers real users or spammers? To find out if they are spammers or real users, we  1. Checked if they were suspended by Twitter  76% users not suspended, 235 of them verified by Twitter  2. Manually verified 100 random users  86% users are real with legitimate links in their Tweets  3. Analyzed their profiles  More active in updating their profiles than random users

12 Are link farmers lay or popular users?  Conventional wisdom:  Lay users more likely to follow back due to social etiquette  Popular users might be more conservative in following others Probability increases with user popularity Link farmers are popular users with lots of followers

13 Are link farmers lay or popular users?  Top 5 link farmers according to Pagerank:  1. Barack Obama: Obama 2012 campaign staff  2. Britney Spears  3. NPR Politics: Political coverage and conversation  4. UK Prime Minister: PM ’ s office  5: JetBlue Airways Link farmers include legitimate, popular users & organizations

14 What possibly motivates link farmers?  One explanation:  Link farmers have similar incentives as spammers  They seek to amass social capital & influence in the network  Link farmers rank among top 5% influential Twitter users  In terms of various metrics like Pagerank & Followerrank

15 Combating link farming  Key challenge:  Real, popular and active users are involved in link farming  Detecting and suspending spammers alone will not help  Insight:  Discourage users from following others carelessly  Penalize users following anyone found to be bad  Lower the influence scores of users following spammers Incentivizes users to be more careful about who they link to

16 Collusionrank  Borrows ideas from spam defense strategies for Web [WWW ’ 05]  Low Collusionrank score for a user indicates  heavy linking to spammers or spam-followers  Requires a seed set of known spammers  Twitter operator periodically identifies and updates spammers

17 Collusionrank Algorithm: 1. Negatively bias the initial scores to the set of spammers 2. In Pagerank style, iteratively penalize users who follow spammers or those who follow spam-followers Collusionrank is based on the score of followings of a user Because user is penalized based on who he follows

18 Evaluating Collusionrank  Goal:  To penalize spammers and spam-followers  Should not penalize users who are not following spammers  Used a small subset of 600 spammers as seed set  Compare ranks between  Pagerank  Pagerank + Collusionrank  Measures influence after accounting for link farming activity

19 Effect of Collusionrank on spammers 40% of spammers appear in top 20% according to Pagerank Most of the spammers get pushed to last 10% positions based on Collusionrank

20 Effect on link farmers 87% of link farmers in top 2% users according to Pagerank 98% of the link farmers get pushed to last 10% positions based on Collusionrank

21 Effect on normal users  Focus on top 100,000 users according to Pagerank  Analyze the percentile difference in ranks between  Pagerank (P) & Pagerank + Collusionrank (PC)  Percentile Difference = ( |PC-P|/N ) x 100 Only 20% of users get demoted heavily Heavily demoted users follow many more spammers than others Collusion rank selectively filters out spammers and spam-followers

22 Summary: Thwarting spammers  Spammers infiltrate the Twitter network by farming links  Link farming helps them gain influence to promote spam  Search involves ranking users based on connectivity & influence  Analyzed link farming in Twitter by studying spammers  Top link farmers are real, active and popular users  Proposed an algorithm Collusionrank to limit link farming  Incentivizes users to be careful about who they connect with

23 Finding Topic Experts in Twitter [WOSN 2012] [SIGIR 2012] Part 2

24 Topic experts in Twitter  Twitter is now an important source of current news  500 million users post 400 million tweets daily  Quality of tweets posted by different users vary widely  News, pointless babble, conversational tweets, spam, …  Challenge: to find topic experts  Sources of authoritative information on specific topics

25 Identifying topic experts in Twitter  Existing approaches  Research studies: Pal [WSDM 11], Weng [WSDM 10]  Application systems: Twitter Who-To-Follow, Wefollow, …  Existing approaches primarily rely on information provided by the user herself  Bio, contents of tweets, network features e.g. #followers  We rely on “wisdom of the Twitter crowd”  How do others describe a user?

26 Twitter Lists  A feature to organize tweets received from the people whom a user is following  Create a List, add name & description, add Twitter users to the list  List meta-data offers cues for who-is-who  Tweets from all listed users will be available as a separate List stream

27

28 Mining Lists to infer expertise  Collect Lists containing a given user U  Identify U’s topics from List meta-data  Basic NLP techniques  Extract nouns and adjectives  Extracted words collected to obtain a topic document for user [movies tv hollywood stars entertainment celebrity hollywood …]

29 Lists vs. other features Fallon, happy, love, fun, video, song, game, hope, #fjoln, #fallonmono Most common words from tweets celeb, funny, humor, music, movies, laugh, comics, television, entertainers Most common words from Lists Profile bio

30 Dataset  Collected Lists of 55 million Twitter users who joined before or in 2009  Our analysis infers topics for 1.3 million users who are included in 10 or more Lists

31 Evaluating inference quality  Quality metrics  Is the inference accurate?  Is the inference informative?  Evaluation of popular users  Celebrities, News media sources, US Senators  Using user feedback

32 Popular users set 1: Celebrities Biographical TagsTopics of ExpertisePopular Perception government, president, USA, democrat politics, government celebs, leader, famous, current events sports, cyclist, athlete tdf, triathlon, cancer celebs, influential, famous, inspiration  The inferred attributes accurately capture  Biographical information  Topics of expertise  Popular perception about the user

33 Popular users set 2: News media sources  The inferred attributes indicate  Primary topics of the media source  Perceived political bias (Verified using ADA scores) MediaBiographical TagsTopics of Expertise Popular Perception CNN media, journalist, bloggers politics, sports, tech, weather, current influential outlets The Nation media, journalist, magazines, blogs politics, governmentprogressive, liberal Townhall.com media, bloggers, commentary, journalists politics conservative, republican GuardianFilmjournalists, reviews movies, cinema, actors, theatre, hollywood film critics

34 Popular users set 3: US Senators  Out of the 100 US senators, 84 have Twitter accounts  The inferred attributes correctly infer  Their political party  The state represented by them  Their gender  ‘Female’ or ‘Women’ for all 15 female senators  Their political ideology  progressive/liberal/conservative/tea-party  The senate committees to which they belong

35 Popular users set 3: US Senators Biographical TagsSenate CommitteesPerception Chuck Grassley politics, senator, republican, iowa, gop health, food, agriculture conservative Claire McCaskill politics, democrats, missouri, women tech, security, power, health, commerce progressive, liberal Jim Inhofe politics, congress, oklahoma, republican army, energy, climate, foreign conservative John Kerry politics, senate, democrats, boston health, climate, techprogressive

36 User feedback AccurateInformative Total Evaluations345342 Response: Yes274277 Response: No1820 Can’t tell5345 Ignoring can’t tell responses, Accuracy – 94 % Informative – 93 %

37 Evaluating inference coverage What fraction of Twitter can our method of inference be applied to? A large fraction of popular Twitter users are covered

38 Evaluating inference coverage  We could also infer attributes of less popular users  6% of users with Follower Ranks between 1 and 10 Million  They are often experts on niche topics User: Twitter bio Follower s ListedInferred Attributes spacespin: news on robotic space exploration 5611 science, space exploration, nasa, astronomy, planets laithm: Al-jazeera network battle cameraman 20116 jounalists, photographer, al-jazeera, media HumphreysLab: Stem Cell, Regenrative Biology of Kidney 11917 science, stem cell, genetics, cancer, physicians, biotech, nephrologist

39 Cognos  Search system for topic experts in Twitter  Given a query (topic)  Identify experts on the topic using Lists  Rank identified experts

40 Ranking experts  Used a ranking scheme solely based on Lists  Two components of ranking user U w.r.t. query Q  Relevance of user to query – cover density ranking between topic document T U of user and Q  Popularity of user – number of Lists including the user Topic relevance(T U, Q) × log(#Lists including U)

41 Cognos results for “stem cell” Cognos: http://twitter-app.mpi-sws.org/whom-to-follow/

42 Evaluation of Cognos System deployed and evaluated ‘in-the-wild’ Evaluators were students & researchers from the three home institutes of authors Cognos: http://twitter-app.mpi-sws.org/whom-to-follow/

43 Cognos vs. Twitter Who-To-Follow Cognos: http://twitter-app.mpi-sws.org/whom-to-follow/

44 Cognos vs. Twitter Who-To-Follow  Considering 27 distinct queries asked at least twice  Judgment by majority voting  Cognos judged better on 12 queries  Computer science, Linux, Mac, Apple, Ipad, Internet, Windows phone, photography, political journalist, …  Twitter Who-To-Follow judged better on 11 queries  Music, Sachin Tendulkar, Anjelina Jolie, Harry Potter, metallica, cloud computing, IIT Kharagpur, … Cognos: http://twitter-app.mpi-sws.org/whom-to-follow/

45 Results for query music

46 Summary: Finding topic experts in Twitter  Developed and deployed Cognos  Uses Lists to infer topics of expertise and rank users  Competes favorably with Twitter Who-To-Follow  Lists vital in searching for topic experts in Twitter  Future work  Make the inference methodology robust against List spam  Key insight: Unlike follow-links, experts do not List non- expert users Cognos: http://twitter-app.mpi-sws.org/whom-to-follow/

47 Twitter microblogging site  An important source for real-time Web content  500 million active users posting 400 million tweets daily  Quality of tweets / content vary widely  Any one can post tweets  Celebrities, politicians, news media, academics, spammers  Challenge: Finding relevant & trustworthy content  Trustworthy: Thwart spammers and their spam  Relevance: Identify authoritative experts on specific topics

48 Higher-level take away  Links mean different things in different real-world social networks  In fact, every social network offers different types of links  They are backed by different social interactions  Many links are implicit  Important to differentiate and leverage domain- specific usage of social links

49 Thank You You can try Cognos at: http://twitter-app.mpi-sws.org/whom-to-follow/ http://twitter-app.mpi-sws.org/who-is-who/


Download ppt "Extracting Relevant & Trustworthy Information from Microblogs Joint work with Bimal Viswanath, Farshad Kooti, Saptarshi Ghosh, Naveen Sharma, Niloy Ganguly,"

Similar presentations


Ads by Google