Extracting Relevant & Trustworthy Information from Microblogs Joint work with Bimal Viswanath, Farshad Kooti, Saptarshi Ghosh, Naveen Sharma, Niloy Ganguly,

Slides:



Advertisements
Similar presentations
Social media for business by Frank Flores Hash Cloud Studio A Creative Marketing Agency 200 Industrial Rd. Suite 155 San Carlos, CA (650)
Advertisements

Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
Learning more about Facebook and Twitter. Introduction  What we’ve covered in the Social Media webinar series so far  Agenda for this call Facebook.
Ranking Tweets Considering Trust and Relevance Srijith Ravikumar,Raju Balakrishnan, and Subbarao Kambhampati Arizona State University 1.
Topical search in Twitter Complex Network Research Group Department of CSE, IIT Kharagpur.
Role of Online Social Networks during disasters & political movements Saptarshi Ghosh Department of Computer Science and Technology Bengal Engineering.
SOCIAL MEDIA & PHYSICAL ACTIVITY PROMOTION: MAKING THE CONNECTIONS Presented by: Sandra De Freitas
Arizona Dietetic Association Melinda Johnson, MS,
1 KSIDI June 9, 2010 Measuring User Influence in Twitter: The Million Follower Fallacy Meeyoung Cha Max Planck Institute for Software Systems (MPI-SWS)
Using Twitter By Nancy Hanus Michigan State University School of Journalism Sept. 13, 2010.
Twitter The Basics. What is Twitter? Tweets are: 140 characters or less Quick to follow and view updates Used to share links, photos, videos, music,hot.
TWITTER BASICS GATEHOUSE NEWS & INTERACTIVE DIVISION.
Developing a Social Media Strategy Ashley Schaffer Ebe Randeree For Your Organization.
Explain the social impacts of the use of IT.. Social Social networking has made the world Bigger and easier to communicate with others. It allows friends,
Web 2.0: Concepts and Applications 5 Connecting People.
Web 2.0: Concepts and Applications 5 Connecting People.
Enabling the Social Web Krishna P. Gummadi Networked Systems Group Max Planck Institute for Software Systems.
WEB2.0 Social Media & Independent Pharmacy Real World Use & Possibilities.
Social Media Marketing. Social Media Marketing / Viral Marketing.
Search Engines and Information Retrieval
A Comprehensive Overview of Social Media Marketing Essentials.
Overview of Search Engines
Topical search in the Twitter OSN Collaborators: Naveen Sharma, Parantapa Bhattacharya, Niloy Ganguly (IITKGP) Muhammad Bilal Zafar, Krishna Gummadi (MPI-SWS)
Social networking FACEBOOK AND TWITTER. Then In the beginning of Facebook, there were very few features. There were no status updates, messages, photo.
Kim Salamonson Hastings District Libraries, Hastings, New Zealand LIANZA Conference 2014 in Association with DigLib-SIG.
Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.
SOCIAL NETWORKS AND THEIR IMPACTS ON BRANDS Edwin Dionel Molina Vásquez.
Adversarial Information Retrieval The Manipulation of Web Content.
Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.
Search Engines and Information Retrieval Chapter 1.
RESEARCH Checking Reliable Sources. Why do I need to check if a website is reliable? Unlike most traditional written information, no one has to approve.
Aardvark Anatomy of a Large-Scale Social Search Engine.
Knowing Your Facebook From Your Flickr Dan O’ Neill – -
What is the Internet? Internet: The Internet, in simplest terms, is the large group of millions of computers around the world that are all connected to.
Working with the Media. This session will cover how to: Understand the media Develop a media strategy Monitor and respond, as needed, to media coverage.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
How do I decide whom to follow on Twitter ? IARank: Ranking Users on Twitter in Near Real-time, Based on their Information Amplification Potential.
Understanding Cross-site Linking in Online Social Networks Yang Chen 1, Chenfan Zhuang 2, Qiang Cao 1, Pan Hui 3 1 Duke University 2 Tsinghua University.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Internet Skills The World Wide Web (Web) consists of billions of interconnected pages of information from a wide variety of sources. In this section: Web.
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
THE MEDIA AND CAMPAIGNS. Learning Objectives and Outcomes Evaluate how people develop political opinions and how this impacts their political behavior.
Date: 2012/4/23 Source: Michael J. Welch. al(WSDM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Topical semantics of twitter links 1.
Social Media 101 An Overview of Social Media Basics.
Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.
OCLC Online Computer Library Center 1 Social Media and Advocacy.
Blogs and Twitter Technology for Journalists. What we’ll go over today A. What is expected out of blogs/twitter B. Blogs vs. traditional print reporting.
Twitter Games: How Successful Spammers Pick Targets Vasumathi Sridharan, Vaibhav Shankar, Minaxi Gupta School of Informatics and Computing, Indiana University.
Social Media and the Internet By: Trevor Babich. Social Media Background Social media are Internet sites where people interact freely, sharing and discussing.
1 SEARCHING FOR TRUTH Locating Information on the WWW chapter 5.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
This material has been prepared by First Trust Portfolios L.P. for educational and informational purposes. Nothing in this material is intended to supersede.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
Measuring User Influence in Twitter: The Million Follower Fallacy Meeyoung Cha Hamed Haddadi Fabricio Benevenuto Krishna P. Gummadi.
How Chapters Can use Social Media Mark Storace Sacramento Chapter March 2013.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
How Chapters Can use Social Media Mark Storace Sacramento Chapter Jan 2011.
Uncovering Social Spammers: Social Honeypots + Machine Learning
Topical Authority Detection and Sentiment Analysis on Top Influencers
WEB SPAM.
A Comparative Study of Link Analysis Algorithms
Presentation transcript:

Extracting Relevant & Trustworthy Information from Microblogs Joint work with Bimal Viswanath, Farshad Kooti, Saptarshi Ghosh, Naveen Sharma, Niloy Ganguly, Fabricio Benevenuto MPI-SWS, Germany; IIT Kharagpur, India; UFOP, Brazil

My research: The big picture Three fundamental trends & challenges in social Web 1. User-generated content sharing  can we protect privacy of users sharing personal data? 2. Word-of-mouth based content exchange  can we understand & leverage word-of-mouth better?? 3. Crowd-sourcing content rating and ranking  can we find trustworthy & relevant content sources?

Twitter microblogging site  An important source for real-time Web content  500 million active users posting 400 million tweets daily  Quality of tweets / content vary widely  Any one can post tweets  Celebrities, politicians, news media, academics, spammers  Challenge: Finding relevant & trustworthy content  Trustworthy: Thwart spammers and their spam  Relevance: Identify authoritative experts on specific topics

Thwarting Spammers in Twitter [WWW 2012] Part 1

Background: How spammers operate  Twitter spammers try to gain lots of followers  To promote spam directly  To gain influence in the network  Search engines rank tweets based on how influential the user is  Most metrics depend on user ’ s network connectivity  More followers help a user to gain influence Incentivizes spammers to acquire links to gain influence

Acquiring followers via link farming  Unrelated users exchange links with each other  To gain more influence based on network connectivity Alice Bob Charlie David Influence based on connectivity is improved

To thwart spammers  We need to  1. Understand link farming activity in Twitter  2. Combat link farming activity in Twitter  Prior works: Focused on detecting spammers  Via their characteristics, e.g., follower to following ratios  Rat-race between spammers and spam fighters  We focus on the spammer support network

Identifying spammers  Used Twitter network gathered from previous study [ICWSM ’ 10]  Data collected in August 2009  54M nodes, 1.9B links, 1.7B Tweets  Identified accounts suspended by Twitter  Account could be suspended for various reasons  Found suspended users that posted blacklisted URLs  Includes 41,352 such spammers

Spammers farm links at large-scale  Spam-targets: 27% of all users followed by at least one of ~40,000 spammers!  Spam-followers: 82% of all followers have been targeted  Spammers have more followers than random users  Avg follower count for Spammers: 234, Random users: 36

Who responds to links from spammers?  Small number of followers respond most of the time Top 100k followers exhibit high reciprocation of 0.8 on avg. Top 100k users account for 60% of all links to spammers We call these users link farmers

Are link farmers real users or spammers? To find out if they are spammers or real users, we  1. Checked if they were suspended by Twitter  76% users not suspended, 235 of them verified by Twitter  2. Manually verified 100 random users  86% users are real with legitimate links in their Tweets  3. Analyzed their profiles  More active in updating their profiles than random users

Are link farmers lay or popular users?  Conventional wisdom:  Lay users more likely to follow back due to social etiquette  Popular users might be more conservative in following others Probability increases with user popularity Link farmers are popular users with lots of followers

Are link farmers lay or popular users?  Top 5 link farmers according to Pagerank:  1. Barack Obama: Obama 2012 campaign staff  2. Britney Spears  3. NPR Politics: Political coverage and conversation  4. UK Prime Minister: PM ’ s office  5: JetBlue Airways Link farmers include legitimate, popular users & organizations

What possibly motivates link farmers?  One explanation:  Link farmers have similar incentives as spammers  They seek to amass social capital & influence in the network  Link farmers rank among top 5% influential Twitter users  In terms of various metrics like Pagerank & Followerrank

Combating link farming  Key challenge:  Real, popular and active users are involved in link farming  Detecting and suspending spammers alone will not help  Insight:  Discourage users from following others carelessly  Penalize users following anyone found to be bad  Lower the influence scores of users following spammers Incentivizes users to be more careful about who they link to

Collusionrank  Borrows ideas from spam defense strategies for Web [WWW ’ 05]  Low Collusionrank score for a user indicates  heavy linking to spammers or spam-followers  Requires a seed set of known spammers  Twitter operator periodically identifies and updates spammers

Collusionrank Algorithm: 1. Negatively bias the initial scores to the set of spammers 2. In Pagerank style, iteratively penalize users who follow spammers or those who follow spam-followers Collusionrank is based on the score of followings of a user Because user is penalized based on who he follows

Evaluating Collusionrank  Goal:  To penalize spammers and spam-followers  Should not penalize users who are not following spammers  Used a small subset of 600 spammers as seed set  Compare ranks between  Pagerank  Pagerank + Collusionrank  Measures influence after accounting for link farming activity

Effect of Collusionrank on spammers 40% of spammers appear in top 20% according to Pagerank Most of the spammers get pushed to last 10% positions based on Collusionrank

Effect on link farmers 87% of link farmers in top 2% users according to Pagerank 98% of the link farmers get pushed to last 10% positions based on Collusionrank

Effect on normal users  Focus on top 100,000 users according to Pagerank  Analyze the percentile difference in ranks between  Pagerank (P) & Pagerank + Collusionrank (PC)  Percentile Difference = ( |PC-P|/N ) x 100 Only 20% of users get demoted heavily Heavily demoted users follow many more spammers than others Collusion rank selectively filters out spammers and spam-followers

Summary: Thwarting spammers  Spammers infiltrate the Twitter network by farming links  Link farming helps them gain influence to promote spam  Search involves ranking users based on connectivity & influence  Analyzed link farming in Twitter by studying spammers  Top link farmers are real, active and popular users  Proposed an algorithm Collusionrank to limit link farming  Incentivizes users to be careful about who they connect with

Finding Topic Experts in Twitter [WOSN 2012] [SIGIR 2012] Part 2

Topic experts in Twitter  Twitter is now an important source of current news  500 million users post 400 million tweets daily  Quality of tweets posted by different users vary widely  News, pointless babble, conversational tweets, spam, …  Challenge: to find topic experts  Sources of authoritative information on specific topics

Identifying topic experts in Twitter  Existing approaches  Research studies: Pal [WSDM 11], Weng [WSDM 10]  Application systems: Twitter Who-To-Follow, Wefollow, …  Existing approaches primarily rely on information provided by the user herself  Bio, contents of tweets, network features e.g. #followers  We rely on “wisdom of the Twitter crowd”  How do others describe a user?

Twitter Lists  A feature to organize tweets received from the people whom a user is following  Create a List, add name & description, add Twitter users to the list  List meta-data offers cues for who-is-who  Tweets from all listed users will be available as a separate List stream

Mining Lists to infer expertise  Collect Lists containing a given user U  Identify U’s topics from List meta-data  Basic NLP techniques  Extract nouns and adjectives  Extracted words collected to obtain a topic document for user [movies tv hollywood stars entertainment celebrity hollywood …]

Lists vs. other features Fallon, happy, love, fun, video, song, game, hope, #fjoln, #fallonmono Most common words from tweets celeb, funny, humor, music, movies, laugh, comics, television, entertainers Most common words from Lists Profile bio

Dataset  Collected Lists of 55 million Twitter users who joined before or in 2009  Our analysis infers topics for 1.3 million users who are included in 10 or more Lists

Evaluating inference quality  Quality metrics  Is the inference accurate?  Is the inference informative?  Evaluation of popular users  Celebrities, News media sources, US Senators  Using user feedback

Popular users set 1: Celebrities Biographical TagsTopics of ExpertisePopular Perception government, president, USA, democrat politics, government celebs, leader, famous, current events sports, cyclist, athlete tdf, triathlon, cancer celebs, influential, famous, inspiration  The inferred attributes accurately capture  Biographical information  Topics of expertise  Popular perception about the user

Popular users set 2: News media sources  The inferred attributes indicate  Primary topics of the media source  Perceived political bias (Verified using ADA scores) MediaBiographical TagsTopics of Expertise Popular Perception CNN media, journalist, bloggers politics, sports, tech, weather, current influential outlets The Nation media, journalist, magazines, blogs politics, governmentprogressive, liberal Townhall.com media, bloggers, commentary, journalists politics conservative, republican GuardianFilmjournalists, reviews movies, cinema, actors, theatre, hollywood film critics

Popular users set 3: US Senators  Out of the 100 US senators, 84 have Twitter accounts  The inferred attributes correctly infer  Their political party  The state represented by them  Their gender  ‘Female’ or ‘Women’ for all 15 female senators  Their political ideology  progressive/liberal/conservative/tea-party  The senate committees to which they belong

Popular users set 3: US Senators Biographical TagsSenate CommitteesPerception Chuck Grassley politics, senator, republican, iowa, gop health, food, agriculture conservative Claire McCaskill politics, democrats, missouri, women tech, security, power, health, commerce progressive, liberal Jim Inhofe politics, congress, oklahoma, republican army, energy, climate, foreign conservative John Kerry politics, senate, democrats, boston health, climate, techprogressive

User feedback AccurateInformative Total Evaluations Response: Yes Response: No1820 Can’t tell5345 Ignoring can’t tell responses, Accuracy – 94 % Informative – 93 %

Evaluating inference coverage What fraction of Twitter can our method of inference be applied to? A large fraction of popular Twitter users are covered

Evaluating inference coverage  We could also infer attributes of less popular users  6% of users with Follower Ranks between 1 and 10 Million  They are often experts on niche topics User: Twitter bio Follower s ListedInferred Attributes spacespin: news on robotic space exploration 5611 science, space exploration, nasa, astronomy, planets laithm: Al-jazeera network battle cameraman jounalists, photographer, al-jazeera, media HumphreysLab: Stem Cell, Regenrative Biology of Kidney science, stem cell, genetics, cancer, physicians, biotech, nephrologist

Cognos  Search system for topic experts in Twitter  Given a query (topic)  Identify experts on the topic using Lists  Rank identified experts

Ranking experts  Used a ranking scheme solely based on Lists  Two components of ranking user U w.r.t. query Q  Relevance of user to query – cover density ranking between topic document T U of user and Q  Popularity of user – number of Lists including the user Topic relevance(T U, Q) × log(#Lists including U)

Cognos results for “stem cell” Cognos:

Evaluation of Cognos System deployed and evaluated ‘in-the-wild’ Evaluators were students & researchers from the three home institutes of authors Cognos:

Cognos vs. Twitter Who-To-Follow Cognos:

Cognos vs. Twitter Who-To-Follow  Considering 27 distinct queries asked at least twice  Judgment by majority voting  Cognos judged better on 12 queries  Computer science, Linux, Mac, Apple, Ipad, Internet, Windows phone, photography, political journalist, …  Twitter Who-To-Follow judged better on 11 queries  Music, Sachin Tendulkar, Anjelina Jolie, Harry Potter, metallica, cloud computing, IIT Kharagpur, … Cognos:

Results for query music

Summary: Finding topic experts in Twitter  Developed and deployed Cognos  Uses Lists to infer topics of expertise and rank users  Competes favorably with Twitter Who-To-Follow  Lists vital in searching for topic experts in Twitter  Future work  Make the inference methodology robust against List spam  Key insight: Unlike follow-links, experts do not List non- expert users Cognos:

Twitter microblogging site  An important source for real-time Web content  500 million active users posting 400 million tweets daily  Quality of tweets / content vary widely  Any one can post tweets  Celebrities, politicians, news media, academics, spammers  Challenge: Finding relevant & trustworthy content  Trustworthy: Thwart spammers and their spam  Relevance: Identify authoritative experts on specific topics

Higher-level take away  Links mean different things in different real-world social networks  In fact, every social network offers different types of links  They are backed by different social interactions  Many links are implicit  Important to differentiate and leverage domain- specific usage of social links

Thank You You can try Cognos at: