Topical Authority Detection and Sentiment Analysis on Top Influencers Machine Learning with Large Datasets Course Project (under the guidance of Prof. William W. Cohen) Team Members: Manuel, Shubham and Soumya
Outline Introduction Related Work Problem Statement Methodology Results Evaluation plan Conclusion
Introduction Topical authority detection in social networks is an active research area Important for recommending relevant feed to users interested in certain topics Challenges - Results should not be overly biased towards: popular authors (such as celebrities) generic authorities (such as news channels) Relatively new users, who may not exist prior to an event, but post dedicatedly on the topic, should also be considered
Related Work TwitterRank [2]: Authority Detection in Twitter using the idea of PageRank Leverages topical similarity and link structure between users Fails to filter out spammers, or celebrities who are not always influential Meeyoung Cha et. al. [3] find that popular users who have high in- degree are not necessarily influential in terms of spawning retweets or mentions Aditya Pal et. al. [5] (considered as the baseline): Use clustering to identify influential vs. non-influential users on Twitter Rank users in the influential cluster, considering various important features
Problem Statement Aim: Perform authority detection on a collection of topics in Twitter for a time window Sentiment analysis to determine the influence of top users tweeting on specific topics on their respective communities Period: June 6th 2010 to June 10th 2010 Topics: Oil Spill iPhone World Cup
Methodology - User Metrics M = Mentions M1: Number of mentions of other users by the author M2: Number of unique users mentioned by the author M3: Number of mentions by others of the author M4: Number of unique users mentioning the author G = Graph Characteristics (restricted by the availability of data) G1: Number of topically active followers G2: Number of topically active friends G3: Number of followers tweeting on topic after the author G4: Number of friends tweeting on topic before the author OT1: Number of original tweets OT2: Number of links shared OT3: Self-similarity score OT4: Number of keyword hashtags used CT = Conversational tweets CT1: Number of conversational tweets CT2: Tweets where conversation is initiated by the author RT = Repeated tweets RT1: Number of retweets of others’ tweets RT2: Number of unique tweets retweeted by other users RT3: Number of unique users who retweeted author’s tweets
Methodology - Features Extracted Topic Signal (TS) Signal Strength (SS) Non-Chat Signal (NCS) Retweet Impact (RI) - modified Mention Impact (MI) Information Diffusion (ID) Network Score (NS) URL Impact (UI)
Methodology - Features Formulae
Methodology - Steps Data in Twitter API format -> User Metrics MapReduce (using Hadoop on AWS) Src-follows-Dest edge-list -> Adjacency Lists User Metrics and Adjacency Lists -> Features Features -> Clusters -> Influential Cluster Using Gaussian Mixture Model and Expectation Maximization Influential Cluster -> Top 20 Influencers Using Gaussian Ranking Sentiment Analysis and Visualization Using Liu Hu Lexicon and Gephi
Results - Authority Detection Normalized Not Normalized 60069699: sandiebanandie 17918561: LATenvironment 17918827: latimesgreen 14323791: dbiello 58315230: mrt7384 138775765: BPOilSpill 3554721: NWF 28657802: climateprogress 47739450: ByronYork 152315367: Oil_Spill_News 22024951: SwampSchool 19029137: BrentSpiner 14717197: TPM 139909476: USGulfOilSpill 15458181: kate_sheppard 48365916: Fertic 138761645: GulfOilCleanup 11856592: msnbcvideo 81696616: alabamainsider 9848: jimmybuffett 17918561: LATenvironment 138775765: BPOilSpill 3554721: NWF 14323791: dbiello 138761645: GulfOilCleanup 60069699: sandiebanandie 14192680: NOLAnews 139119046: BoycottBP 26642006: Alyssa_Milano 139909476: USGulfOilSpill 20582958: guardianeco 28657802: climateprogress 14293310: TIME 47739450: ByronYork 14138785: TelegraphNews 2467791:washingtonpost 58315230: mrt7384 139477825:BPOilNews 46969537:greenforyou 14511951: HuffingtonPost
Results - Sentiment Analysis Dbeillo Negative Sentiment Influence LATenvironment Neutral Sentiment Influence
Evaluation - Clustering, Ranking and Authority We randomly sample users from the “good” and “bad” clusters to ask people how relevant the tweets are for the topic. Using the assigned rank (1 to 5) of the users from the top k Twitter users in our ranking, we run NCGD to compare the relative rank that the users assigned to our ranking. WIth a final survey, we plan to ask people to rank the authoritativeness of the top k users in our rank with anonymized and non-anonymized tweets.
Evaluation
Conclusion While the baseline had more authorities who seemed generic, such as news Twitter accounts, our results show more topical authorities. We have also analyzed the sentiment influence of the top authorities, which can have further applications in formulating better marketing strategies for products and to influence consumers. Further, we plan to include evaluation results in our final report, and also improve upon the features related to the follower-following graph.
References [1] Pal, Aditya, and Scott Counts. "Identifying topical authorities in microblogs." Proceedings of the fourth ACM international conference on Web search and data mining. ACM, 2011. [2] Weng, Jianshu, et al. "Twitterrank: finding topic-sensitive influential twitterers." Proceedings of the third ACM international conference on Web search and data mining. ACM, 2010. [3] Cha, Meeyoung, et al. "Measuring User Influence in Twitter: The Million Follower Fallacy." ICWSM 10.10-17 (2010): 30. [4] Yoshida, M., & Yamaguchi, Y. (2015). Interactive Tagging Networks (Following/Followers and Tags on 1 million Twitter Users) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.16267 [5] Page, Lawrence, et al. "The PageRank citation ranking: bringing order to the web." (1999). [6] Bishop, Christopher M. "Pattern recognition." Machine Learning 128 (2006).
Baseline Results NWF TIME Huffingtonpost NOLAnews Reuters CBSNews LATenvironment kate_sheppard MotherNatureNet mparent77772