Presentation is loading. Please wait.

Presentation is loading. Please wait.

12/18/12CHMPR IAB 20121 Social Media Analytics : Digital Footprints Sandhya Krishnan Dr. Anupam Joshi Funded by:

Similar presentations


Presentation on theme: "12/18/12CHMPR IAB 20121 Social Media Analytics : Digital Footprints Sandhya Krishnan Dr. Anupam Joshi Funded by:"— Presentation transcript:

1 12/18/12CHMPR IAB 20121 Social Media Analytics : Digital Footprints Sandhya Krishnan Dr. Anupam Joshi Funded by:

2 Introduction 12/18/12CHMPR IAB 20122  Social media has greatly impacted the way we communicate today. With approximately 3000 tweets/sec (13K/sec around Superbowl) and 2.5 Billion updates a day, it is a great way to disseminate information to users across the world.  However such a tool can also be used to disseminate misinformation in a quick and efficient manner which can have an harmful impact in multiple scenarios like national security cases, or business/marketing cases and hence needs to be curbed and kept in check  Our approach is to create a social footprint of users which can be used to distinguish real and imposter/ compromised accounts on social media.

3 Introduction Social Media is a great way to disseminate information to users across the world. 1.11 billion users as of May 2013 3 200 million active users and 340 million tweets/day (December 2012) But, what about disinformation (intentionally false or inaccurate information spread deliberately) ??

4 Motivation- Real Twitter Verified Account Both claim to be Pope Francis Fake Account Banned by Twitter March 2013 @flydeltaassist @deltaassist Both claim to be February 2013 4 Was Tweeting against Church’s Anti-Gay policy Promised free tickets to first several thousand followers

5 Motivation 12/18/12CHMPR IAB 20125 @BarakObama @BarackObama @theUSpresident Which one is real??

6 6 fake PMO India Profiles August 2012 @pmoindia claims to be @pm0india claims to be @dryumyumsingh claims to be Motivation 6 Tweeting content which was :  Misrepresenting violence against Muslims in Burma  Instigating riots in North- Eastern Region of India

7 Motivation- News/Business Scenarios Hacked Accounts February 2013 7 April 2013

8 #Twithackery Some Recent Hacking Episodes 2012 -2013 8

9 Objective @BarakOba ma @theUSpresident Which one is real?? @Obamanews @ BarakObama__ @BarackObama44 @BarackObama @ThePresObama 9 Is this account compromised??

10 Success Criteria  Build a prototype system which performs a joint content and network structure analysis demonstrating the feasibility of distinguishing real and fake profiles.  Developing high accuracy in identifying real accounts of “famous people”  Evaluate further by filtering down the social media network to check the validity of accounts belonging to a layman 12/18/12CHMPR IAB 201210

11 Solution overview What is a digital footprint? Meta data Created_at Verified A/c ? Location Name URL s Content Words in tweets Hash tags @barackobama 11 DIGITAL FOOTPRINT Followers FollowingRe- tweets Mentions Replies Network Structure

12 12 Solution overview Create Digital Footprint @barackobama Twitter User_timeline API Extract Tweets (Content) Clean Text and Create Bag of Words Model For each word compute TF-IDF score Compute two groups of words- Frequently occurring and Rarely occurring. System- Content Module

13 System- Network Module @barackobama Twitter User_Timeline API Extract users in ‘Re-Tweets’ and ‘Replies’ Extract users who ‘mention’ current user Form Close Social Network 13 Solution Overview Create Digital Footprint

14 14 System- Content Module System- Network Module @barackobama Digital Signature/ Footprint Solution Overview Digital Footprint

15 15 Solution Overview Authenticate Digital Footprint What content is similar? % terms common between tweets and news articles How similar are they? Average difference between TF-IDF scores of such terms Above two metrics computed for Rare and Frequent terms in both context- Tweets and News Article {Rare and Frequent terms indicated by TF-IDF}

16 Solution Overview Authenticate Digital Footprint 16 – Number of nodes in network – Out-degree- From user’s Replies and Re-tweets – In-degree – User’s @mentions in addition to @replies directed to the user and @RT of the user’s tweets Network Characteristics of Close Social Network System- Network Module To understand Trust Propagation in Social Networks, we record: Number of Twitter ‘verified’ users in the current user’s network In some scenarios we also use: Network Intersection with a trusted user Number of hops required to reach the current user from the trusted user in the network

17 Results Hacked /Compromised Accounts Analysis done to identify 17 Ground Truth Twitter ‘verified’ real accounts If above tagging absent, then manual observation of account “Famous people” “Less Famous people” Analysis done to identify real and fake profiles of Corporate Accounts Analysis done for a specific time period or 3500 most recent Tweets- Whichever relevant

18 Results I “Famous People on Twitter” – People about whom enough information from reliable web sources is available on a day to day basis 18 “Famous people” Digital Signature/ Footprint System- Content Module System- Network Module

19 Results I President Obama [1 st May 2013] 19 Graph 1 Graph 2 System- Content Module ParametersValue is: % common terms between tweets and web Higher Average Difference in TF-IDF scores Lower

20 Results I President Obama System: @barackobama is real Ground Truth: @barackobama is the Twitter ‘verified’ real account 20 System- Network Module Parameter s Value is: In DegreeHigher Out Degree Higher No of Nodes in network Higher Number of ‘verified’ users Higher

21 Results I - Conclusion Total Twitter handles – 31 Number of Real handles – 18 Number of Fake handles - 13 21 RealFake Real180 Fake112 Actual Predicted “Famous people”

22 Results - II “Less Famous People on Twitter” – People about whom enough information from reliable web sources is not available on a regular day to day basis Information maybe available on some days or in spurts (when such users are in News for a particular event/ development etc) Continuous availability of web content about such users is not reliable- hence we look at the social network structure of such users 22 “Less Famous people” Digital Signature/ Footprint System- Content Module System- Network Module

23 23 Results - II US Senators Members of Parliament – India Celebrities popular in the USA Celebrities popular in India A good mix of highly sought users in music, acting, fashion, journalism, media, business

24 Results – II Senators- USA Trusted User: @barackobama 24 ParametersValue is: In DegreeHigher Out DegreeHigher No of Nodes in network Higher Number of ‘verified’ users Higher Intersection of Graph with Trusted User Higher Hops from trusted user Lower Digital Signature/ Footprint System- Network Module

25 25 Results – II Senators- USA System: @chuckgrassley is real Ground Truth: @chuckgrassley is the Twitter ‘verified’ real account

26 Results II Celebrities- USA 26 Trusted Users: @youtube, @justinbieber,@shakira, @kimkardashian and @cnnbrk ParametersValue is: In DegreeHigher Out DegreeHigher No of Nodes in network Higher Number of ‘verified’ users Higher Intersection of Graph with Trusted User Higher Hops from trusted user Lower Digital Signature/ Footprint System- Network Module

27 Results II Celebrities- USA (Close)Social Network Analysis Graph1 Graph 2 27

28 28 Results II Celebrities- USA System: @lindsaylohan is real Ground Truth: @lindsaylohan is the Twitter ‘verified’ real account

29 Results – II Conclusion RealFake Real2771 Fake171 Actual Predicted Total Twitter handles – 350 Number of Real handles – 278 Number of Fake handles -72 “Less Famous people”

30 Results III 30 “Corporate Accounts” Digital Signature/ Footprint System- Content Module System- Network Module @bostonmarathon @_bostonmarathon @bostonmarathons

31 Results IV 31 Detect hacked/compromised accounts on Twitter System- Content Module “Twitter Handle” Digital Signature/ Footprint System- Network Module Phase I of Evaluation Phase II of Evaluation Content comparison also done between tweets of compromised account and content from: Other Similar Twitter Accounts Previous Content posted by account over a significant period of time

32 The terms which are absent in news articles but present in the tweets of @AP : Rare Terms (Avg TF-IDF 2.85)TF-IDF score Explosions3.91 Memorial3.91 embassy3.90 Bombings3.91 Argentina3.76 Canada2.81 RI2.61 court2.50 ‘@AP’ hacked Phase I Results “Breaking: Two Explosions in the White House and Barack Obama is injured” 32 System- Content Module

33 33 TERMS AVG DIFF in TF-IDF scores Iraq3.1 War2.7 White House1.8 Injured1.5 Prisoner1.2 Baucus1.5 The terms common between tweets and news but have high difference in TF-IDF scores (Average Difference is 0.6): ‘@AP’ hacked Phase I Results “Breaking: Two Explosions in the White House and Barack Obama is injured” System- Content Module

34 On a regular day, how similar is @AP to @breakingnews, @cnn, @foxnews, @washingtonpost and @Nationnow ? ‘@AP’ hacked Phase II Results 40 – 45 % of the topics spoken by these news channel accounts coincide Above topics showed very high similarity i.e. lower difference in TF-IDF scores Uncommon topics where observed to be specific stories followed by these individual channels 34 Solution approach 3500 most recent tweets of each handle Run Content Analysis Module over this data set Compute: % common terms between @AP and other account handles Average Difference in TF-IDF scores between such terms

35 TermsTF-IDF scores if term is mentioned @AP@washingtonP ost @breakingne ws @foxnews@nationno w @cnn Explosions3.91Absent Barack0.3Absent Obama0.105 Absent White House 0.48Absent Injured2.81Absent2.81Absent ‘@AP’ hacked Phase II Results “Breaking: Two Explosions in the White House and Barack Obama is injured” 35 Are the terms in this tweet mentioned by majority news channel accounts?

36 Other ‘Hacking’ episodes – Successfully Caught 36 @48hours and @60minutes caught accurately with identical Phase I and Phase II analysis like @AP

37 37 Compare tweets from day of attack with – Past 10 day tweets of the handle Other ‘Hacking’ episodes – Successfully Caught

38 Conclusion 38 System- Content Module Digital Signature/ Footprint System- Network Module Authenticate this footpint to flag account as real or fake/compromised

39 Conclusion 39 I. Authenticating ‘famous’ Twitter users Content and network analysis modules - both are extremely useful II. Authenticating ‘less famous’ Twitter users Network analysis module is more relevant III. Detecting if an existing account is hacked/compromised Only content analysis is relevant in this context Content comparison in case of compromised accounts, is done between tweets of compromised account and content from: Reliable web sources Other Similar Twitter Accounts Content posted by account over a significant period of time Applicability of system demonstrated in three flavors:

40 Future Work 40 For the three flavors in which our system is usable, some immediate tasks planned are: I.Authenticating ‘famous’ Twitter users – Implement a sentiment analysis module in addition to the text analysis module II.Authenticating ‘less famous’ Twitter users – Incorporate context to understand who is the “famous” and hence ”trusted” user in context of the current user III.Detecting if an existing account is hacked/compromised – Build an online system which will: Constantly monitor accounts tweeting similar contents Flag if one such account tweets content very different from others

41 41  Gather larger data sets and perform evaluations in each of the above categories  Extend system such that it is more applicable in differentiating a layman’s account as real or fake/compromised Future Work

42 References 1.Zi Chu, Steven Gianvecchio, Haining Wang, and Sushil Jajodia. 2010. Who is tweeting on Twitter: human, bot, or cyborg?. In Proceedings of the 26th Annual Computer Security Applications Conference (ACSAC '10). ACM, New York, NY, USA, 21-30. 2.F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida. Detecting Spammers on Twitter. In Collaboration, Electronic messaging, Anti- Abuse and Spam Conference (CEAS), July 2010 3.Michael Gamon and Anthony Aue. 2005. Automatic identification of sentiment vocabulary: exploiting low association with known sentiment terms. In Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing (FeatureEng '05). Association for Computational Linguistics, Stroudsburg, PA, USA, 57-64 4. Soo-Min Kim and Eduard Hovy. 2006. Extracting opinions, opinion holders, and topics expressed in online news media text. In Proceedings of the Workshop on Sentiment and Subjectivity in Text (SST '06). Association for Computational Linguistics, Stroudsburg, PA, USA, 1-8. 5.Qianni D, Yunjing D. How your friends influence you: quantifying pairwise influences on twitter. [serial online]. January 1, 2012;Available from: Inspec, Ipswich, MA. Accessed April 15, 2013. 6. Meeyoung Cha and Hamed Haddadi and Fabrício Benevenuto and Krishna P. Gummadi, Measuring user influence in Twitter: The million follower fallacy. ICWSM ’10: Proceedings of international AAAI Conference on Weblogs and Social, 2010 7. Mohit Kewalramani, "Community Detection in Twitter", MastersThesis, University of Maryland Baltimore County, May 2011, 8. De Choudhury, M. (2010). How "Birds of a Feather Flock Together" on Online Social Spaces.2010 Grace Hopper Celebration of Women in Computing (Atlanta, 9. Irani, D.; Webb, S.; Kang Li; Pu, C., "Large Online Social Footprints--An Emerging Threat," Computational Science and Engineering, 2009. CSE '09. International Conference on, vol.3, no., pp.271,276, 29-31 Aug. 2009doi: 10.1109/CSE.2009.459 10. D. Correa, A. Sureka, and R. Sethi, “WhACKY! - What anyone could know about you from Twitter," in PST, 2012. 11. M. Motoyama and G. Varghese, “I seek you: searching and matching individuals in social networks," in Proceedings of the eleventh international workshop on Web information and data management,ser. WIDM, 2009. 12.Paridhi Jain, Ponnurangam Kumaraguru, “Finding Nemo: Searching and Resolving Identities of Users Across Online Social Networks” Indraprastha Institute of Information Technology (IIIT-Delhi), India 13.http://www.slideshare.net/franswaa/twitter-101-for-nonprofitshttp://www.slideshare.net/franswaa/twitter-101-for-nonprofits 42

43 Thank you! Questions? 43

44 Questions? 12/18/12CHMPR IAB 201244


Download ppt "12/18/12CHMPR IAB 20121 Social Media Analytics : Digital Footprints Sandhya Krishnan Dr. Anupam Joshi Funded by:"

Similar presentations


Ads by Google