Presentation is loading. Please wait.

Presentation is loading. Please wait.

C ROWD T ARGET : T ARGET - BASED D ETECTION OF C ROWDTURFING IN O NLINE S OCIAL N ETWORKS Authors: Jonghyuk Song, Sangho Lee, Jong Kim Dept. of CSE, POSTECH.

Similar presentations


Presentation on theme: "C ROWD T ARGET : T ARGET - BASED D ETECTION OF C ROWDTURFING IN O NLINE S OCIAL N ETWORKS Authors: Jonghyuk Song, Sangho Lee, Jong Kim Dept. of CSE, POSTECH."— Presentation transcript:

1 C ROWD T ARGET : T ARGET - BASED D ETECTION OF C ROWDTURFING IN O NLINE S OCIAL N ETWORKS Authors: Jonghyuk Song, Sangho Lee, Jong Kim Dept. of CSE, POSTECH Pohang, Republic of Korea 1 Dr. Murtuza Jadliwala CS898AB Sreymoch Don April 25, 2016

2 A GENDA 1. Introduction 2. Some existing methods 3. High-level overview 4. Contribution 5. Malicious services background 6. Data collections for analysis 7. Analyze crowdturfing workers 8. Analyze crowdturfing targets 9. CrowdTarget 10. Analysis Results 11. Feature Robustness 12. Conclusion 13. Q & A 2

3 I NTRODUCTION CrowdSourcing (Crowdturfing) Process of outsourcing tasks to human workers. Detecting challenges Bots vs. human Human vs. human Propose new detecting method “CrowdTarget” Detect target object of crowdturfing (post, page, search key, URL…) Perform by new account or casual worker 3

4 S OME E XISTING M ETHODS Detecting using account-base features Inspect characteristic of user accounts (number of friends, number of post, ages, etc…) This technique is vulnerable to a simple evasion technique such as perform malicious tasks while doing normal behaviors. Detecting using synchronized-group activities Bot accounts perform synchronized activities Human performs task without schedules or with flexible schedule. 4

5 H IGH - LEVEL O VERVIEW 5

6 C ONTRIBUTIONS 6 New detection approach Introducing new “CrowdTarget” Analyze characteristics of crowdturfing targets In-depth analysis Analyze retweet generated by: Normal Crowdturfing Black market High accuracy Evaluation results showed: True-positive: 0.98 False-positive: 0.01

7 M ALICIOUS S ITES B ACKGROUND 7 Black-market sites Boost popularities (followers, likes, comments). Operate with large number of bots to perform many tasks by deadlines. Charge: various plan with deadline. Synchronize group activities. Crowdturfing sites Boost popularities (followers, likes, comments). Human workers perform boosting tasks. Human workers get paid for performing tasks. No synchronized group activities. Automated tasks performance validation.

8 D ATA C OLLECTION - 1 8 Twitter Crowdturfing Black-market Tweet receives >= 50 retweets November 2014 – February 2015

9 D ATA C OLLECTION - 2 9 Tweets and retweets data collection for analysis 5 different black- market sites 1,044 Twitter accounts with >= 100,000 followers 9 different crowdturfing sites Monitor selected verified Twitter accounts to collect tweets and retweets Registered at crowdturfing sites and retrieved tasks requesting retweets - Wrote 282 tweets - Registered at Black- market sites - Purchase retweets for their tweets - Wrote 282 tweets - Registered at Black- market sites - Purchase retweets for their tweets

10 D ATA C OLLECTION M ETHOD Twitter REST API allows up to 100 latest retweets. Two approaches to get as many as possible: 10 Streaming API to monitor retweets it will receive in the next 3 days. Target tweets recently posted Twitter search function to find as many retweets of the target as possible. Target tweets posted in the past

11 L EGAL & E THICAL I SSUES Reference Thomas et al.’s approach : Ethically study underground services Design their data collection and experiments that follows guideline from a formal review of Institutional Review Board (IRB). 11 User Account Anonymity Delete details personal information Minimize effects to underground services Retrieve only public tasks posted on crowdturfing. Buy a small number of retweets from black-market sites Avoid checking and contacting site operators Avoid negative effects on Twitter and its users Delete their fake accounts right after data collection is done.

12 A NALYZE C ROWDTURFING W ORKERS Are they human or bots? 12 Account Popularity Follower to following ratio Number of received retweets per tweet Klout score Synchronized Group Activity Following similarity Retweet similarity

13 A CCOUNT P OPULARITY – F OLLOWER TO F OLLOWING R ELATIONSHIP 13 Percent of accounts have a larger number of followers than followings. Crowdturfing: 70% Normal: 37% Black-market: 20%

14 A CCOUNT P OPULARITY – N UMBER OF R ECEIVED R ETWEET PER T WEET 14 Percent of tweets posted in each accounts that are retweeted more than once: Crowdturfing: 43% Normal: 5% Black-market: 4%

15 A CCOUNT P OPULARITY – K LOUT S CORES 15 Median Klout scores: Crowdturfing: 41 Normal: 33 Black-market: 20

16 A CCOUNT P OPULARITY - R ESULT Crowdturfing accounts has the highest number in all three features: Follower to following relationship Number of retweets per tweet Klout scores Based on their findings: Crowdturfing accounts successfully boost their popularity by gaining followers and retweets from crwodturfing services. Differ from Black-market accounts. Resemble influential users in OSNs. 16

17 S YNCHRONIZED G ROUP A CTIVITY 17 Following similarityRetweet Similarity

18 S YNCHRONIZED G ROUP A CTIVITY - R ESULT 18 Crowdturfing and normal account group have the same pattern (low following similarities & low retweet similarities). Black-market has the highest following and retweet similarities.

19 A NALYZE C ROWDTURFING T ARGETS Crowdturfing targets: Tweets receiving artificial retweets generated by crowdturfing workers. Characteristics: Retweet time distribution (mean, standard deviation, skewness, and kurtosis). Twitter application Unreachable retweeter Click information 19

20 R ETWEET T IME DISTRIBUTION - 1 20 Count number of retweets generated every hour from when a tweet is created.

21 R ETWEET T IME D ISTRIBUTION - 2 21 Mean – average time different between posting and retweeting. Standard deviation – how many retweets are generated around the meantime. Skewness – when a tweet is mostly retweeted. Kurtosis – measure the peak of the distribution.

22 T WITTER A PPLICATION & U NREACHABLE R ETWEETER 22 Percent of retweets generated by application Crowdturfing: 90% Black-market: 99% Normal: 40% Unreachable retweeter Crowdturfing: 80% tweets have 80% unreachable retweeters Normal: < 10% tweets have 80% unreachable retweeters Each tweet receiving retweets, compute the ratio of the number of retweets generated by the application to the total number of retweets.

23 C LICK I NFORMATION 23 Extract tweets contain bit.ly and goo.gl from database Normal: 6,024 Crowdturfing: 3,093 Black-market: 282 Crawl the click analytics and extract number of clicks Normal: over 80% of links receives a larger number of clicks than number retweets. Crowdturfing: about 90% of links receive a smaller number of clicks than number of retweets. Black-market: most of tweets are never clicked.

24 C ROWD T ARGET 24 Prepare Training & Testing Data Set ratio of malicious tweets as 1% of total tweets. Randomly duplicated normal tweets to reach 99 times of malicious tweets CrowdTarget (Classifiers) Mean Standard deviation SkewnessKurtosis Ratio of dominant application used Ratio of unreachable tweeters Ratio of #clicks to #retweet for tweet containing URLs Normalize all feature values: 0 - 1 Test several classifiers using scikit-learn library Ada Boost Gaussian naïve Bayes K-nearest neighbors Validate classification results with 10- fold cross- validation

25 B ASIC C LASSIFICATION N ORMAL VS. M ALICIOUS T WEETS WITHOUT C LICK I NFOMRATION 25 Gaussian naïve Bayes 0.87 Ada Boost 0.95 K-nearest neighbor 0.96 Gausian naïve Bayes 0.99 Ada Boost 0.994 K-nearest neighbor 0.991 True Positive Rate Area under receiver characteristics (ROC) curves

26 C LASSIFICATION WITH CLICK INFORMATION 26 Extract tweets containing bit.ly and goo.gl from dataset Classified them with a link-based feature (#clicks / #retweets) Test with only K-nearest neighbor algorithm TPR: 0.98AUC: 0.993

27 E RROR A NALYSIS 27 False-negative Misjudge crowdturfing tweets that received small number of retweets 50% of undetected crowdturfing tweets were mostly retweeted by reachable accounts Few links in undetected crowdturfing tweets receive a larger number of clicks than retweets False-positive Few verified accounts were retweeted by automated applications Embedded tweets in websites were classified as malicious

28 F EATURE R OBURSTNESS Retweet time distribution Arranging a retweet time schedule similar with a normal retweet time distribution is impossible because crowdturfing workers act independently. Manipulate every boosting task of a worker by installing program at the worker devices is not desirable. Services handle every boosting task at the server can be detected by OSNs by the same IP usage. Use bots accounts to secretly perform tasks can be costly. 28

29 T WITTER A PPLICATION Assigning different applications to different groups of workers, they can eliminate dominant applications. They cannot arbitrary create a large number of Twitter applications due to Twitter restricts number of applications creation per day and per account. Difficult to exactly control the ratio of the most dominant application because workers can retweet any tweet at any time. 29

30 U NREACHABLE R ETWEETERS It is impossible for crowdturfing service workers to follow the posting user of a tweet. Workers keep on receiving future tweets of posting user. Increasing number of following can decrease the popularity of workers on Twitter. Workers cannot follow the posting user when number of their followers is small or when they recently follow many accounts. 30

31 C LICK I NFORMATION It is impossible to request crowdturfing workers to click on a link in a tweet while retweeting it. Workers don’t want to click on link that might be a Malicious link. Artificial click expects to have different time, geographical location, user agents etc… 31

32 C ONCLUSION Manipulation patterns of the target objects maintained regardless of what evasion techniques crowdturfing account used. Through observation, they can distinguish tweets that received retweets by crowdturfing sites from tweets that receiving retweets by normal Twitter users. Evaluation result shows CrowdTarget could detect crowdturfing retweets on Twitter with TPR of 0.98 at FPR of 0.01. 32

33 Q UESTIONS ? C OMMENTS ? C ONCERNS ? S UGGESTIONS ? 33


Download ppt "C ROWD T ARGET : T ARGET - BASED D ETECTION OF C ROWDTURFING IN O NLINE S OCIAL N ETWORKS Authors: Jonghyuk Song, Sangho Lee, Jong Kim Dept. of CSE, POSTECH."

Similar presentations


Ads by Google