Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A Quantitative Study of Forum Spamming Using Context-Based Analysis Yi-Min Wang^ Ming Ma^ Yuan Niu* Hao Chen* Francis Hsu* *UC Davis, ^Microsoft Research.

Similar presentations


Presentation on theme: "1 A Quantitative Study of Forum Spamming Using Context-Based Analysis Yi-Min Wang^ Ming Ma^ Yuan Niu* Hao Chen* Francis Hsu* *UC Davis, ^Microsoft Research."— Presentation transcript:

1 1 A Quantitative Study of Forum Spamming Using Context-Based Analysis Yi-Min Wang^ Ming Ma^ Yuan Niu* Hao Chen* Francis Hsu* *UC Davis, ^Microsoft Research

2 2 User Spammer A Look at the Web

3 3 Why do we care about spam? Users want to Look at quality pages on the web Interact without the trouble of moderation Surf safely Search engines want to Provide good search results Profit from ads We want to investigate the landscape of the problem Popular battleground: web forums

4 4 Why Web Forums? Open communities: wiki, forums, blogs Increasingly easy to contribute

5 5 Why Web Forums?

6 6 How Spammers Operate Spammer Doorwa y Pages (Splogs ) Doorwa y Pages (Splogs ) Search Results Comment Spam Search Engine Spammer Domain 2. Writes Splog URLs 1. Creates Returns 3. Propagates Splog URL 4. Sends User to Doorway URL 5. Redirects User

7 7 How to deal with the problem? Content based approach Constrained by content retrieved May be deceived by tricks like cloaking and redirection We propose: context-based analysis

8 8 Context-based Analysis Consisting of Redirection Cloaking analysis See dynamic content not served to crawlers Use the Strider URL Tracer Flag large number of doorway pages to spam domains Based on intuition that: Publishing links is necessary to increase popularity We must see the destination URL eventually

9 9 Doorways & Redirections Google search: Coach handbag

10 10 Redirection Analysis Fed URLs to Strider URL Tracer, which records all pages visited Ranked top 3 rd Party Domains by redirections Seed known spammer domain Identified doorway pages based on association with spammer domains Manually investigated unknown domains to expand the blacklist

11 11 Cloaking Analysis Diff-based check Run URL twice – once with anti-cloaking, once without Crawler-browser cloaking (User-agent, scripting-on/off) Click-through cloaking (Referer)

12 12 Crawler-Browser Cloaking Google Search: ringtones download www.welcometuscany.it/images/_notes/xc/26/Ringtones-Download.html Javascript Disabled www.welcometuscany.it/images/_notes/xc/26/Ringtones-Download.html Javascript Enabled

13 13 Crawler-Browser Cloaking

14 14 Click-Through Cloaking Cached page/ Scripting off/ Crawler View Advertising Page from Click-throughs Directly Visiting the Page Cached page/ Scripting off/ Crawler View

15 15 Three Perspectives Spammer Doorwa y Pages (Splogs ) Doorwa y Pages (Splogs ) Search Results Comment Spam Search Engine Spammer Domain 2. Writes Splog URLs 1. Creates Returns 3. Propagates Splog URL 4. Sends User to Doorway URL 5. Redirects User Search User Webhost

16 16 Search User

17 17 Search User Chose 9 popular forum software – written in Perl/PHP, hosted/unhosted WWWBoard, Hypernews, Ikonboard, Ezboard, Bravenet, Invision Board, Phpbb, Phorum, and VBulletin Compiled popular tags and common spam terms – list of 190 keywords “Myspace, jewelry, casino, shopping, baseball…” Searched for all pairs in Google & MSN

18 18 Search User Search terms returned spammed forums in top 20 results from both Google and MSN Only exception is “palm-texas-holdem-game” Top 5 most spammed forums: ForumPagesKeywords http://fs.fed.us/...mm/get/mmforumA.html175102 http://www.comm.fsu.edu /interactive/forum/ 13482 http://www.usra.edu /phorum 11994 http://classicauthors.net/messageboard/list.php?f=111797 http://samba.eecs.umich.edu /phorum/list.php?2 10579

19 19 Honeyblogs Spammers: Create their own doorway pages, and Promote the doorways by posting to other people’s pages Honeyblogs lure the spammer in: No moderation, default accept all policy Pinged blog aggregators with every post Abandoned within three months

20 20 Honeyblogs 41,100 comments collected over 339 days 19,297 comments received in the last month Ilium – 930/1432 Litlog – 3734/5714 Spammer activity got me kicked off my hosting server

21 21 Honeyblog Activity

22 22 Honeyblog Activity 3142

23 23 Webhost Perspective Focus on splog doorways Blog Host Examined URLs Spam URLsURLs Using Cloaking Blogspot 13,3891,091 (8.1%)652 Blogspoint 4,7143,535 (75%)131 Blogstudio 369198 (54%)0 Blogsharing 9982 (83%)0 Above Numbers are lower bounds Consider only pages using cloaking & redirection

24 24 Webhost Perspective Blogspot: 1,091 splogs Most popular Randomly sampled 1% of profile pages created in July and extracted all blog links – 13,389 60% of splogs used cloaking 24% of splogs redirected to filldirect.com

25 25 Webhost Perspective Blogspoint: 3535 splogs 2166 redirected to finance-web-search.com 917 redirected to casino-web-search.com Blogstudio: 198 splogs 130 redirected to finance-web-search.com 54 redirected to casino-web-search.com Blogsharing: 82 splogs Plumber related link spamming in splogs

26 26 Also of note… Malicious URLs Previous work by MSR (Strider HoneyMonkey) 1 discovered sites that actively exploit browser vulnerabilities We tested 8 known malicious URLs for presence on the web Found 5 spammed in forums, 2 in link farms, 1 in referrer logs Universal redirectors Redirects user to any URL (sometimes destination is obfuscated): www.rit.edu/~ksa/cgi-bin/splinks/click.cgi?num=2&url=[your url here] http://tinyurl.com/3c7twl  http://www.canadianpharmacyltd.com/group.php?id=59&aid=8 60 Could be used to serve malicious URLs, particularly those on.edu and.gov sites 1 Yi-Min Wang, et al. Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities. NDSS, 2006.

27 27 Related Work (Part 1) Diff-based cloaking Wu & Davison – Diff-based cloaking combined with content based analysis Our approach detects click-through cloaking Content based approaches Fetterly, Manasse and Najork – URL properties, clustering pages of similar content Mishne, Carmel, Lempel – Compared statistical models of comments & target pages against post content Kolari, Finin and Joshi – Meta tag text, anchor text, URLs Our approach is complimentary to content-based approaches

28 28 Related Work (Part 2) Measurements of Trust Metaxas et al – Defined trust neighborhoods Benczur et al – SpamRank: Identify outliers by looking at PageRank of the site and its “supporters” Similarly, our approach propagates distrust by following redirections Plugins to aid moderating forums/blogs Akismet Bad Behavior, Spam Karma Our approach does not require cooperation from forum owners

29 29 Conclusions Context-based approach successfully detects advanced cloaking & redirection based spam Spammers are pervasive 189 of 190 search terms returned spammed forums in the top 20 search results from both Google and MSN Same spammer redirecting to two domains on blogspoint and blogstudio

30 30 Future work There is hope! Economic solution Identifies middlemen in online advertising Read our WWW07 paper 1 http://wwwcsif.cs.ucdavis.edu/~niu http://research.microsoft.com/csm/strider/ 1 Yi-Min Wang et al. Spam Double-Funnel: Connecting Web Spammers with Advertisers. WWW 2007.


Download ppt "1 A Quantitative Study of Forum Spamming Using Context-Based Analysis Yi-Min Wang^ Ming Ma^ Yuan Niu* Hao Chen* Francis Hsu* *UC Davis, ^Microsoft Research."

Similar presentations


Ads by Google