1 A Quantitative Study of Forum Spamming Using Context-Based Analysis Yi-Min Wang^ Ming Ma^ Yuan Niu* Hao Chen* Francis Hsu* *UC Davis, ^Microsoft Research.

Slides:



Advertisements
Similar presentations
Link Building. Link Building Workshop How to get Links Co-citation Link building Dos Link building Donts.
Advertisements

Optimizing search engines using clickthrough data
Understanding and Detecting Malicious Web Advertising
Automated Web Patrol with Strider Honey Monkeys: Finding Web Sites That Exploit Browser Vulnerabilities AUTHORS: Yi-Min Wang, Doug Beck, Xuxian Jiang,
What is WEB SPAM Many slides from a lecture by Marc Najork, Microsoft: “Detecting Spam Web Pages”
1 CANTINA : A Content-Based Approach to Detecting Phishing Web Sites WWW Yue Zhang, Jason Hong, and Lorrie Cranor.
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
The process of increasing the amount of visitors to a website by ranking high in the search results of a search engine.
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
1 ETT 429 Spring 2007 Microsoft Publisher II. 2 World Wide Web Terminology Internet Web pages Browsers Search Engines.
September/2007ECE/UBC - Predictable Computing Systems Prof. Sathish Golapakrishnan 1 Google, we’ve got a problem Elizeu Santos-Neto.
Automated Web Patrol with Strider HoneyMonkeys Present by Zhichun Li.
Teach a man (person) to Phish Recognizing scams, spams and other personal security attacks July 17 th, 2013 High Tea at IT, Summer, 2013.
+ Beginning Blogging by Six Sisters’ Stuff. + Just start! What do you want to blog about? What are you an expert in? What makes you unique? What are you.
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.
Todd Friesen April, 2007 SEO Workshop Web 2.0 Expo San Francisco.
Link Building Strategies You Can Use To Increase Your Rankings, Sales & Profits By Nicole Munoz.
Presentation by Kathleen Stoeckle All Your iFRAMEs Point to Us 17th USENIX Security Symposium (Security'08), San Jose, CA, 2008 Google Technical Report.
Browser Wars and the Politics of Search Engines
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
Adversarial Information Retrieval The Manipulation of Web Content.
John P., Fang Yu, Yinglian Xie, Martin Abadi, Arvind Krishnamurthy University of California, Santa Cruz USENIX SECURITY SYMPOSIUM, August, 2010 John P.,
1 All Your iFRAMEs Point to Us Mike Burry. 2 Drive-by downloads Malicious code (typically Javascript) Downloaded without user interaction (automatic),
BLACK HAT SEO "Show Me The Money”. Keyword Selection.
Advance Web Promotions Analyzing Your Backlinks How to avoid trouble Harold Compton Austin Account Manager.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
Badvertisements: Stealthy Click-Fraud with Unwitting Accessories Mona Gandhi Markus Jakobsson Jacob Ratkiewicz Indiana University at Bloomington Presented.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
© 2006 Stephan M Spencer Netconcepts Search Engine Marketing by Stephan Spencer President, Netconcepts.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Understanding Cross-site Linking in Online Social Networks Yang Chen 1, Chenfan Zhuang 2, Qiang Cao 1, Pan Hui 3 1 Duke University 2 Tsinghua University.
 What is SEO?  Industry Research  SEO Process  Technical aspects of SEO  Social Media - MySpace Optimization  Measuring SEO success  SEO Tools.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
Validating, Promoting, & Publishing Your Web Site Writing For the Web The Internet Writer’s Handbook 2/e.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
1 Fighting Comment Spam Employing the site’s audience, coding skills, and free distributed solutions to fight back.
Search Engine Optimization: A Survey of Current Best Practices Author - Niko Solihin Resource -Grand Valley State University April, 2013 Professor - Soe-Tsyr.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
What Is SEO? Search engine optimization (SEO) is the art and science of publishing and marketing information that ranks well for valuable keywords in.
The Business Model of Google MBAA 609 R. Nakatsu.
Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities Yi-Min Wang, Doug Beck, Xuxian Jiang, Roussi Roussev,
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.
SEO & Analytics The Grey and the Hard Numbers. Introduction  Build a better mouse trap and the world will beat a path to your door  Mouse Trap -> Website.
Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma
The Koobface Botnet and the Rise of Social Malware Kurt Thomas David M. Nicol
© 2010 Pearson Education, Inc. | Publishing as Prentice Hall. Computer Literacy for IC 3 Unit 3: Living Online Chapter 2: Searching for Information.
Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.
Natural Language Processing Lab National Taiwan University The splog Detection Task and A Solution Based on Temporal and Link Properties Yu-Ru Lin et al.
What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari, Tim Finin, Anupam Joshi Computational Approaches to Analyzing Weblogs,
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Heat-seeking Honeypots: Design and Experience John P. John, Fang Yu, Yinglian Xie, Arvind Krishnamurthy and Martin Abadi WWW 2011 Presented by Elias P.
Some from Chapter 11.9 – “Web” 4 th edition and SY306 Web and Databases for Cyber Operations Cookies and.
● The most common website platform ● User friendly-easy to edit ● Constantly improving-updates, plugins, themes Why WordPress?
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Uncovering Social Spammers: Social Honeypots + Machine Learning
WEB SPAM.
A Machine Learning Approach
3 Months Marketing Proposal
Internet LINGO.
By Tommy Koh – SEO GEEK PTE LTD
Malicious Advertisements
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
What is Cookie? Cookie is small information stored in text file on user’s hard drive by web server. This information is later used by web browser to retrieve.
Search Search Engines Search Engine Optimization Search Interfaces
Presentation transcript:

1 A Quantitative Study of Forum Spamming Using Context-Based Analysis Yi-Min Wang^ Ming Ma^ Yuan Niu* Hao Chen* Francis Hsu* *UC Davis, ^Microsoft Research

2 User Spammer A Look at the Web

3 Why do we care about spam? Users want to Look at quality pages on the web Interact without the trouble of moderation Surf safely Search engines want to Provide good search results Profit from ads We want to investigate the landscape of the problem Popular battleground: web forums

4 Why Web Forums? Open communities: wiki, forums, blogs Increasingly easy to contribute

5 Why Web Forums?

6 How Spammers Operate Spammer Doorwa y Pages (Splogs ) Doorwa y Pages (Splogs ) Search Results Comment Spam Search Engine Spammer Domain 2. Writes Splog URLs 1. Creates Returns 3. Propagates Splog URL 4. Sends User to Doorway URL 5. Redirects User

7 How to deal with the problem? Content based approach Constrained by content retrieved May be deceived by tricks like cloaking and redirection We propose: context-based analysis

8 Context-based Analysis Consisting of Redirection Cloaking analysis See dynamic content not served to crawlers Use the Strider URL Tracer Flag large number of doorway pages to spam domains Based on intuition that: Publishing links is necessary to increase popularity We must see the destination URL eventually

9 Doorways & Redirections Google search: Coach handbag

10 Redirection Analysis Fed URLs to Strider URL Tracer, which records all pages visited Ranked top 3 rd Party Domains by redirections Seed known spammer domain Identified doorway pages based on association with spammer domains Manually investigated unknown domains to expand the blacklist

11 Cloaking Analysis Diff-based check Run URL twice – once with anti-cloaking, once without Crawler-browser cloaking (User-agent, scripting-on/off) Click-through cloaking (Referer)

12 Crawler-Browser Cloaking Google Search: ringtones download Javascript Disabled Javascript Enabled

13 Crawler-Browser Cloaking

14 Click-Through Cloaking Cached page/ Scripting off/ Crawler View Advertising Page from Click-throughs Directly Visiting the Page Cached page/ Scripting off/ Crawler View

15 Three Perspectives Spammer Doorwa y Pages (Splogs ) Doorwa y Pages (Splogs ) Search Results Comment Spam Search Engine Spammer Domain 2. Writes Splog URLs 1. Creates Returns 3. Propagates Splog URL 4. Sends User to Doorway URL 5. Redirects User Search User Webhost

16 Search User

17 Search User Chose 9 popular forum software – written in Perl/PHP, hosted/unhosted WWWBoard, Hypernews, Ikonboard, Ezboard, Bravenet, Invision Board, Phpbb, Phorum, and VBulletin Compiled popular tags and common spam terms – list of 190 keywords “Myspace, jewelry, casino, shopping, baseball…” Searched for all pairs in Google & MSN

18 Search User Search terms returned spammed forums in top 20 results from both Google and MSN Only exception is “palm-texas-holdem-game” Top 5 most spammed forums: ForumPagesKeywords /interactive/forum/ /phorum /phorum/list.php?

19 Honeyblogs Spammers: Create their own doorway pages, and Promote the doorways by posting to other people’s pages Honeyblogs lure the spammer in: No moderation, default accept all policy Pinged blog aggregators with every post Abandoned within three months

20 Honeyblogs 41,100 comments collected over 339 days 19,297 comments received in the last month Ilium – 930/1432 Litlog – 3734/5714 Spammer activity got me kicked off my hosting server

21 Honeyblog Activity

22 Honeyblog Activity 3142

23 Webhost Perspective Focus on splog doorways Blog Host Examined URLs Spam URLsURLs Using Cloaking Blogspot 13,3891,091 (8.1%)652 Blogspoint 4,7143,535 (75%)131 Blogstudio (54%)0 Blogsharing 9982 (83%)0 Above Numbers are lower bounds Consider only pages using cloaking & redirection

24 Webhost Perspective Blogspot: 1,091 splogs Most popular Randomly sampled 1% of profile pages created in July and extracted all blog links – 13,389 60% of splogs used cloaking 24% of splogs redirected to filldirect.com

25 Webhost Perspective Blogspoint: 3535 splogs 2166 redirected to finance-web-search.com 917 redirected to casino-web-search.com Blogstudio: 198 splogs 130 redirected to finance-web-search.com 54 redirected to casino-web-search.com Blogsharing: 82 splogs Plumber related link spamming in splogs

26 Also of note… Malicious URLs Previous work by MSR (Strider HoneyMonkey) 1 discovered sites that actively exploit browser vulnerabilities We tested 8 known malicious URLs for presence on the web Found 5 spammed in forums, 2 in link farms, 1 in referrer logs Universal redirectors Redirects user to any URL (sometimes destination is obfuscated): url here]  60 Could be used to serve malicious URLs, particularly those on.edu and.gov sites 1 Yi-Min Wang, et al. Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities. NDSS, 2006.

27 Related Work (Part 1) Diff-based cloaking Wu & Davison – Diff-based cloaking combined with content based analysis Our approach detects click-through cloaking Content based approaches Fetterly, Manasse and Najork – URL properties, clustering pages of similar content Mishne, Carmel, Lempel – Compared statistical models of comments & target pages against post content Kolari, Finin and Joshi – Meta tag text, anchor text, URLs Our approach is complimentary to content-based approaches

28 Related Work (Part 2) Measurements of Trust Metaxas et al – Defined trust neighborhoods Benczur et al – SpamRank: Identify outliers by looking at PageRank of the site and its “supporters” Similarly, our approach propagates distrust by following redirections Plugins to aid moderating forums/blogs Akismet Bad Behavior, Spam Karma Our approach does not require cooperation from forum owners

29 Conclusions Context-based approach successfully detects advanced cloaking & redirection based spam Spammers are pervasive 189 of 190 search terms returned spammed forums in the top 20 search results from both Google and MSN Same spammer redirecting to two domains on blogspoint and blogstudio

30 Future work There is hope! Economic solution Identifies middlemen in online advertising Read our WWW07 paper Yi-Min Wang et al. Spam Double-Funnel: Connecting Web Spammers with Advertisers. WWW 2007.