© 2006 CrawlWall.com ‘Bot Obedience Taking Control of Your Site Transitioning from free-for-all ‘bot abuse to tightly controlled site access Bill Atchison.

Slides:



Advertisements
Similar presentations
Protecting Browser State from Web Privacy Attacks Collin Jackson, Andrew Bortz, Dan Boneh, John Mitchell Stanford University.
Advertisements

Getting Your Web Site Found. Meta Tags Description Tag This allows you to influence the description of your page with the web crawlers.
PHP Meetup - SEO 2/12/2009. Where to Focus? Ensuring the findability of content Ensuring content is well understood by search engines Maximizing the importance.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 44 How Firewalls Work How Firewalls Work.
How the Internet Works Course Objectives Introduce the various web browsers Introduce some new terms Explain the basic Internet to PC hookup  ISP  Wired.
Lecture 2 Page 1 CS 236, Spring 2008 Security Principles and Policies CS 236 On-Line MS Program Networks and Systems Security Peter Reiher Spring, 2008.
What is SEO ? Search engine optimisation Way to optimise your web-site to increase your page rank in SE.
Network Security. Network security starts from authenticating any user. Once authenticated, firewall enforces access policies such as what services are.
Authors: Mona Gandhi, Markus Jakobsson, Jacob Ratkiewicz (Indiana University at Bloomington) Presented By: Lakshmy Mohanan.
+ Search Engine Optimisation PAGNIER Hugo INMAS gpe C TERRADE Joffrey.
SEO PACKAGES. Types of Plans Starter Plan Business Plan Enterprises Plan.
 Proxy Servers are software that act as intermediaries between client and servers on the Internet.  They help users on private networks get information.
_______________________________________________________________________________________________________________ E-Commerce: Fundamentals and Applications1.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
 What I hate about you things people often do that hurt their Web site’s chances with search engines.
Increasing Website ROI through SEO and Analytics Dan Belhassen greatBIGnews.com Modern Earth Inc.
By Raza / Faisal By: Raza Usmani Faisal Khan. What is SEO? It is the process of affecting the visibility of a website or a web page in a search engine's.
8 White Hat SEO Methods for PHP Developers David Fischer Avity LLC.
Data Access Worldwide May 16 – 18, 2007 Copyright 2007, Data Access Worldwide May 16 – 18, 2007 Copyright 2007, Data Access Worldwide Search Engine Optimization.
 Internet vs WWW  Pages vs Sites  How the Internet Works  Getting a Web Presence.
GONE PHISHING ECE 4112 Final Lab Project Group #19 Enid Brown & Linda Larmore.
Prevent Cross-Site Scripting (XSS) attack
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
Server tools. Site server tools can be utilised to build, host, track and monitor transactions on a business site. There are a wide range of possibilities.
14 Publishing a Web Site Section 14.1 Identify the technical needs of a Web server Evaluate Web hosts Compare and contrast internal and external Web hosting.
Building a site on the World Wide Web requires more than simply learning the HTML language and starting out. You need to get a place to put your Web pages,
© 2008 CrawlWall.com Competitive Counter-Intelligence Stop Snooping Competitors Techniques for protecting your SEO investment from prying competitive eyes.
© 2006 KDnuggets [16/Nov/2005:16:32: ] "GET /jobs/ HTTP/1.1" "
Search Engine Optimization ext 304 media-connection.com The process affecting the visibility of a website across various search engines to.
Crawling Slides adapted from
Google Analytics for Small Business Presented by: Keidra Chaney.
Downloading defined: Downloading is the process of copying a file (such as a game or utility) from one computer to another across the internet. When you.
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
Online Translation Service Capstone Design Eunyoung Ku Jason Roberts Jennifer Pitts Gregory Woodburn Kim Tran.
Search Engine Optimization. Search Engines ≈50% your new users are from a search engine ≈50% are returning users Many repeat viewers will return using.
1 Fighting Comment Spam Employing the site’s audience, coding skills, and free distributed solutions to fight back.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
SEO & Analytics The Grey and the Hard Numbers. Introduction  Build a better mouse trap and the world will beat a path to your door  Mouse Trap -> Website.
A Brief Digression on Search Engine Optimization (SEO)
keyword research – corporate training – private coaching Argh! We’ve Been Duped! Dan Thies, SEO Research Labs.
Uncovering the Invisible Web. Back in the day… Students used to research using resources hand-picked by librarians and teachers. These materials were.
© 2010 Pearson Education, Inc. | Publishing as Prentice Hall. Computer Literacy for IC 3 Unit 3: Living Online Chapter 2: Searching for Information.
PHP Error Handling & Reporting. Error Handling Never allow a default error message or error number returned by the mysql_error() and mysql_errno() functions.
SEO for Google in Hello I'm Dave Taylor from Webmedia.
Computer Security By Duncan Hall.
Steps to an E-business  Developing Concept and Selling Points  Domain name  Website Development  Sales and Marketing.
Course Title Google's Online Power Tools That You Need NOW!
Firewalls A brief introduction to firewalls. What does a Firewall do? Firewalls are essential tools in managing and controlling network traffic Firewalls.
The Internet. Important Terms Network Network Internet Internet WWW (World Wide Web) WWW (World Wide Web) Web page Web page Web site Web site Browser.
Introduction Web analysis includes the study of users’ behavior on the web Traffic analysis – Usage analysis Behavior at particular website or across.
Any criminal action perpetrated primarily through the use of a computer.
Search Engine Optimization Presented By:- ARKA Softwares Effective! Affordable! Time Groove
Internet Privacy Define PRIVACY? How important is internet privacy to you? What privacy settings do you utilize for your social media sites?
How to Perform Technical SEO Audit
SEARCH ENGINE OPTIMIZATION, SECURITY, MAINTENANCE.
Protecting your search privacy A lesson plan created & presented by Maria Bernhey (MLS) Adjunct Information Literacy Instructor
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Technical SEO tips for Web Developers Richa Bhatia Singsys Pte. Ltd.
Computer Security Keeping you and your computer safe in the digital world.
Windows Vista Configuration MCTS : Internet Explorer 7.0.
CHAPTER 16 SEARCH ENGINE OPTIMIZATION. LEARNING OBJECTIVES How to monitor your site’s traffic What are the pros and cons of keyword advertising within.
15 free things you can do to climb higher on Google
How do Web Applications Work?
Common Methods Used to Commit Computer Crimes
Software Applications for end-users
Practical Censorship Evasion Leveraging Content Delivery Networks
Web Caching? Web Caching:.
What is Cookie? Cookie is small information stored in text file on user’s hard drive by web server. This information is later used by web browser to retrieve.
Maximizing Exposure for Your Non-Profit
INTELLIGENT BROWSERS Cenk Ursavas.
Presentation transcript:

© 2006 CrawlWall.com ‘Bot Obedience Taking Control of Your Site Transitioning from free-for-all ‘bot abuse to tightly controlled site access Bill Atchison CrawlWall.com “The Bot Stops Here!”

© 2006 CrawlWall.com Rogue Spiders Go On Rampage My website was under constant ‘bot attack –Scraping was 10% or more of daily page views not counting spiders from Google, Yahoo! and MSN –Copyrighted material stolen and scattered all over the web –High speed scrapers overloaded server for extended periods of time stopping visitors and major search engines from accessing the site This was unacceptable and had to be stopped!

© 2006 CrawlWall.com What are Bad ‘Bots? Defining a good ‘bot vs. a bad ‘bot The motives behind why bad ‘bots exist Various types of ‘bots ranging from a mild nuisance to very bad and harmful Stealth ‘bots vs. Visible ‘bots How scraper ‘bots utilize content

© 2006 CrawlWall.com What Good ‘Bots Do Obey Internet standards like robots.txt Don’t crawl your server abusively fast Return to get fresh content in a reasonable timeframe Provide traffic in return for crawling your site

© 2006 CrawlWall.com What Bad ‘Bots Do Will go to any length to get your content –Ignore Internet standards like robots.txt –Spoof ‘bot names used by major search engines –Change the User Agent randomly to avoid filters –Masquerade as humans (stealth) to completely bypass filters –Crawl as fast as possible to avoid being stopped –Crawl as slow as possible to slide under the radar –Crawl from as many IPs as possible to avoid detection –Return often to get your new content and get indexed first Violate your copyrights and repackage your site Hijack your search engine positions Provide no value in return for crawling

© 2006 CrawlWall.com What Motivates Bad ‘Bots? They want to get something for nothing! –To build websites using your content –To mine information using your content –To get traffic using your content –To make money using your content Got the picture? You build it and parasites profit off your hard work.

© 2006 CrawlWall.com Who Are All These ‘Bots? Intelligence gathering Spybots –Copyright Compliance –Branding Compliance –Corporate Security Monitoring –Media Monitoring (mp3, mpeg, etc.) –Myriad of Safe-Site Monitoring solutions Content Scrapers (pure theft) Data Aggregators Link Checkers Privacy Checkers Web Copiers/Downloaders Offline Web Browsers Explosion of open-source crawlers Nutch and Heritrix And many more…

© 2006 CrawlWall.com Stealth ‘Bots vs. Visible* ‘Bots *Visible bots excluding major search engines like Google, Yahoo! and MSN Sample of daily page requests made by unwelcome ‘bots shows stealth activity, which can’t be blocked by user agent filtering, exceeds easily identifiable ‘bots.

© 2006 CrawlWall.com The Wild Wild Web

© 2006 CrawlWall.com How Scraper ‘Bots Use Your Content The following examples will show how content is used by scrapers and hijackers building websites that feed off your text and keywords to drive clicks to their customers. See how these scrapers were fed crumbs of data that linked them back to their ‘bots that crawled the website.

© 2006 CrawlWall.com This web site is not about CrawlWall, but they’re using my site name and scraped content in an attempt to get traffic to click their links. Scrapers Use Your Keywords

© 2006 CrawlWall.com The suspected scraper was fed their own ‘bot IP address for later identification. Scrapers Scramble Your Content Scraped pages are scrambled together to make new content and avoid duplicate content penalties.

© 2006 CrawlWall.com Scrapers’ Methods Used Against Them The suspected scraper was fed their own ‘bot User Agent, which shows this was a stealth crawler Scrapers can be fed their own information back to them in order to link the scraper ‘bot to the scraper website.

© 2006 CrawlWall.com Scraper Site Linked to ‘Bot Origins Quick check in the log file archives reveals: This scraper used a proxy on a dedicated server and only got a couple of error messages seeded with crumbs instead of content as the proxy was already being blocked. Note that Googlebot tried crawling through the proxy server which can lead to hijacked pages in the search engine.

© 2006 CrawlWall.com Cloaked Scrapers Hide Your Content Note the bot IP address was again fed back to the suspected scraper This is what the cloaked site shows search engines to get traffic, this is never seen by visitors to their site.

© 2006 CrawlWall.com Totally unrelated to the scraped content that brings traffic, the cloaked scraper shows visitors this page to earn money Cloaked Scrapers Show Links That Pay

© 2006 CrawlWall.com Search Engine Scraping by Proxy Here are a couple of examples from Google showing how proxy servers attempt to get traffic. Proxy sites don’t have spiders but they use the search engines as unwitting scrapers by cloaking links that entice Googlebot and others to crawl via their proxy. If Googlebot wasn’t being restricted by IP address then the actual site content would’ve been crawled, indexed and the proxy hijacking would possibly appear near, or even above, my site listing.

© 2006 CrawlWall.com Scrapers Damage Reputations Scraper activity can directly damage the reputation of both you and your customers when content from your website appears in disreputable locations. There can be backlash from customers unaware of the scraper situation and think you might somehow be responsible for these promotions on seedy websites.

© 2006 CrawlWall.com Stopping ‘bots doesn’t take a genius

© 2006 CrawlWall.com How to Get ‘Bots Under Control OPT-IN vs. OPT-OUT ‘bot blocking strategies OPT-IN Traffic Analysis Profiling and detecting Stealth ‘Bots vs. Visitors Setting spider traps and using natural traps Avoiding search engine pitfalls Protecting your site

© 2006 CrawlWall.com OPT-OUT ‘Bot Blocking Fails Robots.txt only works for the well behaved ‘bots as most bad ‘bots ignore robots.txt except when trying to avoid spider traps User Agent blacklist filters fail because new bad ‘bots appear daily, periodically change their name or use random names to avoid being blocked IP blocking in the firewall can create lists so large that the firewall processing degrades server performance

© 2006 CrawlWall.com OPT-IN ‘Bot Paradigm Shift Authorize good ‘bots only, no more blacklists as everything else is blocked by default Narrow search engine access by IP range to prevent spoofing and page hijacking via proxy sites Authorize browsers explicitly such as Internet Explorer, Firefox, Opera and mobile devices

© 2006 CrawlWall.com Can OPT-IN Harm My Traffic? Blocking traffic is risky in either OPT-IN or OPT-OUT methods so caution is always advised. Review traffic analysis reports to verify all beneficial sources of traffic are being allowed to access. Google Analytics is an excellent tool that uses Javascript to track traffic thus eliminating most ‘bots from the reports.

© 2006 CrawlWall.com Detecting Stealth - Visitor vs. ‘Bot Challenge stealth with a captcha or something only a human can respond to when sufficient ‘bot-like criteria has been met. –Some ‘bots use cookies –Very few ‘bots execute Javascript –Bots hardly ever examine CSS files –Rarely do ‘bots download images –Monitor speed and duration of site access –Observe the quantity of page requests –Watch for access to robots.txt and other spider traps –Validate page requests for HTML/SGML errors –Verify if the User Agents are valid –Check IPs coming from bad online neighborhoods like web hosts which only have servers

© 2006 CrawlWall.com Set Spider Traps Robots.txt is spider trap because stealth crawlers reading this file expose themselves while trying to avoid spider traps. Create a spider trap page with a hidden link in the your web pages that is inaccessible via browser navigation. Disallow: /spidertrap.html Natural spider traps are files humans rarely read like privacy and legal pages which can be monitored for potential ‘bot traffic.

© 2006 CrawlWall.com Avoid Search Engine Pitfalls Don’t allow search engines to archive pages as search engine cache is also a scraping target. Tell unauthorized robots that crawling is forbidden by dynamically inserting no-crawl directives per page. Even with the archive cache disabled, scrapers extract lists of valid page names from search engines to defeat spider traps. Search engine translation tools and other services are also used as a proxy to scrape websites so they should be dynamically monitored.

© 2006 CrawlWall.com Ways to Protect Your Site Use a script to dynamically display robots.txt and show proper information to allowed ‘bots and all others see “DISALLOW: /” ( User Agent filtering and blocking with the rules structured for an OPT-IN ALLOW list which is easier to maintain and more secure as everything else is blocked by default. Block entire IP ranges for web hosts that host or facilitate access for scraper sites, unwanted ‘bots or proxy servers since humans don’t typically browse via dedicated servers anyway. For blocking large lists of IPs, such as proxy lists, use PHP and a database like MySQL to avoid firewall performance problems. Use scripts like Robert Plank’s AntiCrawl to stop and challenge most stealth crawlers that User Agent filters can’t control. (

© 2006 CrawlWall.com Summary Tighten Site Access: OPT-IN spiders instead of building blacklists Set spider traps to snare stealth crawlers Stealth ‘bot profiling and challenge scripts Eliminate 3 rd party scraping sources such as search engine archives and proxy servers Get Better Results: Tighter controls on copyrighted content Improve search engine ranking after removing unwanted competition Better server performance for visitors and legit search engine crawls

© 2006 CrawlWall.com Thank You!