© 2006 CrawlWall.com ‘Bot Obedience Taking Control of Your Site Transitioning from free-for-all ‘bot abuse to tightly controlled site access Bill Atchison.

© 2006 CrawlWall.com Rogue Spiders Go On Rampage My website was under constant ‘bot attack –Scraping was 10% or more of daily page views not counting spiders from Google, Yahoo! and MSN –Copyrighted material stolen and scattered all over the web –High speed scrapers overloaded server for extended periods of time stopping visitors and major search engines from accessing the site This was unacceptable and had to be stopped!

© 2006 CrawlWall.com What are Bad ‘Bots? Defining a good ‘bot vs. a bad ‘bot The motives behind why bad ‘bots exist Various types of ‘bots ranging from a mild nuisance to very bad and harmful Stealth ‘bots vs. Visible ‘bots How scraper ‘bots utilize content

© 2006 CrawlWall.com What Good ‘Bots Do Obey Internet standards like robots.txt Don’t crawl your server abusively fast Return to get fresh content in a reasonable timeframe Provide traffic in return for crawling your site

© 2006 CrawlWall.com What Bad ‘Bots Do Will go to any length to get your content –Ignore Internet standards like robots.txt –Spoof ‘bot names used by major search engines –Change the User Agent randomly to avoid filters –Masquerade as humans (stealth) to completely bypass filters –Crawl as fast as possible to avoid being stopped –Crawl as slow as possible to slide under the radar –Crawl from as many IPs as possible to avoid detection –Return often to get your new content and get indexed first Violate your copyrights and repackage your site Hijack your search engine positions Provide no value in return for crawling

© 2006 CrawlWall.com What Motivates Bad ‘Bots? They want to get something for nothing! –To build websites using your content –To mine information using your content –To get traffic using your content –To make money using your content Got the picture? You build it and parasites profit off your hard work.

© 2006 CrawlWall.com Who Are All These ‘Bots? Intelligence gathering Spybots –Copyright Compliance –Branding Compliance –Corporate Security Monitoring –Media Monitoring (mp3, mpeg, etc.) –Myriad of Safe-Site Monitoring solutions Content Scrapers (pure theft) Data Aggregators Link Checkers Privacy Checkers Web Copiers/Downloaders Offline Web Browsers Explosion of open-source crawlers Nutch and Heritrix And many more…

© 2006 CrawlWall.com Stealth ‘Bots vs. Visible* ‘Bots *Visible bots excluding major search engines like Google, Yahoo! and MSN Sample of daily page requests made by unwelcome ‘bots shows stealth activity, which can’t be blocked by user agent filtering, exceeds easily identifiable ‘bots.

© 2006 CrawlWall.com How Scraper ‘Bots Use Your Content The following examples will show how content is used by scrapers and hijackers building websites that feed off your text and keywords to drive clicks to their customers. See how these scrapers were fed crumbs of data that linked them back to their ‘bots that crawled the website.

© 2006 CrawlWall.com The suspected scraper was fed their own ‘bot IP address for later identification. Scrapers Scramble Your Content Scraped pages are scrambled together to make new content and avoid duplicate content penalties.

© 2006 CrawlWall.com Scrapers’ Methods Used Against Them The suspected scraper was fed their own ‘bot User Agent, which shows this was a stealth crawler Scrapers can be fed their own information back to them in order to link the scraper ‘bot to the scraper website.

© 2006 CrawlWall.com Scraper Site Linked to ‘Bot Origins Quick check in the log file archives reveals: This scraper used a proxy on a dedicated server and only got a couple of error messages seeded with crumbs instead of content as the proxy was already being blocked. Note that Googlebot tried crawling through the proxy server which can lead to hijacked pages in the search engine.

© 2006 CrawlWall.com Cloaked Scrapers Hide Your Content Note the bot IP address was again fed back to the suspected scraper This is what the cloaked site shows search engines to get traffic, this is never seen by visitors to their site.

© 2006 CrawlWall.com Search Engine Scraping by Proxy Here are a couple of examples from Google showing how proxy servers attempt to get traffic. Proxy sites don’t have spiders but they use the search engines as unwitting scrapers by cloaking links that entice Googlebot and others to crawl via their proxy. If Googlebot wasn’t being restricted by IP address then the actual site content would’ve been crawled, indexed and the proxy hijacking would possibly appear near, or even above, my site listing.

© 2006 CrawlWall.com Scrapers Damage Reputations Scraper activity can directly damage the reputation of both you and your customers when content from your website appears in disreputable locations. There can be backlash from customers unaware of the scraper situation and think you might somehow be responsible for these promotions on seedy websites.

© 2006 CrawlWall.com How to Get ‘Bots Under Control OPT-IN vs. OPT-OUT ‘bot blocking strategies OPT-IN Traffic Analysis Profiling and detecting Stealth ‘Bots vs. Visitors Setting spider traps and using natural traps Avoiding search engine pitfalls Protecting your site

© 2006 CrawlWall.com OPT-OUT ‘Bot Blocking Fails Robots.txt only works for the well behaved ‘bots as most bad ‘bots ignore robots.txt except when trying to avoid spider traps User Agent blacklist filters fail because new bad ‘bots appear daily, periodically change their name or use random names to avoid being blocked IP blocking in the firewall can create lists so large that the firewall processing degrades server performance

© 2006 CrawlWall.com OPT-IN ‘Bot Paradigm Shift Authorize good ‘bots only, no more blacklists as everything else is blocked by default Narrow search engine access by IP range to prevent spoofing and page hijacking via proxy sites Authorize browsers explicitly such as Internet Explorer, Firefox, Opera and mobile devices

© 2006 CrawlWall.com Can OPT-IN Harm My Traffic? Blocking traffic is risky in either OPT-IN or OPT-OUT methods so caution is always advised. Review traffic analysis reports to verify all beneficial sources of traffic are being allowed to access. Google Analytics is an excellent tool that uses Javascript to track traffic thus eliminating most ‘bots from the reports.

© 2006 CrawlWall.com Detecting Stealth - Visitor vs. ‘Bot Challenge stealth with a captcha or something only a human can respond to when sufficient ‘bot-like criteria has been met. –Some ‘bots use cookies –Very few ‘bots execute Javascript –Bots hardly ever examine CSS files –Rarely do ‘bots download images –Monitor speed and duration of site access –Observe the quantity of page requests –Watch for access to robots.txt and other spider traps –Validate page requests for HTML/SGML errors –Verify if the User Agents are valid –Check IPs coming from bad online neighborhoods like web hosts which only have servers

© 2006 CrawlWall.com Set Spider Traps Robots.txt is spider trap because stealth crawlers reading this file expose themselves while trying to avoid spider traps. Create a spider trap page with a hidden link in the your web pages that is inaccessible via browser navigation. Disallow: /spidertrap.html Natural spider traps are files humans rarely read like privacy and legal pages which can be monitored for potential ‘bot traffic.

© 2006 CrawlWall.com Avoid Search Engine Pitfalls Don’t allow search engines to archive pages as search engine cache is also a scraping target. Tell unauthorized robots that crawling is forbidden by dynamically inserting no-crawl directives per page. Even with the archive cache disabled, scrapers extract lists of valid page names from search engines to defeat spider traps. Search engine translation tools and other services are also used as a proxy to scrape websites so they should be dynamically monitored.

© 2006 CrawlWall.com Ways to Protect Your Site Use a script to dynamically display robots.txt and show proper information to allowed ‘bots and all others see “DISALLOW: /” (http://www.leekillough.com/robots.html) User Agent filtering and blocking with the rules structured for an OPT-IN ALLOW list which is easier to maintain and more secure as everything else is blocked by default. Block entire IP ranges for web hosts that host or facilitate access for scraper sites, unwanted ‘bots or proxy servers since humans don’t typically browse via dedicated servers anyway. For blocking large lists of IPs, such as proxy lists, use PHP and a database like MySQL to avoid firewall performance problems. Use scripts like Robert Plank’s AntiCrawl to stop and challenge most stealth crawlers that User Agent filters can’t control. (http://www.anticrawl.com)

© 2006 CrawlWall.com Summary Tighten Site Access: OPT-IN spiders instead of building blacklists Set spider traps to snare stealth crawlers Stealth ‘bot profiling and challenge scripts Eliminate 3 rd party scraping sources such as search engine archives and proxy servers Get Better Results: Tighter controls on copyrighted content Improve search engine ranking after removing unwanted competition Better server performance for visitors and legit search engine crawls

© 2006 CrawlWall.com ‘Bot Obedience Taking Control of Your Site Transitioning from free-for-all ‘bot abuse to tightly controlled site access Bill Atchison.

Similar presentations

Presentation on theme: "© 2006 CrawlWall.com ‘Bot Obedience Taking Control of Your Site Transitioning from free-for-all ‘bot abuse to tightly controlled site access Bill Atchison."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© 2006 CrawlWall.com ‘Bot Obedience Taking Control of Your Site Transitioning from free-for-all ‘bot abuse to tightly controlled site access Bill Atchison.

Similar presentations

Presentation on theme: "© 2006 CrawlWall.com ‘Bot Obedience Taking Control of Your Site Transitioning from free-for-all ‘bot abuse to tightly controlled site access Bill Atchison."— Presentation transcript:

Similar presentations

About project

Feedback