We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byZaire Brixey
Modified about 1 year ago
© 2006 CrawlWall.com ‘Bot Obedience Taking Control of Your Site Transitioning from free-for-all ‘bot abuse to tightly controlled site access Bill Atchison CrawlWall.com “The Bot Stops Here!”
© 2006 CrawlWall.com Rogue Spiders Go On Rampage My website was under constant ‘bot attack –Scraping was 10% or more of daily page views not counting spiders from Google, Yahoo! and MSN –Copyrighted material stolen and scattered all over the web –High speed scrapers overloaded server for extended periods of time stopping visitors and major search engines from accessing the site This was unacceptable and had to be stopped!
© 2006 CrawlWall.com What are Bad ‘Bots? Defining a good ‘bot vs. a bad ‘bot The motives behind why bad ‘bots exist Various types of ‘bots ranging from a mild nuisance to very bad and harmful Stealth ‘bots vs. Visible ‘bots How scraper ‘bots utilize content
© 2006 CrawlWall.com What Good ‘Bots Do Obey Internet standards like robots.txt Don’t crawl your server abusively fast Return to get fresh content in a reasonable timeframe Provide traffic in return for crawling your site
© 2006 CrawlWall.com What Bad ‘Bots Do Will go to any length to get your content –Ignore Internet standards like robots.txt –Spoof ‘bot names used by major search engines –Change the User Agent randomly to avoid filters –Masquerade as humans (stealth) to completely bypass filters –Crawl as fast as possible to avoid being stopped –Crawl as slow as possible to slide under the radar –Crawl from as many IPs as possible to avoid detection –Return often to get your new content and get indexed first Violate your copyrights and repackage your site Hijack your search engine positions Provide no value in return for crawling
© 2006 CrawlWall.com What Motivates Bad ‘Bots? They want to get something for nothing! –To build websites using your content –To mine information using your content –To get traffic using your content –To make money using your content Got the picture? You build it and parasites profit off your hard work.
© 2006 CrawlWall.com Who Are All These ‘Bots? Intelligence gathering Spybots –Copyright Compliance –Branding Compliance –Corporate Security Monitoring –Media Monitoring (mp3, mpeg, etc.) –Myriad of Safe-Site Monitoring solutions Content Scrapers (pure theft) Data Aggregators Link Checkers Privacy Checkers Web Copiers/Downloaders Offline Web Browsers Explosion of open-source crawlers Nutch and Heritrix And many more…
© 2006 CrawlWall.com Stealth ‘Bots vs. Visible* ‘Bots *Visible bots excluding major search engines like Google, Yahoo! and MSN Sample of daily page requests made by unwelcome ‘bots shows stealth activity, which can’t be blocked by user agent filtering, exceeds easily identifiable ‘bots.
© 2006 CrawlWall.com The Wild Wild Web
© 2006 CrawlWall.com How Scraper ‘Bots Use Your Content The following examples will show how content is used by scrapers and hijackers building websites that feed off your text and keywords to drive clicks to their customers. See how these scrapers were fed crumbs of data that linked them back to their ‘bots that crawled the website.
© 2006 CrawlWall.com This web site is not about CrawlWall, but they’re using my site name and scraped content in an attempt to get traffic to click their links. Scrapers Use Your Keywords
© 2006 CrawlWall.com The suspected scraper was fed their own ‘bot IP address for later identification. Scrapers Scramble Your Content Scraped pages are scrambled together to make new content and avoid duplicate content penalties.
© 2006 CrawlWall.com Scrapers’ Methods Used Against Them The suspected scraper was fed their own ‘bot User Agent, which shows this was a stealth crawler Scrapers can be fed their own information back to them in order to link the scraper ‘bot to the scraper website.
© 2006 CrawlWall.com Scraper Site Linked to ‘Bot Origins Quick check in the log file archives reveals: This scraper used a proxy on a dedicated server and only got a couple of error messages seeded with crumbs instead of content as the proxy was already being blocked. Note that Googlebot tried crawling through the proxy server which can lead to hijacked pages in the search engine.
© 2006 CrawlWall.com Cloaked Scrapers Hide Your Content Note the bot IP address was again fed back to the suspected scraper This is what the cloaked site shows search engines to get traffic, this is never seen by visitors to their site.
© 2006 CrawlWall.com Totally unrelated to the scraped content that brings traffic, the cloaked scraper shows visitors this page to earn money Cloaked Scrapers Show Links That Pay
© 2006 CrawlWall.com Search Engine Scraping by Proxy Here are a couple of examples from Google showing how proxy servers attempt to get traffic. Proxy sites don’t have spiders but they use the search engines as unwitting scrapers by cloaking links that entice Googlebot and others to crawl via their proxy. If Googlebot wasn’t being restricted by IP address then the actual site content would’ve been crawled, indexed and the proxy hijacking would possibly appear near, or even above, my site listing.
© 2006 CrawlWall.com Scrapers Damage Reputations Scraper activity can directly damage the reputation of both you and your customers when content from your website appears in disreputable locations. There can be backlash from customers unaware of the scraper situation and think you might somehow be responsible for these promotions on seedy websites.
© 2006 CrawlWall.com Stopping ‘bots doesn’t take a genius
© 2006 CrawlWall.com How to Get ‘Bots Under Control OPT-IN vs. OPT-OUT ‘bot blocking strategies OPT-IN Traffic Analysis Profiling and detecting Stealth ‘Bots vs. Visitors Setting spider traps and using natural traps Avoiding search engine pitfalls Protecting your site
© 2006 CrawlWall.com OPT-OUT ‘Bot Blocking Fails Robots.txt only works for the well behaved ‘bots as most bad ‘bots ignore robots.txt except when trying to avoid spider traps User Agent blacklist filters fail because new bad ‘bots appear daily, periodically change their name or use random names to avoid being blocked IP blocking in the firewall can create lists so large that the firewall processing degrades server performance
© 2006 CrawlWall.com OPT-IN ‘Bot Paradigm Shift Authorize good ‘bots only, no more blacklists as everything else is blocked by default Narrow search engine access by IP range to prevent spoofing and page hijacking via proxy sites Authorize browsers explicitly such as Internet Explorer, Firefox, Opera and mobile devices
© 2006 CrawlWall.com Set Spider Traps Robots.txt is spider trap because stealth crawlers reading this file expose themselves while trying to avoid spider traps. Create a spider trap page with a hidden link in the your web pages that is inaccessible via browser navigation. Disallow: /spidertrap.html Natural spider traps are files humans rarely read like privacy and legal pages which can be monitored for potential ‘bot traffic.
© 2006 CrawlWall.com Avoid Search Engine Pitfalls Don’t allow search engines to archive pages as search engine cache is also a scraping target. Tell unauthorized robots that crawling is forbidden by dynamically inserting no-crawl directives per page. Even with the archive cache disabled, scrapers extract lists of valid page names from search engines to defeat spider traps. Search engine translation tools and other services are also used as a proxy to scrape websites so they should be dynamically monitored.
© 2006 CrawlWall.com Ways to Protect Your Site Use a script to dynamically display robots.txt and show proper information to allowed ‘bots and all others see “DISALLOW: /” (http://www.leekillough.com/robots.html) User Agent filtering and blocking with the rules structured for an OPT-IN ALLOW list which is easier to maintain and more secure as everything else is blocked by default. Block entire IP ranges for web hosts that host or facilitate access for scraper sites, unwanted ‘bots or proxy servers since humans don’t typically browse via dedicated servers anyway. For blocking large lists of IPs, such as proxy lists, use PHP and a database like MySQL to avoid firewall performance problems. Use scripts like Robert Plank’s AntiCrawl to stop and challenge most stealth crawlers that User Agent filters can’t control. (http://www.anticrawl.com)
© 2006 CrawlWall.com Summary Tighten Site Access: OPT-IN spiders instead of building blacklists Set spider traps to snare stealth crawlers Stealth ‘bot profiling and challenge scripts Eliminate 3 rd party scraping sources such as search engine archives and proxy servers Get Better Results: Tighter controls on copyrighted content Improve search engine ranking after removing unwanted competition Better server performance for visitors and legit search engine crawls
© 2006 CrawlWall.com Thank You!
Reveal Course on Communication - Advanced This project has been funded with support from the European Commission. This publication reflects the views only.
What is a Search Engine? Definition: An internet-based tool that searches an index of documents for a particular term, phrase or text specified by the.
SECURITY AWARENESS. The Importance of Security Awareness Training Security Awareness Training provides the knowledge to protect information systems and.
Common types of online attacks Dr.Talal Alkharobi.
Parenting the Online Child. Your Child Is on the Internet The Internet is a wonderful research tool. Reliance on the Internet in schools has grown rapidly.
By Janie and Michael Jones. Our Purpose Purpose of this workshop is to help you: Develop a Marketing Plan Generate Free Leads Drive Traffic To Your Website.
Business Objects Web Intelligence Business Objects Web Intelligence.
Network Security Protecting An Organizations Network.
E-MARKETING (INTERNET MARKETING). E-MARKETING Marketing: A comprehensive process that involves every aspect of a business from designing its products,
How to rank No. 1 on Google (and the other search engines) Clayton Wehner - Blue Train Enterprises M: E:
Ethical Hacking Module XII Web Application Vulnerabilities.
GETTING STARTED WITH HTML5 - By Suresh Kumar. Agenda History, Vision & Future of HTML5 Getting Started With HTML5 Structure of a Web Page Forms Audio.
Unit 1 Review Project Luke Mitchell Kinsey Lyle Theron Guidry Devon Bryant Keith Walston.
Mercy Tablet Hardware/Use and care Network Login Tablet Power Settings Printing Saving Files , PowerSchool, Moodle, Dyknow Acceptable.
Business Objects For Power Users BI_BOBJ_300. Course Content This course focuses on how to run, modify, and create a Business Objects report, including.
The Internet. Contents Internet vs WWW Internet vs WWW Pages vs Sites Pages vs Sites How the Internet Works How the Internet Works Getting a Web Presence.
Using Social Media to Generate Media Coverage and Improve Brand Sentiment Presented By Adeyemi Adeniyi ( B.sc, MCP, MCTS )
12th November 2013 David Clarke MD
The ESC-QuickBooks Integration For Use with ESC Version 12.
UNIT 2: Firewalls Content : Firewalls in general basic operation and architecture Main border firewalls using stateful inspection Screening firewalls.
By : Uday Kumar. What is.htaccess? How to use.htaccess? Error documents Redirects & Rewrites Password protection Deny visitors by IP address DirectoryIndex.
Web site design incubation Thomas Krichel LIU & НГУ
Steve Krug. Nothing important should ever be more than two clicks away.
The Monetization Equation. Get rich quick? No. You've come to the wrong place if you're looking for that. So is it worth your time and investment? YES!
Information Systems Using Information (Higher and Intermediate 2)
: One-to-One Marketing and Personalisation in EC One-to-one marketing: –Marketing that treats each customer in an unique way, facilitated by the.
Actionable Analytics What to do with all those numbers by Greg
Windows 2008 Active Directory Configuration – Week 4 of 6 Microsoft Test: Mark McCoy MCSE, CNE, CISSP.
Computer Hope Copyright © Cannady ACOS. All rights reserved.
You and Your Business on the Internet Ray Mills Raymond Mills & Associates.
© 2016 SlidePlayer.com Inc. All rights reserved.