We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byZaire Brixey
Modified over 2 years ago
© 2006 CrawlWall.com ‘Bot Obedience Taking Control of Your Site Transitioning from free-for-all ‘bot abuse to tightly controlled site access Bill Atchison CrawlWall.com “The Bot Stops Here!”
© 2006 CrawlWall.com Rogue Spiders Go On Rampage My website was under constant ‘bot attack –Scraping was 10% or more of daily page views not counting spiders from Google, Yahoo! and MSN –Copyrighted material stolen and scattered all over the web –High speed scrapers overloaded server for extended periods of time stopping visitors and major search engines from accessing the site This was unacceptable and had to be stopped!
© 2006 CrawlWall.com What are Bad ‘Bots? Defining a good ‘bot vs. a bad ‘bot The motives behind why bad ‘bots exist Various types of ‘bots ranging from a mild nuisance to very bad and harmful Stealth ‘bots vs. Visible ‘bots How scraper ‘bots utilize content
© 2006 CrawlWall.com What Good ‘Bots Do Obey Internet standards like robots.txt Don’t crawl your server abusively fast Return to get fresh content in a reasonable timeframe Provide traffic in return for crawling your site
© 2006 CrawlWall.com What Bad ‘Bots Do Will go to any length to get your content –Ignore Internet standards like robots.txt –Spoof ‘bot names used by major search engines –Change the User Agent randomly to avoid filters –Masquerade as humans (stealth) to completely bypass filters –Crawl as fast as possible to avoid being stopped –Crawl as slow as possible to slide under the radar –Crawl from as many IPs as possible to avoid detection –Return often to get your new content and get indexed first Violate your copyrights and repackage your site Hijack your search engine positions Provide no value in return for crawling
© 2006 CrawlWall.com What Motivates Bad ‘Bots? They want to get something for nothing! –To build websites using your content –To mine information using your content –To get traffic using your content –To make money using your content Got the picture? You build it and parasites profit off your hard work.
© 2006 CrawlWall.com Who Are All These ‘Bots? Intelligence gathering Spybots –Copyright Compliance –Branding Compliance –Corporate Security Monitoring –Media Monitoring (mp3, mpeg, etc.) –Myriad of Safe-Site Monitoring solutions Content Scrapers (pure theft) Data Aggregators Link Checkers Privacy Checkers Web Copiers/Downloaders Offline Web Browsers Explosion of open-source crawlers Nutch and Heritrix And many more…
© 2006 CrawlWall.com Stealth ‘Bots vs. Visible* ‘Bots *Visible bots excluding major search engines like Google, Yahoo! and MSN Sample of daily page requests made by unwelcome ‘bots shows stealth activity, which can’t be blocked by user agent filtering, exceeds easily identifiable ‘bots.
© 2006 CrawlWall.com The Wild Wild Web
© 2006 CrawlWall.com How Scraper ‘Bots Use Your Content The following examples will show how content is used by scrapers and hijackers building websites that feed off your text and keywords to drive clicks to their customers. See how these scrapers were fed crumbs of data that linked them back to their ‘bots that crawled the website.
© 2006 CrawlWall.com This web site is not about CrawlWall, but they’re using my site name and scraped content in an attempt to get traffic to click their links. Scrapers Use Your Keywords
© 2006 CrawlWall.com The suspected scraper was fed their own ‘bot IP address for later identification. Scrapers Scramble Your Content Scraped pages are scrambled together to make new content and avoid duplicate content penalties.
© 2006 CrawlWall.com Scrapers’ Methods Used Against Them The suspected scraper was fed their own ‘bot User Agent, which shows this was a stealth crawler Scrapers can be fed their own information back to them in order to link the scraper ‘bot to the scraper website.
© 2006 CrawlWall.com Scraper Site Linked to ‘Bot Origins Quick check in the log file archives reveals: This scraper used a proxy on a dedicated server and only got a couple of error messages seeded with crumbs instead of content as the proxy was already being blocked. Note that Googlebot tried crawling through the proxy server which can lead to hijacked pages in the search engine.
© 2006 CrawlWall.com Cloaked Scrapers Hide Your Content Note the bot IP address was again fed back to the suspected scraper This is what the cloaked site shows search engines to get traffic, this is never seen by visitors to their site.
© 2006 CrawlWall.com Totally unrelated to the scraped content that brings traffic, the cloaked scraper shows visitors this page to earn money Cloaked Scrapers Show Links That Pay
© 2006 CrawlWall.com Search Engine Scraping by Proxy Here are a couple of examples from Google showing how proxy servers attempt to get traffic. Proxy sites don’t have spiders but they use the search engines as unwitting scrapers by cloaking links that entice Googlebot and others to crawl via their proxy. If Googlebot wasn’t being restricted by IP address then the actual site content would’ve been crawled, indexed and the proxy hijacking would possibly appear near, or even above, my site listing.
© 2006 CrawlWall.com Scrapers Damage Reputations Scraper activity can directly damage the reputation of both you and your customers when content from your website appears in disreputable locations. There can be backlash from customers unaware of the scraper situation and think you might somehow be responsible for these promotions on seedy websites.
© 2006 CrawlWall.com Stopping ‘bots doesn’t take a genius
© 2006 CrawlWall.com How to Get ‘Bots Under Control OPT-IN vs. OPT-OUT ‘bot blocking strategies OPT-IN Traffic Analysis Profiling and detecting Stealth ‘Bots vs. Visitors Setting spider traps and using natural traps Avoiding search engine pitfalls Protecting your site
© 2006 CrawlWall.com OPT-OUT ‘Bot Blocking Fails Robots.txt only works for the well behaved ‘bots as most bad ‘bots ignore robots.txt except when trying to avoid spider traps User Agent blacklist filters fail because new bad ‘bots appear daily, periodically change their name or use random names to avoid being blocked IP blocking in the firewall can create lists so large that the firewall processing degrades server performance
© 2006 CrawlWall.com OPT-IN ‘Bot Paradigm Shift Authorize good ‘bots only, no more blacklists as everything else is blocked by default Narrow search engine access by IP range to prevent spoofing and page hijacking via proxy sites Authorize browsers explicitly such as Internet Explorer, Firefox, Opera and mobile devices
© 2006 CrawlWall.com Set Spider Traps Robots.txt is spider trap because stealth crawlers reading this file expose themselves while trying to avoid spider traps. Create a spider trap page with a hidden link in the your web pages that is inaccessible via browser navigation. Disallow: /spidertrap.html Natural spider traps are files humans rarely read like privacy and legal pages which can be monitored for potential ‘bot traffic.
© 2006 CrawlWall.com Avoid Search Engine Pitfalls Don’t allow search engines to archive pages as search engine cache is also a scraping target. Tell unauthorized robots that crawling is forbidden by dynamically inserting no-crawl directives per page. Even with the archive cache disabled, scrapers extract lists of valid page names from search engines to defeat spider traps. Search engine translation tools and other services are also used as a proxy to scrape websites so they should be dynamically monitored.
© 2006 CrawlWall.com Ways to Protect Your Site Use a script to dynamically display robots.txt and show proper information to allowed ‘bots and all others see “DISALLOW: /” (http://www.leekillough.com/robots.html) User Agent filtering and blocking with the rules structured for an OPT-IN ALLOW list which is easier to maintain and more secure as everything else is blocked by default. Block entire IP ranges for web hosts that host or facilitate access for scraper sites, unwanted ‘bots or proxy servers since humans don’t typically browse via dedicated servers anyway. For blocking large lists of IPs, such as proxy lists, use PHP and a database like MySQL to avoid firewall performance problems. Use scripts like Robert Plank’s AntiCrawl to stop and challenge most stealth crawlers that User Agent filters can’t control. (http://www.anticrawl.com)
© 2006 CrawlWall.com Summary Tighten Site Access: OPT-IN spiders instead of building blacklists Set spider traps to snare stealth crawlers Stealth ‘bot profiling and challenge scripts Eliminate 3 rd party scraping sources such as search engine archives and proxy servers Get Better Results: Tighter controls on copyrighted content Improve search engine ranking after removing unwanted competition Better server performance for visitors and legit search engine crawls
© 2006 CrawlWall.com Thank You!
© 2008 CrawlWall.com Competitive Counter-Intelligence Stop Snooping Competitors Techniques for protecting your SEO investment from prying competitive eyes.
keyword research – corporate training – private coaching Argh! We’ve Been Duped! Dan Thies, SEO Research Labs.
PHP Meetup - SEO 2/12/2009. Where to Focus? Ensuring the findability of content Ensuring content is well understood by search engines Maximizing the importance.
By Raza / Faisal By: Raza Usmani Faisal Khan. What is SEO? It is the process of affecting the visibility of a website or a web page in a search engine's.
Technical SEO tips for Web Developers Richa Bhatia Singsys Pte. Ltd.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
Windows Vista Configuration MCTS : Internet Explorer 7.0.
Increasing Website ROI through SEO and Analytics Dan Belhassen greatBIGnews.com Modern Earth Inc.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
© 2006 KDnuggets [16/Nov/2005:16:32: ] "GET /jobs/ HTTP/1.1" "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N"
Search Engine Optimization Presented By:- ARKA Softwares Effective! Affordable! Time Groove
Crawling Slides adapted from
Introduction Web analysis includes the study of users’ behavior on the web Traffic analysis – Usage analysis Behavior at particular website or across.
Online Translation Service Capstone Design Eunyoung Ku Jason Roberts Jennifer Pitts Gregory Woodburn Kim Tran.
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
Prevent Cross-Site Scripting (XSS) attack
Search Engine Optimization. Search Engines ≈50% your new users are from a search engine ≈50% are returning users Many repeat viewers will return using.
Network Security. Network security starts from authenticating any user. Once authenticated, firewall enforces access policies such as what services are.
Computer Security By Duncan Hall.
Proxy Servers are software that act as intermediaries between client and servers on the Internet. They help users on private networks get information.
© 2017 SlidePlayer.com Inc. All rights reserved.