Download presentation
Presentation is loading. Please wait.
Published byFrederick York Modified over 9 years ago
1
Wasim Rangoonwala ID# 00506259 CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when, how, and to what extent information about them is communicated to others” - Alan Westin: Privacy & Freedom,1967
3
What are www Robots? A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders or Bots.
4
Web Spiders / Robots Collecting Data
5
Controlling how search engine access and index your website? Google refers to their spiders as Googlebots and Googlebots-Image Google has a set of computers that continually crawl the web. Together these machines are known as the Googlebot. In general you want Googlebot to access your site so your web pages can be found by people searching on Google.
6
Controlling how search engine access and index your website? One key Question is: how does Google know what parts of a website the site owner wants to have show up in search results? Can publishers specify that some parts of the site should be private and non- searchable? The good news is that those who publish on the web have a lot of control over which pages should appear in search results and which pages can be kept Private.. Answer: Robots.txt File
7
Controlling how search engine access and index your website? 1.Robots.txt has been an industry standard for many years that lets a site owner control how search engines access their web site. 2.The robots.txt file contains a list of the pages that search engines shouldn't access. 3.You can exclude pages from Google's crawler by creating a text file called robots.txt and placing it in the root directory. Making Use of Robots.txt File
8
Controlling how search engine access and index your website? Example of pages you want to kept private from search engines 1.A directory that contains internal logs. 2.News articles that require payment to access. 3.Administration area of website. Database configuration string, stored passwords, credit card details. 4.Images that you want to kept Private. Making Use of Robots.txt File Continue
9
Achieving Privacy through Robots.txt File # robots.txt File # Currently disallow all images to the Google Image bot User-agent: Googlebot-Image Disallow: / # ALL search engine spiders/crawlers (put at end of file) User-agent: Googlebot Disallow: /admin/ Disallow: /account_password.html Disallow: /address_book.html Disallow: /checkout_payment.html Disallow: /cookie_usage.html Disallow: /login.html Example of Robots.txt File
10
Privacy through Robots tag You can use a special HTML tag to tell robots not to index the content of a page, and/or not scan it for links to follow. Example... The "NAME" attribute must be "ROBOTS". Valid values for the "CONTENT" attribute are: "INDEX", "NOINDEX", "FOLLOW", "NOFOLLOW". Multiple comma-separated values are allowed, but obviously only some combinations make sense. If there is no robots tag, the default is "INDEX,FOLLOW", so there's no need to spell that out. Example of Tag
11
Search Engine Web Spiders Names Yahoo! Search-Yahoo Slurp AltaVista- Scooter AskJeeves- Ask Jeeves/Teoma MSN Search- MSNbot Visit http://www.robotstxt.org/db.html http://www.robotstxt.org/db.html For more details on Search Engine Web Spider Names.
12
Bonus
13
Google: Anatomy Google Crawlers (GoogleBot) Multiple distributed crawlers Own DNS cache 300 connections open at once Send fetched pages to Store Server Originally written in Python
14
PageRank ™ Algorithm Hypertext- matching Analysis Google: Technology
15
Google Webmaster Central Webmasters Central offer services: see which parts of a site Googlebot had problems crawling upload an XML Sitemap file analyze and generate robots.txt files remove URLs already crawled by Googlebot specify the preferred domain identify issues with title and description meta tags understand the top searches used to reach a site get a glimpse at how Googlebot sees pages remove unwanted site links that Google may use in results
16
When surfing the internet, avoid “free” offers and protect your information! Chatting – guard your information unless You are 100% Sure who you are chatting with. Cookies aren’t just for eating, they may be sending your personal information to others. Protect your passwords like you would your wallet or car keys. Make it complicate! E-mail is not secure and should never be though of as private. Don’t even open Spam, download a spam buster ! Beware of phishing, which are fake e-mails Sent to try to gain your personal and financial information. Protect your privacy on the Web
18
http://www.google.com/support/webmasters/bin/answer.py?answer=80553 http://www.google.com/bot.html http://www.googleguide.com http://www.searchengineposition.com http://www.google-watch.org http://www.robotstxt.org/db.html http://www.googleblog.blogspot.com For more Details Visit http://techwasim.blogspot.com
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.