Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Introduction to Web Robots, Crawlers & Spiders Instructor: Joseph.

Similar presentations


Presentation on theme: "Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Introduction to Web Robots, Crawlers & Spiders Instructor: Joseph."— Presentation transcript:

1 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Introduction to Web Robots, Crawlers & Spiders Instructor: Joseph DiVerdi, Ph.D., MBA

2 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Web Robot Defined A Web Robot Is a Program –That Automatically Traverses the Web Using Hypertext Links –Retrieving a Particular Document Then Retrieving All Documents That Are Referenced –Recursively Recursive Doesn't Limit the Definition –To Any Specific Traversal Algorithm –Even If a Robot Applies Some Heuristic to the Selection & Order of Documents to Visit & Spaces Out Requests Over a Long Time Period It Is Still a Robot

3 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Web Robot Defined Normal Web Browsers Are Not Robots –Because the Are Operated by a Human –Don't Automatically Retrieve Referenced Documents Other Than Inline Images

4 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Web Robot Defined Sometimes Referred to As –Web Wanderers –Web Crawlers –Spiders These Names Are a Bit Misleading –They Give the Impression the Software Itself Moves Between Sites Like a Virus –This Not the Case A Robot Visits Sites by Requesting Documents From Them

5 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Agent Defined The Term Agent Is (Over) Used These Days Specific Agents Include: –Autonomous Agent –Intelligent Agent –User-Agent

6 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Autonomous Agent Defined An Autonomous Agent Is a Program –That Automatically Travels Between Sites –Makes Its Own Decisions When To Move, When To Stay –Are Limited to Travel Between Selected Sites –Currently Not Widespread on the Web

7 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Intelligent Agent Defined An Intelligent Agent Is a Program –That Helps Users With Certain Activities Choosing a Product Filling Out a Form Find Particular Items –Generally Have Little to Do With Networking –Usually Created & Maintained by an Organization To Assist Its Own Viewers

8 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC User-Agent Defined An User-Agent Is a Program –Performs Networking Tasks for a User Web User-Agent –Navigator –Internet Explorer –Opera Email User-Agent –Eudora FTP User-Agent –HTML-Kit –Fetch –cute_FTP

9 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Search Engine Defined A Search Engine Is a Program –That Examines A Database Upon Request or Automatically Delivers Results or Creates Digest –In the Context of the Web A Search Engine Is A Program That Examines Databases of HTML Documents –Databases Gathered by a Robot Upon Request Delivers Results Via HTML Document

10 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Robot Purposes Robots Are Used for a Number of Tasks –Indexing Just Like a Book Index –HTML Validation –Link Validation Searching for Broken Links –What's New Monitoring –Mirroring Making a Copy of a Primary Web Site On a Separate Server –More Local to Some Users –Shares the Work Load With the Primary Server

11 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Other Popular Names All Names for the Same Sort of Program –With Slightly Different Connotations Web Spiders –Sounds Cooler in the Media Web Crawlers –Webcrawler Is a Specific Robot Web Worms –A Worm Is a Replicating Program Web Ants –Distributed Cooperating Robots

12 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Robot Ethics Robots Have Enjoyed a Checkered History –Certain Robot Programs Can And Have in the Past –Overload Networks & Servers With Numerous Requests This Happens Especially With Programmers –Just Starting to Write a Robot Program These Days There Is Sufficient Information on Robots to Prevent Many of These Mistakes –But Does Everyone Read It?

13 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Robot Ethics Robots Have Enjoyed a Checkered History –Robots Are Operated by Humans Can Make Mistakes in Configuration Don't Consider the Implications of Actions This Means –Robot Operators Need to Be Careful –Robot Authors Need to Make It Difficult for Operators to Make Mistakes With Bad Effects

14 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Robot Ethics Robots Have Enjoyed a Checkered History –Indexing Robots Build Central Database of Documents –Which Doesn't Always Scale Well To Millions of Documents On Millions of Sites –Many Different Problems Occur Missing Sites & Links High Server Loads Broken Links

15 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Robot Ethics Robots Have Enjoyed a Checkered History –Majority of Robots Are Well Designed Professionally Operated Cause No Problems Provide a Valuable Service Robots Aren't Inherently Bad –Nor Are They Inherently Brilliant They Just Need Careful Attention

16 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Robot Visitation Strategies Generally Start From Historical URL List –Especially Documents With Many or Certain Links Server Lists What's New Pages Most Popular Sites on the Web Other Sources for URLs Are Used –Scans Through USENET Postings –Published Mailing List Archives Robot Selects URLs to Visit, Index, & Parse And Use As a Source for New URLs

17 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Robot Indexing Strategies If an Indexing Robot Is Aware of a Document –Robot May Decide to Parse Document –Insert Document Content Into Robot's Database Decision Depends on the Robot –Some Robots Index HTML Titles The First Few Paragraphs Parse the Entire HTML & Index All Words –With Weightings Depending on HTML Constructs Parse the META Tag –Or Other Special Internal Tags

18 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Robot Visitation Strategies Many Indexing Services Also Allow Web Developers to Submit URL Manually –Which Is Queued –Visited by the Robot Exact Process Depends on Robot Service –Many Services Have a Link to a URL Submission Form on Their Search Page Certain Aggregators Exist –Which Purport to Submit to Many Robots at Once http://www.submit-it.com/

19 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Determining Robot Activity Examine Server Logs –Examine User-Agent, If Available –Examine Host Name or IP Address –Check for Many Accesses in Short Time Period –Check for Robot Exclusion Document Access Found at: /robots.txt

20 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Apache Access Log Snippet "GET /robots.txt HTTP/1.0" 200 0 "-" "Scooter-3.2.EX" "GET / HTTP/1.0" 200 4591 "-" "Scooter-3.2.EX" "GET /robots.txt HTTP/1.0" 200 64 "-" "ia_archiver" "GET / HTTP/1.1" 200 4205 "-" "libwww-perl/5.63" "GET /robots.txt HTTP/1.0" 200 64 "-" "FAST-WebCrawler/3.5 (atw- crawler at fast dot no; http://fast.no/support.php?c=faqs/crawler)" "GET /robots.txt HTTP/1.0" 200 64 "-" "Mozilla/3.0 (Slurp/si; slurp@inktomi.com; http://www.inktomi.com/slurp.html)"

21 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC After Robot Visitation Some Webmasters Panic After Being Visited –Generally Not a Problem –Generally a Benefit –No Relation to Viruses –Little Relation to Hackers –Close Relation to Lots of Visits

22 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Controlling Robot Access Excluding Robots Is Feasible Using Server Authentication Techniques –.htaccess File & Directives Deny From 0.0.0.0 (IP Address) SetEnvIf User-Agent Robot is_a_robot Can Increase Server Load Seldom Required –More Often (Mis) Desired

23 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Robot Exclusion Standard Robot Exclusion Standard Exists –Consists of Single Site-wide File /robots.txt Contains Directives, Comment Lines, & Blank Lines –Not a Locked Door –More of a "No Entry" Sign –Represents a Declaration of Owner's Wishes –May Be Ignored by Incoming Traffic Much Like a Red Traffic Light –If Everyone Follows The Rules, The World's a Better Place

24 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Sample robots.txt File # /robots.txt file for http://webcrawler.com/ # mail webmaster@webcrawler.com for constructive criticism User-agent: webcrawler Disallow: User-agent: lycra Disallow: / User-agent: * Disallow: /tmp Disallow: /logs

25 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Exclusion Standard Syntax # /robots.txt file for http://webcrawler.com/ # mail webmaster@webcrawler.com for constructive criticism Lines Beginning With '#' Are Comments Comment Lines Are Ignored –Comments May Not Appear Mid-Line

26 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Exclusion Standard Syntax User-agent: webcrawler Disallow: Specify That the Robot Named 'webcrawler' Has Nothing Disallowed –It May Go Anywhere on This Site

27 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Exclusion Standard Syntax User-agent: lycra Disallow: / Specify That the Robot Named 'lycra' Has All URLs starting with '/' Disallowed –It May Go Nowhere on This Site –Because All URLs On This Server Begin With Slash

28 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Exclusion Standard Syntax User-agent: * Disallow: /tmp Disallow: /logs Specify That All Robots Has URLs starting with '/tmp' & '/log' Disallowed –It May Not Access Any URLs Beginning With Those Strings Note The '*' is a Special Token –Meaning "any other User-agent" Regular Expressions Cannot Be Used

29 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Exclusion Standard Syntax Two Common Configuration Errors –Wildcards Are Not Supported Do Not Use 'Disallow: /tmp/*' Use 'Disallow: /tmp' –Put Only One Path on Each Disallow Line This May Change in a Future Version of the Standard

30 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC robots.txt File Location The Robot Exclusion File Must be Placed at The Server's Document Root For example: Site URLCorresponding Robots.txt URL http://www.w3.org/-> http://www.w3.org/robots.txt http://www.w3.org:80/-> http://www.w3.org:80/robots.txt http://www.w3.org:1234/-> http://www.w3.org:1234/robots.txt http://w3.org/-> http://w3.org/robots.txt

31 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Common Mistakes Urls Are Case Sensitive –"/robots.txt" must be all lower-case Pointless robots.txt URLs http://www.w3.org/admin/robots.txt http://www.w3.org/~timbl/robots.txt On a Server With Multiple Users –Like linus.ulltra.com –robots.txt Cannot Be Placed in Individual Users' Directories –It Must Be Placed in the Server Root By the Server Administrator

32 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC For Non-System Administrators Sometimes Users Have Insufficient Authority to Install a /robots.txt File –Because They Don't Administer the Entire Server Use META Tag In individual HTML Documents to Exclude Robots –Prevents Document From Being Indexed –Prevents Document Links From Being Followed

33 Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Bottom Line Use Robots Exclusion to Prevent Time Variant Content From Being Improperly Indexed Don't Use It to Exclude Visitors Don't Use It to Secure Sensitive Content –Use Authentication If It's Important –Use SSL If It's Really Important


Download ppt "Web Robots, Crawlers, & Spiders Webmaster- Fort Collins, CO Copyright © XTR Systems, LLC Introduction to Web Robots, Crawlers & Spiders Instructor: Joseph."

Similar presentations


Ads by Google