Presentation is loading. Please wait.

Presentation is loading. Please wait.

Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service Indexing your web server(s) Helen Varley Sargan.

Similar presentations


Presentation on theme: "Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service Indexing your web server(s) Helen Varley Sargan."— Presentation transcript:

1 Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service Indexing your web server(s) Helen Varley Sargan

2 Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service Why create an index? Helps users (and webmasters) to find things …but isnt a substitute for good navigation Gives cohesion to a group of unrelated servers Observation of logs gives information on what people are looking for - and what they are having trouble finding You are already being part-indexed by many search engines, unless you have taken specific action against it

3 Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service Current situation NameTotal ht://Dig 25 Excite 19 Microsoft 12 Harvest 8 Ultraseek 7 SWISH 5 Webinator 4 Netscape 3 wwwwais 3 FreeFind 2 Other 13 None 59 Based on UKOLN survey of search engines used in 160 UK HEIs carried out in July/Aug 1999. Report to be published in Ariadne issue 21. See.

4 Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service Current situation questions Is the version of Muscat used by Surrey the free version available for a time (but not any more)? Are the users of Excite quite happy with the security and that development seems to have ceased? Are users of local search engines that don't use robots.txt happy with what other search engines can index on their sites (you have got a robots.txt file haven't you?)

5 Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service Types of tool External services are robots Tools you install yourself fall into two main categories (some will work both ways) –direct indexes of local and/or networked file structure –robot- or spider-based following instructions from the robots.txt file on each web server indexed The programs are either in a form you have to compile yourself or are precompiled for your OS, or they are written in Perl or Java, so will need either Perl or Java runtime to function.

6 Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service Controlling robot access 1 All of our web servers are being part-indexed by external robots Control of external robots and a local robot- mediated indexer is by the same route –a robots.txt file to give access information –Meta tags for robots in each HTML file giving indexing and link-following entry or exclusion –Meta tags in each HTML file giving description and keywords The first two controls are observed by all the major search engines. Some search engines do not observe description and keyword meta tags.

7 Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service Controlling robot access 2 Some patchy support for Dublin Core metadata Access to branches of the server can be limited by the server software - by combining access control with metadata you can give limited information to some users and more to others. If you dont want people to read files, either password-protect that section of the server or remove them. Limiting robot access to a directory can make nosey users flock to look whats inside.

8 Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service Security There has been a security problem with indexing software (Excite free version in 1998) Remember the security of the OS the indexing software is running under - keep all machines up-to- date with security patches whether they are causing trouble or not. Seek help with security if you are not an expert in the OS, particularly with Unix or Windows NT

9 Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service What tool to use? 1 Find out if any money, hardware and/or staff are available for the project first Make a shopping list of your requirements and conditions –hosting the index (where)? –platform (available and desirable)? –how many servers (and/or pages) will I index? –is the indexed data very dynamic? –what types of files do I want indexed? –what kind of search (keyword, phrase, natural language, constrained)? Are you concerned how you are indexed by others?

10 Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service What tool to use? 2 Equipped with the answers to the previous questions, you will be able to select a suitable category of tool If you are concerned how others index your site, install a local robot- or spider-based indexer and look at indexer control measures Free externally hosted services for very small needs Free tools (mainly Unix-based) for the technically literate or built-in to some server software Commercial tools cover a range of platforms and pocket-depths but vary enormously in features

11 Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service Free externally hosted services Will be limited to the number of pages indexed, possibly the number of times the index is access, and may be deleted if not used for a certain number of days (5-7) Very useful for small sites and/or those with little technical experience or resources Access is prey to Internet traffic (most services are in US) and server availability, and for UK users incoming transatlantic traffic will be charged for You may have to have advertising on your search page as a condition of use

12 Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service Free tools - built in Microsoft, Netscape, WebStar, WebTen and WebSite Pro all come with built in indexers (others may too) With any or all of these there may be problems indexing some other servers, since they are all using vendor-specific APIs (they may receive responses from other servers that they cant interpret). Problems are more likely with more and varied server types being indexed

13 Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service Free tools - installed Most active current development on SWISH (both E and ++), Webglimpse, ht://Dig and Alkaline Alkaline is a new product, all the others have been through long periods of inactivity and all are dependent on volunteer effort All of these are now robot based but may have other means of looking at directories as well Alkaline is available on Windows NT, but all the others are Unix. Some need to be compiled.

14 Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service Commercial tools Most have specialisms - sort out your requirements very carefully before you select a shortlist Real money price may vary from US$250 to £10,000+ (possibly with additional yearly maintenance), depending on product The cost of most will be on a sliding scale depending on the size of index being used Bear in mind that Java-based tools will require the user to be running a Java-enabled browser

15 Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service Case Study 1 - Essex Platform: Windows NT Number of servers searched: 16 Number of entries: approx 11,500 File types indexed: Office files, html and txt. Filters available for other formats Index updating: Configured with windows task scheduler. Incremental updates possible. Constrained searches possible: Yes Configuration: follows robots.txt but can take a 'back door' route as well. Obeys robots meta tag Logs and reports: Creates reports on crawling progress. Log analysis not included but can be written as add-ons (asp scripts) Pros: Free of charge with Windows NT. Cons: Needs high level of Windows NT expertise to set up and run it effectively. May run into problems indexing servers running diverse server software. Not compatible with Microsoft Index server (a single server product). Creates several catlog files, which may create network problems when indexing many servers.

16 Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service Case Study 2 - Oxford Platform: Unix Number of servers searched: 131 Number of entries: approx 43, 500 (specifically 9 levels down as a maximum on any server) File types indexed: Office files, html and txt. Filters available for other formats Index updating: Configured to reindex after a set time period. Incremental updates possible. Constrained searches possible: Yes but need to be configured on the ht://Dig server Configuration: follows robots.txt but can take a 'back door' route as well. Logs and reports: none generated in an obvious manner, but probably available somehow. Pros: Free of charge. Wide number of configuration options available. Cons: Needs high level of Unix expertise to set up and run it effectively. Index files are very large.

17 Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service Case Study 3 - Cambridge Platform: Unix Number of servers searched: 232 Number of entries: approx 188,000 File types indexed: Many formats, including PDF, html and txt. Index updating: Intelligent incremental reindexing dependent on the frequency of file updates - can be given permitted schedule. Manual incremental updates easily done. Constrained searches possible: Yes easily configured by users and can also be added to configuration as a known constrained search. Configuration: follows robots.txt and meta tags. Configurable weighting given to terms in title and meta tags. Thesaurus add-on available to give user- controlled alternatives Logs and reports: Logs and reports available for every aspect of use - search terms, number of terms, servers searched, etc. Pros: Very easy to install and maintain. Gives extremely good results in a problematic environment. Technical support excellent. Cons: Relatively expensive.

18 Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service Recommendations Choosing an appropriate search engine is wholly dependent on your particular needs and circumstances Sort out all your robot-based indexing controls when you install your local indexer Do review your indexing software regularly - if its trouble free it still needs maintaining


Download ppt "Institutional Webmasters Workshop7-9 September 1999 University of Cambridge Computing Service Indexing your web server(s) Helen Varley Sargan."

Similar presentations


Ads by Google