Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Advanced Archive-It Application Training: Crawl Scoping.

Similar presentations


Presentation on theme: "1 Advanced Archive-It Application Training: Crawl Scoping."— Presentation transcript:

1 1 Advanced Archive-It Application Training: Crawl Scoping

2 2 Agenda Basic Crawl Scoping and Seed Types What to look for in your reports How to Change your Crawl Scope – Crawl Limits – Expand scope – Host rules – Actionable Host report

3 3 Archive-It Crawling Scope Scope: How to specify what you want to archive in scope URLs will be archived out of scope URLs are not archived. The scope of a crawl is determined by the seed URLs added to a collection by any scoping rules specified for your collection

4 4 Archive-It Crawling Scope The crawler will start with your seed URL and follow links within your seed site to archive pages Only links associated with your seeds will be archived All embedded content on an in scope page is captured Example seed: www.archive.org/www.archive.org Link: www.archive.org/about.html is in scopewww.archive.org/about.html Link: www.ca.gov is NOT in scopewww.ca.gov Embedded image: www.ala.org/logo.jpg is in scopewww.ala.org/logo.jpg

5 Archive-It Crawling Scope 5 Seed URLs can limit the crawl to a single directory of a site. ex: www.archive.org/about/www.archive.org/about/ * a / at the end of your url can have a big effect on scope * Parts of the site not included in your seed directory will NOT be archived Example seed: www.archive.org/about/www.archive.org/about/ Link: www.archive.org/webarchive.html NOT in scopewww.archive.org/webarchive.html Example seed: www.archive.org/aboutwww.archive.org/about Link: www.archive.org/webarchive.html IS in scopewww.archive.org/webarchive.html

6 Archive-It Crawling Scope Sub-domains are divisions of a larger site named to the left of the host name (ex. crawler.archive.org) Sub-domains of seed URLs are NOT automatically in scope To crawl sub-domains, either: Add individual sub-domains as separate seed URLs Or add an ‘Expand Scope’ rule to allow all or specific sub- domains Example seed: www.archive.orgwww.archive.org – Link: crawler.archive.org NOT in scopecrawler.archive.org Example seed: archive.orgarchive.org Link: : crawler.archive.org IS in scopecrawler.archive.org 6

7 7 Seed Types Default – Used in majority of seeds and the universal setting for most crawls. Will capture all links that are in scope. Crawl One Page Only – Capture just your seed URL and embedded content RSS/News Feed – Capture any linked pages from your seed URL as one page only

8 8 Analyzing Crawl Scope How to analyze the scope of your crawls: – Run a test crawl on new collections or seeds – Review reports of test crawl (or for existing crawl, review reports of actual crawl) – Based on the reports, you will be able to add the appropriate scoping rules – It is a good idea to run a test crawl with your scoping rules in to ensure they are correct. – Note: Running test crawls is just the first step. You may need to run additional tests to perfect scoping rules.

9 9 Hosts Report Are there numbers in the “Queued” or “Robots.txt Blocked” column? – Check the URL lists to see if you want to capture these URLs or not Are there hosts with fewer or more archived URLs than you expected? – Fewer: Are any expected URLs “Out of Scope”? – More: Are there parts of the site or specific URLs you want to block?

10 10 Common Reasons to Limit Crawl Scope Crawler traps (ex: calendars) “Duplicate” URLs (ex: print version URLs) If there are certain areas of the site you do not care about or do not want to archive If you just want a snapshot of the site, and don’t necessarily want to crawl it to completion If you only want to capture one page of a site

11 11 Modify Crawl Scope

12 12 Crawl Limits

13 13 2 Different Types of Rules Host Constraints – Ignore Robots.txt – Block a host – Limit the kinds of URLs from a specific host -by text match -by Regular Expression Expand Scope – Include URLs in a crawl that would not be in scope by default -by text match -by regular expression -by SURT

14 14 Host Constraints Specific to a host http://www.facebook.com/archiveitorg is a URL www.facebook.com is the HOST facebook.com is a host, and applies to all subdomains, including photos.facebook.com

15 15 Adding Host Constraints

16 16 Adding Host Constraints

17 17 Adding Host Constraints

18 18 Adding Host Constraints

19 19 Actionable Hosts Report Available in 5.0 Reports: Allows you to quickly add and review rules that were in place for specific hosts, as well as run a patch crawl for URLs blocked by Robots.txt.

20 20 Actionable Hosts Report

21 21 Actionable Hosts Report

22 22 Expand Crawl Scope How do you know you need to expand your scope? – Review the ‘Out of Scope’ column in the Hosts Report. – If in clicking around your archived site you find ‘Not in Archive’ trends that could be addressed by an expand scope rule

23 23 Expand Crawl Scope

24 24 Expand Crawl Scope – Include all (or only specific) subdomains – Include certain parts of the site that may not have been included based on the seed URL Ex: seed URL is: http://mgahouse.maryland.gov/ But you also want to archive pages such as: http://files.maryland.gov/House/report.pdf

25 25 Expand Crawl Scope Solution: Add an expand scope rule to include URLs that contain: “files.maryland.gov”

26 26 Expand Crawl Scope WARNING Expanding scope is a powerful tool, and the more specific the better. Expand scope rules do not help the crawler discover URLs. Common mistake scenario: I’m responsible for archiving amazinguniversity.edu, so I’m going to create an expand scope rule to include any URL with amazinguniversity.edu.

27 27 Play it safe 1.Run Test Crawls 1.Deactivate Rules when appropriate.

28 28 Q&A


Download ppt "1 Advanced Archive-It Application Training: Crawl Scoping."

Similar presentations


Ads by Google