1 Advanced Archive-It Application Training: Crawl Scoping.

Slides:



Advertisements
Similar presentations
RP Designs Semi-Custom e-Commerce Package. Overview RP Designs semi- custom e-commerce package is a complete website solution. Visitors can browse a catalog.
Advertisements

Create a Table of Contents Microsoft Word Help FAQ. How to create a table of contents in Microsoft Word Microsoft Word Help FAQ. How to create a table.
1 Advanced Archive-It Application Training: Quality Assurance October 17, 2013.
Go to ‘Site Actions’ ‘View All Site Content ‘View All Site Content’
SEO Best Practices with Web Content Management Brent Arrington, Services Developer, Hannon Hill Morgan Griffith, Marketing Director, Hannon Hill 2009 Cascade.
Streamlined Scoping at North Carolina Kathleen Kenney.
Looking Ahead Archive-It Partner Meeting November 12, 2013.
1 Lesson 14 - Unit N Optimizing Your Web Site for Search Engines.
Creating Accessible Word Documents by Debbie Lyn Jones, IT Manager I, NSU Webmaster FRIDAY, JANUARY 23, 2015.
University Archives University Archives & Archive-It WebCom
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
HCI 201 Week 4 Design Usability Heuristics Tables Links.
1 of 6 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
1 of 6 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
1 of 6 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
1 of 5 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
1 of 7 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
Using Tabs. Tabs are the most general way to navigate through the zzusis portal applications. This tour gives an overview of how to use tabs.
How to Get The Most Out of Outlook 2003 Michele Schwartzman Division of Customer Support Summer 2006.
Effective Advocacy tools with Engaging Networks. More actions taken More page completions from s More engaged supporters Fewer unsubscribes / lapsed.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
1 Archive-It Training University of Maryland July 12, 2007.
ADVANCED MICROSOFT POWERPOINT Lesson 6 – Creating Tables and Charts
NCSRA Assignor Training Module For USSF North Carolina Arbiter Site Section 4 of 5 © Copyright July 2005 by Paul James, all rights reserved.
1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites.
Chapter 9 Collecting Data with Forms. A form on a web page consists of form objects such as text boxes or radio buttons into which users type information.
Archive-It collection on “Occupy Movement 2011/2012” Archiving Web Content.
Branded Websites. Branded Website Training Click the “Edit Pencil” to edit the website Enter in your iBoomerang username and password.
Interpreting logs and reports IIPC GA 2014 Crawl engineers and operators workshop Bert Wendland/BnF.
Go to the MTSD Home Page In the URL add “/admin”
Using Windows Firewall and Windows Defender
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
XP Dreamweaver 8.0 Tutorial 3 1 Adding Text and Formatting Text with CSS Styles.
WAS to Archive-It Metadata Migration March 11, 2015.
Crawling Slides adapted from
Microsoft FrontPage 2003 Illustrated Complete Finalizing a Web Site.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
Moodle (Course Management Systems). Creating and Managing Content.
Website Custom Audience Ad’s Facebook Re-targeting part #2.
Module 10 Administering and Configuring SharePoint Search.
Welcome to the Winter Training Series Today we will be focusing on Campaign actions – the basics.
© CGI Group Inc. CONFIDENTIAL Cgi.com Training Content list search tool.
0 SharePoint Search 2013 Rafael de la Cruz SharePoint Developer Seneca Resources twitter.com/delacruz_rafael
1 Video and flash harvesting. 2 Dailymotion, a special crawl Twice a year we crawl Dailymotion. But the model changes all the time… –The seed list contains.
Module 7 Planning and Deploying Messaging Compliance.
Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.
Unit 1 – Improving Productivity Instructions ~ 100 words per box.
Team Site Admin with SharePoint 2010 Gareth Johns IT Skills Development Advisor.
Current Quality Assurance Practices in Web Archiving Brenda Reyes Ayala, Mark Phillips, and Lauren Ko University of North Texas
Constraints Lesson 8. Skills Matrix Constraints Domain Integrity: A domain refers to a column in a table. Domain integrity includes data types, rules,
1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping.
© 2010 Delmar, Cengage Learning Chapter 11 Creating and Using Templates.
1 of 5 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
Google Calendar at daretolearn.org. Calendar Settings.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
IBM Software Group © 2008 IBM Corporation Tivoli Provisioning Manager Beta Program Web Replay Intro and Lab September, 2008 Robert Uthe.
Windows Vista Configuration MCTS : Internet Explorer 7.0.
Web Archiving Workshop Mark Phillips Texas Conference on Digital Libraries June 4, 2008.
2 At the top of the zone in which you want to add the Web Part, click Add a Web Part. In the Add Web Parts to [zone] dialog box, select the check box of.
2 At the top of the zone in which you want to add the Web Part, click Add a Web Part. In the Add Web Parts to [zone] dialog box, select the check box of.
LMEvents SharePoint Portal How-to Guide
Making Templates Accessible
SharePoint Administrative Communications Planning: Dynamic User Notifications for Upgrades, Migrations, Testing, … Presented by Robert Freeman (
New PowerPoint Template
MSC photo:  It was taken some time in the late 1930s, but we don’t have an exact date.  The college was known as MSC from 1925 until 1955 when we became.
New PowerPoint Template
Europa Analytics 1st Steps Through the Reporting Environment
Making Templates Accessible
HP ALM Test Lab Module To protect the confidential and proprietary information included in this material, it may not be disclosed or provided to any third.
Presentation transcript:

1 Advanced Archive-It Application Training: Crawl Scoping

2 Agenda Basic Crawl Scoping and Seed Types What to look for in your reports How to Change your Crawl Scope – Crawl Limits – Expand scope – Host rules – Actionable Host report

3 Archive-It Crawling Scope Scope: How to specify what you want to archive in scope URLs will be archived out of scope URLs are not archived. The scope of a crawl is determined by the seed URLs added to a collection by any scoping rules specified for your collection

4 Archive-It Crawling Scope The crawler will start with your seed URL and follow links within your seed site to archive pages Only links associated with your seeds will be archived All embedded content on an in scope page is captured Example seed: Link: is in scopewww.archive.org/about.html Link: is NOT in scopewww.ca.gov Embedded image: is in scopewww.ala.org/logo.jpg

Archive-It Crawling Scope 5 Seed URLs can limit the crawl to a single directory of a site. ex: * a / at the end of your url can have a big effect on scope * Parts of the site not included in your seed directory will NOT be archived Example seed: Link: NOT in scopewww.archive.org/webarchive.html Example seed: Link: IS in scopewww.archive.org/webarchive.html

Archive-It Crawling Scope Sub-domains are divisions of a larger site named to the left of the host name (ex. crawler.archive.org) Sub-domains of seed URLs are NOT automatically in scope To crawl sub-domains, either: Add individual sub-domains as separate seed URLs Or add an ‘Expand Scope’ rule to allow all or specific sub- domains Example seed: – Link: crawler.archive.org NOT in scopecrawler.archive.org Example seed: archive.orgarchive.org Link: : crawler.archive.org IS in scopecrawler.archive.org 6

7 Seed Types Default – Used in majority of seeds and the universal setting for most crawls. Will capture all links that are in scope. Crawl One Page Only – Capture just your seed URL and embedded content RSS/News Feed – Capture any linked pages from your seed URL as one page only

8 Analyzing Crawl Scope How to analyze the scope of your crawls: – Run a test crawl on new collections or seeds – Review reports of test crawl (or for existing crawl, review reports of actual crawl) – Based on the reports, you will be able to add the appropriate scoping rules – It is a good idea to run a test crawl with your scoping rules in to ensure they are correct. – Note: Running test crawls is just the first step. You may need to run additional tests to perfect scoping rules.

9 Hosts Report Are there numbers in the “Queued” or “Robots.txt Blocked” column? – Check the URL lists to see if you want to capture these URLs or not Are there hosts with fewer or more archived URLs than you expected? – Fewer: Are any expected URLs “Out of Scope”? – More: Are there parts of the site or specific URLs you want to block?

10 Common Reasons to Limit Crawl Scope Crawler traps (ex: calendars) “Duplicate” URLs (ex: print version URLs) If there are certain areas of the site you do not care about or do not want to archive If you just want a snapshot of the site, and don’t necessarily want to crawl it to completion If you only want to capture one page of a site

11 Modify Crawl Scope

12 Crawl Limits

13 2 Different Types of Rules Host Constraints – Ignore Robots.txt – Block a host – Limit the kinds of URLs from a specific host -by text match -by Regular Expression Expand Scope – Include URLs in a crawl that would not be in scope by default -by text match -by regular expression -by SURT

14 Host Constraints Specific to a host is a URL is the HOST facebook.com is a host, and applies to all subdomains, including photos.facebook.com

15 Adding Host Constraints

16 Adding Host Constraints

17 Adding Host Constraints

18 Adding Host Constraints

19 Actionable Hosts Report Available in 5.0 Reports: Allows you to quickly add and review rules that were in place for specific hosts, as well as run a patch crawl for URLs blocked by Robots.txt.

20 Actionable Hosts Report

21 Actionable Hosts Report

22 Expand Crawl Scope How do you know you need to expand your scope? – Review the ‘Out of Scope’ column in the Hosts Report. – If in clicking around your archived site you find ‘Not in Archive’ trends that could be addressed by an expand scope rule

23 Expand Crawl Scope

24 Expand Crawl Scope – Include all (or only specific) subdomains – Include certain parts of the site that may not have been included based on the seed URL Ex: seed URL is: But you also want to archive pages such as:

25 Expand Crawl Scope Solution: Add an expand scope rule to include URLs that contain: “files.maryland.gov”

26 Expand Crawl Scope WARNING Expanding scope is a powerful tool, and the more specific the better. Expand scope rules do not help the crawler discover URLs. Common mistake scenario: I’m responsible for archiving amazinguniversity.edu, so I’m going to create an expand scope rule to include any URL with amazinguniversity.edu.

27 Play it safe 1.Run Test Crawls 1.Deactivate Rules when appropriate.

28 Q&A