1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping.

Slides:



Advertisements
Similar presentations
RP Designs Semi-Custom e-Commerce Package. Overview RP Designs semi- custom e-commerce package is a complete website solution. Visitors can browse a catalog.
Advertisements

Welcome to WebCRD.
Other areas in the No Wrong Door website that you may find useful are… Whats New (inc. education and training) You can access this area via the menu on.
1 Advanced Archive-It Application Training: Quality Assurance October 17, 2013.
Go to ‘Site Actions’ ‘View All Site Content ‘View All Site Content’
KompoZer. This is what KompoZer will look like with a blank document open. As you can see, there are a lot of icons for beginning users. But don't be.
Looking Ahead Archive-It Partner Meeting November 18, 2014.
Streamlined Scoping at North Carolina Kathleen Kenney.
Looking Ahead Archive-It Partner Meeting November 12, 2013.
Comprehensive Continuous Improvement Plan(CCIP) Training Module 3 Planning Tool, Part 2.
Overtime Air Request Website (CWEB). Itinerary for this session: Sign In View Past overtime air requests history View Current overtime air requests View.
Developing Accessible PDF Documents Carolyn Kelley Klinger October 10, 2009 Accessibility Camp DC.
Microsoft Word Objectives: Word processing using Microsoft Word
Creating Accessible Word Documents by Debbie Lyn Jones, IT Manager I, NSU Webmaster FRIDAY, JANUARY 23, 2015.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
1 of 6 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
1 of 7 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
PubMed Search Options (Basic Course: Module 6). Table of Contents  History  Advanced Search  Accessing full text articles from HINARI/PubMed  Failure.
1 of 5 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2006 Microsoft Corporation.
RIMS II Online Order and Delivery System Tutorial on Downloading and Viewing Multipliers.
Effective Advocacy tools with Engaging Networks. More actions taken More page completions from s More engaged supporters Fewer unsubscribes / lapsed.
Sharepoint Portal Server Basics. Introduction Sharepoint server belongs to Microsoft family of servers Integrated suite of server capabilities Hosted.
1 Archive-It Training University of Maryland July 12, 2007.
1 of 5 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites.
Using Google Calendar & Google Forms to Organize Anything Lori Y. Lee Zanesville City Schools
Archive-It collection on “Occupy Movement 2011/2012” Archiving Web Content.
Leveraging the UpSideDown21 Content Management System Tutorial #2.
Branded Websites. Branded Website Training Click the “Edit Pencil” to edit the website Enter in your iBoomerang username and password.
Go to the MTSD Home Page In the URL add “/admin”
6 th Annual Focus Users’ Conference 6 th Annual Focus Users’ Conference Curriculum Guides Presented by: Kori Watkins Presented by: Kori Watkins.
EGrants web portal provides online submission, tracking, reviewing, and processing of most TEA grant applications.  New generation roll-out spring 2007.
Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin.
Moodle (Course Management Systems). Assignments 1 Assignments are a refreshingly simple method for collecting student work. They are a simple and flexible.
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
Mail merge I: Use mail merge for mass mailings Perform a complete mail merge Now you’ll walk through the process of performing a mail merge by using the.
WAS to Archive-It Metadata Migration March 11, 2015.
Crawling Slides adapted from
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
B O N N E V I L L E P O W E R A D M I N I S T R A T I O N BPA Energy Efficiency Marketing Portal Training Instructions for adding utility logo and contact.
242/102/49 0/51/59 181/172/166 Primary colors 248/152/29 PMS 172 PMS 137 PMS 546 PMS /206/ /227/ /129/123 Secondary colors 114/181/204.
Developing Accessible PDF Documents Carolyn Kelley Klinger October 10, 2009 Accessibility Camp DC.
Welcome to the Winter Training Series Today we will be focusing on Campaign actions – the basics.
Submitting Course Outlines for C-ID Designation Training for Articulation Officers Summer 2012.
Put your assignment on the page. Embed a document for students to complete and send back to you. Have students complete the document and submit it to.
CLEW Basics Lorie Stolarchuk Learning Technology Trainer Centre for Teaching and Learning 1.
1 Advanced Archive-It Application Training: Crawl Scoping.
Advanced Website Training: June, 2010 Insert Images as Your Background Using Google Docs for Document Hosting Custom Contact Forms on Your Website.
Gensuite ® Step-by-Step Guide for the setup of Gensuite Compliance Calendar and Training Calendar integration with Microsoft Outlook Some computers experience.
1 of 5 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
Creating Your Own Online Classroom MOODLE. Welcome Amy Basket – 17 years with Bay City Public Schools – Gifted and Talented Program – Volunteer Program.
Comprehensive Continuous Improvement Plan(CCIP) Training Module 3 Funding Application.
Child Care Subsidy Program Online Billing Provider Training Spring 2016.
Comprehensive Continuous Improvement Plan(CCIP) Training Module 4 Funding Application.
TechKnowlogy Conference August 2, 2011 Using GoogleDocs for Collaboration.
Comprehensive Continuous Improvement Plan(CCIP) Training Module 4 Funding Application Pages.
Perform a complete mail merge Lesson 14 By the end of this lesson you will be able to complete the following: Use the Mail Merge Wizard to perform a basic.
Web Archiving Workshop Mark Phillips Texas Conference on Digital Libraries June 4, 2008.
MicrosoftTM SharePoint Content Management SystemTutorial
Compatible with the latest browsers; Chrome, Safari, Firefox, Opera and Internet Explorer 9 and above.
LMEvents SharePoint Portal How-to Guide
RSC电子平台使用介绍 联系人:孙燕 Tel:
Canvas Discussion Boards
I-Supplier Training Guide
MSC photo:  It was taken some time in the late 1930s, but we don’t have an exact date.  The college was known as MSC from 1925 until 1955 when we became.
Canvas Discussion Boards
Welcome to WebCRD.
HP ALM Test Lab Module To protect the confidential and proprietary information included in this material, it may not be disclosed or provided to any third.
Training Document Accessing Reports in VinCENT.
Integration, setup & functionality
Presentation transcript:

1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

2 Agenda Basic Crawl Scoping What to look for in your reports How to Change your Crawl Scope Scope-It Live examples….

3 Archive-It Crawling Scope Scope: How to specify what you want to archive in scope URLs will be archived out of scope URLs are not archived. The scope of a crawl is determined by the seed URLs added to a collection by any scoping rules specified for collection

4 Archive-It Crawling Scope The crawler will start with your seed URL and follow links within your seed site to archive pages Only links associated with your seeds will be archived All embedded content on an in scope page is captured Example seed: Link: is in scopewww.archive.org/about.html Link: is NOT in scopewww.ca.gov Embedded image: is in scopewww.ala.org/logo.jpg

Archive-It Crawling Scope Seed URLs can limit the crawl to a single directory of a site. ex: a / at the end of your URL can have a big effect on scope Parts of the site not included in your seed directory will NOT be archived Example seed: NOT in scope Example seed: IS in scope 5

Archive-It Crawling Scope Sub-domains are divisions of a larger site named to the left of the host name (ex. crawler.archive.org) Sub-domains of seed URLs are NOT automatically in scope To crawl sub-domains, either: Add individual sub-domains as separate seed URLs Or add an ‘Expand Scope’ rule to allow all or specific sub- domains – Example seed: – Link: crawler.archive.org NOT in scopecrawler.archive.org 6

7 Analyzing Crawl Scope How to analyze the scope of your crawls: – Run a test crawl on new collections or seeds – Review reports of test crawl (or for existing crawl, review reports of actual crawl) – Based on the reports, you will be able to add the appropriate scoping rules – It is a good idea to run a test crawl with your scoping rules in to ensure they are correct. – Note: This can be a trial and error process, so be patient.

8 Reviewing Reports How make the most of your time reviewing reports: – Review high level reports first (Seed Status and Seed Source) for seed level issues – Then review more detailed reports (Hosts report and file type specific reports) – Run a QA Report to see if any embedded content on your seed pages was not captured

9 Seed Status Report Are there any seeds not being crawled? – Double check your seed URLs are correct – Ignore robots.txt

10 Seed Source Report Are there any seeds that are capturing far fewer or far more URLs than others? – Fewer: Was seed “Not Crawled” in seed status report? – More: Check host report for any obvious area to limit your crawl

11 Hosts Report Are there numbers in the “Queued” or “Robots.txt Blocked” column? – Check the URL lists to see if you want to capture these URLs or not Are there hosts with fewer or more archived URLs than you expected? – Fewer: Are any expected URLs “Out of Scope”? – More: Are there parts of the site or specific URLs you want to block?

12 QA Report Is there embedded content on your seed pages that was not captured? – Run a Patch Crawl!

13 File Type/PDF/Videos Reports Are there file types you expected to archive that were not archived? – Check the “Out of scope” column of host report for files not captured

14 Changing Crawl Scope The default Archive-It crawl settings can be adjusted Use Modify Crawl Scope options to limit or expand scope for specific websites Use Seed Types other than default to change the scope of a seed in specific ways Use Scope-It to refine the scope of your collection

15 Common Reasons to Limit Crawl Scope Crawler traps (ex: calendars) “Duplicate” URLs (ex: print version URLs) If there are certain areas of the site you do not care about or do not want to archive If you just want a snapshot of the site, and don’t necessarily want to crawl it to completion If you only want to capture one page of a site

16 Changing Crawl Scope How do you know you need to limit your scope? – You are using up more of your document budget than you want to or expected. – Reviewing the Queued Docs in the Host Report shows many URLs that you do not want or need

17 Common Reasons to Expand Crawl Scope – Include all (or only specific) subdomains – Include certain parts of the site that may not have been included based on the seed URL Ex: seed URL is: /default.aspx But you also want to archive pages such as

18 Changing Crawl Scope How do you know you need to expand your scope? – Review the ‘Out of Scope’ column in the Host Report for a real or test crawl. If you see URLs you would like to be archived, make the appropriate scoping rule – If in clicking around your archived site you find ‘Not in Archive’ pages that you want captured, make the appropriate scoping rule and recrawl

19 Changing Crawl Scope “Not in Archive” example for seed

20 Changing Crawl Scope Other ways to expand scope: – Ignoring Robots.txt blocks Not available by default, but the feature can be turned on by request for a partner Can ignore robots.txt blocks on a per-host basis Can be helpful for capturing social media sites, stylesheets as well as sites not in your organization's domain

21 Modify Crawl Scope – Host constraints Block completely Block certain URLs (URL contains, regular expression) Limit host to maximum number of URLs (documents) (optional) Ignore robots.txt block – Crawl Limits Limit by number of URLs (documents) or amount of data Crawl PDFs only Change maximum crawl duration – Expand Scope Rules Crawl certain URLs (URL contains, regular expression, SURTs)

22 Host Constraints

23 Crawl Limits

24 Expand Crawl Scope

25 Seed Types Crawl One Page Only Capture just your seed URL and embedded content RSS/News Feed Capture any linked pages from your seed URL as one page only

Scope-It A tool for limiting the scope of new or existing collections 26

Why Use Scope-It? For existing collections/crawls: – Analyze completed crawls in existing collections and revise the scope for future crawls – View host report information and add rules to your collection from the same screen – Add the same scoping rules to multiple collections at once 27

Why use Scope-It? To test and scope new seeds before creating a collection: – Run a “Scope-It” test crawl on a set of seeds that are not yet part of a collection. – Analyze the results of the crawl and potentially create a new collection with scoping rules in place 28

29 Changing Crawl Scope And now for some real-life examples...

30 Thank you! Any Questions, Discussion and/or Feedback? Please take our quick survey: