Goals Effective use of tools within the Archive-It web application to get the best quality capture possible of your archived content, including embedded resources necessary to the display and functionality of all in scope content. See recorded training videos for more detailed information about crawl scoping. 2
Quality Assurance Tips 1.Prioritize crawls and websites within your collection to use your time effectively 2.Review Reports, including QA report 3.Browse your Websites -Wayback QA -Proxy Mode 3
4 Reviewing Reports How make the most of your time reviewing reports: – Review high level reports first (Seed Status and Seed Source) for seed level issues – Then review more detailed reports (Hosts report and file type specific reports) – Run a QA Report to see if any embedded content on your seed pages was not captured
5 Seed Status Report Are there any seeds not being crawled? – Double check your seed URLs are correct – Ignore robots.txt
6 Seed Source Report Are there any seeds that are capturing far fewer or far more URLs than others? – Fewer: Was seed “Not Crawled” in seed status report? – More: Check host report for any obvious area to limit your crawl
7 Hosts Report Are there numbers in the “Queued” or “Robots.txt Blocked” column? – Check the URL lists to see if you want to capture these URLs or not Are there hosts with fewer or more archived URLs than you expected? – Fewer: Are any expected URLs “Out of Scope”? – More: Are there parts of the site or specific URLs you want to block?
8 File Type/PDF/Videos Reports Are there file types you expected to archive that were not archived? – Check the “Out of scope” column of host report for files not captured
9 QA Report Is there embedded content on your seed pages that was not captured? – Run a Patch Crawl!
10 QA Report Quickly see from the Reports menu which crawls you have run a QA report for already.
14 Wayback QA Why is this helpful? – Wayback QA allows you to perform automated quality assurance work as you’re browsing through your archived pages in Wayback. – Wayback QA will note any missing files from the pages you view and allow you to run a patch crawl in order to capture these files and improve the display of your archived pages.
16 Wayback QA - Tips Browse through all of the sites that you would like to QA before running a patch crawl- you can do one patch crawl across your entire collection. Sometimes Wayback QA can be an iterative process. Ignoring Robots.txt for a patch crawl does not change crawl settings for future crawls.
17 Wayback QA vs. QA Report Wayback QA Immediately check for missing resources. Can be conducted on any page Occurs while browsing in Wayback Patch crawl: selective QA Report Takes 24 hours to generate after content is Wayback. Includes initial seed pages Tied to a specific crawl report Patch crawl: All or nothing
18 Potential Workflow 1.After crawl completes- log in to web application 2.Analyze reports- any surprises? 3.Check pages in Wayback – any surprises? 4.Request QA Report and run patch crawl 5.In archive mode, run Wayback QA on necessary seeds, as well as some linked content or pages that may not have archived well. Optional: compare sites in Proxy Mode versus Archive mode 6.Run patch crawl from Wayback QA 7.Check for improvements to archived content. 8.Use “Submit a Question” link to get further help and guidance for difficult to archive sites. What is your workflow like?
19 Questions? Please take our quick survey to let us know what you thought about today’s training, and any suggestions or ideas you have for further Archive-It trainings! http://www.surveymonkey.com/s/FHVCVP6 (see Webex chat for link)