Workshop on Web Archiving

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

Harvesting and archiving the Web Nordunet2000, Juha Hakala Helsinki University Library.
Recent developments in digital archiving and preservation Jan Fullerton Director General National Library of Australia.
Using SD K12 SharePoint ®. What is SharePoint? Microsoft SharePoint Components Web Browser Collaboration functions Process management modules Search modules.
Using SD K12 SharePoint®.
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
Bibliothèque nationale de France Tallinn, BnF update: production and development priorities in 2015.
Digital & Preservation Resources Managing the digital collection life cycle.
BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall
1 Archiving and Preserving the Web Kristine Hanna Internet Archive July 2008.
SOC 229D-A: Cultural Anthropology Research Project Proposal / Bibliography Dr. Adams - Fall 2010 Alison Gregory
27 APRIL 2015 STUDYING A NATION’S WEB DOMAIN OVER TIME: ANALYTICAL AND METHODOLOGICAL CONSIDERATIONS NIELS BRÜGGER, ASSOCIATE PROFESSOR, HEAD OF CENTRE.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
1 Archive-It Training University of Maryland July 12, 2007.
Annick Le Follic Bibliothèque nationale de France Tallinn,
Danish Legal Deposit Experiences & the Need for Adjustments by Birgit N. Henriksen Head of Digitization and Web Department The Royal Library, Denmark.
OARE Module 3: OARE Portal.
Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web.
WebArchiv Czech Web Archive IIPC 2007, Paris.
1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.
Marty Harris aka TEXT QUERY SYSTEM Marty Harris Mgr TRD.
Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.
Dreamweaver MX Unit A CIS 205—Web Site Design & Development.
Cataloguing Electronic resources Prepared by the Cataloguing Team at Charles Sturt University.
Plans for 2015 Tallinn, Jan 29 th, 2015 Ditte Laursen, Sabine Schostag,
Office of Strategic Initiatives All Hands Meeting-March 2010 Challenges in Web Archiving: Library of Congress Edition Abbie Grotke, Web Archiving Team.
1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.
A historical perspective of Digital Preservation at The Royal Library, Denmark.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
Netarkivet RESAW seminar, Dec 2-3, 2013 Day 1. Who are we today □Birgit N. Henriksen, head of digital preservation, KB □Bjarne Andersen, head of digital.
Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /
1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.
Chapter 8 Using the Web for Teaching and Learning.
1 NetarchiveSuite Workshop Paris November , 2011.
Video Active Presentation Agenda: –Demonstration of videoactive.eu Frontend and Backend fiatifta.dk Copenhagen September 2008.
Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus
ACES User Interface Workshop #1 Prototype Inspection 22. November 2011.
Reference Management Module I: Introduction By Rehema Chande-Mallya(PhD)
The Web Web Design. 3.2 The Web Focus on Reading Main Ideas A URL is an address that identifies a specific Web page. Web browsers have varying capabilities.
Web Archiving Workshop Mark Phillips Texas Conference on Digital Libraries June 4, 2008.
Knowledge Hub Walkthrough August
Knowledge Hub Walkthrough August
Copenhagen 11 March 2015 Dias 1 Theme 2a: Media Tools — NetLab, a Research Infrastructure for Internet Studies Niels Brügger, Aarhus University Advisory.
4.01 How Web Pages Work.
Workshop on Web Archiving
4.01 How Web Pages Work.
Archiving & Preserving Digital Content
Institution update KB DK
Virtual Research Environments
Databases vs the Internet
Joanne Archer University of Maryland Libraries
Research4Life Programmes: Similarities and Differences!
Table of Contents: Part B
Extraction, aggregation and classification at Web Scale
Documentation as part of curation in web archiving.
4Shared Clone | File Sharing php Script from PHP Script directory.
Wisconsin County and Municipal Government Collections in Archive-It
Fast, free, fun Weebly web sites.
Márton Németh – László Drótos (National Széchényi Library, Hungary)
From Big Bang to beta An overview of the 3R Project
Márton Németh – László Drótos How to catalogue a web archive?
Global Legal Information Network
Research4Life Programmes: Similarities and Differences! (Part A)
PubMed Database Interface (Basic Course: Module 4)
Research4Life Programmes: Similarities and Differences!
Web archives as a research subject
4.01 How Web Pages Work.
Internet Vocabulary Terms
Table of Contents – Part B
Metadata supported full-text search in a web archive
Presentation transcript:

Workshop on Web Archiving MODULE 2: EXISTING WEB ARCHIVES Janne Nielsen Asger Harlung Ulrich Karstoft Have

Module 2: Existing Web Collections Introduction to web archives The Danish Netarkivet Internet Archive Library of Congress Other (US) web archives Ideas for NetLab workspace

Introduction to Web Archives Focus on: The collection, including strategies Access Search Documentation

Netarkivet The collection, including strategies Access Search Documentation

Netarkivet is run by the State and University Library (Aarhus) and the Royal Library (National Library of Denmark, Copenhagen). The Danish part of the Internet is defined as cultural heritage in the Legal Deposit Act (Act no. 1439 of 22.12.2004), effective from June 1st, 2005 The ”Danish part of the Internet” = all Internet content in Danish or meant for Danes  the top level domain .dk and danica (e.g. sites in Danish or addressing Danes on other domains such as .com, .eu, .nu, etc.) .dk domain names: 607.000 in July 2005, 960.000 in January 2013 Dead .dk domains from July 2005 to January 2013: 741.838 2011: Roughly 222 TB; 6 m objects, most common file types are html, jpeg, gif and png 2013: Most common file types are html, jpeg, pdf and mp4 (video) 2014: On July 27 the data in Netarkivet amounted to 501 TB 2015: On November 15 the data comprised 654 TB

From http://netarkivet.dk/om-netarkivet Netarkivet 2005  Strategies: Broad/bulk Selective Event Special From http://netarkivet.dk/om-netarkivet Broad Coverage Time Event Selective E

Netarkivet The collection, including strategies Access Search options Documentation Access is restricted to: researchers (online) thesis students (on-site) No-one else can get access.

Netarkivet The collection, including strategies Access Search options Documentation Single URL search using the wayback interface

Netarkivet The collection, including strategies Access Search options Documentation Single URL search using the wayback interface Free text search NetLab is working on: multiple URL search file type search

Netarkivet The collection, including strategies Access Search options Documentation Manual documentation: At collection level (netarkivet.dk, word-dokument) Curators (wiki) Automated documentation: Harvesting data (metadata) Crawl logs, but not accessible yet

Internet Archive The collection, including strategies Access Search Documentation

The Internet Archive: american non-profit from 1996 not based on national legislation in general based on cumulative archiving, following hyperlinks from what was already archived the worlds largest collection of archived web more than 491 billion web pages, collects app. 1 billion pages per week quality is erratic — often only top level(s) heterogenious collection, no overall strategy, including donations…

Internet Archive The collection Access Search Documentation Free online access for everyone

Internet Archive The collection Access Search Documentation Search for individual URLs, displayed via Open Wayback interface

Internet Archive The collection Access Search Documentation No accessible documentation for the URL except harvest time General documentation about how the Internet Archive harvests (FAQ)

Exercise in Web Archives Open Internet Archive on https://archive.org Find one or more websites in the Wayback Machine. Move around on the website by clicking hyperlinks. Are elements missing, or do you notice anything else? If you have access to Netarkivet, you can choose to do the excercise in Netarkivet: https://netarkiv-wayback.kb.dk

Funny observations?

Internet Archive Archive-It — the Internet Archive’s subscription web archiving service A number of collections from their partners, including event collections Full-text searchable Archive-It Research Services (ARS) — provides access to data sets extracted from collections (metadata, link graphs, named entities, other data). https://archive-it.org/

Library of Congress The collection, including strategies Access Search Documentation

Library of Congress web archive: from 2000 curated, topic based and selective collections harvested by the Internet Archive (not Archive-It) 763TB

Library of Congress The collection, including strategies Access Search Documentation Free online access for everyone, via LoC Wayback https://www.loc.gov/websites/collections/ In many cases only ‘flat’ image

Library of Congress The collection, including strategies Access Search Documentation Search for individual URL, displayed via Open Wayback interface Full-text search in meta data

Library of Congress The collection, including strategies Access Search Documentation Very well documented and curated Documentation about each collection, and about each website

Other (US) Web Archives

Other Web Archives IIPC Member Archives http://netpreserve.org/resources/member-archives List of Web archiving initiatives, https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives Truman, G. (2016). WebArchiving Environmental Scan. Harvard Library Report.http://nrs.harvard.edu/urn-3:HUL.InstRepos:25658314

Ideas for NetLab workspace

The Four Phases in Research Corpus creation Analysis Dissemination Storage Search Duplicates Select Isolate Identifify Evaluate Select/remove/combine

Ideas for NetLab Workspace Challenges: Large amounts of data How to distinguish between the many versions? No visual representation Needs: Different ways of filtering content Choosing and ‘bookmarking’ pages Isolation/extraction of corpus Flexible interface to present different metadata

Inspiration: LARM.fm

Inspiration: Trello

Inspiration: Papers 2

Ideas for NetLab Workspace

Ideas for NetLab Workspace

Ideas for NetLab Workspace

Ideas for NetLab Workspace

Ideas for NetLab Workspace

Ideas for NetLab Workspace