Workshop on Web Archiving

Workshop on Web Archiving
MODULE 2: EXISTING WEB ARCHIVES Janne Nielsen Asger Harlung Ulrich Karstoft Have

Module 2: Existing Web Collections
Introduction to web archives The Danish Netarkivet Internet Archive Library of Congress Other (US) web archives Ideas for NetLab workspace

Introduction to Web Archives
Focus on: The collection, including strategies Access Search Documentation

Netarkivet The collection, including strategies Access Search
Documentation

Netarkivet is run by the State and University Library (Aarhus) and the Royal Library (National Library of Denmark, Copenhagen). The Danish part of the Internet is defined as cultural heritage in the Legal Deposit Act (Act no of ), effective from June 1st, 2005 The ”Danish part of the Internet” = all Internet content in Danish or meant for Danes  the top level domain .dk and danica (e.g. sites in Danish or addressing Danes on other domains such as .com, .eu, .nu, etc.) .dk domain names: in July 2005, in January 2013 Dead .dk domains from July 2005 to January 2013: 2011: Roughly 222 TB; 6 m objects, most common file types are html, jpeg, gif and png 2013: Most common file types are html, jpeg, pdf and mp4 (video) 2014: On July 27 the data in Netarkivet amounted to 501 TB 2015: On November 15 the data comprised 654 TB

From http://netarkivet.dk/om-netarkivet
Netarkivet 2005  Strategies: Broad/bulk Selective Event Special From Broad Coverage Time Event Selective E

Netarkivet The collection, including strategies Access Search options
Documentation Access is restricted to: researchers (online) thesis students (on-site) No-one else can get access.

Documentation Single URL search using the wayback interface

Documentation Single URL search using the wayback interface Free text search NetLab is working on: multiple URL search file type search

Documentation Manual documentation: At collection level (netarkivet.dk, word-dokument) Curators (wiki) Automated documentation: Harvesting data (metadata) Crawl logs, but not accessible yet

Internet Archive The collection, including strategies Access Search
Documentation

The Internet Archive: american non-profit from 1996 not based on national legislation in general based on cumulative archiving, following hyperlinks from what was already archived the worlds largest collection of archived web more than 491 billion web pages, collects app. 1 billion pages per week quality is erratic — often only top level(s) heterogenious collection, no overall strategy, including donations…

Internet Archive The collection Access Search
Documentation Free online access for everyone

Documentation Search for individual URLs, displayed via Open Wayback interface

Documentation No accessible documentation for the URL except harvest time General documentation about how the Internet Archive harvests (FAQ)

Exercise in Web Archives
Open Internet Archive on Find one or more websites in the Wayback Machine. Move around on the website by clicking hyperlinks. Are elements missing, or do you notice anything else? If you have access to Netarkivet, you can choose to do the excercise in Netarkivet:

Funny observations?

Internet Archive Archive-It — the Internet Archive’s subscription web archiving service A number of collections from their partners, including event collections Full-text searchable Archive-It Research Services (ARS) — provides access to data sets extracted from collections (metadata, link graphs, named entities, other data).

Library of Congress The collection, including strategies Access Search
Documentation

Library of Congress web archive:
from 2000 curated, topic based and selective collections harvested by the Internet Archive (not Archive-It) 763TB

Documentation Free online access for everyone, via LoC Wayback In many cases only ‘flat’ image

Documentation Search for individual URL, displayed via Open Wayback interface Full-text search in meta data

Documentation Very well documented and curated Documentation about each collection, and about each website

Other (US) Web Archives

Other Web Archives IIPC Member Archives
List of Web archiving initiatives, Truman, G. (2016). WebArchiving Environmental Scan. Harvard Library Report.

Ideas for NetLab workspace

The Four Phases in Research
Corpus creation Analysis Dissemination Storage Search Duplicates Select Isolate Identifify Evaluate Select/remove/combine

Ideas for NetLab Workspace
Challenges: Large amounts of data How to distinguish between the many versions? No visual representation Needs: Different ways of filtering content Choosing and ‘bookmarking’ pages Isolation/extraction of corpus Flexible interface to present different metadata

Inspiration: LARM.fm

Inspiration: Trello

Inspiration: Papers 2

Ideas for NetLab Workspace

Workshop on Web Archiving

Similar presentations

Presentation on theme: "Workshop on Web Archiving"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Workshop on Web Archiving

Similar presentations

Presentation on theme: "Workshop on Web Archiving"— Presentation transcript:

Similar presentations

About project

Feedback