Presentation is loading. Please wait.

Presentation is loading. Please wait.

Workshop on Web Archiving

Similar presentations


Presentation on theme: "Workshop on Web Archiving"— Presentation transcript:

1 Workshop on Web Archiving
MODULE 2: EXISTING WEB ARCHIVES Janne Nielsen Asger Harlung Ulrich Karstoft Have

2 Module 2: Existing Web Collections
Introduction to web archives The Danish Netarkivet Internet Archive Library of Congress Other (US) web archives Ideas for NetLab workspace

3 Introduction to Web Archives
Focus on: The collection, including strategies Access Search Documentation

4 Netarkivet The collection, including strategies Access Search
Documentation

5 Netarkivet is run by the State and University Library (Aarhus) and the Royal Library (National Library of Denmark, Copenhagen). The Danish part of the Internet is defined as cultural heritage in the Legal Deposit Act (Act no of ), effective from June 1st, 2005 The ”Danish part of the Internet” = all Internet content in Danish or meant for Danes  the top level domain .dk and danica (e.g. sites in Danish or addressing Danes on other domains such as .com, .eu, .nu, etc.) .dk domain names: in July 2005, in January 2013 Dead .dk domains from July 2005 to January 2013: 2011: Roughly 222 TB; 6 m objects, most common file types are html, jpeg, gif and png 2013: Most common file types are html, jpeg, pdf and mp4 (video) 2014: On July 27 the data in Netarkivet amounted to 501 TB 2015: On November 15 the data comprised 654 TB

6 From http://netarkivet.dk/om-netarkivet
Netarkivet 2005  Strategies: Broad/bulk Selective Event Special From Broad Coverage Time Event Selective E

7 Netarkivet The collection, including strategies Access Search options
Documentation Access is restricted to: researchers (online) thesis students (on-site) No-one else can get access.

8 Netarkivet The collection, including strategies Access Search options
Documentation Single URL search using the wayback interface

9

10

11 Netarkivet The collection, including strategies Access Search options
Documentation Single URL search using the wayback interface Free text search NetLab is working on: multiple URL search file type search

12 Netarkivet The collection, including strategies Access Search options
Documentation Manual documentation: At collection level (netarkivet.dk, word-dokument) Curators (wiki) Automated documentation: Harvesting data (metadata) Crawl logs, but not accessible yet

13 Internet Archive The collection, including strategies Access Search
Documentation

14 The Internet Archive: american non-profit from 1996 not based on national legislation in general based on cumulative archiving, following hyperlinks from what was already archived the worlds largest collection of archived web more than 491 billion web pages, collects app. 1 billion pages per week quality is erratic — often only top level(s) heterogenious collection, no overall strategy, including donations…

15 Internet Archive The collection Access Search
Documentation Free online access for everyone

16 Internet Archive The collection Access Search
Documentation Search for individual URLs, displayed via Open Wayback interface

17

18

19 Internet Archive The collection Access Search
Documentation No accessible documentation for the URL except harvest time General documentation about how the Internet Archive harvests (FAQ)

20 Exercise in Web Archives
Open Internet Archive on Find one or more websites in the Wayback Machine. Move around on the website by clicking hyperlinks. Are elements missing, or do you notice anything else? If you have access to Netarkivet, you can choose to do the excercise in Netarkivet:

21 Funny observations?

22 Internet Archive Archive-It — the Internet Archive’s subscription web archiving service A number of collections from their partners, including event collections Full-text searchable Archive-It Research Services (ARS) — provides access to data sets extracted from collections (metadata, link graphs, named entities, other data).

23 Library of Congress The collection, including strategies Access Search
Documentation

24 Library of Congress web archive:
from 2000 curated, topic based and selective collections harvested by the Internet Archive (not Archive-It) 763TB

25 Library of Congress The collection, including strategies Access Search
Documentation Free online access for everyone, via LoC Wayback In many cases only ‘flat’ image

26 Library of Congress The collection, including strategies Access Search
Documentation Search for individual URL, displayed via Open Wayback interface Full-text search in meta data

27 Library of Congress The collection, including strategies Access Search
Documentation Very well documented and curated Documentation about each collection, and about each website

28 Other (US) Web Archives

29 Other Web Archives IIPC Member Archives
List of Web archiving initiatives, Truman, G. (2016). WebArchiving Environmental Scan. Harvard Library Report.

30 Ideas for NetLab workspace

31 The Four Phases in Research
Corpus creation Analysis Dissemination Storage Search Duplicates Select Isolate Identifify Evaluate Select/remove/combine

32

33

34 Ideas for NetLab Workspace
Challenges: Large amounts of data How to distinguish between the many versions? No visual representation Needs: Different ways of filtering content Choosing and ‘bookmarking’ pages Isolation/extraction of corpus Flexible interface to present different metadata

35 Inspiration: LARM.fm

36 Inspiration: Trello

37 Inspiration: Papers 2

38 Ideas for NetLab Workspace

39 Ideas for NetLab Workspace

40 Ideas for NetLab Workspace

41 Ideas for NetLab Workspace

42 Ideas for NetLab Workspace

43 Ideas for NetLab Workspace


Download ppt "Workshop on Web Archiving"

Similar presentations


Ads by Google